Giter Site home page Giter Site logo

vitae-transformer / vitpose Goto Github PK

View Code? Open in Web Editor NEW
1.2K 22.0 172.0 10.72 MB

The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"

License: Apache License 2.0

Python 99.77% Dockerfile 0.05% Shell 0.19%
deep-learning distillation mae pose-estimation pytorch self-supervised-learning vision-transformer

vitpose's Introduction

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

PWC PWC PWC PWC

Results | Updates | Usage | Todo | Acknowledge

This branch contains the pytorch implementation of ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation and ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

Web Demo

MAE Pre-trained model

  • The small size MAE pre-trained model can be found in Onedrive.
  • The base, large, and huge pre-trained models using MAE can be found in the MAE official repo.

Results from this repo on MS COCO val set (single-task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

Model Pretrain Resolution AP AR config log weight
ViTPose-S MAE 256x192 73.8 79.2 config log Onedrive
ViTPose-B MAE 256x192 75.8 81.1 config log Onedrive
ViTPose-L MAE 256x192 78.3 83.5 config log Onedrive
ViTPose-H MAE 256x192 79.1 84.1 config log Onedrive

With simple decoder

Model Pretrain Resolution AP AR config log weight
ViTPose-S MAE 256x192 73.5 78.9 config log Onedrive
ViTPose-B MAE 256x192 75.5 80.9 config log Onedrive
ViTPose-L MAE 256x192 78.2 83.4 config log Onedrive
ViTPose-H MAE 256x192 78.9 84.0 config log Onedrive

Results with multi-task training

Note * There may exist duplicate images in the crowdpose training set and the validation images in other datasets, as discussed in issue #24. Please be careful when using these models for evaluation. We provide the results without the crowpose dataset for reference.

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Results on MS COCO val set

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII 256x192 77.1 82.2 config Onedrive
ViTPose-L COCO+AIC+MPII 256x192 78.7 83.8 config Onedrive
ViTPose-H COCO+AIC+MPII 256x192 79.5 84.5 config Onedrive
ViTPose-G COCO+AIC+MPII 576x432 81.0 85.6
ViTPose-B* COCO+AIC+MPII+CrowdPose 256x192 77.5 82.6 config Onedrive
ViTPose-L* COCO+AIC+MPII+CrowdPose 256x192 79.1 84.1 config Onedrive
ViTPose-H* COCO+AIC+MPII+CrowdPose 256x192 79.8 84.8 config Onedrive
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 75.8 82.6 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 77.0 82.6 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 78.6 84.1 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 79.4 84.8 config log | Onedrive

Results on OCHuman test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII 256x192 88.0 89.6 config Onedrive
ViTPose-L COCO+AIC+MPII 256x192 90.9 92.2 config Onedrive
ViTPose-H COCO+AIC+MPII 256x192 90.9 92.3 config Onedrive
ViTPose-G COCO+AIC+MPII 576x432 93.3 94.3
ViTPose-B* COCO+AIC+MPII+CrowdPose 256x192 88.2 90.0 config Onedrive
ViTPose-L* COCO+AIC+MPII+CrowdPose 256x192 91.5 92.8 config Onedrive
ViTPose-H* COCO+AIC+MPII+CrowdPose 256x192 91.6 92.8 config Onedrive
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 78.4 80.6 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 82.6 84.8 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 85.7 87.5 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 85.7 87.4 config log | Onedrive

Results on MPII val set

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

Model Dataset Resolution Mean config weight
ViTPose-B COCO+AIC+MPII 256x192 93.3 config Onedrive
ViTPose-L COCO+AIC+MPII 256x192 94.0 config Onedrive
ViTPose-H COCO+AIC+MPII 256x192 94.1 config Onedrive
ViTPose-G COCO+AIC+MPII 576x432 94.3
ViTPose-B* COCO+AIC+MPII+CrowdPose 256x192 93.4 config Onedrive
ViTPose-L* COCO+AIC+MPII+CrowdPose 256x192 93.9 config Onedrive
ViTPose-H* COCO+AIC+MPII+CrowdPose 256x192 94.1 config Onedrive
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 92.7 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 92.8 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 94.0 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 94.2 config log | Onedrive

Results on AI Challenger test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII 256x192 32.0 36.3 config Onedrive
ViTPose-L COCO+AIC+MPII 256x192 34.5 39.0 config Onedrive
ViTPose-H COCO+AIC+MPII 256x192 35.4 39.9 config Onedrive
ViTPose-G COCO+AIC+MPII 576x432 43.2 47.1
ViTPose-B* COCO+AIC+MPII+CrowdPose 256x192 31.9 36.3 config Onedrive
ViTPose-L* COCO+AIC+MPII+CrowdPose 256x192 34.6 39.0 config Onedrive
ViTPose-H* COCO+AIC+MPII+CrowdPose 256x192 35.3 39.8 config Onedrive
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 29.7 34.3 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 31.8 36.3 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 34.3 38.9 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 34.8 39.1 config log | Onedrive

Results on CrowdPose test set

Using YOLOv3 human detector. Note the configs here are only for evaluation.

Model Dataset Resolution AP AP(H) config weight
ViTPose-B* COCO+AIC+MPII+CrowdPose 256x192 74.7 63.3 config Onedrive
ViTPose-L* COCO+AIC+MPII+CrowdPose 256x192 76.6 65.9 config Onedrive
ViTPose-H* COCO+AIC+MPII+CrowdPose 256x192 76.3 65.6 config Onedrive

Animal datasets (AP10K, APT36K)

Results on AP-10K test set

Model Dataset Resolution AP config weight
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 71.4 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 74.5 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 80.4 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 82.4 config log | Onedrive

Results on APT-36K val set

Model Dataset Resolution AP config weight
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 74.2 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 75.9 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 80.8 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 82.3 config log | Onedrive

WholeBody dataset

Model Dataset Resolution AP config weight
ViTPose+-S COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 54.4 config log | Onedrive
ViTPose+-B COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 57.4 config log | Onedrive
ViTPose+-L COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 60.6 config log | Onedrive
ViTPose+-H COCO+AIC+MPII+AP10K+APT36K+WholeBody 256x192 61.2 config log | Onedrive

Transfer results on the hand dataset (InterHand2.6M)

Model Dataset Resolution AUC config weight
ViTPose+-S COCO+AIC+MPII+WholeBody 256x192 86.5 config Coming Soon
ViTPose+-B COCO+AIC+MPII+WholeBody 256x192 87.0 config Coming Soon
ViTPose+-L COCO+AIC+MPII+WholeBody 256x192 87.5 config Coming Soon
ViTPose+-H COCO+AIC+MPII+WholeBody 256x192 87.6 config Coming Soon

Updates

[2023-01-10] Update ViTPose+! It uses MoE strategies to jointly deal with human, animal, and wholebody pose estimation tasks.

[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

After downloading the pretrained models, please conduct the experiments by running

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0

To test the pretrained models performance, please run

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

For ViTPose+ pre-trained models, please first re-organize the pre-trained weights using

python tools/model_split.py --source <Pretrained PATH>

Todo

This repo current contains modifications including:

  • Upload configs and pretrained models

  • More models with SOTA results

  • Upload multi-task training config

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

For ViTPose

@inproceedings{
  xu2022vitpose,
  title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
  author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
}

For ViTPose+

@article{xu2022vitpose+,
  title={ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation},
  author={Xu, Yufei and Zhang, Jing and Zhang, Qiming and Tao, Dacheng},
  journal={arXiv preprint arXiv:2212.04246},
  year={2022}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

vitpose's People

Contributors

ak391 avatar annbless avatar seaman1900 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vitpose's Issues

Training device?

Hi, I'd like to re-train this model on my own data, however, out of memory error occurs even samples_per_gpu is set to 1. I'm using gtx 2080ti.

where is optimize attention block ?

Thank you for open this great repo.
In paper table4, compare attention trick, However cant find it in this repo. such as window MSA or shift MSA etc.

Would you provide bottom-up-based pretrained weights of ViTPose?

Thanks for your research contribution and publishing code!

I will be inferencing this model as bottom-up-based keypoints estimation process for research.

When I see code, I found bottom-up-based inferencing code, but I can not found bottom-up-based pretrained weights of ViTPose.

Would you provide bottom-up-based pretrained weights to me
?

The model and loaded state dict do not match exactly

Hi there,

First of all, thank you for reading this issue.

I am testing the following model and get the following error, it seems the config file does not match the pre-trained model. I am not sure what mistakes I have made. Many thanks if anyone could offer any hints.

Results from this repo on MS COCO val set (single-task training)

ViTPose-B | MAE | 256x192 | 75.8 | 81.1 | config | log | Onedrive (here is where I download the pth file)

I used the following command:
bash tools/dist_train.sh /home/zee/ViTPose/ViTPose/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_base_coco_256x192.py 1 --cfg-options model.pretrained=/home/zee/ViTPose/vitpose-b.pth --seed 0

WARNING:root:The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.pos_embed, backbone.patch_embed.proj.weight, backbone.patch_embed.proj.bias, backbone.blocks.0.norm1.weight, backbone.blocks.0.norm1.bias, backbone.blocks.0.attn.qkv.weight, backbone.blocks.0.attn.qkv.bias, backbone.blocks.0.attn.proj.weight, backbone.blocks.0.attn.proj.bias, backbone.blocks.0.norm2.weight, backbone.blocks.0.norm2.bias, backbone.blocks.0.mlp.fc1.weight, backbone.blocks.0.mlp.fc1.bias, backbone.blocks.0.mlp.fc2.weight, backbone.blocks.0.mlp.fc2.bias, backbone.blocks.1.norm1.weight, backbone.blocks.1.norm1.bias, backbone.blocks.1.attn.qkv.weight, backbone.blocks.1.attn.qkv.bias, backbone.blocks.1.attn.proj.weight, backbone.blocks.1.attn.proj.bias, backbone.blocks.1.norm2.weight, backbone.blocks.1.norm2.bias, backbone.blocks.1.mlp.fc1.weight, backbone.blocks.1.mlp.fc1.bias, backbone.blocks.1.mlp.fc2.weight, backbone.blocks.1.mlp.fc2.bias, backbone.blocks.2.norm1.weight, backbone.blocks.2.norm1.bias, backbone.blocks.2.attn.qkv.weight, backbone.blocks.2.attn.qkv.bias, backbone.blocks.2.attn.proj.weight, backbone.blocks.2.attn.proj.bias, backbone.blocks.2.norm2.weight, backbone.blocks.2.norm2.bias, backbone.blocks.2.mlp.fc1.weight, backbone.blocks.2.mlp.fc1.bias, backbone.blocks.2.mlp.fc2.weight, backbone.blocks.2.mlp.fc2.bias, backbone.blocks.3.norm1.weight, backbone.blocks.3.norm1.bias, backbone.blocks.3.attn.qkv.weight, backbone.blocks.3.attn.qkv.bias, backbone.blocks.3.attn.proj.weight, backbone.blocks.3.attn.proj.bias, backbone.blocks.3.norm2.weight, backbone.blocks.3.norm2.bias, backbone.blocks.3.mlp.fc1.weight, backbone.blocks.3.mlp.fc1.bias, backbone.blocks.3.mlp.fc2.weight, backbone.blocks.3.mlp.fc2.bias, backbone.blocks.4.norm1.weight, backbone.blocks.4.norm1.bias, backbone.blocks.4.attn.qkv.weight, backbone.blocks.4.attn.qkv.bias, backbone.blocks.4.attn.proj.weight, backbone.blocks.4.attn.proj.bias, backbone.blocks.4.norm2.weight, backbone.blocks.4.norm2.bias, backbone.blocks.4.mlp.fc1.weight, backbone.blocks.4.mlp.fc1.bias, backbone.blocks.4.mlp.fc2.weight, backbone.blocks.4.mlp.fc2.bias, backbone.blocks.5.norm1.weight, backbone.blocks.5.norm1.bias, backbone.blocks.5.attn.qkv.weight, backbone.blocks.5.attn.qkv.bias, backbone.blocks.5.attn.proj.weight, backbone.blocks.5.attn.proj.bias, backbone.blocks.5.norm2.weight, backbone.blocks.5.norm2.bias, backbone.blocks.5.mlp.fc1.weight, backbone.blocks.5.mlp.fc1.bias, backbone.blocks.5.mlp.fc2.weight, backbone.blocks.5.mlp.fc2.bias, backbone.blocks.6.norm1.weight, backbone.blocks.6.norm1.bias, backbone.blocks.6.attn.qkv.weight, backbone.blocks.6.attn.qkv.bias, backbone.blocks.6.attn.proj.weight, backbone.blocks.6.attn.proj.bias, backbone.blocks.6.norm2.weight, backbone.blocks.6.norm2.bias, backbone.blocks.6.mlp.fc1.weight, backbone.blocks.6.mlp.fc1.bias, backbone.blocks.6.mlp.fc2.weight, backbone.blocks.6.mlp.fc2.bias, backbone.blocks.7.norm1.weight, backbone.blocks.7.norm1.bias, backbone.blocks.7.attn.qkv.weight, backbone.blocks.7.attn.qkv.bias, backbone.blocks.7.attn.proj.weight, backbone.blocks.7.attn.proj.bias, backbone.blocks.7.norm2.weight, backbone.blocks.7.norm2.bias, backbone.blocks.7.mlp.fc1.weight, backbone.blocks.7.mlp.fc1.bias, backbone.blocks.7.mlp.fc2.weight, backbone.blocks.7.mlp.fc2.bias, backbone.blocks.8.norm1.weight, backbone.blocks.8.norm1.bias, backbone.blocks.8.attn.qkv.weight, backbone.blocks.8.attn.qkv.bias, backbone.blocks.8.attn.proj.weight, backbone.blocks.8.attn.proj.bias, backbone.blocks.8.norm2.weight, backbone.blocks.8.norm2.bias, backbone.blocks.8.mlp.fc1.weight, backbone.blocks.8.mlp.fc1.bias, backbone.blocks.8.mlp.fc2.weight, backbone.blocks.8.mlp.fc2.bias, backbone.blocks.9.norm1.weight, backbone.blocks.9.norm1.bias, backbone.blocks.9.attn.qkv.weight, backbone.blocks.9.attn.qkv.bias, backbone.blocks.9.attn.proj.weight, backbone.blocks.9.attn.proj.bias, backbone.blocks.9.norm2.weight, backbone.blocks.9.norm2.bias, backbone.blocks.9.mlp.fc1.weight, backbone.blocks.9.mlp.fc1.bias, backbone.blocks.9.mlp.fc2.weight, backbone.blocks.9.mlp.fc2.bias, backbone.blocks.10.norm1.weight, backbone.blocks.10.norm1.bias, backbone.blocks.10.attn.qkv.weight, backbone.blocks.10.attn.qkv.bias, backbone.blocks.10.attn.proj.weight, backbone.blocks.10.attn.proj.bias, backbone.blocks.10.norm2.weight, backbone.blocks.10.norm2.bias, backbone.blocks.10.mlp.fc1.weight, backbone.blocks.10.mlp.fc1.bias, backbone.blocks.10.mlp.fc2.weight, backbone.blocks.10.mlp.fc2.bias, backbone.blocks.11.norm1.weight, backbone.blocks.11.norm1.bias, backbone.blocks.11.attn.qkv.weight, backbone.blocks.11.attn.qkv.bias, backbone.blocks.11.attn.proj.weight, backbone.blocks.11.attn.proj.bias, backbone.blocks.11.norm2.weight, backbone.blocks.11.norm2.bias, backbone.blocks.11.mlp.fc1.weight, backbone.blocks.11.mlp.fc1.bias, backbone.blocks.11.mlp.fc2.weight, backbone.blocks.11.mlp.fc2.bias, backbone.blocks.12.norm1.weight, backbone.blocks.12.norm1.bias, backbone.blocks.12.attn.qkv.weight, backbone.blocks.12.attn.qkv.bias, backbone.blocks.12.attn.proj.weight, backbone.blocks.12.attn.proj.bias, backbone.blocks.12.norm2.weight, backbone.blocks.12.norm2.bias, backbone.blocks.12.mlp.fc1.weight, backbone.blocks.12.mlp.fc1.bias, backbone.blocks.12.mlp.fc2.weight, backbone.blocks.12.mlp.fc2.bias, backbone.blocks.13.norm1.weight, backbone.blocks.13.norm1.bias, backbone.blocks.13.attn.qkv.weight, backbone.blocks.13.attn.qkv.bias, backbone.blocks.13.attn.proj.weight, backbone.blocks.13.attn.proj.bias, backbone.blocks.13.norm2.weight, backbone.blocks.13.norm2.bias, backbone.blocks.13.mlp.fc1.weight, backbone.blocks.13.mlp.fc1.bias, backbone.blocks.13.mlp.fc2.weight, backbone.blocks.13.mlp.fc2.bias, backbone.blocks.14.norm1.weight, backbone.blocks.14.norm1.bias, backbone.blocks.14.attn.qkv.weight, backbone.blocks.14.attn.qkv.bias, backbone.blocks.14.attn.proj.weight, backbone.blocks.14.attn.proj.bias, backbone.blocks.14.norm2.weight, backbone.blocks.14.norm2.bias, backbone.blocks.14.mlp.fc1.weight, backbone.blocks.14.mlp.fc1.bias, backbone.blocks.14.mlp.fc2.weight, backbone.blocks.14.mlp.fc2.bias, backbone.blocks.15.norm1.weight, backbone.blocks.15.norm1.bias, backbone.blocks.15.attn.qkv.weight, backbone.blocks.15.attn.qkv.bias, backbone.blocks.15.attn.proj.weight, backbone.blocks.15.attn.proj.bias, backbone.blocks.15.norm2.weight, backbone.blocks.15.norm2.bias, backbone.blocks.15.mlp.fc1.weight, backbone.blocks.15.mlp.fc1.bias, backbone.blocks.15.mlp.fc2.weight, backbone.blocks.15.mlp.fc2.bias, backbone.blocks.16.norm1.weight, backbone.blocks.16.norm1.bias, backbone.blocks.16.attn.qkv.weight, backbone.blocks.16.attn.qkv.bias, backbone.blocks.16.attn.proj.weight, backbone.blocks.16.attn.proj.bias, backbone.blocks.16.norm2.weight, backbone.blocks.16.norm2.bias, backbone.blocks.16.mlp.fc1.weight, backbone.blocks.16.mlp.fc1.bias, backbone.blocks.16.mlp.fc2.weight, backbone.blocks.16.mlp.fc2.bias, backbone.blocks.17.norm1.weight, backbone.blocks.17.norm1.bias, backbone.blocks.17.attn.qkv.weight, backbone.blocks.17.attn.qkv.bias, backbone.blocks.17.attn.proj.weight, backbone.blocks.17.attn.proj.bias, backbone.blocks.17.norm2.weight, backbone.blocks.17.norm2.bias, backbone.blocks.17.mlp.fc1.weight, backbone.blocks.17.mlp.fc1.bias, backbone.blocks.17.mlp.fc2.weight, backbone.blocks.17.mlp.fc2.bias, backbone.blocks.18.norm1.weight, backbone.blocks.18.norm1.bias, backbone.blocks.18.attn.qkv.weight, backbone.blocks.18.attn.qkv.bias, backbone.blocks.18.attn.proj.weight, backbone.blocks.18.attn.proj.bias, backbone.blocks.18.norm2.weight, backbone.blocks.18.norm2.bias, backbone.blocks.18.mlp.fc1.weight, backbone.blocks.18.mlp.fc1.bias, backbone.blocks.18.mlp.fc2.weight, backbone.blocks.18.mlp.fc2.bias, backbone.blocks.19.norm1.weight, backbone.blocks.19.norm1.bias, backbone.blocks.19.attn.qkv.weight, backbone.blocks.19.attn.qkv.bias, backbone.blocks.19.attn.proj.weight, backbone.blocks.19.attn.proj.bias, backbone.blocks.19.norm2.weight, backbone.blocks.19.norm2.bias, backbone.blocks.19.mlp.fc1.weight, backbone.blocks.19.mlp.fc1.bias, backbone.blocks.19.mlp.fc2.weight, backbone.blocks.19.mlp.fc2.bias, backbone.blocks.20.norm1.weight, backbone.blocks.20.norm1.bias, backbone.blocks.20.attn.qkv.weight, backbone.blocks.20.attn.qkv.bias, backbone.blocks.20.attn.proj.weight, backbone.blocks.20.attn.proj.bias, backbone.blocks.20.norm2.weight, backbone.blocks.20.norm2.bias, backbone.blocks.20.mlp.fc1.weight, backbone.blocks.20.mlp.fc1.bias, backbone.blocks.20.mlp.fc2.weight, backbone.blocks.20.mlp.fc2.bias, backbone.blocks.21.norm1.weight, backbone.blocks.21.norm1.bias, backbone.blocks.21.attn.qkv.weight, backbone.blocks.21.attn.qkv.bias, backbone.blocks.21.attn.proj.weight, backbone.blocks.21.attn.proj.bias, backbone.blocks.21.norm2.weight, backbone.blocks.21.norm2.bias, backbone.blocks.21.mlp.fc1.weight, backbone.blocks.21.mlp.fc1.bias, backbone.blocks.21.mlp.fc2.weight, backbone.blocks.21.mlp.fc2.bias, backbone.blocks.22.norm1.weight, backbone.blocks.22.norm1.bias, backbone.blocks.22.attn.qkv.weight, backbone.blocks.22.attn.qkv.bias, backbone.blocks.22.attn.proj.weight, backbone.blocks.22.attn.proj.bias, backbone.blocks.22.norm2.weight, backbone.blocks.22.norm2.bias, backbone.blocks.22.mlp.fc1.weight, backbone.blocks.22.mlp.fc1.bias, backbone.blocks.22.mlp.fc2.weight, backbone.blocks.22.mlp.fc2.bias, backbone.blocks.23.norm1.weight, backbone.blocks.23.norm1.bias, backbone.blocks.23.attn.qkv.weight, backbone.blocks.23.attn.qkv.bias, backbone.blocks.23.attn.proj.weight, backbone.blocks.23.attn.proj.bias, backbone.blocks.23.norm2.weight, backbone.blocks.23.norm2.bias, backbone.blocks.23.mlp.fc1.weight, backbone.blocks.23.mlp.fc1.bias, backbone.blocks.23.mlp.fc2.weight, backbone.blocks.23.mlp.fc2.bias, backbone.last_norm.weight, backbone.last_norm.bias, keypoint_head.deconv_layers.0.weight, keypoint_head.deconv_layers.1.weight, keypoint_head.deconv_layers.1.bias, keypoint_head.deconv_layers.1.running_mean, keypoint_head.deconv_layers.1.running_var, keypoint_head.deconv_layers.1.num_batches_tracked, keypoint_head.deconv_layers.3.weight, keypoint_head.deconv_layers.4.weight, keypoint_head.deconv_layers.4.bias, keypoint_head.deconv_layers.4.running_mean, keypoint_head.deconv_layers.4.running_var, keypoint_head.deconv_layers.4.num_batches_tracked, keypoint_head.final_layer.weight, keypoint_head.final_layer.bias

missing keys in source state_dict: pos_embed, patch_embed.proj.weight, patch_embed.proj.bias, blocks.0.norm1.weight, blocks.0.norm1.bias, blocks.0.attn.qkv.weight, blocks.0.attn.qkv.bias, blocks.0.attn.proj.weight, blocks.0.attn.proj.bias, blocks.0.norm2.weight, blocks.0.norm2.bias, blocks.0.mlp.fc1.weight, blocks.0.mlp.fc1.bias, blocks.0.mlp.fc2.weight, blocks.0.mlp.fc2.bias, blocks.1.norm1.weight, blocks.1.norm1.bias, blocks.1.attn.qkv.weight, blocks.1.attn.qkv.bias, blocks.1.attn.proj.weight, blocks.1.attn.proj.bias, blocks.1.norm2.weight, blocks.1.norm2.bias, blocks.1.mlp.fc1.weight, blocks.1.mlp.fc1.bias, blocks.1.mlp.fc2.weight, blocks.1.mlp.fc2.bias, blocks.2.norm1.weight, blocks.2.norm1.bias, blocks.2.attn.qkv.weight, blocks.2.attn.qkv.bias, blocks.2.attn.proj.weight, blocks.2.attn.proj.bias, blocks.2.norm2.weight, blocks.2.norm2.bias, blocks.2.mlp.fc1.weight, blocks.2.mlp.fc1.bias, blocks.2.mlp.fc2.weight, blocks.2.mlp.fc2.bias, blocks.3.norm1.weight, blocks.3.norm1.bias, blocks.3.attn.qkv.weight, blocks.3.attn.qkv.bias, blocks.3.attn.proj.weight, blocks.3.attn.proj.bias, blocks.3.norm2.weight, blocks.3.norm2.bias, blocks.3.mlp.fc1.weight, blocks.3.mlp.fc1.bias, blocks.3.mlp.fc2.weight, blocks.3.mlp.fc2.bias, blocks.4.norm1.weight, blocks.4.norm1.bias, blocks.4.attn.qkv.weight, blocks.4.attn.qkv.bias, blocks.4.attn.proj.weight, blocks.4.attn.proj.bias, blocks.4.norm2.weight, blocks.4.norm2.bias, blocks.4.mlp.fc1.weight, blocks.4.mlp.fc1.bias, blocks.4.mlp.fc2.weight, blocks.4.mlp.fc2.bias, blocks.5.norm1.weight, blocks.5.norm1.bias, blocks.5.attn.qkv.weight, blocks.5.attn.qkv.bias, blocks.5.attn.proj.weight, blocks.5.attn.proj.bias, blocks.5.norm2.weight, blocks.5.norm2.bias, blocks.5.mlp.fc1.weight, blocks.5.mlp.fc1.bias, blocks.5.mlp.fc2.weight, blocks.5.mlp.fc2.bias, blocks.6.norm1.weight, blocks.6.norm1.bias, blocks.6.attn.qkv.weight, blocks.6.attn.qkv.bias, blocks.6.attn.proj.weight, blocks.6.attn.proj.bias, blocks.6.norm2.weight, blocks.6.norm2.bias, blocks.6.mlp.fc1.weight, blocks.6.mlp.fc1.bias, blocks.6.mlp.fc2.weight, blocks.6.mlp.fc2.bias, blocks.7.norm1.weight, blocks.7.norm1.bias, blocks.7.attn.qkv.weight, blocks.7.attn.qkv.bias, blocks.7.attn.proj.weight, blocks.7.attn.proj.bias, blocks.7.norm2.weight, blocks.7.norm2.bias, blocks.7.mlp.fc1.weight, blocks.7.mlp.fc1.bias, blocks.7.mlp.fc2.weight, blocks.7.mlp.fc2.bias, blocks.8.norm1.weight, blocks.8.norm1.bias, blocks.8.attn.qkv.weight, blocks.8.attn.qkv.bias, blocks.8.attn.proj.weight, blocks.8.attn.proj.bias, blocks.8.norm2.weight, blocks.8.norm2.bias, blocks.8.mlp.fc1.weight, blocks.8.mlp.fc1.bias, blocks.8.mlp.fc2.weight, blocks.8.mlp.fc2.bias, blocks.9.norm1.weight, blocks.9.norm1.bias, blocks.9.attn.qkv.weight, blocks.9.attn.qkv.bias, blocks.9.attn.proj.weight, blocks.9.attn.proj.bias, blocks.9.norm2.weight, blocks.9.norm2.bias, blocks.9.mlp.fc1.weight, blocks.9.mlp.fc1.bias, blocks.9.mlp.fc2.weight, blocks.9.mlp.fc2.bias, blocks.10.norm1.weight, blocks.10.norm1.bias, blocks.10.attn.qkv.weight, blocks.10.attn.qkv.bias, blocks.10.attn.proj.weight, blocks.10.attn.proj.bias, blocks.10.norm2.weight, blocks.10.norm2.bias, blocks.10.mlp.fc1.weight, blocks.10.mlp.fc1.bias, blocks.10.mlp.fc2.weight, blocks.10.mlp.fc2.bias, blocks.11.norm1.weight, blocks.11.norm1.bias, blocks.11.attn.qkv.weight, blocks.11.attn.qkv.bias, blocks.11.attn.proj.weight, blocks.11.attn.proj.bias, blocks.11.norm2.weight, blocks.11.norm2.bias, blocks.11.mlp.fc1.weight, blocks.11.mlp.fc1.bias, blocks.11.mlp.fc2.weight, blocks.11.mlp.fc2.bias, blocks.12.norm1.weight, blocks.12.norm1.bias, blocks.12.attn.qkv.weight, blocks.12.attn.qkv.bias, blocks.12.attn.proj.weight, blocks.12.attn.proj.bias, blocks.12.norm2.weight, blocks.12.norm2.bias, blocks.12.mlp.fc1.weight, blocks.12.mlp.fc1.bias, blocks.12.mlp.fc2.weight, blocks.12.mlp.fc2.bias, blocks.13.norm1.weight, blocks.13.norm1.bias, blocks.13.attn.qkv.weight, blocks.13.attn.qkv.bias, blocks.13.attn.proj.weight, blocks.13.attn.proj.bias, blocks.13.norm2.weight, blocks.13.norm2.bias, blocks.13.mlp.fc1.weight, blocks.13.mlp.fc1.bias, blocks.13.mlp.fc2.weight, blocks.13.mlp.fc2.bias, blocks.14.norm1.weight, blocks.14.norm1.bias, blocks.14.attn.qkv.weight, blocks.14.attn.qkv.bias, blocks.14.attn.proj.weight, blocks.14.attn.proj.bias, blocks.14.norm2.weight, blocks.14.norm2.bias, blocks.14.mlp.fc1.weight, blocks.14.mlp.fc1.bias, blocks.14.mlp.fc2.weight, blocks.14.mlp.fc2.bias, blocks.15.norm1.weight, blocks.15.norm1.bias, blocks.15.attn.qkv.weight, blocks.15.attn.qkv.bias, blocks.15.attn.proj.weight, blocks.15.attn.proj.bias, blocks.15.norm2.weight, blocks.15.norm2.bias, blocks.15.mlp.fc1.weight, blocks.15.mlp.fc1.bias, blocks.15.mlp.fc2.weight, blocks.15.mlp.fc2.bias, blocks.16.norm1.weight, blocks.16.norm1.bias, blocks.16.attn.qkv.weight, blocks.16.attn.qkv.bias, blocks.16.attn.proj.weight, blocks.16.attn.proj.bias, blocks.16.norm2.weight, blocks.16.norm2.bias, blocks.16.mlp.fc1.weight, blocks.16.mlp.fc1.bias, blocks.16.mlp.fc2.weight, blocks.16.mlp.fc2.bias, blocks.17.norm1.weight, blocks.17.norm1.bias, blocks.17.attn.qkv.weight, blocks.17.attn.qkv.bias, blocks.17.attn.proj.weight, blocks.17.attn.proj.bias, blocks.17.norm2.weight, blocks.17.norm2.bias, blocks.17.mlp.fc1.weight, blocks.17.mlp.fc1.bias, blocks.17.mlp.fc2.weight, blocks.17.mlp.fc2.bias, blocks.18.norm1.weight, blocks.18.norm1.bias, blocks.18.attn.qkv.weight, blocks.18.attn.qkv.bias, blocks.18.attn.proj.weight, blocks.18.attn.proj.bias, blocks.18.norm2.weight, blocks.18.norm2.bias, blocks.18.mlp.fc1.weight, blocks.18.mlp.fc1.bias, blocks.18.mlp.fc2.weight, blocks.18.mlp.fc2.bias, blocks.19.norm1.weight, blocks.19.norm1.bias, blocks.19.attn.qkv.weight, blocks.19.attn.qkv.bias, blocks.19.attn.proj.weight, blocks.19.attn.proj.bias, blocks.19.norm2.weight, blocks.19.norm2.bias, blocks.19.mlp.fc1.weight, blocks.19.mlp.fc1.bias, blocks.19.mlp.fc2.weight, blocks.19.mlp.fc2.bias, blocks.20.norm1.weight, blocks.20.norm1.bias, blocks.20.attn.qkv.weight, blocks.20.attn.qkv.bias, blocks.20.attn.proj.weight, blocks.20.attn.proj.bias, blocks.20.norm2.weight, blocks.20.norm2.bias, blocks.20.mlp.fc1.weight, blocks.20.mlp.fc1.bias, blocks.20.mlp.fc2.weight, blocks.20.mlp.fc2.bias, blocks.21.norm1.weight, blocks.21.norm1.bias, blocks.21.attn.qkv.weight, blocks.21.attn.qkv.bias, blocks.21.attn.proj.weight, blocks.21.attn.proj.bias, blocks.21.norm2.weight, blocks.21.norm2.bias, blocks.21.mlp.fc1.weight, blocks.21.mlp.fc1.bias, blocks.21.mlp.fc2.weight, blocks.21.mlp.fc2.bias, blocks.22.norm1.weight, blocks.22.norm1.bias, blocks.22.attn.qkv.weight, blocks.22.attn.qkv.bias, blocks.22.attn.proj.weight, blocks.22.attn.proj.bias, blocks.22.norm2.weight, blocks.22.norm2.bias, blocks.22.mlp.fc1.weight, blocks.22.mlp.fc1.bias, blocks.22.mlp.fc2.weight, blocks.22.mlp.fc2.bias, blocks.23.norm1.weight, blocks.23.norm1.bias, blocks.23.attn.qkv.weight, blocks.23.attn.qkv.bias, blocks.23.attn.proj.weight, blocks.23.attn.proj.bias, blocks.23.norm2.weight, blocks.23.norm2.bias, blocks.23.mlp.fc1.weight, blocks.23.mlp.fc1.bias, blocks.23.mlp.fc2.weight, blocks.23.mlp.fc2.bias, last_norm.weight, last_norm.bias

Bottom up vs top down model

Hi, can someone explain how the bottom up Vitpose model work? Can you give an example with VITpose_B. I am instrested in the smallest, fastest single person pose model among all while preserving decent accuracy on COCO. Would it be ViTpose_b in bottom up or top down manner?

How to do inference on video by scripts?

Hey! I used the web version demo in #20 to do inference on a video file, but it's super slow! I'm wondering if there's any scripts to do so?

Another question, let's say I have an input image of size 1080x640x3 that contains 10 people. The detector could detect all of them, so after cliping and resizing, the actual data flowing into ViTPose is 10x3x256x192. And your speed #4 (900 fps) is measured on each 256x192x3. Am I correct?

Thanks in advance!

Training on test images when using CrowdPose?

Dear authors, thanks for the exciting work and I'd like to apologize in advance if I misunderstood.

As you may already know, CrowdPose dataset itself is constituted by cherry-picked crowd samples selected from MSCOCO, MPII and AIC, but CrowdPose did not specify if they treated train/val/test images from MSCOCO/MPII/AIC differently. They also re-annotated (presumably more accurately) these samples.

What we have noticed is that many of the test images in "MS COCO val set" also present in "CrowdPose train" and "CrowdPose train/val" splits. Although CrowdPose has renamed all their images, we have identified at least 181 images in "CrowdPose train/val" having the same md5 info as in "MS COCO val set".

For example, "108951.jpg" in "CrowdPose train" and "000000147740.jpg" in "MS COCO val set" are the same image with md5: f9fc120dc085166b30c08da3de333b69

We did not identify any image overlap between CrowdPose and MPII/AIC on md5 level for both train and test images, possibly because CrowdPose did some preprocessing for selected MPII/AIC images, but based on the finding on COCO, the possibility for such train-test overlap with MPII/AIC is notable. We have not checked if "CrowdPose test" images also present in "COCO train set" yet.

So if I did not miss anything, the model jointly trained on COCO+AIC+MPII+CrowdPose would have seen many of the test images (with labels, at least for COCO) during the training process, making the results untrustworthy.

config question

Can you explain what nMS_THr and OKs_THr mean and what they do?
Thank you very much!
nms_thr=1 0

KeyError: 'ViT is not in the models registry'

I am trying to run top_down_video_demo_with_mmdet.py with the command:

python demo/top_down_video_demo_with_mmdet.py \
demo/mmdetection_cfg/yolov3_d53_320_273e_coco.py  \
https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_d53_320_273e_coco/yolov3_d53_320_273e_coco-421362b6.pth \
configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_huge_coco_256x192.py \
../pretrained/ViTPose-H.pth \
--video-path ../UCF_Videos/Fighting/Fighting018_x264.mp4 \
--out-video-root ../output/test1

However, I am getting the following error:

Traceback (most recent call last):
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
    return obj_cls(**args)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/detectors/top_down.py", line 48, in __init__
    self.backbone = builder.build_backbone(backbone)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/builder.py", line 19, in build_backbone
    return BACKBONES.build(cfg)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 237, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 61, in build_from_cfg
    raise KeyError(
KeyError: 'ViT is not in the models registry'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2435462/HRC/ViTPose/demo/top_down_video_demo_with_mmdet.py", line 165, in <module>
    main()
  File "/home/s2435462/HRC/ViTPose/demo/top_down_video_demo_with_mmdet.py", line 76, in main
    pose_model = init_pose_model(
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/apis/inference.py", line 43, in init_pose_model
    model = build_posenet(config.model)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/builder.py", line 39, in build_posenet
    return POSENETS.build(cfg)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 237, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "TopDown: 'ViT is not in the models registry'"

I installed everything with these commands:

conda create -n open-mmlab python=3.9 -y
conda activate open-mmlab

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

pip install mmcv-full
pip install mmdet

rm -rf mmpose
git clone https://github.com/open-mmlab/mmpose.git
cd mmpose
pip install -r requirements.txt
pip install -e .

Can someone guide me on how to solve this?

About model size

Hi,I used the pre-trained model you provided for fine-tuning. Performance and speed is competitive.But the size of the model is about three times larger than you.For example, the size of my vitpose-b is 1.xGB, but yours is 343MB. How can i get a same size model?

How to load pre-trained model?

Hi , When I was loading the pre-trained model, I used the params of "--resume-from" which followed by a pre-trained model path, I got the err message like this:

2022-06-10 10:51:38,626 - mmpose - INFO - load checkpoint from local path: models/epoch_1.pth Traceback (most recent call last): File "/home/pose/codes/ViTPose/tools/train.py", line 195, in <module> main() File "/home/pose/codes/ViTPose/tools/train.py", line 184, in main train_model( File "/home/pose/codes/ViTPose/mmpose/apis/train.py", line 197, in train_model runner.resume(cfg.resume_from) File "/home/pose/codes/ViTPose/ViT_venv/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 364, in resume self._iter = checkpoint['meta']['iter'] KeyError: 'iter'

so what's the right way to load a pre-trained model ? thank you for your patience and time !

Will this model work with unseen data?

Will this model work with unseen data (in the wild pose estimation) or does it require further training outside the COCO/AIC/MPII/CrowdPose datasets?

Onnx version of the model

Hi, thank you author for the great work.

I really impress with your work. By any chance could you realese the onnx version of Vitpose model?
I tried to run the vitpose B* but failed many times

AttributeError: 'ConfigDict' object has no attribute 'data'

When I try to run the code below in notebook ->

!bash tools/dist_test.sh /content/ViTPose/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.yml /content/mask_rcnn_swin_tiny_patch4_window7_1x.pth 1

as you have mentioned in README.md ->

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

I get the error below ->

apex is not installed
apex is not installed
apex is not installed
/usr/local/lib/python3.7/dist-packages/mmcv/cnn/bricks/transformer.py:33: UserWarning: Fail to import MultiScaleDeformableAttention from mmcv.ops.multi_scale_deform_attn, You should install mmcv-full if you need this module.
warnings.warn('Fail to import MultiScaleDeformableAttention from '
Traceback (most recent call last):
File "tools/test.py", line 184, in
main()
File "tools/test.py", line 96, in main
setup_multi_processes(cfg)
File "/content/ViTPose/mmpose/utils/setup_env.py", line 30, in setup_multi_processes
if 'OMP_NUM_THREADS' not in os.environ and cfg.data.workers_per_gpu > 1:
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/config.py", line 513, in getattr
return getattr(self._cfg_dict, name)
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/config.py", line 49, in getattr
raise ex
AttributeError: 'ConfigDict' object has no attribute 'data'
Killing subprocess 743
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/test.py', '--local_rank=0', '/content/ViTPose/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.yml', '/content/mask_rcnn_swin_tiny_patch4_window7_1x.pth', '--launcher', 'pytorch']' returned non-zero exit status 1.

About multitask log

Thx for the great work! It helps me l lot in my own study. Could you please release the training log of the multitask training? The .log file may better than .json. Thx again.

Keypoints absent from model output?

I've been trying to use the demo scripts but keep getting the following error:

Traceback (most recent call last):
  File "demo/top_down_video_demo_with_mmdet.py", line 165, in <module>
    main()
  File "demo/top_down_video_demo_with_mmdet.py", line 125, in main
    pose_results, returned_outputs = inference_top_down_pose_model(
  File "/home/nshah/work/packages/vitpose/mmpose/apis/inference.py", line 415, in inference_top_down_pose_model
    poses, heatmap = _inference_single_pose_model(
  File "/home/nshah/work/packages/vitpose/mmpose/apis/inference.py", line 307, in _inference_single_pose_model
    return result['preds'], result['output_heatmap']
KeyError: 'preds'

The model seemingly outputs only the heatmap and not the actual keypoint predictions. However, I noticed in some of the closed issues that people were able to get some of the demo scripts to work. I'm just wondering whether I'm missing something very obvious.

I'm using this config which does appear to have a keypoint head.

Top-down or bottom-up?

Hey! Was reading the paper, impressive stuff.

I was uncertain about what you actually predicted however. Do you do first crop the humans, and then do keypoint estimation (top-down I guess)? Or do you predict all humans at once (bottom-up), and then predict a part-affinity map (or the like) along with the keypoints?

If it's the latter, what exactly does the model output?

Thank you in advance 🙏

Running the project

Hi, can anyone summarize the installation setup and the quick start process (for instance using the demo and running the inference). The instructions mentioned in the README.md is confusing for beginners. Thank you !!!

Model in video demo

Hello, I was wondering, what is the model which is on Web Demo for video in HuggingFace? I would like to test that using scripts. Are weights provided for that particular model? Thanks

Running video demo

Hello,

I tried to run the video demo using mmdet:

python demo/top_down_pose_tracking_demo_with_mmdet.py ./demo/mmdetection_cfg/faster_rcnn_r50_fpn_coco.py ./faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_base_coco_256x192.py ./vitpose-b.pt --video-path ./test.MOV --out-video-root ./output_video/
but I have errors due to the version compatibility between mmcv, mmdet et the current ViTPose (or mmpose) version.

So here what I do, I install mmcv from sources (1.3.9 version as recommended in the read me of this repo) and the mmdet from sources as well (I tried last mmdet, mmdet==2.14.0 as it is recommended in the mminstall.txt for mmpose 0.24.0: ['mmcv-full>=1.3.8', 'mmdet>=2.14.0', 'mmtrack>=0.6.0']., and mmdet==2.23.0)

here is what I have with the following versions for example (pip list) :
mmcv 1.3.9
mmdet 2.14.0
mmpose 0.24.0

Note that I use this :
torch 1.11.0+cu113
torchvision 0.12.0+cu113

I got this error :
/home/ubuntu/venv/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py:27: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv-full`` if you need this module. warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from ' Traceback (most recent call last): File "demo/top_down_pose_tracking_demo_with_mmdet.py", line 190, in <module> main() File "demo/top_down_pose_tracking_demo_with_mmdet.py", line 74, in main assert has_mmdet, 'Please install mmdet to run the demo.' AssertionError: Please install mmdet to run the demo.

when I put mmdet to 2.23.0 I got this error :

AssertionError: MMCV==1.3.9 is used but incompatible. Please install mmcv>=1.3.17, <=1.6.0.

Tried to set mmcv>=1.3.17, did not resolve the problem !

can you please tell us which versions (mmcv and mmdet) are recommended to run ViTPose on videos ?

Inference speed

Thank you for the nice work! May I know if you all have done any analysis and comparison for the model's inference speed?

Video demo with VitPose Base

Hello, how to run video demo with Vit.Are the demo scripts using any Vit pose models? As it says in demo page - Using mmdet for human bounding box detection. We provide a demo script to run mmdet for human detection, and mmpose for pose estimation. How to use VItPose for videopose estimation?

Demo codes.

Hi, I am very interested in your excellect work and I would like to ask where could I find the codes for the web demo? By the way, where can I get the quantitative intermediate output results of this APP (https://huggingface.co/spaces/Gradio-Blocks/ViTPose), such as detection boxes and keypoints? Looking forward to your reply !

Speed of Detection

Hi, I have only managed to get fps of around 5 fps for the topdown model under 2D pose estimation with GTX 1660 GPU for via demo/webcam.py with video testing. How can i also speed up the inference speed when i use with synchronous mode? Thank you!! :)

Use ViTPose with Jetson AGX Orin

Hi, thanks for the great work you have done on the pose estimation, I used the deployment script pytorch2onnx.py to convert to onnx and then use trtexec to convert to an engine file, But the output heat map is different when using tensorRT inference,

What is the full-window attention structure?

It is mentioned in the paper that full-window attention structure is used to reduce memory load, but I did not find the introduction of full-window attention structure. I would like to ask how this structure is realized.?

About Inference speed

Are you sure that this method is faster than HRNet?
I have tried both with yolov5 as the detector in trt inference.
HRNet achieves around 30-35 fps while VitPose can reach 7 fps at the same video with trt.
Inference test I have conducted show that hrnet is 6-7 faster when using larger batch sizes for some reason (around 220 fps per target for fp16 and 450 fps for int8) while VitPose achieves around 60 fps per target in trt.

Testing on CPU

How to test on CPU?

Setting number of GPUs to 0 don't work.

bash tools/dist_test.sh configs/body/2d_kpt_sview_rgb_img/deeppose/coco/res101_coco_256x192.py  ../weights/mae_pretrain_vit_base.pth 0

Error:

FutureWarning,
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 225, in launch_agent
    master_port=master_port,
  File "<string>", line 15, in __init__
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__
    assert self.local_world_size > 0
AssertionError

model mismatch

Hi, I encountered a mismatch issue when training ViTbase from the pretrained MAE.
'The model and loaded state dict do not match exactly
unexpected key in source state_dict: cls_token, norm.weight, norm.bias
missing keys in source state_dict: last_norm.weight, last_norm.bias'

And
'fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
'

But the training was not stopped. What actions should I take other than simply loading from the pretrained MAE?

about the code of the transformer block

Thank you for open this great repo.
Hello, where is the code of the transformer block? I didn't find the corresponding code

I would be greatly appreciated if you could spend some of your time for a reply.

Question about the file vitpose-l-simple.pth.

The model file vitpose-l-simple.pth I downloaded cannot be loaded. I would like to confirm whether it is the problem that I have not downloaded or the problem of the uploaded model itself?
屏幕截图 2022-09-02 013036
And below is a screenshot of my error.
image
Looking forward to your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.