vitae-transformer / vitpose Goto Github PK

The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"

License: Apache License 2.0

Python 99.77% Dockerfile 0.05% Shell 0.19%

deep-learning distillation mae pose-estimation pytorch self-supervised-learning vision-transformer

vitpose's Introduction

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Results | Updates | Usage | Todo | Acknowledge

This branch contains the pytorch implementation of ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation and ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

Web Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo for video: and images

MAE Pre-trained model

The small size MAE pre-trained model can be found in Onedrive.
The base, large, and huge pre-trained models using MAE can be found in the MAE official repo.

Results from this repo on MS COCO val set (single-task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-S	MAE	256x192	73.8	79.2	config	log	Onedrive
ViTPose-B	MAE	256x192	75.8	81.1	config	log	Onedrive
ViTPose-L	MAE	256x192	78.3	83.5	config	log	Onedrive
ViTPose-H	MAE	256x192	79.1	84.1	config	log	Onedrive

With simple decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-S	MAE	256x192	73.5	78.9	config	log	Onedrive
ViTPose-B	MAE	256x192	75.5	80.9	config	log	Onedrive
ViTPose-L	MAE	256x192	78.2	83.4	config	log	Onedrive
ViTPose-H	MAE	256x192	78.9	84.0	config	log	Onedrive

Results with multi-task training

Note * There may exist duplicate images in the crowdpose training set and the validation images in other datasets, as discussed in issue #24. Please be careful when using these models for evaluation. We provide the results without the crowpose dataset for reference.

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Results on MS COCO val set

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	77.1	82.2	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	78.7	83.8	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	79.5	84.5	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	81.0	85.6
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	77.5	82.6	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	79.1	84.1	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	79.8	84.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	75.8	82.6	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	77.0	82.6	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	78.6	84.1	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	79.4	84.8	config	log \| Onedrive

Results on OCHuman test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	88.0	89.6	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	90.9	92.2	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	90.9	92.3	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	93.3	94.3
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	88.2	90.0	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	91.5	92.8	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	91.6	92.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	78.4	80.6	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.6	84.8	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	85.7	87.5	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	85.7	87.4	config	log \| Onedrive

Results on MPII val set

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

Model	Dataset	Resolution	Mean	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	93.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	94.0	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	94.1	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	94.3
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	93.4	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	93.9	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	94.1	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	92.7	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	92.8	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	94.0	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	94.2	config	log \| Onedrive

Results on AI Challenger test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII	256x192	32.0	36.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII	256x192	34.5	39.0	config	Onedrive
ViTPose-H	COCO+AIC+MPII	256x192	35.4	39.9	config	Onedrive
ViTPose-G	COCO+AIC+MPII	576x432	43.2	47.1
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	31.9	36.3	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	34.6	39.0	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	35.3	39.8	config	Onedrive
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	29.7	34.3	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	31.8	36.3	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	34.3	38.9	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	34.8	39.1	config	log \| Onedrive

Results on CrowdPose test set

Using YOLOv3 human detector. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AP(H)	config	weight
ViTPose-B*	COCO+AIC+MPII+CrowdPose	256x192	74.7	63.3	config	Onedrive
ViTPose-L*	COCO+AIC+MPII+CrowdPose	256x192	76.6	65.9	config	Onedrive
ViTPose-H*	COCO+AIC+MPII+CrowdPose	256x192	76.3	65.6	config	Onedrive

Animal datasets (AP10K, APT36K)

Results on AP-10K test set

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	71.4	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	74.5	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	80.4	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.4	config	log \| Onedrive

Results on APT-36K val set

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	74.2	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	75.9	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	80.8	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	82.3	config	log \| Onedrive

WholeBody dataset

Model	Dataset	Resolution	AP	config	weight
ViTPose+-S	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	54.4	config	log \| Onedrive
ViTPose+-B	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	57.4	config	log \| Onedrive
ViTPose+-L	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	60.6	config	log \| Onedrive
ViTPose+-H	COCO+AIC+MPII+AP10K+APT36K+WholeBody	256x192	61.2	config	log \| Onedrive

Transfer results on the hand dataset (InterHand2.6M)

Model	Dataset	Resolution	AUC	config	weight
ViTPose+-S	COCO+AIC+MPII+WholeBody	256x192	86.5	config	Coming Soon
ViTPose+-B	COCO+AIC+MPII+WholeBody	256x192	87.0	config	Coming Soon
ViTPose+-L	COCO+AIC+MPII+WholeBody	256x192	87.5	config	Coming Soon
ViTPose+-H	COCO+AIC+MPII+WholeBody	256x192	87.6	config	Coming Soon

Updates

[2023-01-10] Update ViTPose+! It uses MoE strategies to jointly deal with human, animal, and wholebody pose estimation tasks.

[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

After downloading the pretrained models, please conduct the experiments by running

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0

To test the pretrained models performance, please run

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

For ViTPose+ pre-trained models, please first re-organize the pre-trained weights using

python tools/model_split.py --source <Pretrained PATH>

Todo

This repo current contains modifications including:

Upload configs and pretrained models
More models with SOTA results
Upload multi-task training config

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

For ViTPose

@inproceedings{
  xu2022vitpose,
  title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
  author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
}

For ViTPose+

@article{xu2022vitpose+,
  title={ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation},
  author={Xu, Yufei and Zhang, Jing and Zhang, Qiming and Tao, Dacheng},
  journal={arXiv preprint arXiv:2212.04246},
  year={2022}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

vitpose's People

Contributors

Stargazers

Watchers

Forkers

yangyin2016 jinwook-shim sruthi5797 fosstheory shangdibufashi mornydew nielsrogge chjxu qwopqwop200 gjtjx victoria-brami luczot ykk648 amalaj7 sadjadasghari safwennaimi sonalily trassir winstondeng asdf2kr ak391 chenzhutian d2e19 jiahongwu1995 huismiling tornikeo ucprer vghost2008 aiyongy 111111m acamargofb sehwanyoo wheemyungshin tkpham3105 ligaoqi2 bibersay donghappyyy trellixvulnteam xuliangcs kzwyj c0rvus-ix chunsheng13 dl-vit rogerzhangzz superz678 jamesz309 andrewyguo vatsalrathod16 iantimmis study-ml-cv-nlp-slam exitudio masum06 mpattnaik97 z915287285 zmic kmkmkr frunyang wasedamagina superjay1996 jackie666666 samorange1 chenhuigou skutukov yonigozlan chenchy ferhatsb kulits yhl2018 jesse-vd-linden suki1504 vinace melvinebenezer diwaslamsal rettend mirapurkrabek chhaviilli ebenezero lareinam celeste-cj illiped andyroro x-facto wenyux marcoalves20 quent1fvr seaman1900 l846505908 hcp6897 shubham-goel michellelychan retail-intelligence cjh8817676 charliecr94 miracledance adamyerbin raojiyong using0601 jonasnoki sparklingyueran cjh88888

vitpose's Issues

Config path

What is the config path in train.sh ?

Training device?

Hi, I'd like to re-train this model on my own data, however, out of memory error occurs even samples_per_gpu is set to 1. I'm using gtx 2080ti.

where is optimize attention block ?

Thank you for open this great repo.
In paper table4, compare attention trick, However cant find it in this repo. such as window MSA or shift MSA etc.

Would you provide bottom-up-based pretrained weights of ViTPose?

Thanks for your research contribution and publishing code!

I will be inferencing this model as bottom-up-based keypoints estimation process for research.

When I see code, I found bottom-up-based inferencing code, but I can not found bottom-up-based pretrained weights of ViTPose.

Would you provide bottom-up-based pretrained weights to me
?

configure the environment

（Windows 10）When I follow the instructions to configure the environment, I get an error：

What's the batchsize used in the configuration?

Hi, I can only see the 'samples_per_gpu' in the config, but can't find the number of GPU used in the experiment. I wonder what's the actual batch size being used in each experiment.

The model and loaded state dict do not match exactly

Hi there,

First of all, thank you for reading this issue.

I am testing the following model and get the following error, it seems the config file does not match the pre-trained model. I am not sure what mistakes I have made. Many thanks if anyone could offer any hints.

Results from this repo on MS COCO val set (single-task training)

ViTPose-B | MAE | 256x192 | 75.8 | 81.1 | config | log | Onedrive (here is where I download the pth file)

I used the following command:
bash tools/dist_train.sh /home/zee/ViTPose/ViTPose/configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_base_coco_256x192.py 1 --cfg-options model.pretrained=/home/zee/ViTPose/vitpose-b.pth --seed 0

WARNING:root:The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.pos_embed, backbone.patch_embed.proj.weight, backbone.patch_embed.proj.bias, backbone.blocks.0.norm1.weight, backbone.blocks.0.norm1.bias, backbone.blocks.0.attn.qkv.weight, backbone.blocks.0.attn.qkv.bias, backbone.blocks.0.attn.proj.weight, backbone.blocks.0.attn.proj.bias, backbone.blocks.0.norm2.weight, backbone.blocks.0.norm2.bias, backbone.blocks.0.mlp.fc1.weight, backbone.blocks.0.mlp.fc1.bias, backbone.blocks.0.mlp.fc2.weight, backbone.blocks.0.mlp.fc2.bias, backbone.blocks.1.norm1.weight, backbone.blocks.1.norm1.bias, backbone.blocks.1.attn.qkv.weight, backbone.blocks.1.attn.qkv.bias, backbone.blocks.1.attn.proj.weight, backbone.blocks.1.attn.proj.bias, backbone.blocks.1.norm2.weight, backbone.blocks.1.norm2.bias, backbone.blocks.1.mlp.fc1.weight, backbone.blocks.1.mlp.fc1.bias, backbone.blocks.1.mlp.fc2.weight, backbone.blocks.1.mlp.fc2.bias, backbone.blocks.2.norm1.weight, backbone.blocks.2.norm1.bias, backbone.blocks.2.attn.qkv.weight, backbone.blocks.2.attn.qkv.bias, backbone.blocks.2.attn.proj.weight, backbone.blocks.2.attn.proj.bias, backbone.blocks.2.norm2.weight, backbone.blocks.2.norm2.bias, backbone.blocks.2.mlp.fc1.weight, backbone.blocks.2.mlp.fc1.bias, backbone.blocks.2.mlp.fc2.weight, backbone.blocks.2.mlp.fc2.bias, backbone.blocks.3.norm1.weight, backbone.blocks.3.norm1.bias, backbone.blocks.3.attn.qkv.weight, backbone.blocks.3.attn.qkv.bias, backbone.blocks.3.attn.proj.weight, backbone.blocks.3.attn.proj.bias, backbone.blocks.3.norm2.weight, backbone.blocks.3.norm2.bias, backbone.blocks.3.mlp.fc1.weight, backbone.blocks.3.mlp.fc1.bias, backbone.blocks.3.mlp.fc2.weight, backbone.blocks.3.mlp.fc2.bias, backbone.blocks.4.norm1.weight, backbone.blocks.4.norm1.bias, backbone.blocks.4.attn.qkv.weight, backbone.blocks.4.attn.qkv.bias, backbone.blocks.4.attn.proj.weight, backbone.blocks.4.attn.proj.bias, backbone.blocks.4.norm2.weight, backbone.blocks.4.norm2.bias, backbone.blocks.4.mlp.fc1.weight, backbone.blocks.4.mlp.fc1.bias, backbone.blocks.4.mlp.fc2.weight, backbone.blocks.4.mlp.fc2.bias, backbone.blocks.5.norm1.weight, backbone.blocks.5.norm1.bias, backbone.blocks.5.attn.qkv.weight, backbone.blocks.5.attn.qkv.bias, backbone.blocks.5.attn.proj.weight, backbone.blocks.5.attn.proj.bias, backbone.blocks.5.norm2.weight, backbone.blocks.5.norm2.bias, backbone.blocks.5.mlp.fc1.weight, backbone.blocks.5.mlp.fc1.bias, backbone.blocks.5.mlp.fc2.weight, backbone.blocks.5.mlp.fc2.bias, backbone.blocks.6.norm1.weight, backbone.blocks.6.norm1.bias, backbone.blocks.6.attn.qkv.weight, backbone.blocks.6.attn.qkv.bias, backbone.blocks.6.attn.proj.weight, backbone.blocks.6.attn.proj.bias, backbone.blocks.6.norm2.weight, backbone.blocks.6.norm2.bias, backbone.blocks.6.mlp.fc1.weight, backbone.blocks.6.mlp.fc1.bias, backbone.blocks.6.mlp.fc2.weight, backbone.blocks.6.mlp.fc2.bias, backbone.blocks.7.norm1.weight, backbone.blocks.7.norm1.bias, backbone.blocks.7.attn.qkv.weight, backbone.blocks.7.attn.qkv.bias, backbone.blocks.7.attn.proj.weight, backbone.blocks.7.attn.proj.bias, backbone.blocks.7.norm2.weight, backbone.blocks.7.norm2.bias, backbone.blocks.7.mlp.fc1.weight, backbone.blocks.7.mlp.fc1.bias, backbone.blocks.7.mlp.fc2.weight, backbone.blocks.7.mlp.fc2.bias, backbone.blocks.8.norm1.weight, backbone.blocks.8.norm1.bias, backbone.blocks.8.attn.qkv.weight, backbone.blocks.8.attn.qkv.bias, backbone.blocks.8.attn.proj.weight, backbone.blocks.8.attn.proj.bias, backbone.blocks.8.norm2.weight, backbone.blocks.8.norm2.bias, backbone.blocks.8.mlp.fc1.weight, backbone.blocks.8.mlp.fc1.bias, backbone.blocks.8.mlp.fc2.weight, backbone.blocks.8.mlp.fc2.bias, backbone.blocks.9.norm1.weight, backbone.blocks.9.norm1.bias, backbone.blocks.9.attn.qkv.weight, backbone.blocks.9.attn.qkv.bias, backbone.blocks.9.attn.proj.weight, backbone.blocks.9.attn.proj.bias, backbone.blocks.9.norm2.weight, backbone.blocks.9.norm2.bias, backbone.blocks.9.mlp.fc1.weight, backbone.blocks.9.mlp.fc1.bias, backbone.blocks.9.mlp.fc2.weight, backbone.blocks.9.mlp.fc2.bias, backbone.blocks.10.norm1.weight, backbone.blocks.10.norm1.bias, backbone.blocks.10.attn.qkv.weight, backbone.blocks.10.attn.qkv.bias, backbone.blocks.10.attn.proj.weight, backbone.blocks.10.attn.proj.bias, backbone.blocks.10.norm2.weight, backbone.blocks.10.norm2.bias, backbone.blocks.10.mlp.fc1.weight, backbone.blocks.10.mlp.fc1.bias, backbone.blocks.10.mlp.fc2.weight, backbone.blocks.10.mlp.fc2.bias, backbone.blocks.11.norm1.weight, backbone.blocks.11.norm1.bias, backbone.blocks.11.attn.qkv.weight, backbone.blocks.11.attn.qkv.bias, backbone.blocks.11.attn.proj.weight, backbone.blocks.11.attn.proj.bias, backbone.blocks.11.norm2.weight, backbone.blocks.11.norm2.bias, backbone.blocks.11.mlp.fc1.weight, backbone.blocks.11.mlp.fc1.bias, backbone.blocks.11.mlp.fc2.weight, backbone.blocks.11.mlp.fc2.bias, backbone.blocks.12.norm1.weight, backbone.blocks.12.norm1.bias, backbone.blocks.12.attn.qkv.weight, backbone.blocks.12.attn.qkv.bias, backbone.blocks.12.attn.proj.weight, backbone.blocks.12.attn.proj.bias, backbone.blocks.12.norm2.weight, backbone.blocks.12.norm2.bias, backbone.blocks.12.mlp.fc1.weight, backbone.blocks.12.mlp.fc1.bias, backbone.blocks.12.mlp.fc2.weight, backbone.blocks.12.mlp.fc2.bias, backbone.blocks.13.norm1.weight, backbone.blocks.13.norm1.bias, backbone.blocks.13.attn.qkv.weight, backbone.blocks.13.attn.qkv.bias, backbone.blocks.13.attn.proj.weight, backbone.blocks.13.attn.proj.bias, backbone.blocks.13.norm2.weight, backbone.blocks.13.norm2.bias, backbone.blocks.13.mlp.fc1.weight, backbone.blocks.13.mlp.fc1.bias, backbone.blocks.13.mlp.fc2.weight, backbone.blocks.13.mlp.fc2.bias, backbone.blocks.14.norm1.weight, backbone.blocks.14.norm1.bias, backbone.blocks.14.attn.qkv.weight, backbone.blocks.14.attn.qkv.bias, backbone.blocks.14.attn.proj.weight, backbone.blocks.14.attn.proj.bias, backbone.blocks.14.norm2.weight, backbone.blocks.14.norm2.bias, backbone.blocks.14.mlp.fc1.weight, backbone.blocks.14.mlp.fc1.bias, backbone.blocks.14.mlp.fc2.weight, backbone.blocks.14.mlp.fc2.bias, backbone.blocks.15.norm1.weight, backbone.blocks.15.norm1.bias, backbone.blocks.15.attn.qkv.weight, backbone.blocks.15.attn.qkv.bias, backbone.blocks.15.attn.proj.weight, backbone.blocks.15.attn.proj.bias, backbone.blocks.15.norm2.weight, backbone.blocks.15.norm2.bias, backbone.blocks.15.mlp.fc1.weight, backbone.blocks.15.mlp.fc1.bias, backbone.blocks.15.mlp.fc2.weight, backbone.blocks.15.mlp.fc2.bias, backbone.blocks.16.norm1.weight, backbone.blocks.16.norm1.bias, backbone.blocks.16.attn.qkv.weight, backbone.blocks.16.attn.qkv.bias, backbone.blocks.16.attn.proj.weight, backbone.blocks.16.attn.proj.bias, backbone.blocks.16.norm2.weight, backbone.blocks.16.norm2.bias, backbone.blocks.16.mlp.fc1.weight, backbone.blocks.16.mlp.fc1.bias, backbone.blocks.16.mlp.fc2.weight, backbone.blocks.16.mlp.fc2.bias, backbone.blocks.17.norm1.weight, backbone.blocks.17.norm1.bias, backbone.blocks.17.attn.qkv.weight, backbone.blocks.17.attn.qkv.bias, backbone.blocks.17.attn.proj.weight, backbone.blocks.17.attn.proj.bias, backbone.blocks.17.norm2.weight, backbone.blocks.17.norm2.bias, backbone.blocks.17.mlp.fc1.weight, backbone.blocks.17.mlp.fc1.bias, backbone.blocks.17.mlp.fc2.weight, backbone.blocks.17.mlp.fc2.bias, backbone.blocks.18.norm1.weight, backbone.blocks.18.norm1.bias, backbone.blocks.18.attn.qkv.weight, backbone.blocks.18.attn.qkv.bias, backbone.blocks.18.attn.proj.weight, backbone.blocks.18.attn.proj.bias, backbone.blocks.18.norm2.weight, backbone.blocks.18.norm2.bias, backbone.blocks.18.mlp.fc1.weight, backbone.blocks.18.mlp.fc1.bias, backbone.blocks.18.mlp.fc2.weight, backbone.blocks.18.mlp.fc2.bias, backbone.blocks.19.norm1.weight, backbone.blocks.19.norm1.bias, backbone.blocks.19.attn.qkv.weight, backbone.blocks.19.attn.qkv.bias, backbone.blocks.19.attn.proj.weight, backbone.blocks.19.attn.proj.bias, backbone.blocks.19.norm2.weight, backbone.blocks.19.norm2.bias, backbone.blocks.19.mlp.fc1.weight, backbone.blocks.19.mlp.fc1.bias, backbone.blocks.19.mlp.fc2.weight, backbone.blocks.19.mlp.fc2.bias, backbone.blocks.20.norm1.weight, backbone.blocks.20.norm1.bias, backbone.blocks.20.attn.qkv.weight, backbone.blocks.20.attn.qkv.bias, backbone.blocks.20.attn.proj.weight, backbone.blocks.20.attn.proj.bias, backbone.blocks.20.norm2.weight, backbone.blocks.20.norm2.bias, backbone.blocks.20.mlp.fc1.weight, backbone.blocks.20.mlp.fc1.bias, backbone.blocks.20.mlp.fc2.weight, backbone.blocks.20.mlp.fc2.bias, backbone.blocks.21.norm1.weight, backbone.blocks.21.norm1.bias, backbone.blocks.21.attn.qkv.weight, backbone.blocks.21.attn.qkv.bias, backbone.blocks.21.attn.proj.weight, backbone.blocks.21.attn.proj.bias, backbone.blocks.21.norm2.weight, backbone.blocks.21.norm2.bias, backbone.blocks.21.mlp.fc1.weight, backbone.blocks.21.mlp.fc1.bias, backbone.blocks.21.mlp.fc2.weight, backbone.blocks.21.mlp.fc2.bias, backbone.blocks.22.norm1.weight, backbone.blocks.22.norm1.bias, backbone.blocks.22.attn.qkv.weight, backbone.blocks.22.attn.qkv.bias, backbone.blocks.22.attn.proj.weight, backbone.blocks.22.attn.proj.bias, backbone.blocks.22.norm2.weight, backbone.blocks.22.norm2.bias, backbone.blocks.22.mlp.fc1.weight, backbone.blocks.22.mlp.fc1.bias, backbone.blocks.22.mlp.fc2.weight, backbone.blocks.22.mlp.fc2.bias, backbone.blocks.23.norm1.weight, backbone.blocks.23.norm1.bias, backbone.blocks.23.attn.qkv.weight, backbone.blocks.23.attn.qkv.bias, backbone.blocks.23.attn.proj.weight, backbone.blocks.23.attn.proj.bias, backbone.blocks.23.norm2.weight, backbone.blocks.23.norm2.bias, backbone.blocks.23.mlp.fc1.weight, backbone.blocks.23.mlp.fc1.bias, backbone.blocks.23.mlp.fc2.weight, backbone.blocks.23.mlp.fc2.bias, backbone.last_norm.weight, backbone.last_norm.bias, keypoint_head.deconv_layers.0.weight, keypoint_head.deconv_layers.1.weight, keypoint_head.deconv_layers.1.bias, keypoint_head.deconv_layers.1.running_mean, keypoint_head.deconv_layers.1.running_var, keypoint_head.deconv_layers.1.num_batches_tracked, keypoint_head.deconv_layers.3.weight, keypoint_head.deconv_layers.4.weight, keypoint_head.deconv_layers.4.bias, keypoint_head.deconv_layers.4.running_mean, keypoint_head.deconv_layers.4.running_var, keypoint_head.deconv_layers.4.num_batches_tracked, keypoint_head.final_layer.weight, keypoint_head.final_layer.bias

missing keys in source state_dict: pos_embed, patch_embed.proj.weight, patch_embed.proj.bias, blocks.0.norm1.weight, blocks.0.norm1.bias, blocks.0.attn.qkv.weight, blocks.0.attn.qkv.bias, blocks.0.attn.proj.weight, blocks.0.attn.proj.bias, blocks.0.norm2.weight, blocks.0.norm2.bias, blocks.0.mlp.fc1.weight, blocks.0.mlp.fc1.bias, blocks.0.mlp.fc2.weight, blocks.0.mlp.fc2.bias, blocks.1.norm1.weight, blocks.1.norm1.bias, blocks.1.attn.qkv.weight, blocks.1.attn.qkv.bias, blocks.1.attn.proj.weight, blocks.1.attn.proj.bias, blocks.1.norm2.weight, blocks.1.norm2.bias, blocks.1.mlp.fc1.weight, blocks.1.mlp.fc1.bias, blocks.1.mlp.fc2.weight, blocks.1.mlp.fc2.bias, blocks.2.norm1.weight, blocks.2.norm1.bias, blocks.2.attn.qkv.weight, blocks.2.attn.qkv.bias, blocks.2.attn.proj.weight, blocks.2.attn.proj.bias, blocks.2.norm2.weight, blocks.2.norm2.bias, blocks.2.mlp.fc1.weight, blocks.2.mlp.fc1.bias, blocks.2.mlp.fc2.weight, blocks.2.mlp.fc2.bias, blocks.3.norm1.weight, blocks.3.norm1.bias, blocks.3.attn.qkv.weight, blocks.3.attn.qkv.bias, blocks.3.attn.proj.weight, blocks.3.attn.proj.bias, blocks.3.norm2.weight, blocks.3.norm2.bias, blocks.3.mlp.fc1.weight, blocks.3.mlp.fc1.bias, blocks.3.mlp.fc2.weight, blocks.3.mlp.fc2.bias, blocks.4.norm1.weight, blocks.4.norm1.bias, blocks.4.attn.qkv.weight, blocks.4.attn.qkv.bias, blocks.4.attn.proj.weight, blocks.4.attn.proj.bias, blocks.4.norm2.weight, blocks.4.norm2.bias, blocks.4.mlp.fc1.weight, blocks.4.mlp.fc1.bias, blocks.4.mlp.fc2.weight, blocks.4.mlp.fc2.bias, blocks.5.norm1.weight, blocks.5.norm1.bias, blocks.5.attn.qkv.weight, blocks.5.attn.qkv.bias, blocks.5.attn.proj.weight, blocks.5.attn.proj.bias, blocks.5.norm2.weight, blocks.5.norm2.bias, blocks.5.mlp.fc1.weight, blocks.5.mlp.fc1.bias, blocks.5.mlp.fc2.weight, blocks.5.mlp.fc2.bias, blocks.6.norm1.weight, blocks.6.norm1.bias, blocks.6.attn.qkv.weight, blocks.6.attn.qkv.bias, blocks.6.attn.proj.weight, blocks.6.attn.proj.bias, blocks.6.norm2.weight, blocks.6.norm2.bias, blocks.6.mlp.fc1.weight, blocks.6.mlp.fc1.bias, blocks.6.mlp.fc2.weight, blocks.6.mlp.fc2.bias, blocks.7.norm1.weight, blocks.7.norm1.bias, blocks.7.attn.qkv.weight, blocks.7.attn.qkv.bias, blocks.7.attn.proj.weight, blocks.7.attn.proj.bias, blocks.7.norm2.weight, blocks.7.norm2.bias, blocks.7.mlp.fc1.weight, blocks.7.mlp.fc1.bias, blocks.7.mlp.fc2.weight, blocks.7.mlp.fc2.bias, blocks.8.norm1.weight, blocks.8.norm1.bias, blocks.8.attn.qkv.weight, blocks.8.attn.qkv.bias, blocks.8.attn.proj.weight, blocks.8.attn.proj.bias, blocks.8.norm2.weight, blocks.8.norm2.bias, blocks.8.mlp.fc1.weight, blocks.8.mlp.fc1.bias, blocks.8.mlp.fc2.weight, blocks.8.mlp.fc2.bias, blocks.9.norm1.weight, blocks.9.norm1.bias, blocks.9.attn.qkv.weight, blocks.9.attn.qkv.bias, blocks.9.attn.proj.weight, blocks.9.attn.proj.bias, blocks.9.norm2.weight, blocks.9.norm2.bias, blocks.9.mlp.fc1.weight, blocks.9.mlp.fc1.bias, blocks.9.mlp.fc2.weight, blocks.9.mlp.fc2.bias, blocks.10.norm1.weight, blocks.10.norm1.bias, blocks.10.attn.qkv.weight, blocks.10.attn.qkv.bias, blocks.10.attn.proj.weight, blocks.10.attn.proj.bias, blocks.10.norm2.weight, blocks.10.norm2.bias, blocks.10.mlp.fc1.weight, blocks.10.mlp.fc1.bias, blocks.10.mlp.fc2.weight, blocks.10.mlp.fc2.bias, blocks.11.norm1.weight, blocks.11.norm1.bias, blocks.11.attn.qkv.weight, blocks.11.attn.qkv.bias, blocks.11.attn.proj.weight, blocks.11.attn.proj.bias, blocks.11.norm2.weight, blocks.11.norm2.bias, blocks.11.mlp.fc1.weight, blocks.11.mlp.fc1.bias, blocks.11.mlp.fc2.weight, blocks.11.mlp.fc2.bias, blocks.12.norm1.weight, blocks.12.norm1.bias, blocks.12.attn.qkv.weight, blocks.12.attn.qkv.bias, blocks.12.attn.proj.weight, blocks.12.attn.proj.bias, blocks.12.norm2.weight, blocks.12.norm2.bias, blocks.12.mlp.fc1.weight, blocks.12.mlp.fc1.bias, blocks.12.mlp.fc2.weight, blocks.12.mlp.fc2.bias, blocks.13.norm1.weight, blocks.13.norm1.bias, blocks.13.attn.qkv.weight, blocks.13.attn.qkv.bias, blocks.13.attn.proj.weight, blocks.13.attn.proj.bias, blocks.13.norm2.weight, blocks.13.norm2.bias, blocks.13.mlp.fc1.weight, blocks.13.mlp.fc1.bias, blocks.13.mlp.fc2.weight, blocks.13.mlp.fc2.bias, blocks.14.norm1.weight, blocks.14.norm1.bias, blocks.14.attn.qkv.weight, blocks.14.attn.qkv.bias, blocks.14.attn.proj.weight, blocks.14.attn.proj.bias, blocks.14.norm2.weight, blocks.14.norm2.bias, blocks.14.mlp.fc1.weight, blocks.14.mlp.fc1.bias, blocks.14.mlp.fc2.weight, blocks.14.mlp.fc2.bias, blocks.15.norm1.weight, blocks.15.norm1.bias, blocks.15.attn.qkv.weight, blocks.15.attn.qkv.bias, blocks.15.attn.proj.weight, blocks.15.attn.proj.bias, blocks.15.norm2.weight, blocks.15.norm2.bias, blocks.15.mlp.fc1.weight, blocks.15.mlp.fc1.bias, blocks.15.mlp.fc2.weight, blocks.15.mlp.fc2.bias, blocks.16.norm1.weight, blocks.16.norm1.bias, blocks.16.attn.qkv.weight, blocks.16.attn.qkv.bias, blocks.16.attn.proj.weight, blocks.16.attn.proj.bias, blocks.16.norm2.weight, blocks.16.norm2.bias, blocks.16.mlp.fc1.weight, blocks.16.mlp.fc1.bias, blocks.16.mlp.fc2.weight, blocks.16.mlp.fc2.bias, blocks.17.norm1.weight, blocks.17.norm1.bias, blocks.17.attn.qkv.weight, blocks.17.attn.qkv.bias, blocks.17.attn.proj.weight, blocks.17.attn.proj.bias, blocks.17.norm2.weight, blocks.17.norm2.bias, blocks.17.mlp.fc1.weight, blocks.17.mlp.fc1.bias, blocks.17.mlp.fc2.weight, blocks.17.mlp.fc2.bias, blocks.18.norm1.weight, blocks.18.norm1.bias, blocks.18.attn.qkv.weight, blocks.18.attn.qkv.bias, blocks.18.attn.proj.weight, blocks.18.attn.proj.bias, blocks.18.norm2.weight, blocks.18.norm2.bias, blocks.18.mlp.fc1.weight, blocks.18.mlp.fc1.bias, blocks.18.mlp.fc2.weight, blocks.18.mlp.fc2.bias, blocks.19.norm1.weight, blocks.19.norm1.bias, blocks.19.attn.qkv.weight, blocks.19.attn.qkv.bias, blocks.19.attn.proj.weight, blocks.19.attn.proj.bias, blocks.19.norm2.weight, blocks.19.norm2.bias, blocks.19.mlp.fc1.weight, blocks.19.mlp.fc1.bias, blocks.19.mlp.fc2.weight, blocks.19.mlp.fc2.bias, blocks.20.norm1.weight, blocks.20.norm1.bias, blocks.20.attn.qkv.weight, blocks.20.attn.qkv.bias, blocks.20.attn.proj.weight, blocks.20.attn.proj.bias, blocks.20.norm2.weight, blocks.20.norm2.bias, blocks.20.mlp.fc1.weight, blocks.20.mlp.fc1.bias, blocks.20.mlp.fc2.weight, blocks.20.mlp.fc2.bias, blocks.21.norm1.weight, blocks.21.norm1.bias, blocks.21.attn.qkv.weight, blocks.21.attn.qkv.bias, blocks.21.attn.proj.weight, blocks.21.attn.proj.bias, blocks.21.norm2.weight, blocks.21.norm2.bias, blocks.21.mlp.fc1.weight, blocks.21.mlp.fc1.bias, blocks.21.mlp.fc2.weight, blocks.21.mlp.fc2.bias, blocks.22.norm1.weight, blocks.22.norm1.bias, blocks.22.attn.qkv.weight, blocks.22.attn.qkv.bias, blocks.22.attn.proj.weight, blocks.22.attn.proj.bias, blocks.22.norm2.weight, blocks.22.norm2.bias, blocks.22.mlp.fc1.weight, blocks.22.mlp.fc1.bias, blocks.22.mlp.fc2.weight, blocks.22.mlp.fc2.bias, blocks.23.norm1.weight, blocks.23.norm1.bias, blocks.23.attn.qkv.weight, blocks.23.attn.qkv.bias, blocks.23.attn.proj.weight, blocks.23.attn.proj.bias, blocks.23.norm2.weight, blocks.23.norm2.bias, blocks.23.mlp.fc1.weight, blocks.23.mlp.fc1.bias, blocks.23.mlp.fc2.weight, blocks.23.mlp.fc2.bias, last_norm.weight, last_norm.bias

About the object detection method

Hi, can you please mention which object detector you used? I could not find it mentioned anywhere.

Bottom up vs top down model

Hi, can someone explain how the bottom up Vitpose model work? Can you give an example with VITpose_B. I am instrested in the smallest, fastest single person pose model among all while preserving decent accuracy on COCO. Would it be ViTpose_b in bottom up or top down manner?

How to do inference on video by scripts?

Hey! I used the web version demo in #20 to do inference on a video file, but it's super slow! I'm wondering if there's any scripts to do so?

Another question, let's say I have an input image of size 1080x640x3 that contains 10 people. The detector could detect all of them, so after cliping and resizing, the actual data flowing into ViTPose is 10x3x256x192. And your speed #4 (900 fps) is measured on each 256x192x3. Am I correct?

Thanks in advance!

Training on test images when using CrowdPose?

Dear authors, thanks for the exciting work and I'd like to apologize in advance if I misunderstood.

As you may already know, CrowdPose dataset itself is constituted by cherry-picked crowd samples selected from MSCOCO, MPII and AIC, but CrowdPose did not specify if they treated train/val/test images from MSCOCO/MPII/AIC differently. They also re-annotated (presumably more accurately) these samples.

What we have noticed is that many of the test images in "MS COCO val set" also present in "CrowdPose train" and "CrowdPose train/val" splits. Although CrowdPose has renamed all their images, we have identified at least 181 images in "CrowdPose train/val" having the same md5 info as in "MS COCO val set".

For example, "108951.jpg" in "CrowdPose train" and "000000147740.jpg" in "MS COCO val set" are the same image with md5: f9fc120dc085166b30c08da3de333b69

We did not identify any image overlap between CrowdPose and MPII/AIC on md5 level for both train and test images, possibly because CrowdPose did some preprocessing for selected MPII/AIC images, but based on the finding on COCO, the possibility for such train-test overlap with MPII/AIC is notable. We have not checked if "CrowdPose test" images also present in "COCO train set" yet.

So if I did not miss anything, the model jointly trained on COCO+AIC+MPII+CrowdPose would have seen many of the test images (with labels, at least for COCO) during the training process, making the results untrustworthy.

about OmniPose-Lite

When can we use the OmniPose-Lite.I hope for it very much.

config question

Can you explain what nMS_THr and OKs_THr mean and what they do?
Thank you very much!

KeyError: 'ViT is not in the models registry'

I am trying to run top_down_video_demo_with_mmdet.py with the command:

python demo/top_down_video_demo_with_mmdet.py \
demo/mmdetection_cfg/yolov3_d53_320_273e_coco.py  \
https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_d53_320_273e_coco/yolov3_d53_320_273e_coco-421362b6.pth \
configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_huge_coco_256x192.py \
../pretrained/ViTPose-H.pth \
--video-path ../UCF_Videos/Fighting/Fighting018_x264.mp4 \
--out-video-root ../output/test1

However, I am getting the following error:

Traceback (most recent call last):
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
    return obj_cls(**args)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/detectors/top_down.py", line 48, in __init__
    self.backbone = builder.build_backbone(backbone)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/builder.py", line 19, in build_backbone
    return BACKBONES.build(cfg)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 237, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 61, in build_from_cfg
    raise KeyError(
KeyError: 'ViT is not in the models registry'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/s2435462/HRC/ViTPose/demo/top_down_video_demo_with_mmdet.py", line 165, in <module>
    main()
  File "/home/s2435462/HRC/ViTPose/demo/top_down_video_demo_with_mmdet.py", line 76, in main
    pose_model = init_pose_model(
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/apis/inference.py", line 43, in init_pose_model
    model = build_posenet(config.model)
  File "/home/s2435462/HRC/ViTPose/mmpose/mmpose/models/builder.py", line 39, in build_posenet
    return POSENETS.build(cfg)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 237, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/s2435462/.conda/envs/open-mmlab/lib/python3.9/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "TopDown: 'ViT is not in the models registry'"

I installed everything with these commands:

conda create -n open-mmlab python=3.9 -y
conda activate open-mmlab

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

pip install mmcv-full
pip install mmdet

rm -rf mmpose
git clone https://github.com/open-mmlab/mmpose.git
cd mmpose
pip install -r requirements.txt
pip install -e .

Can someone guide me on how to solve this?

About model size

Hi，I used the pre-trained model you provided for fine-tuning. Performance and speed is competitive.But the size of the model is about three times larger than you.For example, the size of my vitpose-b is 1.xGB, but yours is 343MB. How can i get a same size model?

How to load pre-trained model?

Hi , When I was loading the pre-trained model, I used the params of "--resume-from" which followed by a pre-trained model path, I got the err message like this:

2022-06-10 10:51:38,626 - mmpose - INFO - load checkpoint from local path: models/epoch_1.pth Traceback (most recent call last): File "/home/pose/codes/ViTPose/tools/train.py", line 195, in <module> main() File "/home/pose/codes/ViTPose/tools/train.py", line 184, in main train_model( File "/home/pose/codes/ViTPose/mmpose/apis/train.py", line 197, in train_model runner.resume(cfg.resume_from) File "/home/pose/codes/ViTPose/ViT_venv/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 364, in resume self._iter = checkpoint['meta']['iter'] KeyError: 'iter'

so what's the right way to load a pre-trained model ? thank you for your patience and time !

Will this model work with unseen data?

Will this model work with unseen data (in the wild pose estimation) or does it require further training outside the COCO/AIC/MPII/CrowdPose datasets?

Config / weights for ViTPose-G

Is it possible to release the config and weights for ViTPose-G?

Onnx version of the model

Hi, thank you author for the great work.

I really impress with your work. By any chance could you realese the onnx version of Vitpose model?
I tried to run the vitpose B* but failed many times

Code for multi task training.

Hi, thanks for this solid work. I'd like to know when you plan to release the code for multi task training

AttributeError: 'ConfigDict' object has no attribute 'data'

When I try to run the code below in notebook ->

!bash tools/dist_test.sh /content/ViTPose/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.yml /content/mask_rcnn_swin_tiny_patch4_window7_1x.pth 1

as you have mentioned in README.md ->

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

I get the error below ->

apex is not installed
apex is not installed
apex is not installed
/usr/local/lib/python3.7/dist-packages/mmcv/cnn/bricks/transformer.py:33: UserWarning: Fail to import MultiScaleDeformableAttention from mmcv.ops.multi_scale_deform_attn, You should install mmcv-full if you need this module.
warnings.warn('Fail to import MultiScaleDeformableAttention from '
Traceback (most recent call last):
File "tools/test.py", line 184, in
main()
File "tools/test.py", line 96, in main
setup_multi_processes(cfg)
File "/content/ViTPose/mmpose/utils/setup_env.py", line 30, in setup_multi_processes
if 'OMP_NUM_THREADS' not in os.environ and cfg.data.workers_per_gpu > 1:
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/config.py", line 513, in getattr
return getattr(self._cfg_dict, name)
File "/usr/local/lib/python3.7/dist-packages/mmcv/utils/config.py", line 49, in getattr
raise ex
AttributeError: 'ConfigDict' object has no attribute 'data'
Killing subprocess 743
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/test.py', '--local_rank=0', '/content/ViTPose/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.yml', '/content/mask_rcnn_swin_tiny_patch4_window7_1x.pth', '--launcher', 'pytorch']' returned non-zero exit status 1.

About multitask log

Thx for the great work! It helps me l lot in my own study. Could you please release the training log of the multitask training? The .log file may better than .json. Thx again.

What is the pretrain model of MAE huge?

I found that in MAE project, the huge model has a patch_size = 14. But in the config, the patch size is set to 16. How do you load the MAE pretrained weights?

Keypoints absent from model output?

I've been trying to use the demo scripts but keep getting the following error:

Traceback (most recent call last):
  File "demo/top_down_video_demo_with_mmdet.py", line 165, in <module>
    main()
  File "demo/top_down_video_demo_with_mmdet.py", line 125, in main
    pose_results, returned_outputs = inference_top_down_pose_model(
  File "/home/nshah/work/packages/vitpose/mmpose/apis/inference.py", line 415, in inference_top_down_pose_model
    poses, heatmap = _inference_single_pose_model(
  File "/home/nshah/work/packages/vitpose/mmpose/apis/inference.py", line 307, in _inference_single_pose_model
    return result['preds'], result['output_heatmap']
KeyError: 'preds'

The model seemingly outputs only the heatmap and not the actual keypoint predictions. However, I noticed in some of the closed issues that people were able to get some of the demo scripts to work. I'm just wondering whether I'm missing something very obvious.

I'm using this config which does appear to have a keypoint head.

Top-down or bottom-up?

Hey! Was reading the paper, impressive stuff.

I was uncertain about what you actually predicted however. Do you do first crop the humans, and then do keypoint estimation (top-down I guess)? Or do you predict all humans at once (bottom-up), and then predict a part-affinity map (or the like) along with the keypoints?

If it's the latter, what exactly does the model output?

Thank you in advance 🙏

Running the project

Hi, can anyone summarize the installation setup and the quick start process (for instance using the demo and running the inference). The instructions mentioned in the README.md is confusing for beginners. Thank you !!!

LibTorch version of ViTPose

I tried to follow the instructions for converting a PyTorch model to LibTorch (C++) using the tracing instructions found here:
https://pytorch.org/tutorials/advanced/cpp_export.html
but I ran into some difficulties.

Has anyone else managed to generate a LibTorch version of ViTPose and is it thought to be possible?

Thanks!

Model in video demo

Hello, I was wondering, what is the model which is on Web Demo for video in HuggingFace? I would like to test that using scripts. Are weights provided for that particular model? Thanks

Running video demo

Hello,

I tried to run the video demo using mmdet:

python demo/top_down_pose_tracking_demo_with_mmdet.py ./demo/mmdetection_cfg/faster_rcnn_r50_fpn_coco.py ./faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth configs/body/2d_kpt_sview_rgb_img/topdown_heatmap/coco/ViTPose_base_coco_256x192.py ./vitpose-b.pt --video-path ./test.MOV --out-video-root ./output_video/
but I have errors due to the version compatibility between mmcv, mmdet et the current ViTPose (or mmpose) version.

So here what I do, I install mmcv from sources (1.3.9 version as recommended in the read me of this repo) and the mmdet from sources as well (I tried last mmdet, mmdet==2.14.0 as it is recommended in the mminstall.txt for mmpose 0.24.0: ['mmcv-full>=1.3.8', 'mmdet>=2.14.0', 'mmtrack>=0.6.0']., and mmdet==2.23.0)

here is what I have with the following versions for example (pip list) :
mmcv 1.3.9
mmdet 2.14.0
mmpose 0.24.0

Note that I use this :
torch 1.11.0+cu113
torchvision 0.12.0+cu113

I got this error :
/home/ubuntu/venv/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py:27: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv-full`` if you need this module. warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from ' Traceback (most recent call last): File "demo/top_down_pose_tracking_demo_with_mmdet.py", line 190, in <module> main() File "demo/top_down_pose_tracking_demo_with_mmdet.py", line 74, in main assert has_mmdet, 'Please install mmdet to run the demo.' AssertionError: Please install mmdet to run the demo.

when I put mmdet to 2.23.0 I got this error :

AssertionError: MMCV==1.3.9 is used but incompatible. Please install mmcv>=1.3.17, <=1.6.0.

Tried to set mmcv>=1.3.17, did not resolve the problem !

can you please tell us which versions (mmcv and mmdet) are recommended to run ViTPose on videos ?

Can you share the models on GoogleDrive or BaiduDrive?

Thanks for your excellent work. I cannot get access to Onedrive. Is there any possible for you to share the trained models on GoogleDrive or BaiduDrive?

Inference speed

Thank you for the nice work! May I know if you all have done any analysis and comparison for the model's inference speed?

Looking Forward the Code

How long until the code is released？
Thanks

Video demo with VitPose Base

Hello, how to run video demo with Vit.Are the demo scripts using any Vit pose models? As it says in demo page - Using mmdet for human bounding box detection. We provide a demo script to run mmdet for human detection, and mmpose for pose estimation. How to use VItPose for videopose estimation?

How to edit number of transformer blocks?

Hi, Can you please point to part in config or scripts to change number of transformer blocks in the network for training?
Thanks

Does this model track hand keypoints?

Demo codes.

Hi, I am very interested in your excellect work and I would like to ask where could I find the codes for the web demo? By the way, where can I get the quantitative intermediate output results of this APP (https://huggingface.co/spaces/Gradio-Blocks/ViTPose), such as detection boxes and keypoints? Looking forward to your reply !

Speed of Detection

Hi, I have only managed to get fps of around 5 fps for the topdown model under 2D pose estimation with GTX 1660 GPU for via demo/webcam.py with video testing. How can i also speed up the inference speed when i use with synchronous mode? Thank you!! :)

Use ViTPose with Jetson AGX Orin

Hi, thanks for the great work you have done on the pose estimation, I used the deployment script pytorch2onnx.py to convert to onnx and then use trtexec to convert to an engine file, But the output heat map is different when using tensorRT inference,

config.py and weight file to integrate with mmpose framework

Hi
could elaborate on where the corresponding config and weight files for the integration with mmpose framework

What is the full-window attention structure?

It is mentioned in the paper that full-window attention structure is used to reduce memory load, but I did not find the introduction of full-window attention structure. I would like to ask how this structure is realized.？

About Inference speed

Are you sure that this method is faster than HRNet?
I have tried both with yolov5 as the detector in trt inference.
HRNet achieves around 30-35 fps while VitPose can reach 7 fps at the same video with trt.
Inference test I have conducted show that hrnet is 6-7 faster when using larger batch sizes for some reason (around 220 fps per target for fp16 and 450 fps for int8) while VitPose achieves around 60 fps per target in trt.

pretrain model

can you provide us the pretrained MAE or VITAE?

Testing on CPU

How to test on CPU?

Setting number of GPUs to 0 don't work.

bash tools/dist_test.sh configs/body/2d_kpt_sview_rgb_img/deeppose/coco/res101_coco_256x192.py  ../weights/mae_pretrain_vit_base.pth 0

Error:

FutureWarning,
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 225, in launch_agent
    master_port=master_port,
  File "<string>", line 15, in __init__
  File "/home/ahmad/Desktop/RedBuffer/BowlingAngle/venv/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__
    assert self.local_world_size > 0
AssertionError

ViTAE-G config

would you like to share ViTAE-G config? Thanks!!!

model mismatch

Hi, I encountered a mismatch issue when training ViTbase from the pretrained MAE.
'The model and loaded state dict do not match exactly
unexpected key in source state_dict: cls_token, norm.weight, norm.bias
missing keys in source state_dict: last_norm.weight, last_norm.bias'
And
'fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
'
But the training was not stopped. What actions should I take other than simply loading from the pretrained MAE?

about the code of the transformer block

Thank you for open this great repo.
Hello, where is the code of the transformer block? I didn't find the corresponding code

I would be greatly appreciated if you could spend some of your time for a reply.

where the model code?

I can not find the vitpose model in mmpose, can anybody know?

Question about the file vitpose-l-simple.pth.

The model file vitpose-l-simple.pth I downloaded cannot be loaded. I would like to confirm whether it is the problem that I have not downloaded or the problem of the uploaded model itself?

And below is a screenshot of my error.

Looking forward to your reply!

Test with a resolution different from 256x192

Hi, I want to test with an image at a size of 224x224. Could you please tell me how to modify the position embedding?

Reproduce the video results in README

Which config and weight file have you used?