qitaozhao / poseformerv2 Goto Github PK
View Code? Open in Web Editor NEWThe project is an official implementation of our paper "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation".
License: MIT License
The project is an official implementation of our paper "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation".
License: MIT License
Thank you for sharing the code.
May I ask whether the camera intrinsics have been used for preprocessing the data (assuming that we have access to 2D gt in pixel space)? According to my understanding even though you are returning them with "out_camera_params" they have not been used in the scripts so I just wanted to ask whether there is any other place where they might be needed.
I also wanted to ask about your recommended 2D pose estimator for running your code for inference and whether there is a plan to provide an inference script in the future.
Thanks in advance for your time.
Hi! Thanks for your work! I found that MPJPE performance of MixSTE is probably wrong in your paper. It should be 40.9 and better than yours?
Thank you very much for sharing!
As you mentioned in the article, PoseFormerV2 is reported to be more efficient than MHFormer by approximately 4 times. I have tried processing the same video using MHFormer and PoseFormerV2 using the demo sources provided here:
MHFormer: https://github.com/Vegetebird/MHFormer/blob/main/demo/vis.py
PoseFormerV2: https://github.com/QitaoZhao/PoseFormerV2/blob/main/demo/vis.py
For MHFormer, I used the default source code without any additional modifications.
For PoseFormerV2, I used the 9_81_46.0.bin model.
Both were run on a GTX 3090.
Results:
MHFormer: ~6.7 it/s
PoseFormerV2: ~2.7 it/s
Results when I commented out the drawing and file-saving part in the get_pose3D() function:
MHFormer: ~54.00 it/s
PoseFormerV2: ~30.63 it/s
Could you please explain why there is such a difference? Have I done something wrong?
Thank you very much for your work. when I run this command
python run_poseformer.py -d h36m -k gt
-c checkpoint -g 0
--evaluate 27_243_45.2.bin
--render --viz-subject S11
--viz-action Walking --viz-camera 0
--viz-export output3d
, I encountered the following issue.
Traceback (most recent call last):
File "/home/dvlab/PoseFormerV2-main/run_poseformer.py", line 574, in module
prediction = evaluate(gen, return_predictions=True)
File "/home/dvlab/PoseFormerV2-main/run_poseformer.py", line 481, in evaluate
cam = torch.from_numpy(cam.astype('float32'))
AttributeError: 'NoneType' object has no attribute 'astype'
I had already modified the code, and I check if cam had exact value.
test_generator = UnchunkedGenerator(cameras_valid, poses_valid, poses_valid_2d,
pad=pad, causal_shift=causal_shift, augment=False,
kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
class UnchunkedGenerator:
def init(self, cameras, poses_3d, poses_2d, pad=0, causal_shift=0,
augment=False, kps_left=None, kps_right=None, joints_left=None, joints_right=None):
assert poses_3d is None or len(poses_3d) == len(poses_2d)
assert cameras is None or len(cameras) == len(poses_2d)
print("cameras 3 :",cameras )
its value is a list:
cameras 3 : [array([ 2.29009891e+00, 2.28756237e+00, 2.50830650e-02, 2.89029814e-02,
-2.07098916e-01, 2.47775182e-01, -3.07515031e-03, -9.75698873e-04,
-1.42447161e-03, 4.36662363e-01, 0.00000000e+00, -1.09528303e-02,
0.00000000e+00, 4.37146551e-01, -1.26348389e-02, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29935122e+00, 2.29518342e+00, 1.76972151e-02, 1.61298513e-02,
-1.94213629e-01, 2.40408540e-01, 6.81997556e-03, -1.61902665e-03,
-2.74089444e-03, 4.34905287e-01, 0.00000000e+00, -7.69661227e-03,
0.00000000e+00, 4.35695026e-01, -7.02769589e-03, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29828143e+00, 2.29759789e+00, 3.96317244e-02, 2.80535221e-03,
-2.08338186e-01, 2.55488008e-01, -2.46049743e-03, 1.48438697e-03,
-7.59999326e-04, 4.35107723e-01, 0.00000000e+00, -1.72440689e-02,
0.00000000e+00, 4.35237169e-01, -1.22099358e-03, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29102278e+00, 2.28954792e+00, 2.99364328e-02, 1.76403334e-03,
-1.98384091e-01, 2.18323678e-01, -8.94780736e-03, -5.87205577e-04,
-1.81336200e-03, 4.36486276e-01, 0.00000000e+00, -1.30668422e-02,
0.00000000e+00, 4.36767447e-01, -7.70472339e-04, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29009891e+00, 2.28756237e+00, 2.50830650e-02, 2.89029814e-02,
-2.07098916e-01, 2.47775182e-01, -3.07515031e-03, -9.75698873e-04,
-1.42447161e-03, 4.36662363e-01, 0.00000000e+00, -1.09528303e-02,
0.00000000e+00, 4.37146551e-01, -1.26348389e-02, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29935122e+00, 2.29518342e+00, 1.76972151e-02, 1.61298513e-02,
-1.94213629e-01, 2.40408540e-01, 6.81997556e-03, -1.61902665e-03,
-2.74089444e-03, 4.34905287e-01, 0.00000000e+00, -7.69661227e-03,
0.00000000e+00, 4.35695026e-01, -7.02769589e-03, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29828143e+00, 2.29759789e+00, 3.96317244e-02, 2.80535221e-03,
-2.08338186e-01, 2.55488008e-01, -2.46049743e-03, 1.48438697e-03,
-7.59999326e-04, 4.35107723e-01, 0.00000000e+00, -1.72440689e-02,
0.00000000e+00, 4.35237169e-01, -1.22099358e-03, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29102278e+00, 2.28954792e+00, 2.99364328e-02, 1.76403334e-03,
-1.98384091e-01, 2.18323678e-01, -8.94780736e-03, -5.87205577e-04,
-1.81336200e-03, 4.36486276e-01, 0.00000000e+00, -1.30668422e-02,
0.00000000e+00, 4.36767447e-01, -7.70472339e-04, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29009891e+00, 2.28756237e+00, 2.50830650e-02, 2.89029814e-02,
-2.07098916e-01, 2.47775182e-01, -3.07515031e-03, -9.75698873e-04,
-1.42447161e-03, 4.36662363e-01, 0.00000000e+00, -1.09528303e-02,
0.00000000e+00, 4.37146551e-01, -1.26348389e-02, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00]), array([ 2.29935122e+00, 2.29518342e+00, 1.76972151e-02, 1.61298513e-02,
-1.94213629e-01, 2.40408540e-01, 6.81997556e-03, -1.61902665e-03,
-2.74089444e-03, 4.34905287e-01, 0.00000000e+00, -7.69661227e-03,
0.00000000e+00, 4.35695026e-01, -7.02769589e-03, 0.00000000e+00,
0.00000000e+00, 1.00000000e+00])
I have no idea where the bug is.
Thanks for your help
Thanks for your awesome work. I also hope to use the MPI-INF-3DHP data set in my experiment.
How did you deal with MPI-INF-3DHP?
This problem has been bothering me for several days.
I would be very glad if you could help me.
recently , I tried to train the model , frame = 243 central frame = 27 coefficient = 27 . I expect to get the result which is close to 45.2mm , because accuracy of my model is only about 46.6mm
I really want to know how to train the similar model like yours. please tell me how can I set correctly to get this outcome
thank you very much for your help
Great work!The utilization rate of the GPU is very low when I am running. How should I modify the config?
Hello, I am interested in your excellent work, but as a beginner I have a problem with visualization, can you provide instructions for visualization, thank you very much!
Great work!And I want to know how to add noise to test the robustness of the model, like Figure 6 in your paper. Thank you!!!
If I remember correctly, you took the structure of the PoseformerV1 spatial-temporal transformer and replaced the temporal transformer with a "DCT + LPF + Linear projection" module, right?
I'm curious why only the spatial transformer part has temporal embedding instead of spatial embedding.
Is it related to "reformulated as a Time-Frequency Feature Fusion module"? So it must have temporal embedding.
Please help me figure this out.
Thanks for your reply
I successfully reproduced the in-text accuracy for an input sequence of 27, f=1, dct=3, but when I raise the RF to 81, the MPJPE only reaches about 48.7. I have a learning rate of 0.0008 and a batchsize of 1024, what went wrong, any answer would be appreciated!
Thanks. We did reproduce PoseFormerV1 but _not_ reproduce MixSTE. The reproduced PoseFormerV1 result (AUC & MPJPE) is worse than MixSTE, which is consistent with Human3.6M.
Originally posted by @QitaoZhao in #3 (comment)
Hello, sir. May I inquire about the methodology you utilize to calculate the value of MFLOPs?
When I enter this command :
python demo/vis.py --video sample_video.mp4
I will get the error like this :
Traceback (most recent call last):
File "demo/vis.py", line 5, in
from lib.hrnet.gen_kpts import gen_video_kpts as hrnet_pose
File "/home/dvlab/PoseFormerV2-main/demo/lib/hrnet/gen_kpts.py", line 21, in
from lib.hrnet.lib.utils.inference import get_final_preds
File "/home/dvlab/PoseFormerV2-main/demo/lib/hrnet/lib/utils/inference.py", line 17, in
from utils.transforms import transform_preds
ModuleNotFoundError: No module named 'utils.transforms'
however, i download the whole file on the github. Also, I check if my document is missed, but everything is good
please tell me how to solve this problem
thanks
could you provide a script to export model to ONNX, please?
Thank you for your great works!!
I have a doubt about human3.6m datasets code, why was instrinsic parameter changed as following:
cam['intrinsic'] = np.concatenate((cam['focal_length'],
cam['center'],
cam['radial_distortion'],
cam['tangential_distortion'],
[1/cam['focal_length'][0], 0, -cam['center'][0]/cam['focal_length'][0],
0, 1/cam['focal_length'][1], -cam['center'][1]/cam['focal_length'][1],
0, 0, 1]))
I checked out the project P-STMO, its code as the same as videopose3d project like this:
cam['intrinsic'] = np.concatenate((cam['focal_length'],
cam['center'],
cam['radial_distortion'],
cam['tangential_distortion']))
Thanks
Hi I am a student trying to use this project. Firstly thanks for the amazing project.
I am trying to find a python API to run the model on in_the_wild videos of mine. I saw the CLI way of doing it and tried it out. worked pretty good
python demo/vis.py --video sample_video.mp4
Is there a way of calling this using python code? since I want to customize it to just give me the 3D coordinates (I feel the saving of images is the reason the inference is slow).
Edit: I got the way we get the 2D point and then pass them to get_pose3D. but I can't seem to understand which part in this block of code has the values of the 3D keypoints
PoseFormerV2/common/model_poseformer.py
Line 224 in 0c0b125
PoseFormerV2/common/model_poseformer.py
Line 120 in 0c0b125
PoseFormerV2/common/model_poseformer.py
Line 122 in 0c0b125
PoseFormerV2/common/model_poseformer.py
Line 127 in 0c0b125
PoseFormerV2/common/model_poseformer.py
Line 128 in 0c0b125
x[:, :f//2]
x[:, f//2:]
Are the frame numbers of x set incorrectly, or did I misunderstand?x[:, :f//2]
, and Spatial_feature belongs to the x[:, f//2:]
.?
thanks for your help
Great work!I am wondering can the model run real-time with video stream input?
Thanks!
Great innovation. I noticed that the PoseFormer V1 experiment you replicated had a much better performance compared to the original article, while the MixSTE experiment had a significant decrease in performance. Is this a normal phenomenon?
Hi team, thanks for the work on this repo it's quite fantastic. but where is the weights for PoseFormerV2
Thank you for the amazing work! Can you please share the demo script to run on the wild videos? Thanks
Hello, thank you very much for your great work, I wanted to make sure that to train to get the best results mentioned in the paper, the training code is
python run_poseformer.py -g 5 -k cpn_ft_h36m_dbb -frame 243 -frame-kept 27 -coeff-kept 27 -c checkpoint/243 -b 1024?
when pip install -r requirements.txt, some pkgs need python 3.9. the python of your project is python3.8?
Thank you very much for your work. I would like to know how the number of parameters and the amount of calculation of the model in your work is calculated?
Hello, Great work! but I can not find the torch_dct file,can you upload it? Thank you a lot!
I cannot obtain the same experimental result 47.9 using the parameters suggested by the author. Do you need to pay attention to any further experimental details? I hope the author can inform me.
Thank you for your work ! Are the released pretrained models suitable for inference on custom videos ? Or for the demos do you use models trained differently ? Thanks !
I noticed in Table 2 that StridedTrans.[15] TMM’22 has an MPJPE of 47.5 and MFLOPs of 342.5. This is far from the results reported in the StridedTrans. article. In the StridedTrans. author's article, an input of 81 frames can reach 45.4 MPJPE. The input of 27 frames can reach 46.9MPJPE and only requires 128MFLOPs.
Can you explain why you chose the result 47.5?
Thanks!
I'm having difficulty understanding what each point on the figure indicates. For example, why does PoseFormerV2 27xRF have three points? Was this data mentioned in the paper?
Also, under what conditions does PoseFormerV2 9xRF have an MPJPE of 46?"
if i can jump into conclusion . most of all , the PoseFormerV2 27xRF is the best model except for PoseFormerV2 9xRF with MPJPE of 46
PoseFormerV2 9xRF with MPJPE of 46 is the special case ?
Thank you for your reply
First, thanks for your greatest work of PoseFormerV2. I checked your code and found that when evaluating the mpii dataset, the calculation of the two metrics pck and auc is not defined. May I know how to calculate the pck and auc for mpiinf-3dhp? Thanks a lot!
Accroding to run_poseformer.py
line 287 and line 337 "inputs_3d[:, :, 0] = 0"
What does "inputs_3d[:, :, 0] = 0" mean?
The performance is worse after taking it out, but I can't understand it, it doesn't seem to have anything to do with the training model, I just know that the "MPJPE" calculated this way is not the same.
Thank you for your reply.
为什么训练Human3.6M数据集的模型代码和训练MPI-INF-3DHP数据集模型代码不同
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.