I am trying to use Monocular Total capture code to see how it works on other videos no

Please refer to <a href="https://github.com/CMU-Perceptual-Computing-Lab/MonocularTota

Not working with other videos about monoculartotalcapture HOT 6 CLOSED

MichalisLazarou commented on June 5, 2024

Not working with other videos

from monoculartotalcapture.

Comments (6)

Tetsujinfr commented on June 5, 2024 1

After having ran a few tests, with lots of failures, I would like to share some observations for others trying to infer this model on videos in the wild (some points may not be related to your problem Michalis):

the current code stops if there is not stricktly 1 person detected in a given frame, so transitions, text screens or crowds will stop everything. That's understandable but the code need to be tweak if you want the comps to not stop brutally.
it seems to me that if at least one hand with finger is not detected it fails, but happy to be corrected. At a minimum the model is currently very sensitive to hand and finger detection. All videos in the paper video have characters exhibiting hands all the time and in crisp/sharp situations (no or little blur)
the quality of video in the wild is quite often limited due to compression, hence the openpose detection may not give the full face/finger details, and then this pose estimation model will not work or not converge properly.
the model works not necessarily for body poses who are not facing the camera, e.g. people showing their profile or their back to the camera.
in order to run the model fit on the full skeleton you should have the character in full in the frames all the times while for most videos at some times the camera frame is going to crop the lower part body hence forcing to edit the video and run the full fit on one part and the upper body model on the other part.

So you can still get the code to work for video in the wild, but the above discard a ton of candidates, for the remaining ones you may need to edit heavily to strip out sequences failing the codes (intro/transition screens, no people or over 1 person sequences)

Btw, great piece of code anyway, love the comments, and so impressed by what you guys are doing with OpenPose as a backbone. When it works, the face and hand fitting is amazing!

Regarding perfs, on my core5 with 980ti it takes approx the following per 1080p res frame:

openpose: 0.3"
PAF: 5"
raw full body/face/hands fit:6.7"
tracked full body/face/hands fit: 27"
(I did not time the ffmpeg video to frame extract because it is fast and not very relevant here)
So the total render time per frame on my machine is approx 12" for the non-tracked model and of 32" for the tracked version.

Below render where all works, great face and hands/arms fit, the body pose is not ideal (should be turned by approx 70deg on the left instead of facing the cam) but I had to run the upper body model so that is understandable:

from monoculartotalcapture.

MichalisLazarou commented on June 5, 2024

Also when I run openpose on this specific video it seems to be working well, detecting the body pose well for the whole body.

from monoculartotalcapture.

xiangdonglai commented on June 5, 2024

I can think of 2 possible problems with this output, both related to the resolution of your input video. This input video seems to be very low-resolution, so
(1) The person under estimation would be extremely far away from the camera (very large z value), out of the rendering range of OpenGL, given the current way we estimate the absolute translation of the person in 3D space. This means the person is there but won't be rendered by OpenGL. To fix this, instead of putting the image on the top left corner, you can try to resize the video to 1080 in height or 1920 in width (whichever fits), and then feed into our pipeline.
(2) The input resolution definitely have an influence on the performance of our method, which is in general true for any computer vision algorithm. It is very unlikely our method will be able to get hands pose under this resolution (we mention this is the discussion session of our paper), and our body network possibly won't work perfectly in this case (but you should still see a person there; the current problem is definitely about the distance as explained in point 1).

from monoculartotalcapture.

xiangdonglai commented on June 5, 2024

Please refer to here for what I mean about the rendering range of OpenGL.

from monoculartotalcapture.

Tetsujinfr commented on June 5, 2024

Assuming there is some minimum decent resolution, if one increases the Z_Max would that better capture people a bit far away from the camera without compromising the tracking?

from monoculartotalcapture.

xiangdonglai commented on June 5, 2024

After having ran a few tests, with lots of failures, I would like to share some observations for others trying to infer this model on videos in the wild (some points may not be related to your problem Michalis):

the current code stops if there is not stricktly 1 person detected in a given frame, so transitions, text screens or crowds will stop everything. That's understandable but the code need to be tweak if you want the comps to not stop brutally.

it seems to me that if at least one hand with finger is not detected it fails, but happy to be corrected. At a minimum the model is currently very sensitive to hand and finger detection. All videos in the paper video have characters exhibiting hands all the time and in crisp/sharp situations (no or little blur)

the quality of video in the wild is quite often limited due to compression, hence the openpose detection may not give the full face/finger details, and then this pose estimation model will not work or not converge properly.

the model works not necessarily for body poses who are not facing the camera, e.g. people showing their profile or their back to the camera.

in order to run the model fit on the full skeleton you should have the character in full in the frames all the times while for most videos at some times the camera frame is going to crop the lower part body hence forcing to edit the video and run the full fit on one part and the upper body model on the other part.

So you can still get the code to work for video in the wild, but the above discard a ton of candidates, for the remaining ones you may need to edit heavily to strip out sequences failing the codes (intro/transition screens, no people or over 1 person sequences)

Btw, great piece of code anyway, love the comments, and so impressed by what you guys are doing with OpenPose as a backbone. When it works, the face and hand fitting is amazing!

Regarding perfs, on my core5 with 980ti it takes approx the following per 1080p res frame:

openpose: 0.3"

PAF: 5"

raw full body/face/hands fit:6.7"

tracked full body/face/hands fit: 27"
(I did not time the ffmpeg video to frame extract because it is fast and not very relevant here)
So the total render time per frame on my machine is approx 12" for the non-tracked model and of 32" for the tracked version.

Below render where all works, great face and hands/arms fit, the body pose is not ideal (should be turned by approx 70deg on the left instead of facing the cam) but I had to run the upper body model so that is understandable:

Thank you for your interest in our code and great analysis of the result. Your comment is very true in general. Our code works only when the details in hands are clearly visible in the images (a good test is to see whether Openpose correctly produces the output). Trying to predict reasonable output in the blurry case is beyond the scope of this paper, as an optimization-based method will never be able to handle that scenario.

from monoculartotalcapture.

Not working with other videos about monoculartotalcapture HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent