Giter Site home page Giter Site logo

Comments (6)

Tetsujinfr avatar Tetsujinfr commented on June 5, 2024 1

After having ran a few tests, with lots of failures, I would like to share some observations for others trying to infer this model on videos in the wild (some points may not be related to your problem Michalis):

  • the current code stops if there is not stricktly 1 person detected in a given frame, so transitions, text screens or crowds will stop everything. That's understandable but the code need to be tweak if you want the comps to not stop brutally.
  • it seems to me that if at least one hand with finger is not detected it fails, but happy to be corrected. At a minimum the model is currently very sensitive to hand and finger detection. All videos in the paper video have characters exhibiting hands all the time and in crisp/sharp situations (no or little blur)
  • the quality of video in the wild is quite often limited due to compression, hence the openpose detection may not give the full face/finger details, and then this pose estimation model will not work or not converge properly.
  • the model works not necessarily for body poses who are not facing the camera, e.g. people showing their profile or their back to the camera.
  • in order to run the model fit on the full skeleton you should have the character in full in the frames all the times while for most videos at some times the camera frame is going to crop the lower part body hence forcing to edit the video and run the full fit on one part and the upper body model on the other part.

So you can still get the code to work for video in the wild, but the above discard a ton of candidates, for the remaining ones you may need to edit heavily to strip out sequences failing the codes (intro/transition screens, no people or over 1 person sequences)

Btw, great piece of code anyway, love the comments, and so impressed by what you guys are doing with OpenPose as a backbone. When it works, the face and hand fitting is amazing!

Regarding perfs, on my core5 with 980ti it takes approx the following per 1080p res frame:

  • openpose: 0.3"
  • PAF: 5"
  • raw full body/face/hands fit:6.7"
  • tracked full body/face/hands fit: 27"
    (I did not time the ffmpeg video to frame extract because it is fast and not very relevant here)
    So the total render time per frame on my machine is approx 12" for the non-tracked model and of 32" for the tracked version.

Below render where all works, great face and hands/arms fit, the body pose is not ideal (should be turned by approx 70deg on the left instead of facing the cam) but I had to run the upper body model so that is understandable:
38833a

from monoculartotalcapture.

MichalisLazarou avatar MichalisLazarou commented on June 5, 2024

Also when I run openpose on this specific video it seems to be working well, detecting the body pose well for the whole body.

from monoculartotalcapture.

xiangdonglai avatar xiangdonglai commented on June 5, 2024

I can think of 2 possible problems with this output, both related to the resolution of your input video. This input video seems to be very low-resolution, so
(1) The person under estimation would be extremely far away from the camera (very large z value), out of the rendering range of OpenGL, given the current way we estimate the absolute translation of the person in 3D space. This means the person is there but won't be rendered by OpenGL. To fix this, instead of putting the image on the top left corner, you can try to resize the video to 1080 in height or 1920 in width (whichever fits), and then feed into our pipeline.
(2) The input resolution definitely have an influence on the performance of our method, which is in general true for any computer vision algorithm. It is very unlikely our method will be able to get hands pose under this resolution (we mention this is the discussion session of our paper), and our body network possibly won't work perfectly in this case (but you should still see a person there; the current problem is definitely about the distance as explained in point 1).

from monoculartotalcapture.

xiangdonglai avatar xiangdonglai commented on June 5, 2024

Please refer to here for what I mean about the rendering range of OpenGL.

from monoculartotalcapture.

Tetsujinfr avatar Tetsujinfr commented on June 5, 2024

Assuming there is some minimum decent resolution, if one increases the Z_Max would that better capture people a bit far away from the camera without compromising the tracking?

from monoculartotalcapture.

xiangdonglai avatar xiangdonglai commented on June 5, 2024

After having ran a few tests, with lots of failures, I would like to share some observations for others trying to infer this model on videos in the wild (some points may not be related to your problem Michalis):

  • the current code stops if there is not stricktly 1 person detected in a given frame, so transitions, text screens or crowds will stop everything. That's understandable but the code need to be tweak if you want the comps to not stop brutally.
  • it seems to me that if at least one hand with finger is not detected it fails, but happy to be corrected. At a minimum the model is currently very sensitive to hand and finger detection. All videos in the paper video have characters exhibiting hands all the time and in crisp/sharp situations (no or little blur)
  • the quality of video in the wild is quite often limited due to compression, hence the openpose detection may not give the full face/finger details, and then this pose estimation model will not work or not converge properly.
  • the model works not necessarily for body poses who are not facing the camera, e.g. people showing their profile or their back to the camera.
  • in order to run the model fit on the full skeleton you should have the character in full in the frames all the times while for most videos at some times the camera frame is going to crop the lower part body hence forcing to edit the video and run the full fit on one part and the upper body model on the other part.

So you can still get the code to work for video in the wild, but the above discard a ton of candidates, for the remaining ones you may need to edit heavily to strip out sequences failing the codes (intro/transition screens, no people or over 1 person sequences)

Btw, great piece of code anyway, love the comments, and so impressed by what you guys are doing with OpenPose as a backbone. When it works, the face and hand fitting is amazing!

Regarding perfs, on my core5 with 980ti it takes approx the following per 1080p res frame:

  • openpose: 0.3"
  • PAF: 5"
  • raw full body/face/hands fit:6.7"
  • tracked full body/face/hands fit: 27"
    (I did not time the ffmpeg video to frame extract because it is fast and not very relevant here)
    So the total render time per frame on my machine is approx 12" for the non-tracked model and of 32" for the tracked version.

Below render where all works, great face and hands/arms fit, the body pose is not ideal (should be turned by approx 70deg on the left instead of facing the cam) but I had to run the upper body model so that is understandable:
38833a

Thank you for your interest in our code and great analysis of the result. Your comment is very true in general. Our code works only when the details in hands are clearly visible in the images (a good test is to see whether Openpose correctly produces the output). Trying to predict reasonable output in the blurry case is beyond the scope of this paper, as an optimization-based method will never be able to handle that scenario.

from monoculartotalcapture.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.