Thanks for your great work. <a class="user-mention notranslate" data-hovercard-type="u

Thanks for your comments, <a class="user-mention notranslate" data-hovercard-type="use

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Answers to your questions: Yes, we followed [31

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Reproducing results of Table 7 in paper (3D hand pose estimation)! about dex-ycb-toolkit HOT 8 CLOSED

nitba commented on August 15, 2024

Reproducing results of Table 7 in paper (3D hand pose estimation)!

from dex-ycb-toolkit.

Comments (8)

nitba commented on August 15, 2024 1

Thanks for your comments, @ychao-nvidia .
The paper did not mention that for For each test sample, you assume to have GT bounding box coordinates and camera intrinsic to report the absolute errors.
Regarding 100 mm absolute error, I did my experiments assuming that I do not have bounding boxes for test samples!

from dex-ycb-toolkit.

nitba commented on August 15, 2024

Hi @zc-alexfan

Do you have any idea about my question?

#3 (comment)

from dex-ycb-toolkit.

zc-alexfan commented on August 15, 2024

Sorry, I did not use it.

from dex-ycb-toolkit.

nitba commented on August 15, 2024

Hi @umariqb,

I would appreciate your comments on my issue.

from dex-ycb-toolkit.

ychao-nvidia commented on August 15, 2024

Answers to your questions:

Yes, we followed [31] for the baselines reported in Tab. 7. Therefore, yes, the input to the network is a 128 x 128 RGB image cropped around the bounding box.
We report absolute error by computing the 3D distance between the ground-truth and predicted joint positions.

Two additional comments regarding [31] and absolute error:

[31] predicts a special "2.5D representation" (see [18]) for the hand pose, and then uses the bounding box coordinates and the camera intrinsics to convert this "2.5D representation" to the 3D pose. That said, the task of the network is only to predict (1) the 2D locations of keypoints within the input image and (2) the root-relative depths of keypoints (see [18]), which is reasonable for a cropped image input. With this "2.5D representation", converting to 3D pose is well-posed (see [18]).
With that said, the input for this benchmark (for RGB-only) should really be (1) the full RGB image, (2) the bounding box coordinates, and (3) the camera intrinsics---not just the cropped image itself. You can see [31] as using (1) and (2) to get the input of their network (i.e., the cropped image).

from dex-ycb-toolkit.

namepllet commented on August 15, 2024

Hi @ychao-nvidia I'd like to compare my results on Dex YCB dataset to your results(Table 7, 3D hand pose estimation)

It seems ground truth bounding box for hand is not available in test time,

so what bounding box (maybe using 2D joints coordinates or detected bounding box ... ?) did you use when crop image for hand in test time ?

from dex-ycb-toolkit.

ychao-nvidia commented on August 15, 2024

As mentioned above, for hand pose estimation we assume the bounding box is given at test time.

We calculated a tight bounding box by [min(X), min(Y), max(X), max(Y)], where X (Y) is the 2D x (y) coordinates of the 21 hand joints provided in the ground truths.

We then cropped a square image region (1) with a center shared with this tight bounding box and (2) with a side length of l, where 0.7*l=max(w, h) and w and h are the width and height of the tight bounding box. We used this cropped image region as the input to our network.

from dex-ycb-toolkit.

namepllet commented on August 15, 2024

Thanks for clear comments!

from dex-ycb-toolkit.

Reproducing results of Table 7 in paper (3D hand pose estimation)! about dex-ycb-toolkit HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent