daveredrum / scanrefer Goto Github PK

View Code? Open in Web Editor NEW

218.0 9.0 29.0 37.26 MB

[ECCV 2020] ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Home Page: https://daveredrum.github.io/ScanRefer/

License: Other

Python 92.32% C++ 2.94% Cuda 4.33% C 0.41%

eccv computer-vision natural-language-processing 3d pytorch dataset deep-learning point-cloud visual-grounding

scanrefer's Issues

Requesting for download link

Thank you very much for opening up the download of the dataset! I hope to use the ScanRefer dataset for my research. But, I didn't receive any e-mails after I filled out the ScanrRefer Dataset form a few days ago. I tried sending an e-mail to [email protected], but the server returned the e-mail. Please give me any replies, my e-mail address is [email protected].
Best wishes.

Question about Language to object cls loss(lang_loss)

Hi, zhenyu,

Thanks for your good work. However, I got a question when training the network.

Firstly, I train a model from scratch with the command python script/train.py --use_multiview --use_normal, which enable the language to object classification loss(lang_loss).

The lang_loss and lang_acc are shown in the pictures.

It is very strange that the language loss on validation set is much lower than the loss on train set, which violates the basic principals of machine learning.

Also, the lang_acc is much higher on validation set than the acc on train set, which makes me more confused.

Can you check this problem? It can be of great help.

About the multiview images selection.

Hi Dave,

Thank you for your great work. I noticed that you used the ScanNet images from Ji Hou's 3DSIS for multiview feature.

The original paper says, "we use 5 RGB images for each training chunk (with image selection based on average 3D instance coverage)." I want to ask do these images cover the whole scene. Or do these images cover all the objects mentioned in the language utterance so that based on the images, the target object can be identified? Do you have some analysis of these images?

I want to do the task in a RGB-D-only setting. Do you have any idea of the frames needed for the task? Thank you for your answer.

The scene and the predicted boxes are unaligned

Hi all,
We tried to visualize the boxes and the sense together using the open3d package. We first generated all the ply boxes by running the predict function and then we took the scene ply file ( scene0011_00_vh_clean_2.ply ).

Please check the below image.

Dataset application

Hi, I have written the form and sent it to you by email. Could you give me some replies, thank you.

Evaluation degradation w.r.t runtime and memory usage

Dears,

Thank you very much for your amazing work!
I have one question regarding the evaluation performance, I noticed that the evaluation step takes almost the same time as training one epoch, which is weird for me.
Also, I noticed the GPU memory also is reserved with the same amount during both phases, i.e., training and testing.
Thus, I feel like the evaluation loop should be optimized more.
I tried to optimize it but unfortunately, I couldn't find the root cause of this degradation in the runtime performance.

Thanks in advance!

How the evaluation works?

I notice that the deinition of ref_acc in (line 89, lib/eval_helper.py) calculates whether the selected bounding box matches the prediction box with maximum iou with the target box.

However, in my understanding, the expected output of 3D visual grounding is to generate only one bounding box with repect to the input scene and language query. Thus, this metric is only an intermediate evaluation rather than the final evaluation?

Evaluation

Hello,

I run your code. Now I make the evaluation. I find there are two outputs for each case, i.e., iou_rate_0.25 and iou_rate_0.5. And do the results of iou_rate_0.25 and iou_rate_0.5 correspond to the results of Table 2?

"Unique" subset for validation

Hi @daveredrum ,

Hope you are doing fine.

Great work! So I was wondering how do I know which samples fall in the "unique" split and which falls in the "multiple" one, as suggested in the Table 5 for example?

Would appreciate any pointers.

~Cheers!

something about data prepare

dear author, i'm intersted in your ScanRefer.
I have down the scannet dataset but don't know where to get ScanRefer_filtered_train.json and ScanRefer_filtered_val.json
can you give me a hand ? thx a lot

Cannot reproduce the detection results

Hi, zhenyu,

Thanks for your wonderful work. However, I got a question when running your code.

Firstly, I train a model from scratch with the command python script/train.py --use_color. Then I test the model use script/eval.py. I can reproduce the grouding result, but the detection mAP is 23.67 . While this result in your paper Supplementary Material Table 8 [h] is 29.91. The difference is quite large.

I also test with the pre-trained model provided by VoteNet repo, which is 28.57. It is lower than Table 8 [a] 30.08, too. But this difference is much smaller.

Can you check this problem? It can be of great help.

Run into problem when visualizing

Hi @daveredrum ,

I need some help here. After successfullu training and evaluating using the command
python scripts/train.py --use_color --batch_size 16 --gpu 1 and
python scripts/eval.py --folder 2020-12-30_16-34-15 --reference --use_color --no_nms --force --repeat 5 --gpu 1,
when I run
python scripts/visualize.py --folder 2020-12-30_16-34-15 --scene_id scene0011_00 --use_color --gpu 1
I get the error messege writing

Traceback (most recent call last):
  File "scripts/visualize.py", line 471, in <module>
    visualize(args)
  File "scripts/visualize.py", line 445, in visualize
    dump_results(args, scanrefer, data, DC)
  File "scripts/visualize.py", line 351, in dump_results
    pred_ref_scores_softmax = F.softmax(data["cluster_ref"] * torch.argmax(data['objectness_scores'], 2).float() * data['pred_mask'], dim=1).detach().cpu().numpy()
KeyError: 'pred_mask'.

I've tried making breakpoints and find that the variable 'data' indeed doesn't have key 'pred_mask'. Actually, I have no idea about how to deal with this problem.

I've also attempted to use the pretrained model. However, I get the RunTimeError telling me the model.pth file in XYZ_COLOR_LANGCLS is a zip archive and the program just stopped. Look forward to your reply.

Thank you!

Question about get_3d_box(...) function in utils/box_util.py

Hi, @daveredrum,

I have a question about the implementation of get_3d_box(...) in utils/box_util.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/utils/box_util.py#L290

I see that you have changed the corresponding order of channels between box_size and center.

However, the get_3d_box() function is also called in function parse_predictions() in lib/ap_helper.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/lib/ap_helper.py#L40

In parse_predictions() function, before the get_3d_box() is called, the pred_center has been transformed into pred_center_upright_camera by flip_axis_to_camera(), which changes the Y,Z order of point cloud data.

So, I'm a little confused about the get_3d_box() implementation, and hope for your explanation about the change compared with original votenet.

Can you check this problem? It will be of great help.

Requesting for download link

Hi, I've filled out the form requesting for access to ScanRefer dataset last week. Could you give me any reply? Best regards.

GPU memory

Dear author:

Could I use one 1080ti with 12GB memory to reproduce your method? And how long does your model need to train?

Camera viewpoint correspondence to the object description

Hi Dave,

I was wondering if there is a linkage between scanrefer reference query and a camera viewpoint for answering the question. For example, if the query is "there is a fridge to the left of bed", then is this query attached to a particular camera viewpoint which has fridge and bed and thus the query can be grounded? Are the questions linked to the camera viewpoint in some way? Or is there a way to find that viewpoint easily?

To give you the context, our work is based on rgbds, so at test time our model expects an RGB-D image from the relevant camera viewpoint and the query which needs to be answered.

Thank you

Scanrefer web browser seems have been down.

Hi Dave,

The data browser website seems down.

Best,
Runsen

Visualization

It is my visualization result. How to obtain the visualization result of your demo?

Thank you very much.

Missing Camera Information for Test Set

Hi, thanks for releasing the camera information. Can you also release the camera information for the test split? This would be a fair assumption to have for indoor robotic applications.

Input Dimension

Dear author,

When I run python scripts/train.py --use_color, the dimension of data_dict['point_clouds'] is torch.Size([8, 40000, 7]). Why is the dimension not torch.Size([8, 40000, 6])?

Data Prepare

Hello,

I have filled the form and sent to you. Have you received my filled dataset form?

Thank you

Some issues about the pre-trained VoteNet

Hello,

First, thanks for your awesome work. I really admire it.

Still, I have a question. I noticed that you reported the performance of your model with the pre-trained VoteNet. As you mentioned in the previous issue, you modified the PointNet++ backbone implementation. I wonder if you can release the pre-trained VoteNet model with different input dimensions (e.g., xyz, xyz+rgb, xyz+rgb+normal, xyz+multiview, etc.).

I would be grateful for your help.

Download links are down

Hi,

All links are all down alongside the online dataset browser. While working on a fix, can you provide a temporary download link for the camera poses you recently released (maybe via Google Drive)?

The link of GLoVE embeddings doesn't work.

A utterance refer to a more than one object

As can be seen below, in the scene scene0011_00 which is in the val split, the utterance for one chair is This is a brown chair. There are many identical chairs setting around the table it sets at.
Obviously, there are at least 4 chairs that match this utterance. Such ambiguous descriptions in the training set may provide some supervision signals to facilitate the model's learning of vision-language alignment, but encountering such ambiguous descriptions in the validation set does not help us evaluate the model's performance.

daveredrum / scanrefer Goto Github PK

scanrefer's Issues

Recommend Projects

Recommend Topics

Recommend Org