daveredrum / scanrefer Goto Github PK
View Code? Open in Web Editor NEW[ECCV 2020] ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
Home Page: https://daveredrum.github.io/ScanRefer/
License: Other
[ECCV 2020] ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
Home Page: https://daveredrum.github.io/ScanRefer/
License: Other
Thank you very much for opening up the download of the dataset! I hope to use the ScanRefer dataset for my research. But, I didn't receive any e-mails after I filled out the ScanrRefer Dataset form a few days ago. I tried sending an e-mail to [email protected], but the server returned the e-mail. Please give me any replies, my e-mail address is [email protected].
Best wishes.
Hi, zhenyu,
Thanks for your good work. However, I got a question when training the network.
Firstly, I train a model from scratch with the command python script/train.py --use_multiview --use_normal
, which enable the language to object classification loss(lang_loss).
The lang_loss and lang_acc are shown in the pictures.
It is very strange that the language loss on validation set is much lower than the loss on train set, which violates the basic principals of machine learning.
Also, the lang_acc is much higher on validation set than the acc on train set, which makes me more confused.
Can you check this problem? It can be of great help.
Hi Dave,
Thank you for your great work. I noticed that you used the ScanNet images from Ji Hou's 3DSIS for multiview feature.
The original paper says, "we use 5 RGB images for each training chunk (with image selection based on average 3D instance coverage)." I want to ask do these images cover the whole scene. Or do these images cover all the objects mentioned in the language utterance so that based on the images, the target object can be identified? Do you have some analysis of these images?
I want to do the task in a RGB-D-only setting. Do you have any idea of the frames needed for the task? Thank you for your answer.
Hi, I have written the form and sent it to you by email. Could you give me some replies, thank you.
Dears,
Thank you very much for your amazing work!
I have one question regarding the evaluation performance, I noticed that the evaluation step takes almost the same time as training one epoch, which is weird for me.
Also, I noticed the GPU memory also is reserved with the same amount during both phases, i.e., training and testing.
Thus, I feel like the evaluation loop should be optimized more.
I tried to optimize it but unfortunately, I couldn't find the root cause of this degradation in the runtime performance.
Thanks in advance!
I notice that the deinition of ref_acc in (line 89, lib/eval_helper.py) calculates whether the selected bounding box matches the prediction box with maximum iou with the target box.
However, in my understanding, the expected output of 3D visual grounding is to generate only one bounding box with repect to the input scene and language query. Thus, this metric is only an intermediate evaluation rather than the final evaluation?
Hello,
I run your code. Now I make the evaluation. I find there are two outputs for each case, i.e., iou_rate_0.25 and iou_rate_0.5. And do the results of iou_rate_0.25 and iou_rate_0.5 correspond to the results of Table 2?
Hi @daveredrum ,
Hope you are doing fine.
Great work! So I was wondering how do I know which samples fall in the "unique" split and which falls in the "multiple" one, as suggested in the Table 5 for example?
Would appreciate any pointers.
~Cheers!
dear author, i'm intersted in your ScanRefer.
I have down the scannet dataset but don't know where to get ScanRefer_filtered_train.json and ScanRefer_filtered_val.json
can you give me a hand ? thx a lot
Hi, zhenyu,
Thanks for your wonderful work. However, I got a question when running your code.
Firstly, I train a model from scratch with the command python script/train.py --use_color
. Then I test the model use script/eval.py. I can reproduce the grouding result, but the detection mAP is 23.67 . While this result in your paper Supplementary Material Table 8 [h] is 29.91. The difference is quite large.
I also test with the pre-trained model provided by VoteNet repo, which is 28.57. It is lower than Table 8 [a] 30.08, too. But this difference is much smaller.
Can you check this problem? It can be of great help.
Hi @daveredrum ,
I need some help here. After successfullu training and evaluating using the command
python scripts/train.py --use_color --batch_size 16 --gpu 1
and
python scripts/eval.py --folder 2020-12-30_16-34-15 --reference --use_color --no_nms --force --repeat 5 --gpu 1
,
when I run
python scripts/visualize.py --folder 2020-12-30_16-34-15 --scene_id scene0011_00 --use_color --gpu 1
I get the error messege writing
Traceback (most recent call last):
File "scripts/visualize.py", line 471, in <module>
visualize(args)
File "scripts/visualize.py", line 445, in visualize
dump_results(args, scanrefer, data, DC)
File "scripts/visualize.py", line 351, in dump_results
pred_ref_scores_softmax = F.softmax(data["cluster_ref"] * torch.argmax(data['objectness_scores'], 2).float() * data['pred_mask'], dim=1).detach().cpu().numpy()
KeyError: 'pred_mask'.
I've tried making breakpoints and find that the variable 'data' indeed doesn't have key 'pred_mask'. Actually, I have no idea about how to deal with this problem.
I've also attempted to use the pretrained model. However, I get the RunTimeError telling me the model.pth file in XYZ_COLOR_LANGCLS is a zip archive and the program just stopped. Look forward to your reply.
Thank you!
Hi, @daveredrum,
I have a question about the implementation of get_3d_box(...) in utils/box_util.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/utils/box_util.py#L290
I see that you have changed the corresponding order of channels between box_size and center.
However, the get_3d_box() function is also called in function parse_predictions() in lib/ap_helper.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/lib/ap_helper.py#L40
In parse_predictions() function, before the get_3d_box() is called, the pred_center has been transformed into pred_center_upright_camera by flip_axis_to_camera(), which changes the Y,Z order of point cloud data.
So, I'm a little confused about the get_3d_box() implementation, and hope for your explanation about the change compared with original votenet.
Can you check this problem? It will be of great help.
Hi, I've filled out the form requesting for access to ScanRefer dataset last week. Could you give me any reply? Best regards.
Dear author:
Could I use one 1080ti with 12GB memory to reproduce your method? And how long does your model need to train?
Hi Dave,
I was wondering if there is a linkage between scanrefer reference query and a camera viewpoint for answering the question. For example, if the query is "there is a fridge to the left of bed", then is this query attached to a particular camera viewpoint which has fridge and bed and thus the query can be grounded? Are the questions linked to the camera viewpoint in some way? Or is there a way to find that viewpoint easily?
To give you the context, our work is based on rgbds, so at test time our model expects an RGB-D image from the relevant camera viewpoint and the query which needs to be answered.
Thank you
Hi, thanks for releasing the camera information. Can you also release the camera information for the test split? This would be a fair assumption to have for indoor robotic applications.
Dear author,
When I run python scripts/train.py --use_color, the dimension of data_dict['point_clouds'] is torch.Size([8, 40000, 7]). Why is the dimension not torch.Size([8, 40000, 6])?
Hello,
I have filled the form and sent to you. Have you received my filled dataset form?
Thank you
Hello,
First, thanks for your awesome work. I really admire it.
Still, I have a question. I noticed that you reported the performance of your model with the pre-trained VoteNet. As you mentioned in the previous issue, you modified the PointNet++ backbone implementation. I wonder if you can release the pre-trained VoteNet model with different input dimensions (e.g., xyz, xyz+rgb, xyz+rgb+normal, xyz+multiview, etc.).
I would be grateful for your help.
Hi,
All links are all down alongside the online dataset browser. While working on a fix, can you provide a temporary download link for the camera poses you recently released (maybe via Google Drive)?
As can be seen below, in the scene scene0011_00
which is in the val split, the utterance for one chair is This is a brown chair. There are many identical chairs setting around the table it sets at.
Obviously, there are at least 4 chairs that match this utterance. Such ambiguous descriptions in the training set may provide some supervision signals to facilitate the model's learning of vision-language alignment, but encountering such ambiguous descriptions in the validation set does not help us evaluate the model's performance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.