Giter Site home page Giter Site logo

daveredrum / scanrefer Goto Github PK

View Code? Open in Web Editor NEW
215.0 9.0 29.0 37.26 MB

[ECCV 2020] ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Home Page: https://daveredrum.github.io/ScanRefer/

License: Other

Python 92.32% C++ 2.94% Cuda 4.33% C 0.41%
eccv computer-vision natural-language-processing 3d pytorch dataset deep-learning point-cloud visual-grounding

scanrefer's Introduction

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Introduction

We introduce the new task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, where the core idea is to learn a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor then correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of the 3D bounding box of the target object. In order to train and benchmark our method, we introduce a new ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.

Please also check out the project website here.

For additional detail, please see the ScanRefer paper:
"ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language"
by Dave Zhenyu Chen, Angel X. Chang and Matthias Nießner
from Technical University of Munich and Simon Fraser University.

Changelog

01/20/2023: Released annotated viewpoints for descriptions!.

11/11/2020: Updated paper with the improved results due to bug fixing.

11/05/2020: Released pre-trained weights.

08/08/2020: Fixed the issue with lib/box_util.py.

08/03/2020: Fixed the issue with lib/solver.py and script/eval.py.

06/16/2020: Fixed the issue with multiview features.

01/31/2020: Fixed the issue with bad tokens.

01/21/2020: Released the ScanRefer dataset.

🌟 Benchmark Challenge 🌟

We provide the ScanRefer Benchmark Challenge for benchmarking your model automatically on the hidden test set! Learn more at our benchmark challenge website. After finishing training the model, please download the benchmark data and put the unzipped ScanRefer_filtered_test.json under data/. Then, you can run the following script the generate predictions:

python scripts/predict.py --folder <folder_name> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The generated predictions are stored in outputs/<folder_name>/pred.json. For submitting the predictions, please compress the pred.json as a .zip or .7z file and follow the instructions to upload your results.

Dataset

If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.

Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.

Download the dataset by simply executing the wget command:

wget <download_link>

Data format

"scene_id": [ScanNet scene id, e.g. "scene0000_00"],
"object_id": [ScanNet object id (corresponds to "objectId" in ScanNet aggregation file), e.g. "34"],
"object_name": [ScanNet object name (corresponds to "label" in ScanNet aggregation file), e.g. "coffee_table"],
"ann_id": [description id, e.g. "1"],
"description": [...],
"token": [a list of tokens from the tokenized description] 

🌟 Annotated viewpoints 🌟

You can now download the viewpoints via this link. Once you've downloaded the dataset, you can also play around the viewpoints that are recorded during annotation.

Viewpoint format

"scene_id": [ScanNet scene id, e.g. "scene0000_00"],
"object_id": [ScanNet object id (corresponds to "objectId" in ScanNet aggregation file), e.g. "34"],
"object_name": [ScanNet object name (corresponds to "label" in ScanNet aggregation file), e.g. "coffee_table"],
"ann_id": [description id, e.g. "1"],
"id": "<scene_id>-<object_id>_<ann_id>"
"camera": {
    "position": [...] # camera position in the original ScanNet scene
    "rotation": [...] # camera rotation in the original ScanNet scene
    "lookat": [...] # the location that the camera is currently pointing at
}

Setup

The code is tested on Ubuntu 16.04 LTS & 18.04 LTS with PyTorch 1.2.0 CUDA 10.0 installed. There are some issues with the newer version (>=1.3.0) of PyTorch. You might want to make sure you have installed the correct version. Otherwise, please execute the following command to install PyTorch:

The code is now compatiable with PyTorch 1.6! Please execute the following command to install PyTorch

conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data preparation

  1. Download the ScanRefer dataset and unzip it under data/.
  2. Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  3. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.

    a. Download the ENet pretrained weights (1.4MB) and put it under data/

    b. Download and decompress the extracted ScanNet frames (~13GB).

    c. Change the data paths in config.py marked with TODO accordingly.

    d. Extract the ENet features:

    python script/compute_multiview_features.py

    e. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:

    python script/project_multiview_features.py --maxpool

    You can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:

    python script/project_multiview_labels.py --scene_id scene0000_00 --maxpool

Usage

Training

To train the ScanRefer model with RGB values:

python scripts/train.py --use_color

For more training options (like using preprocessed multiview features), please run scripts/train.py -h.

Evaluation

To evaluate the trained ScanRefer models, please find the folder under outputs/ with the current timestamp and run:

python scripts/eval.py --folder <folder_name> --reference --use_color --no_nms --force --repeat 5

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json

Visualization

To predict the localization results predicted by the trained ScanRefer model in a specific scene, please find the corresponding folder under outputs/ with the current timestamp and run:

python scripts/visualize.py --folder <folder_name> --scene_id <scene_id> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The output .ply files will be stored under outputs/<folder_name>/vis/<scene_id>/

Models

For reproducing our results in the paper, we provide the following training commands and the corresponding pre-trained models:

Name Command Unique Multiple Overall Weights
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
xyz
python script/train.py --no_lang_cls
63.98 43.57 29.28 18.99 36.01 23.76 weights
xyz+rgb
python script/train.py --use_color --no_lang_cls
63.24 41.78 30.06 19.23 36.5 23.61 weights
xyz+rgb+normals
python script/train.py --use_color --use_normal --no_lang_cls
64.63 43.65 31.89 20.77 38.24 25.21 weights
xyz+multiview
python script/train.py --use_multiview --no_lang_cls
77.2 52.69 32.08 19.86 40.84 26.23 weights
xyz+multiview+normals
python script/train.py --use_multiview --use_normal --no_lang_cls
78.22 52.38 33.61 20.77 42.27 26.9 weights
xyz+lobjcls
python script/train.py
64.31 44.04 30.77 19.44 37.28 24.22 weights
xyz+rgb+lobjcls
python script/train.py --use_color
65.00 43.31 30.63 19.75 37.30 24.32 weights
xyz+rgb+normals+lobjcls
python script/train.py --use_color --use_normal
67.64 46.19 32.06 21.26 38.97 26.10 weights
xyz+multiview+lobjcls
python script/train.py --use_multiview
76.00 50.40 34.05 20.73 42.19 26.50 weights
xyz+multiview+normals+lobjcls
python script/train.py --use_multiview --use_normal
76.33 53.51 32.73 21.11 41.19 27.40 weights

If you would like to try out the pre-trained models, please download the model weights and extract the folder to outputs/. Note that the results are higher than before because of a few iterations of code refactoring and bug fixing.

Citation

If you use the ScanRefer data or code in your work, please kindly cite our work and the original ScanNet paper:

@inproceedings{chen2020scanrefer,
    title={Scanrefer: 3d object localization in rgb-d scans using natural language},
    author={Chen, Dave Zhenyu and Chang, Angel X and Nie{\ss}ner, Matthias},
    booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16},
    pages={202--221},
    year={2020},
    organization={Springer}
}

@inproceedings{dai2017scannet,
    title={Scannet: Richly-annotated 3d reconstructions of indoor scenes},
    author={Dai, Angela and Chang, Angel X and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie{\ss}ner, Matthias},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    pages={5828--5839},
    year={2017}
}

Acknowledgement

We would like to thank facebookresearch/votenet for the 3D object detection codebase and erikwijmans/Pointnet2_PyTorch for the CUDA accelerated PointNet++ implementation.

License

ScanRefer is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Copyright (c) 2020 Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner

scanrefer's People

Contributors

daveredrum avatar eamonn-zh avatar sherif-abdelkarim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scanrefer's Issues

A utterance refer to a more than one object

As can be seen below, in the scene scene0011_00 which is in the val split, the utterance for one chair is This is a brown chair. There are many identical chairs setting around the table it sets at.
Obviously, there are at least 4 chairs that match this utterance. Such ambiguous descriptions in the training set may provide some supervision signals to facilitate the model's learning of vision-language alignment, but encountering such ambiguous descriptions in the validation set does not help us evaluate the model's performance.

图片

Run into problem when visualizing

Hi @daveredrum ,

I need some help here. After successfullu training and evaluating using the command
python scripts/train.py --use_color --batch_size 16 --gpu 1 and
python scripts/eval.py --folder 2020-12-30_16-34-15 --reference --use_color --no_nms --force --repeat 5 --gpu 1,
when I run
python scripts/visualize.py --folder 2020-12-30_16-34-15 --scene_id scene0011_00 --use_color --gpu 1
I get the error messege writing

Traceback (most recent call last):
  File "scripts/visualize.py", line 471, in <module>
    visualize(args)
  File "scripts/visualize.py", line 445, in visualize
    dump_results(args, scanrefer, data, DC)
  File "scripts/visualize.py", line 351, in dump_results
    pred_ref_scores_softmax = F.softmax(data["cluster_ref"] * torch.argmax(data['objectness_scores'], 2).float() * data['pred_mask'], dim=1).detach().cpu().numpy()
KeyError: 'pred_mask'.

I've tried making breakpoints and find that the variable 'data' indeed doesn't have key 'pred_mask'. Actually, I have no idea about how to deal with this problem.

I've also attempted to use the pretrained model. However, I get the RunTimeError telling me the model.pth file in XYZ_COLOR_LANGCLS is a zip archive and the program just stopped. Look forward to your reply.

Thank you!

Cannot reproduce the detection results

Hi, zhenyu,

Thanks for your wonderful work. However, I got a question when running your code.

Firstly, I train a model from scratch with the command python script/train.py --use_color. Then I test the model use script/eval.py. I can reproduce the grouding result, but the detection mAP is 23.67 . While this result in your paper Supplementary Material Table 8 [h] is 29.91. The difference is quite large.

I also test with the pre-trained model provided by VoteNet repo, which is 28.57. It is lower than Table 8 [a] 30.08, too. But this difference is much smaller.

Can you check this problem? It can be of great help.

Camera viewpoint correspondence to the object description

Hi Dave,

I was wondering if there is a linkage between scanrefer reference query and a camera viewpoint for answering the question. For example, if the query is "there is a fridge to the left of bed", then is this query attached to a particular camera viewpoint which has fridge and bed and thus the query can be grounded? Are the questions linked to the camera viewpoint in some way? Or is there a way to find that viewpoint easily?

To give you the context, our work is based on rgbds, so at test time our model expects an RGB-D image from the relevant camera viewpoint and the query which needs to be answered.

Thank you

About the multiview images selection.

Hi Dave,

Thank you for your great work. I noticed that you used the ScanNet images from Ji Hou's 3DSIS for multiview feature.

The original paper says, "we use 5 RGB images for each training chunk (with image selection based on average 3D instance coverage)." I want to ask do these images cover the whole scene. Or do these images cover all the objects mentioned in the language utterance so that based on the images, the target object can be identified? Do you have some analysis of these images?

I want to do the task in a RGB-D-only setting. Do you have any idea of the frames needed for the task? Thank you for your answer.

#6

The scene and the predicted boxes are unaligned

Hi all,
We tried to visualize the boxes and the sense together using the open3d package. We first generated all the ply boxes by running the predict function and then we took the scene ply file ( scene0011_00_vh_clean_2.ply ).

Please check the below image.
Screen Shot 2022-07-22 at 2 08 05 PM

Missing Camera Information for Test Set

Hi, thanks for releasing the camera information. Can you also release the camera information for the test split? This would be a fair assumption to have for indoor robotic applications.

Input Dimension

Dear author,

When I run python scripts/train.py --use_color, the dimension of data_dict['point_clouds'] is torch.Size([8, 40000, 7]). Why is the dimension not torch.Size([8, 40000, 6])?

Evaluation degradation w.r.t runtime and memory usage

Dears,

Thank you very much for your amazing work!
I have one question regarding the evaluation performance, I noticed that the evaluation step takes almost the same time as training one epoch, which is weird for me.
Also, I noticed the GPU memory also is reserved with the same amount during both phases, i.e., training and testing.
Thus, I feel like the evaluation loop should be optimized more.
I tried to optimize it but unfortunately, I couldn't find the root cause of this degradation in the runtime performance.

Thanks in advance!

GPU memory

Dear author:

Could I use one 1080ti with 12GB memory to reproduce your method? And how long does your model need to train?

Data Prepare

Hello,

I have filled the form and sent to you. Have you received my filled dataset form?

Thank you

Visualization

QQ截图20200220181040
It is my visualization result. How to obtain the visualization result of your demo?

Thank you very much.

"Unique" subset for validation

Hi @daveredrum ,

Hope you are doing fine.

Great work! So I was wondering how do I know which samples fall in the "unique" split and which falls in the "multiple" one, as suggested in the Table 5 for example?

Would appreciate any pointers.

~Cheers!

Question about Language to object cls loss(lang_loss)

Hi, zhenyu,

Thanks for your good work. However, I got a question when training the network.

lang-loss

lang-acc

Firstly, I train a model from scratch with the command python script/train.py --use_multiview --use_normal, which enable the language to object classification loss(lang_loss).

The lang_loss and lang_acc are shown in the pictures.

It is very strange that the language loss on validation set is much lower than the loss on train set, which violates the basic principals of machine learning.

Also, the lang_acc is much higher on validation set than the acc on train set, which makes me more confused.

Can you check this problem? It can be of great help.

something about data prepare

dear author, i'm intersted in your ScanRefer.
I have down the scannet dataset but don't know where to get ScanRefer_filtered_train.json and ScanRefer_filtered_val.json
can you give me a hand ? thx a lot

Some issues about the pre-trained VoteNet

Hello,

First, thanks for your awesome work. I really admire it.

Still, I have a question. I noticed that you reported the performance of your model with the pre-trained VoteNet. As you mentioned in the previous issue, you modified the PointNet++ backbone implementation. I wonder if you can release the pre-trained VoteNet model with different input dimensions (e.g., xyz, xyz+rgb, xyz+rgb+normal, xyz+multiview, etc.).

I would be grateful for your help.

Evaluation

Hello,

I run your code. Now I make the evaluation. I find there are two outputs for each case, i.e., iou_rate_0.25 and iou_rate_0.5. And do the results of iou_rate_0.25 and iou_rate_0.5 correspond to the results of Table 2?

Download links are down

Hi,

All links are all down alongside the online dataset browser. While working on a fix, can you provide a temporary download link for the camera poses you recently released (maybe via Google Drive)?

Requesting for download link

Hi, I've filled out the form requesting for access to ScanRefer dataset last week. Could you give me any reply? Best regards.

How the evaluation works?

I notice that the deinition of ref_acc in (line 89, lib/eval_helper.py) calculates whether the selected bounding box matches the prediction box with maximum iou with the target box.

However, in my understanding, the expected output of 3D visual grounding is to generate only one bounding box with repect to the input scene and language query. Thus, this metric is only an intermediate evaluation rather than the final evaluation?

Question about get_3d_box(...) function in utils/box_util.py

Hi, @daveredrum,

I have a question about the implementation of get_3d_box(...) in utils/box_util.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/utils/box_util.py#L290

I see that you have changed the corresponding order of channels between box_size and center.

However, the get_3d_box() function is also called in function parse_predictions() in lib/ap_helper.py https://github.com/daveredrum/ScanRefer/blob/a728f21feb73b8f491b7be50bd46d5c390d16fa7/lib/ap_helper.py#L40

In parse_predictions() function, before the get_3d_box() is called, the pred_center has been transformed into pred_center_upright_camera by flip_axis_to_camera(), which changes the Y,Z order of point cloud data.

So, I'm a little confused about the get_3d_box() implementation, and hope for your explanation about the change compared with original votenet.

Can you check this problem? It will be of great help.

Requesting for download link

Thank you very much for opening up the download of the dataset! I hope to use the ScanRefer dataset for my research. But, I didn't receive any e-mails after I filled out the ScanrRefer Dataset form a few days ago. I tried sending an e-mail to [email protected], but the server returned the e-mail. Please give me any replies, my e-mail address is [email protected].
Best wishes.

Dataset application

Hi, I have written the form and sent it to you by email. Could you give me some replies, thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.