Giter Site home page Giter Site logo

daveredrum / scan2cap Goto Github PK

View Code? Open in Web Editor NEW
99.0 7.0 15.0 111.05 MB

[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Home Page: https://daveredrum.github.io/Scan2Cap

License: Other

Python 95.94% C 0.16% C++ 1.15% Cuda 1.69% Shell 0.38% Cython 0.67%
computer-vision natural-language-processing 3d pytorch cvpr cvpr2021 scans deep-learning point-cloud caption-generation

scan2cap's Introduction

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected] improvement).

Please also check out the project website here.

For additional detail, please see the Scan2Cap paper:
"Scan2Cap: Context-aware Dense Captioning in RGB-D Scans"
by Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.

News

  • [08/22/2022] We launched the Scan2Cap Dense Captioning Benchmark. Come check it out!
  • [08/22/2022] We released a new implementation of Scan2Cap with 1)8x faster training time; 2) revised evaluation metrics; 3) benchmark toolbox. Please see more details in the faster-captioning repo.

🌟 Benchmark Challenge 🌟

We provide the Scan2Cap Benchmark Challenge for benchmarking your model automatically on the hidden test set! Learn more at our benchmark challenge website. After finishing training the model, please download the benchmark data and put the unzipped ScanRefer_filtered_test.json under data/. Then, you can run the following script the generate predictions:

python benchmark/predict.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The generated predictions are stored in outputs/<folder_name>/pred.json. For submitting the predictions, please compress the pred.json as a .zip or .7z file and follow the instructions to upload your results.

Local Benchmarking on Val Set

Before submitting the results on the test set to the official benchmark, you can also benchmark the performance on the val set. Run the following script to generate GTs for val set first:

python scripts/build_benchmark_gt.py --split val

NOTE: don't forget to change the DATA_ROOT in scripts/build_benchmark_gt.py

Generate the predictions on val set:

python benchmark/predict.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --test_split val

Evaluate the predictions on the val set:

python benchmark/eval.py --split val --path <path to predictions> --verbose

(Optional) Compile accelerated generalized IoU for faster evaluation:

python cython_compile.py build_ext --inplace

Data

ScanRefer

If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.

Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.

Download the dataset by simply executing the wget command:

wget <download_link>

Scan2CAD

As learning the relative object orientations in the relational graph requires CAD model alignment annotations in Scan2CAD, please refer to the Scan2CAD official release (you need ~8MB on your disk). Once the data is downloaded, extract the zip file under data/ and change the path to Scan2CAD annotations (CONF.PATH.SCAN2CAD) in lib/config.py . As Scan2CAD doesn't cover all instances in ScanRefer, please download the mapping file and place it under CONF.PATH.SCAN2CAD. Parsing the raw Scan2CAD annotations by the following command:

python scripts/Scan2CAD_to_ScanNet.py

Setup

Please execute the following command to install PyTorch 1.8:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

And don't forget to refer to Pytorch Geometric to install the graph support.

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data preparation

  1. Download the ScanRefer dataset and unzip it under data/ - You might want to run python scripts/organize_scanrefer.py to organize the data a bit.
  2. Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  3. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.

    a. Download the ENet pretrained weights (1.4MB) and put it under data/

    b. Download and decompress the extracted ScanNet frames (~13GB).

    c. Change the data paths in config.py marked with TODO accordingly.

    d. Extract the ENet features:

    python scripts/compute_multiview_features.py

    e. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:

    python scripts/project_multiview_features.py --maxpool

    You can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:

    python scripts/project_multiview_labels.py --scene_id scene0000_00 --maxpool

Usage

End-to-End training for 3D dense captioning

Run the following script to start the end-to-end training of Scan2Cap model using the multiview features and normals. For more training options, please run scripts/train.py -h:

python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50

The trained model as well as the intermediate results will be dumped into outputs/<output_folder>. For evaluating the model (@0.5IoU), please run the following script and change the <output_folder> accordingly, and note that arguments must match the ones for training:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_caption --min_iou 0.5

Evaluating the detection performance:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection

You can even evaluate the pretraiend object detection backbone:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection --eval_pretrained

If you want to visualize the results, please run this script to generate bounding boxes and descriptions for scene <scene_id> to outputs/<output_folder>:

python scripts/visualize.py --folder <output_folder> --scene_id <scene_id> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10

Note that you need to run python scripts/export_scannet_axis_aligned_mesh.py first to generate axis-aligned ScanNet mesh files.

3D dense captioning with ground truth bounding boxes

For experimenting the captioning performance with ground truth bounding boxes, you need to extract the box features with a pre-trained extractor. The pretrained ones are already in pretrained, but if you want to train a new one from scratch, run the following script:

python scripts/train_maskvotenet.py --batch_size 8 --epoch 200 --lr 1e-3 --wd 0 --use_multiview --use_normal

The pretrained model will be stored under outputs/<output_folder>. Before we proceed, you need to move the <output_folder> to pretrained/ and change the name of the folder to XYZ_MULTIVIEW_NORMAL_MASKS_VOTENET, which must reflect the features while training, e.g. MULTIVIEW -> --use_multiview.

After that, let's run the following script to extract the features for the ground truth bounding boxes. Note that the feature options must match the ones in the previous steps:

python scripts/extract_gt_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

The extracted features will be stored as a HDF5 database under <your-project-root>/gt_<dataset-name>_features. You need ~610MB space on your disk.

Now the box features are ready - we're good to go! Next step: run the following command to start training the dense captioning pipeline with the extraced ground truth box features:

python scripts/train_pretrained.py --mode gt --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode gt --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

3D dense captioning with pre-trained VoteNet bounding boxes

If you would like to play around with the pre-trained VoteNet bounding boxes, you can directly use the pre-trained VoteNet in pretrained. After picking the model you like, run the following command to extract the bounding boxes and associated box features:

python scripts/extract_votenet_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

Now the box features are ready. Next step: run the following command to start training the dense captioning pipeline with the extraced VoteNet boxes:

python scripts/train_pretrained.py --mode votenet --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode votenet --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

Experiments on ReferIt3D

Yes, of course you can use the ReferIt3D dataset for training and evaluation. Simply download ReferIt3D dataset and unzip it under data, then run the following command to convert it to ScanRefer format:

python scripts/organize_referit3d.py

Then you can simply specify the dataset you would like to use by --dataset ReferIt3D in the aforementioned steps. Have fun!

2D Experiments

Please refer to Scan2Cad-2D for more information.

Citation

If you found our work helpful, please kindly cite our paper via:

@inproceedings{chen2021scan2cap,
  title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans},
  author={Chen, Zhenyu and Gholami, Ali and Nie{\ss}ner, Matthias and Chang, Angel X},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3193--3203},
  year={2021}
}

License

Scan2Cap is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Copyright (c) 2021 Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

scan2cap's People

Contributors

databaaz avatar daveredrum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scan2cap's Issues

question

where is the 'votenet_{}_predictions'?

How did you measure the accuracy in Table 3?

Hi Dave, may I ask how did you measure the accuracy in category, attribute, and relation of the generated captions as presented in Table 3 from the main paper? Thanks in advance!

About the results

I'm curious about the criteria you reported in your paper.

Did the four criteria (BLEU-4, CIDEr, METEOR, ROUGE-L) reported come from the model with the best cider score reached on validation set, or the best of each criterion on the validation set?

Why the `nyu40id2class` of Scan2Cap is different with that of these detection methods?

## VoteNet
(Pdb) DC18.nyu40id2class
{3: 0, 4: 1, 5: 2, 6: 3, 7: 4, 8: 5, 9: 6, 10: 7, 11: 8, 12: 9, 14: 10, 16: 11, 24: 12, 28: 13, 33: 14, 34: 15, 36: 16, 39: 17}
(Pdb) len(DC18.nyu40id2class)
18
(Pdb) 

## Scan2Cap
(Pdb) self.dataset_config.nyu40id2class
{5: 2, 23: 17, 8: 5, 40: 17, 9: 6, 7: 4, 39: 17, 18: 17, 11: 8, 29: 17, 3: 0, 14: 10, 15: 17, 27: 17, 6: 3, 34: 15, 35: 17, 4: 1, 10: 7, 19: 17, 16: 11, 30: 17, 33: 14, 37: 17, 21: 17, 32: 17, 25: 17, 17: 17, 24: 12, 28: 13, 36: 16, 12: 9, 38: 17, 20: 17, 26: 17, 31: 17, 13: 17}
(Pdb) len(self.dataset_config.nyu40id2class)
37
(Pdb) 

How to learn relative orientations

Hi,your work is very great.
Could you explain indetail why you use "CAD model alignment annotations" to learn the object relative orientations in the relational graph ?
I would appreciate it if you could answer me.

Some puzzles about dataset processing

I once encountered a problem when preprocessing the scannetv2 dataset. I tried to solve this problem, but I'm not sure whether my solution is reasonable. I'd like to discuss it with you.

When I execute the command python batch_ load_ scannet_ data.py, an error occurred.

p1

I read the file batch_load_scannet_data.py and found that the function of the file is to select the corresponding folder in the directory data/scannet/scans/ for data processing according to the directory name in the file data/scannet/meta_data/scannetv2.txt and save the generated results in the directory data/scannet/scannet_data/.

p2

I don't know if my understanding is correct.

Then, I read the file data/scannet/meta_data/scannetv2.txt and found that it contains 806 scenes. Directory data/scannet/scans/ contains only 706 scenes for train and val. I think the problem is that there is a mismatch between the two.

So I copied all the files in directory data/scannet/scans_test/ to directory data/scannet/scans/. At this point, executing the command python batch_load_scannet_data.py can work normally.

I want to know, am I right in this way? Looking forward to your reply.

different experiment settings between training end to end scan2cap and the fixed-detector scan2cap

In end-to-end scan2cap, the relation graph's input is the origin proposals without nms. However in fixed-detector scan2cap, the relation graph's input is the origin proposals with nms. I think that's not a fair comparison.
I've performed experiments with pre-fetched votenet features without nms, and use train_pretrained.py to train the fix-detector's performance. The result shows that the fixed-detector one actually out-performs the end-to-end one.

Training time too long

Thanks for releasing the code! I really appreciate your hard work!

When I followed the instructions and ran the training command python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50 , I found that it would take nearly 14 hours (as shown in the screenshot below) to validate simply on the training set. Is it normal to have such a long time period for validation? I use one GeForce RTX 2080 GPU with 11GB of memory.
Screen Shot 2021-07-27 at 3 19 43 pm

GPU configuration

Your work is really interesting! May I ask what's the GPU configuration in your experiments? How many GPUs and memory required to run the experiments? Looking forward to your reply!

about “3D dense captioning with ground truth bounding boxes”

hi~,I have a problem about using maskvotenet to get visual feature of GT bbox,In your code ,you just get One target object's feature,Do you konw how to get all GT bbox feature?of course,for Scan2Cap task,just need one target object's feature,but aboout visual grounding task,we need all GT bbox feature.thank you~

How to visualisze it?

ce0e8e6c95365cfb969b3d88b22a90e
why the description can not display on the 3D object and why the file.json cannot open at 3Dpicture? could u recommend me a software to open it?
af90228d2497878da6e794ddac3c4a3

Unable to train

Hello, thanks for your work and code!

When I followed the instructions and ran the training command python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50, after a week of training, the output log still stayed in "epoch 1 starting...".

I think this is obviously unreasonable, but I don't know what the problem is. I use three GeForce RTX 2080 GPU with 11GB of memory.

How could we get folder render-based/ used in Scan2Cap-2d

Hi Scan2Cap team,

I've mentioned that several folders like "projected-based/renders" "render-based/renders" appeared in the conf.py, but I didn't find these images from ScanNet, ScanRefer or Scan2Cap. Did we use some script to generate these images or other project generate these images?

Also, did we use rendered pictures with bounding box on it to generate global feature? It hard to know without running these code.

Thanks,
Yui

about visualization

Hello! Thanks for your work.
could you pls tell me how to visualize the scene like you show in README? Put the .ply file into Meshlab? Or other software?

Scan2Cap benchmark no response

Hi, I have submited Scan2Cap result on online benchmark but got no reponse. Can you check the backend server and help me with that?

Thanks a lot

Caught KeyError in DataLoader worker process 0.

Hello, I am following the steps in "Setup" to try to run the code, but I encountered the following problem, can you help me solve it?
My machine environment is: Ubuntu18.04, Michine Memory 8G, GPU 12G(Tesla K80), Cuda 10.2, Python 3.7.0, PyTorch 1.8.0, etc.

python ./scripts/train.py --debug --use_color --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50

I typed the above command on the command line just for debugging and got the following error:
image

In addition, I find that my machine memory is a little small, which is one of the reasons why the program can't run. But, confusingly, in debug mode, where only one piece of data is loaded, the above problems still occur. As shown in the figure below, I did not download the data in Step 5. Is that the cause of the problem?
Notice that I replaced "--use_multiview" with "--use_color" in the command

image

I am looking forward to your early reply. Thank you very much

Training time suddenly becomes longer.

Thank you very much for your work. I have encountered some problems here.

In the first 19 epochs, the mean_ iter_time is kept at about 2 seconds.
屏幕快照 2021-09-06 上午10 51 42

But in the 20th epoch, the mean_ iter_time suddenly became about a minute.
屏幕快照 2021-09-06 上午10 52 25

It puzzles me.

I have read another related answer about Training time too long, but I don't know why there will be such a big change after the 20th epoch.

I look forward to your answer and wish you success in your work.

Benchmark Challenge

Hi, I'm really interested in the Scan2Cap Dense Captioning Benchmark, but I have some problems.

I try to run the following command.
python benchmark/predict.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10

Then, it encountered an error.

截屏2022-09-05 18 25 47

I guess the reason for this error is that the ScanNet dataset does not provide point cloud data for scene 707 to scene 806.

The file structure of the original ScanNet dataset I downloaded is shown below.

截屏2022-09-05 18 35 57

Is the ScanNet dataset I downloaded missing? Or is there something wrong with the code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.