Giter Site home page Giter Site logo

jacobkrantz / ivln-ce Goto Github PK

View Code? Open in Web Editor NEW
21.0 3.0 1.0 73.08 MB

Official Implementation of IVLN-CE: Iterative Vision-and-Language Navigation in Continuous Environments

License: MIT License

Python 100.00%
computer-vision habitat python research robotics vln

ivln-ce's Introduction

Iterative Vision-and-Language Navigation in Continuous Environments (IVLN-CE)

Jacob Krantz*, Shurjo Banerjee*, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason

[Project Page] [Paper] [IVLN Code]

This is the official implementation of Iterative Vision-and-Language Navigation (IVLN) in continuous environments, a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent’s memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes each defined by an individual language instruction and a target path. This repository implements the Iterative Room-to-Room in Continuous Environments (IR2R-CE) benchmark.

IVLN

Setup

This project is modified from the VLN-CE repository starting from this commit.

  1. Initialize the project
git clone --recurse-submodules [email protected]:jacobkrantz/Iterative-VLNCE.git
cd Iterative-VLNCE

conda env create -f environment.yml
conda activate ivlnce

Note: if you have runtime issues relating to torch-scatter, reinstall it with the cuda-supported wheel. In my case, this was:

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.2+cu113.html
  1. Download the Matterport3D scene meshes
# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb

download_mp.py must be obtained from the Matterport3D project webpage.

  1. Download the Room-to-Room episodes in VLN-CE format (link)
gdown https://drive.google.com/uc?id=1T9SjqZWyR2PCLSXYkFckfDeIs6Un0Rjm
# Extract to: ./data/datasets/R2R_VLNCE_v1-3/{split}/{split}.json.gz
  1. Download files that define tours of episodes:
Weights Download Extract Path
Tour ordering Link (1 MB) data/tours.json
Target paths for t-nDTW eval Link (132 MB) data/gt_ndtw.json
  1. [OPTIONAL] To run baseline models, the following weights are required:
Weights Download Extract Path
ResNet Depth Encoder (DDPPO-trained) Link (745 MB) data/ddppo-models/{model}.pth
Semantics inference (RedNet) Link (626 MB) data/rednet_mp3d_best_model.pkl
Pre-trained MapCMA models Link (608 MB) data/checkpoints/{model}.pth
Pre-computed known maps Link (78 MB) data/known_maps/{semantic-src}/{scene}.npz

Starter Code

The run.py script controls training and evaluation for all models:

python run.py \
  --exp-config path/to/experiment_config.yaml \
  --run-type {train | eval}

Config files exist for running each experiment detailed in the paper, both for training and for evaluation. The configs for running ground-truth semantics experiments are located in ivlnce_baselines/config/map_cma/gt_semantics and the configs for running predicted semantics experiments are located in ivlnce_baselines/config/map_cma/pred_semantics. Each subfolder {episodic, iterative, known} contains configs for training and evaluating a model with that mapping method. Following the numbered order of config .yaml files in each respective directory will train the model and evaluate it on all mapping modes. The unstructured memory models are represented in the ivlnce_baselines/config/latent_baselines folder.

Evaluating Pre-trained MapCMA Models

The naming convention of pre-trained MapCMA models is [semantics]_[training].pth where semantics is either gt (ground-truth) or pred (predicted from RedNet) and training is the map construction method: either episodic (ep), iterative (it), or known (kn). Each can be evaluated with existing config files. For example, consider a model trained on predicted semantics and with iterative maps (pred_it.pth). To evalaute this model in the same setting, run:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml \
  EVAL_CKPT_PATH_DIR data/checkpoints/pred_it.pth

Similarly, this model can be evaluated with known maps:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml \
  EVAL_CKPT_PATH_DIR data/checkpoints/pred_it.pth

You can look through the configs in ivlnce_baselines/config/map_cma to find a particular training or evaluation configuration of interest.

Training Agents

The DaggerTrainer class is the standard trainer and supports teacher forcing or dataset aggregation (DAgger) of episodic data. We also include the IterativeCollectionDAgger trainer which builds maps iteratively and then trains agents episodically on those maps. The IterativeDAggerTrainer collects and trains models iteratively and is used to train unstructured memory models on IR2R-CE. All trainers inherit from BaseVLNCETrainer.

Training MapCMA

Suppose you want to train a MapCMA model from scratch with predicted semantics and iterative maps, like was done in the paper. First, train on IR2R-CE + augmented tour data using teacher forcing:

python run.py \
  --run-type train \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/0_train_tf.yaml

Then, swap train for eval to evaluate each checkpoint. Take the best performing checkpoint and fine-tune with DAgger on the IR2R-CE tours:

python run.py \
  --run-type train \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/1_ftune_dagger.yaml \
  IL.ckpt_to_load path/to/best/checkpoint.pth

Finally, evaluate each resulting checkpoint to find the best on the val_unseen split:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml

While this tutorial walked through a single example, config sequences are provided for all models in the paper (both latent CMA and MapCMA).

Citation

If you find this work useful, please consider citing:

@article{krantz2022iterative
  title={Iterative Vision-and-Language Navigation},
  author={Krantz, Jacob and Banerjee, Shurjo and Zhu, Wang and Corso, Jason and Anderson, Peter and Lee, Stefan and Thomason, Jesse},
  journal={arXiv preprint arXiv:2210.03087},
  year={2022},
}

License

This codebase is MIT licensed. Trained models and task datasets are considered data derived from the mp3d scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.

ivln-ce's People

Contributors

jacobkrantz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tikatoka

ivln-ce's Issues

Checkpoints for finetuning

Hi, thank you for your excellent work. I have some problems replicating the experiment, could you please provide the best-performing checkpoints of first-stage training, i.e., the checkpoint to be loaded before finetuning?

Thank you for your patience.

Data preparation

Thanks for the code! I have two questions about the data.

  1. what is the difference between {split}.json and {split}_gt.json?

    GT_PATH: data/datasets/R2R_VLNCE_v1-3_preprocessed/{split}/{split}_gt.json.gz

    DATA_PATH: data/datasets/R2R_VLNCE_v1-3_preprocessed/{split}/{split}.json.gz

  2. in the google drive link you provided, there is only R2R_VLNCE_v1-3, how to obtain R2R_VLNCE_v1-3_preprocessed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.