VoiceControlledRobot-VAR

This repository contains the codes for our paper titled "Learning Visual-Audio Representations for Voice-Controlled Robots" in ICRA 2023. For more details, please refer to the project website and arXiv preprint. For experiment demonstrations, please refer to the youtube video.

Abstract

Based on the recent advancements in representation learning, we propose a novel pipeline for task-oriented voice-controlled robots with raw sensor inputs. Previous methods rely on a large number of labels and task-specific reward functions. Not only can such an approach hardly be improved after the deployment, but also has limited generalization across robotic platforms and tasks. To address these problems, our pipeline first learns a visual-audio representation (VAR) that associates images and sound commands. Then the robot learns to fulfill the sound command via reinforcement learning using the reward generated by the VAR. We demonstrate our approach with various sound types, robots, and tasks. We show that our method outperforms previous work with much fewer labels. We show in both the simulated and real-world experiments that the system can self-improve in previously unseen scenarios given a reasonable number of newly labeled data.

Setup

Install the python packages in requirements.txt
The package sounddevice requires additional package install sudo apt-get install libportaudio2
We use the following sound dataset: Fluent Speech Dataset, GoogleCommand Dataset, NSynth, and UrbanSound8K. The sound data is located under commonMedia folder. Notice that we processed the sound data to be mono wav with 16kHz sampling rate.

Getting started

Structure

commonMedia: contains sound datasets
data: contains the data collected from the environment, VAR models, and the RL models.
Envs: contains the implementation of OpenAI Gym environments used in the paper. The Kuka environment is in Envs/pybullet. The iTHOR environment is in Envs/ai2thor. Each environment has a configuration file for the environment, the algorithm, and the deep model.
examples: contains important information about configuration
models: contains the implementation of the VAR, the RL model, and an RL algorithm.
VAR: contains functions which support pretext.py and RL.py
cfg.py: Change this file to select one of the four environments to run.
dataset.py: definition of the dataset and data loader.
pretext.py: run this file to collect triplets, train, and test the VAR.
RL.py: run this file to load the trained VAR and perform the RL training, testing and fine-tuning.
utils.py: contains some helper functions

Run the code

Correctly set the configuration file. Please see the README.md in examples
VAR related:

python pretext.py

RL related:

python RL.py

Citation

If you find the code or the paper useful for your research, please cite our paper:

@INPROCEEDINGS{chang2023learning,
  author={Chang, Peixin and Liu, Shuijing and McPherson, D. Livingston and Driggs-Campbell, Katherine},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)}, 
  title={Learning Visual-Audio Representations for Voice-Controlled Robots}, 
  year={2023},
  volume={},
  number={},
  pages={9508-9514},
  doi={10.1109/ICRA48891.2023.10161461}}

Credits

Other contributors:
Shuijing Liu

Part of the code is based on the following repositories:

[1] I. Kostrikov, “Pytorch implementations of reinforcement learning algorithms,” https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail, 2018.

Contact

If you have any questions or find any bugs, please feel free to open an issue or pull request.

aliw521 / - Goto Github PK

-'s Introduction

VoiceControlledRobot-VAR

Abstract

Setup

Getting started

Structure

Run the code

Citation

Credits

Contact

-'s People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent