Giter Site home page Giter Site logo

vsp-llm's Introduction

VSP-LLM (Visual Speech Processing incorporated with LLMs)

This is the PyTorch code for Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing. This code is developed on the code of AV-HuBERT.

  • add colab demo

Introduction

We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.

vsr-vst

Model checkpoint

You can find checkpoint of our model in here

Demo

Try our VSP-LLM demo using colab

Preparation

conda create -n vsp-llm python=3.9 -y
conda activate vsp-llm
git clone https://github.com/Sally-SH/VSP-LLM.git
cd VSP-LLM
pip install -r requirements.txt
  • Download AV-HuBERT pre-trained model AV-HuBERT Large (LSR3 + VoxCeleb2) from here.
  • Download LLaMA2-7B from here.

Data preprocessing

Follow Auto-AVSR preperation to preprocess the LRS3 dataset. This will create two directories: lrs3_video_seg24s and lrs3_text_seg24s.
The LRS3 dataset manifest is provided in here. Unzip this zip file in dataset. Then, you need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:

cd src/dataset
python replace_path.py --lrs3 /path/to/lrs3 --outdir /path/to/save

Generate visual speech unit and cluster counts file

If you want to create a "cluster_counts" file for deduplication on your own, follow the steps in clustering.

Dataset layout

.
├── lrs3
│     ├── lrs3_video_seg24s               # preprocessed video and audio data
│     └── lrs3_text_seg24s                # preprocessed text data
└── lrs3_dataset
      ├── train.tsv                       # List of audio and video path for training
      ├── train.wrd                       # List of target label for training
      ├── train.cluster_counts            # List of clusters to deduplicate speech units in training
      ├── test.tsv                        # List of audio and video path for testing
      ├── test.wrd                        # List of target label for testing
      └── test.cluster_counts             # List of clusters to deduplicate speech units in testing

Training

Open the training script (scripts/train.sh) and replace these variables:

# path to downloaded pre-trained avhubert
PRETRAINED_MODEL_PATH=???

# path to train dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where output trained models will be located
OUT_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=???

# path to the trained model
MODEL_PATH=???

# path to test dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

vsp-llm's People

Contributors

sally-sh avatar eltociear avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.