This is an implementation of a deep learning model for embedding 2-second audio signals into text representations. This repository contains code for training models that convert audio data into text embeddings.
This code can be useful for text-to-video generation models, as it converts audio into text embeddings by segmenting them into 2-second intervals.
Clone the repository:
git clone https://github.com/jibin86/Audio-to-Text-Embedding.git
cd Audio-to-Text-Embedding
Create and activate the conda environment:
conda env create --file env.yaml
conda activate audio_emb
This code is built upon The Power of Sound(TPoS) and AudioGPT. Obtain the checkpoints for the audio extractor and audio detector.
-
Audio Extraction
Pretrained weights can be found at the following link: link. Once downloaded, place the weights in the
pretrained_models
directory. -
Audio Detection
cd pretrained_models wget https://huggingface.co/Dongchao/pre_trained_model/resolve/main/audio_detection.pth
cd unav_dataset/scripts
python video_download.py
cd unav_dataset
python extract_audio.py
cd audio_encoder
python unav_segment_2sec.py
cd audio_encoder
python unav_curate.py --train_or_test train
python unav_curate.py --train_or_test test
cd audio_detection
python make_prompt.py --audio_dir "../unav_dataset/data/unav100/audio_segments_2sec/train" --json_dir "../audio_encoder/text_prompt"
python make_prompt.py --audio_dir "../unav_dataset/data/unav100/audio_segments_2sec/test" --json_dir "../audio_encoder/text_prompt"
cd audio_encoder
python unav_train_audio_encoder_tpos.py