TalkLip net

This repo is the official implementation of 'Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert', CVPR 2023.

Paper

Prerequisite

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html.
Install AV-Hubert by following his installation.
Install supplementary packages via pip install -r requirements.txt
Install ffmpeg. We adopt version=4.3.2. Please double check wavforms extracted from mp4 files. Extracted wavforms should not contain prefix of 0. If you use anaconda, you can refer to conda install -c conda-forge ffmpeg==4.2.3
Download the pre-trained checkpoint of face detector pre-trained model and put it to face_detection/detection/sfd/s3fd.pth. Alternative link.

Dataset and pre-processing

Download LRS2 for training and evaluation. Note that we do not use the pretrain set.
Download LRW for evaluation.
To extract wavforms from mp4 files:

python preparation/audio_extract.py --filelist $filelist  --video_root $video_root --audio_root $audio_root

$filelist: a txt file containing names of videos. We provide the filelist of LRW test set as an example in the datalist directory.
$video_root: root directory of videos. In LRS2 dataset, $video_root should contains directories like "639XXX". In LRW dataset, $video_root should contains directories like "ABOUT".
$audio_root: root directory for saving wavforms
other optional arguments: please refer to audio_extract.py

To detect bounding boxes in videos and save it:

python preparation/bbx_extract.py --filelist $filelist  --video_root $video_root --bbx_root $bbx_root --gpu $gpu

$bbx_root: a root directory for saving detected bounding boxes
$gpu: run bbx_extract on a specific gpu. For example, 3.

*If you want to accelerate bbx_extract via multi-thread processing, you can use the following bash script:

*Please revise variables in the 2-nd to the 9-th lines to make it compatible with your own machine.

sh preprocess.sh

$file_list_dir: a directory which contains train.txt, valid.txt, test.txt of LRS2 dataset
$num_thread: number of threads you used. Please do not let it cross 8 with a 24GB GPU, 4 with a 12GB gpu.

Checkpoints

Model	Description	Link
TalkLip (g)	TalkLip net with the global audio encoder	Link
TalkLip (g+c)	TalkLip net with the global audio encoder and contrastive learning	Link
Lip reading observer 1	AV-hubert (large) fine-tuned on LRS2	Link
Lip reading observer 2	Conformer lip-reading network	Link
Lip reading expert	lip-reading network for training of talking face generation	Link

Train

python train.py --file_dir $file_list_dir --video_root $video_root --audio_root $audio_root \
--bbx_root $bbx_root --word_root $word_root --avhubert_root $avhubert_root --avhubert_path $avhubert_path \
--checkpoint_dir $checkpoint_dir --log_name $log_name --cont_w $cont_w --lip_w $lip_w --perp_w $perp_w \
--gen_checkpoint_path $gen_checkpoint_path --disc_checkpoint_path $disc_checkpoint_path

$file_list_dir: a directory which contains train.txt, valid.txt, test.txt of LRS2 dataset
$word_root: root directory of text annotation. Normally, it should be equal to $video_root, as LRS2 dataset puts a video file ".mp4" and its corresponding text file ".txt" in the same directory.
$avhubert_root: path of root of avhubert (should like xxx/av_hubert/avhubert)
$avhubert_path: download the above Lip reading expert and enter its path
$checkpoint_dir: a directory to save checkpoint of talklip
$log_name: name of log file
$cont_w: weight of contrastive learning loss (default: 1e-3)
$lip_w: weight of lip reading loss (default: 1e-5)
$perp_w: weight of perceptual loss (default: 0.07)
$gen_checkpoint_path(optional): enter the path of a generator checkpoint if you want to resume training from a checkpoint
$disc_checkpoint_path(optional): enter the path of a discriminator checkpoint if you want to resume training from a checkpoint

Note: Sometimes, discriminator losses may diverge during training (close to 100). Please stop the training and resume it with a reliable checkpoint.

Test

The below command is to synthesize videos for quantitative evaluation in our paper.

python inf_test.py --filelist $filelist --video_root $video_root --audio_root $audio_root \
--bbx_root $bbx_root --save_root $syn_video_root --ckpt_path $talklip_ckpt --avhubert_root $avhubert_root

$syn_video_root: root directory for saving synthesized videos
$talklip_ckpt: a trained checkpoint of TalkLip net

Demo

I update the inf_demo.py on 4/April as I previously suppose that the height and width of output videos are the same when I set cv2.VideoWriter(). Please ensure the sampling rate of the input audio file is 16000 hz.

If you want to reenact the lip movement of a video with a different speech, you can use the following command.

python inf_demo.py --video_path $video_file --wav_path $audio_file --ckpt_path $talklip_ckpt --avhubert_root $avhubert_root

$video_file: a video file (end with .mp4)
$audio_file: a audio file (end with .wav)

Evaluation

Please follow README.md in the evaluation directory

jackzhousz / talklip Goto Github PK

talklip's Introduction

TalkLip net

Prerequisite

Dataset and pre-processing

Checkpoints

Train

Test

Demo

Evaluation

talklip's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent