The training and evaluation code for the vico competition ( we achived the 3rd place in the first track. Our team name: metah.)
- If the resolution/quality of generated image is higher than the ground truth (for example, post process by gfpgan), the
FID
metric would be worse, however, theCPBP
would be better, which is intuitive.
You can refer to our experiment for more details: https://github.com/audio-visual/talking-head-generation--vico-challenge-2023-/blob/main/eval/fid_cpbd_eval.ipynb - A good render is very important. In our experiment, we found that adding head pose or crop can make the
LDM
andPoseL1
better, but the moving head would cause the distort of face boundary and background, thus hampers the Visual Quality related metrics. We are very looking forward to seeing how the first and second place solutions address this issue. - At the beginning, we made a mistake in the data of the first frame, which means that the image of the first frame may be not correspond exactly to the 3dmm coefficients. This mismatching would lead to some images not being correctly driven by the render (we only tried on face-render, for other renders the results may be different).
the left one is the collection of good results, while the right one is the collection of bad results
good results | bad results |
---|---|
good_results_head_pose_method1.mp4 |
bad_results_head_pose_method1.mp4 |
the left one is the collection of good results, while the right one is the collection of bad results
good results | bad results |
---|---|
good_results_onlylip_method1.mp4 |
bad_results_onlylip_method1.mp4 |
good_results_onlylip_method2.mp4 |
bad_results_onlylip_method2.mp4 |
Actually, we propose two methods, both methods can achieve the 3rd place.
Method | SSIM↑ | CPBD↑ | PSNR↑ | FID↓ | CSIM↑ | PoseL1↓ | ExpL1↓ | AVOffset→ | AVConf↑ | LipLMD↓ |
---|---|---|---|---|---|---|---|---|---|---|
method1 | 0.613 | 0.204 | 17.811 | 28.829 | 0.540 | 0.101 | 0.151 | -1.733 | 2.541 | 12.192 |
method2 | 0.609 | 0.196 | 17.579 | 29.184 | 0.538 | 0.103 | 0.160 | -0.422 | 1.455 | 12.224 |
Due to competition time limitations, the engineering code has not been cleaned up for reference only
checkpoints for face-render(from sadtalker):
- mapping.pth.tar https://drive.google.com/file/d/1fXggXOx1XPP799Ogc1Orv6RVIdbeXDwV/view?usp=drive_link
- facevid2vid.pth.tar https://drive.google.com/file/d/1q1VVz4VRVXmBzLpWtmn1NIeIPbBGVeVX/view?usp=drive_link
checkpoints for wav2lip:
checkpoints for 3dmm prediction transformer/lstm:
- emotion transformer: https://drive.google.com/file/d/1mOHW2eLrGNHIQIZsKuJ-z0EOF53gwCF_/view?usp=drive_link
- head motion lstm: https://drive.google.com/file/d/1ffef4k0n2Z7HraFiA2PrJKYwSNeoWpjk/view?usp=drive_link
https://drive.google.com/file/d/1NF7hbE9M-GABAZnGu7p7ZDYFRUaqxQxb/view?usp=drive_link
extract keypoints and the 3dmm coefficients for the first frames
Note: this step needs to prepare environment according to deep3d_pytorch. Actually, we use two seperate enviroments, one for preprocess(deep3d_pytorch), one for others(sadtalker)
Our data structure is slightly different from the baseline,
where the baseline is:
data/train/xx.mp4(.png)
we do not have the train or test subfolder:
data/xx.mp4(.png)
extract facial landmarks from first frames
python extract_kp_images.py \
--input_dir ../../data/talking_head/first_frames/ \
--output_dir ../../data/talking_head/keypoints/ \
--device_ids 0 \
--workers 2
extract coefficients for first frames
python face_recon_images.py \
--input_dir ../../data/talking_head/first_frames/ \
--keypoint_dir ../../data/talking_head/keypoints/ \
--output_dir ../../data/talking_head/recons/ \
--inference_batch_size 128 \
--name=official \ # we rename it to official
--epoch=20 \
--model facerecon
predicting the 3dmm coefficients for the test audios, and feed to the render
python inference_transformer1.py
pass the rendered video to wav2lip
# cd wav2lip
python inference_dataset.py
extract keypoints and the 3dmm coefficients for the wav2lip generated videos
re-render videos using the coefficients obtained from step4
python inference1_rotation_wav2lip.py
combine audio and generated video
python combine_video_audio.py
Unfortunately, this part of the code has been lost, but overall it is very simple. The production of training data can refer to:
prepare_traning_batches.py
Inspirition: The movement pattern of a person's lips is positively correlated with their facial appearance. For example, if a person's lips are large, their range of lip movement is also greater compared to those with small lips.
Traning difference: We use the arcface model to extract face features from the first frame, and feed these the the original emotional-prediction transformer.
The results prove that the improvement in this step helps to improve the final lip-speech consistency under limited training data (only 430 items)