sxjdwang / talklip Goto Github PK

Python 92.42% Shell 1.21% Perl 6.38%

talklip's Introduction

TalkLip net

This repo is the official implementation of 'Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert', CVPR 2023.

Arxiv | Paper

🔥 News

We upload a Talking_face_demo.pptx to this repository which contains some demo videos.
Fix the GPU out-of-memory error in train.py. Running train.py with a batch_size of 8 requires approximately 24GB of memory. However, in some rare cases, it might need more than 24GB and trigger an error. We have resolved this issue using a try-and-catch mechanism. -- 19/July/2023
We upload a checkpoint of the discriminator as requested in the issue.
We upload an eval_lrs.sh in the evaluation folder, which allows you to evaluate all metrics on LRS2 with a single command.

Prerequisite

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html.
Install AV-Hubert by following his installation.
Install supplementary packages via pip install -r requirement.txt
Install ffmpeg. We adopt version=4.3.2. Please double check wavforms extracted from mp4 files. Extracted wavforms should not contain prefix of 0. If you use anaconda, you can refer to conda install -c conda-forge ffmpeg==4.2.3
Download the pre-trained checkpoint of face detector pre-trained model and put it to face_detection/detection/sfd/s3fd.pth. Alternative link.

Dataset and pre-processing

Download LRS2 for training and evaluation. Note that we do not use the pretrain set.
Download LRW for evaluation.
To extract wavforms from mp4 files:

python preparation/audio_extract.py --filelist $filelist  --video_root $video_root --audio_root $audio_root

$filelist: a txt file containing names of videos. We provide the filelist of LRW test set as an example in the datalist directory.
$video_root: root directory of videos. In LRS2 dataset, $video_root should contains directories like "639XXX". In LRW dataset, $video_root should contains directories like "ABOUT".
$audio_root: root directory for saving wavforms
other optional arguments: please refer to audio_extract.py

To detect bounding boxes in videos and save it:

python preparation/bbx_extract.py --filelist $filelist  --video_root $video_root --bbx_root $bbx_root --gpu $gpu

$bbx_root: a root directory for saving detected bounding boxes
$gpu: run bbx_extract on a specific gpu. For example, 3.

*If you want to accelerate bbx_extract via multi-thread processing, you can use the following bash script:

*Please revise variables in the 2-nd to the 9-th lines to make it compatible with your own machine.

sh preprocess.sh

$file_list_dir: a directory which contains train.txt, valid.txt, test.txt of LRS2 dataset
$num_thread: number of threads you used. Please do not let it cross 8 with a 24GB GPU, 4 with a 12GB gpu.

Checkpoints

Model	Description	Link
TalkLip (g)	TalkLip net with the global audio encoder	Link
TalkLip (g+c)	TalkLip net with the global audio encoder and contrastive learning	Link
Lip reading observer 1	AV-hubert (large) fine-tuned on LRS2	Link
Lip reading observer 2	Conformer lip-reading network	Link
Lip reading expert	lip-reading network for training of talking face generation	Link
Discriminator	Discriminator of GAN	Link

Train

Some AV-Hubert files need to be modified.

rm xxx/av_hubert/avhubert/hubert_asr.py
cp avhubert_modification/hubert_asr_wav2lip.py xxx/av_hubert/avhubert/hubert_asr.py

rm xxx/av_hubert/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py
cp avhubert_modification/label_smoothed_cross_entropy_wav2lip.py xxx/av_hubert/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py

You can train with the following command.

python train.py --file_dir $file_list_dir --video_root $video_root --audio_root $audio_root \
--bbx_root $bbx_root --word_root $word_root --avhubert_root $avhubert_root --avhubert_path $avhubert_path \
--checkpoint_dir $checkpoint_dir --log_name $log_name --cont_w $cont_w --lip_w $lip_w --perp_w $perp_w \
--gen_checkpoint_path $gen_checkpoint_path --disc_checkpoint_path $disc_checkpoint_path

$file_list_dir: a directory which contains train.txt, valid.txt, test.txt of LRS2 dataset
$word_root: root directory of text annotation. Normally, it should be equal to $video_root, as LRS2 dataset puts a video file ".mp4" and its corresponding text file ".txt" in the same directory.
$avhubert_root: path of root of avhubert (should like xxx/av_hubert)
$avhubert_path: download the above Lip reading expert and enter its path
$checkpoint_dir: a directory to save checkpoint of talklip
$log_name: name of log file
$cont_w: weight of contrastive learning loss (default: 1e-3)
$lip_w: weight of lip reading loss (default: 1e-5)
$perp_w: weight of perceptual loss (default: 0.07)
$gen_checkpoint_path(optional): enter the path of a generator checkpoint if you want to resume training from a checkpoint
$disc_checkpoint_path(optional): enter the path of a discriminator checkpoint if you want to resume training from a checkpoint

Note: Sometimes, discriminator losses may diverge during training (close to 100). Please stop the training and resume it with a reliable checkpoint.

Test

The below command is to synthesize videos for quantitative evaluation in our paper.

python inf_test.py --filelist $filelist --video_root $video_root --audio_root $audio_root \
--bbx_root $bbx_root --save_root $syn_video_root --ckpt_path $talklip_ckpt --avhubert_root $avhubert_root

$filelist: a txt file containing names of all test files, e.g. xxx/mvlrs_v1/test.txt.
$syn_video_root: root directory for saving synthesized videos
$talklip_ckpt: a trained checkpoint of TalkLip net

Demo

I update the inf_demo.py on 4/April as I previously suppose that the height and width of output videos are the same when I set cv2.VideoWriter(). Please ensure the sampling rate of the input audio file is 16000 hz.

If you want to reenact the lip movement of a video with a different speech, you can use the following command.

python inf_demo.py --video_path $video_file --wav_path $audio_file --ckpt_path $talklip_ckpt --avhubert_root $avhubert_root

$video_file: a video file (end with .mp4)
$audio_file: an audio file (end with .wav)

**Please ensure that the input audio only has one channel

Evaluation

Please follow README.md in the evaluation directory

Citation

@inproceedings{wang2023seeing,
  title={Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert},
  author={Wang, Jiadong and Qian, Xinyuan and Zhang, Malu and Tan, Robby T and Li, Haizhou},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14653--14662},
  year={2023}
}

talklip's People

Contributors

Stargazers

Watchers

talklip's Issues

[BUG]The bug of the function audio_visual_pad I found

Excellent work, but I found some exceptions when running the demo. The original code is as follows:

def audio_visual_pad(audio_feats, video_feats):
    diff = len(audio_feats) - len(video_feats)
    repeat = 1
    if diff > 0:
        repeat = math.ceil(len(audio_feats) / len(video_feats))
        video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
    diff = len(audio_feats) - len(video_feats)
    video_feats = video_feats[:diff]
    return video_feats, repeat, diff

In my opinion, what this code does is to process audio features and video features to make them equal in length. Next, the code determines whether the video feature needs to be repeated by judging the value of diff. If diff is greater than 0, it means that the length of the audio feature is greater than the length of the video feature, and the video feature needs to be repeated to make its length equal to the audio feature. Next, the code calculates the length difference between the audio feature and the video feature again, and performs a slicing operation to truncate the length of the video feature to be equal to the audio feature, because after repeated operations, the length of the video feature may exceed the length of the audio feature. The final slicing operation may need to determine whether the length of the video feature exceeds the length of the audio feature. Otherwise, if the input video features and audio features are exactly equal, meaningless results will be returned.
This is the result after my modification:

def audio_visual_pad(audio_feats, video_feats):
    diff = len(audio_feats) - len(video_feats)
    repeat = 1
    if diff > 0:
        repeat = math.ceil(len(audio_feats) / len(video_feats))
        video_feats = torch.repeat_interleave(video_feats, repeat, dim=0)
    diff = len(audio_feats) - len(video_feats)
    if diff < 0:
        video_feats = video_feats[:diff]
    return video_feats, repeat, diff

Looking forward to your reply!

Average Confidence value range 1~2?

I used LSE-C of your code. I got 1.88 in my result, 2.00 in wav2lip inference result and 1.89 in audio2head inference result. But as I have found out score in other papers, the range of score is 6~7. What is it wrong for my quantitative result of LSE-C?

the output video frames will increase unexpectly

I have tried with inf_demo.py, but I found that the frame count of the output video was doubled.

The input video file is 10s/25fps/250frames, but I found the duration of the output video file is 20s/25fps/501frames.

I find the length of audio features array is 501.

Maybe the audio/video frames are not aligned in my case. I am not sure if there are some fps/sample rate constraint in your project.

Waiting for your reply, thank you.

You can find my input/output video/audio files in the following linkage.

talklip-issue.zip

I run the inf_demo.py with the following command:
python inf_demo.py --video_path ./input.mp4 --wav_path ./input.wav --ckpt_path ./checkpoints/global_contrastive.pth --avhubert_root /root/workspace/av_hubert

ffmpeg version is 4.2.3:

some debug logs:

How does task state.pt come from?

Project Page ?

Hello , I'm asking if there is going to be a project page to this repo to see the quality of the visual results of this method in order to compare it to the others wav2lip etc. If not, can you upload me one of the results to see the quality of this method.

For example you can upload either one of those videos

Discriminator Forward Pass

I'd like to thank the authors for their incredible effort.

I have a question regarding the discriminator forward pass, specifically the get_lower_half() function. Shouldn't this function return the lower half along both width and height dimensions? It seems to be doing it along one axis only, so the returned tensor would be of shape: [N, 3, 48, 96]. I would appreciate your clarification on this!

Hello, I trained a digital person with a square cover on their face. How to remove this？

50001.mp4

[Bug] TypeError: 'NoneType' object is not subscriptable in "utils /data_avhubert.py"

Hi! I encountered the following error when using the demo for inference:

Traceback (most recent call last):
  File "inf_demo.py", line 284, in <module>
    synt_demo(fa, device, model, args)
  File "inf_demo.py", line 248, in synt_demo
    for j, im in enumerate(processed_img[0]):
TypeError: 'NoneType' object is not subscriptable

I checked that processd_img comes from the emb_roi2im function in utils/data_avhubert.py, which I suspect it is missing the return value.

def emb_roi2im(pickedimg, imgs, bbxs, pre, device):
    trackid = 0
    height, width, _ = imgs[0][0].shape
    for i in range(len(pickedimg)):
        idimg = pickedimg[i]
        imgs[i] = imgs[i].float().to(device)
        for j in range(len(idimg)):
            bbx = bbxs[i][idimg[j]]
            if bbx[2] > width: bbx[2] = width
            if bbx[3] > height: bbx[3] = height
            resize2ori = transforms.Resize([bbx[3] - bbx[1], bbx[2] - bbx[0]])
            try:
                resized = resize2ori(pre[trackid + j] * 255.).permute(1, 2, 0)
                imgs[i][idimg[j]][bbx[1]:bbx[3], bbx[0]:bbx[2], :] = resized
            except:
                print(bbx, resized.shape)
                import sys
                sys.exit()
        trackid += len(idimg)

Based on my intuition, I added the return imgs after it and the reasoning code ran normally.

Effect of FaceFormer in the paper

Hi I found the quality results you showed in the paper including FaceFormer. But as I know, this is a 3D-mesh animation algorithm.

May I ask which code base is the one you used in the paper? Did you directly use the official FaceFormer code and adjust video encoder and decoder and retrained it?

Looking forward to your reply.
Best.

Ouput frames are not synced with Audio

Why the Output video is not synced with Audio? and it is 1/3* fps of the original video in my case. I run it with a video in which its fps is 24.0.

how to install avhubert?

when i try to run the demo , it show the bug "no module avhubert". But i don't know how to install avhubert?

Request for sharing the pre-trained discriminator weights

Hi there,

Congrats on this awesome work!

I have been trying to fine-tune the model on a custom dataset, which needs to access the pre-trained discriminator model. Could you please share its weights?

Also, do I need the lip observer models during training?

Thanks in advance!

IndexError: index 0 is out of bounds for dimension 0 with size 0

/TalkLip/utils/data_avhubert.py", line 172, in emb_roi2im
    width = imgs[0][0].shape[1]
IndexError: index 0 is out of bounds for dimension 0 with size 0

it shows up when I trying to test my own data. Can anyone help with that? Thanks!

about output file tmp.avi and tmp.mp4

Good morning,

It is a good project, I just have a try, it run the following command :

python3 inf_demo.py --video_path ./input.mp4 --wav_path ./voice.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./avhubert

It run without error , but the output file tmp.avi is just 21Kb and could not play properly, and the tmp.mp4 is 0 kb, may I have your suggestion what wrong I did? or I missed something? thanks.

checkpoint

sorry i'm slightly confused by the wording of the document just to be clear no pre-trained checkpoint is avaliable correct?

Any body came across this error?

@Sxjdwang
python3 inf_demo.py --video_path ./data/jtest.mp4 --wav_path ./data/jtest.wav --ckpt_path ./global_contrastive.pth --avhubert_root ./av_hubert/
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 234, in synt_demo
prediction, _ = model(sample, inps, idAudio, spectrogram.shape[0])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/ai/lzhh/digital_human/TalkLip/models/talklip.py", line 103, in forward
enc_out = self.audio_encoder(**sample["net_input"])
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert_asr.py", line 386, in forward
x, padding_mask = self.w2v_model.extract_finetune(**w2v_args)
File "./av_hubert/avhubert/hubert.py", line 704, in extract_finetune
features_audio = self.forward_features(src_audio, modality='audio') # features: [B, F, T]
File "./av_hubert/avhubert/hubert.py", line 541, in forward_features
features = extractor(source)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "./av_hubert/avhubert/hubert.py", line 327, in forward
x = self.proj(x.transpose(1, 2))
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/paddle/python3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (250x160 and 104x768)

About lip-reading expert

Thank you very much for your work.
I would like to ask you how you train lip reading experts, which I did not find in your code and paper.
Are you using avhubrt to fine-tune on lrs2, and the weight you get is the weight of lip-reading experts?
And what is your lip reading observer weight responsible for?
Finally, thank you again for your work.

Runtime error with long videos >30s

With short one everything is ok, but more than ~20s got an error:
Video - 480x840 30fps Windows 11, 1080ti
RuntimeError: CUDA out of memory. Tried to allocate 3.29 GiB (GPU 0; 11.00 GiB total capacity; 7.10 GiB already allocated; 1.09 GiB free; 8.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

hello，--word_root data how to create

There is a bug that is ignored in the wav data processing.

f the input audio is multi-channel, the loaded wav data will be [16k*t, X], where X is the channel’s number.
Then utilizing L160 to extract the spectrogram will increase the T*X times in temporal space.

https://github.com/Sxjdwang/TalkLip/blob/main/inf_demo.py#L160

So, the users need to ensure that the input audio only has one channel,

ffmpeg  -i input.wav -ac 1  -ar 16000 output.wav  # -ac is set the number of audio channels

or revise L160 to the following function.

from python_speech_features import logfbank
if len(wav_data.shape)>1:
    audio_feats = logfbank(wav_data[:,0], samplerate=sample_rate).astype(np.float32)  # [T, F]
else:
    audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F]

Some issues met the same bug. Such as #7

Additionally, the train.py ignore this potential situation too.

The results of the generation are not aligned. Why do you need to adjust the bbx when post-processing the fused face?

In line 178 of the file utils/data_avhubert.py, adjusting bbx causes the results to be misaligned.

Could you upload the ckpt of TalkLip_disc_qual?

Hi, thanks for your wonderful work.
Recently, I wanted to finetune the network in my dataset. Hence, I need model['gen'] and model[disc'] co-trained checkpoints.

Severe Blur in the mouth area

Dear Sir or Madam,

Thanks for making this projects open-sourced. Appreciate that.

But I found I cannot get a make-sense result. In most times, there are severe blur in the mouth area. Like the following video shows.

learn-english-00083.mp4

I am assuming that it is because the number of reference identity input is only one. It must be open-mouth or close mouth. So in one single generation period, the network cannot get both open-mouth and close-mouth identity characteristic feature of the face, so it will lead to much blur.

Please correct me if I was wrong.

Demo code failed to run

After loading the pretraining weight in the runtime demo, an error message appears indicating that the pretraining parameters do not match. How can I solve this problem

我们创建了一个中文讨论组，有需要的加我微信douzijun1999

1705126444.mp4

[Question] How to output long sequence video demo？

May I ask if you have tried long-sequence inference, I am referring to a video input with a resolution of 96x96 and a frame rate of 25 that is greater than 30 seconds. I ran your demo code and got abnormal results, which made me very confused. Do I need to modify it? Looking forward to your reply！

omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig'

followd the instructions in the readme , but got the above error. Which version of fairseq did you use?

After executing an epoch during training, this error will appear. Has anyone encountered it?

Traceback (most recent call last):
File "/data/wwp/TalkLip-main/train.py", line 531, in train
average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
File "/data/wwp/TalkLip-main/train.py", line 592, in eval_model
lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
net_output = model(**sample["net_input"])
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wwp/TalkLip-main/avhubert/hubert_asr.py", line 494, in forward
ft = self.freeze_finetune_updates <= self.num_updates
File "/data/anaconda3/envs/wwp_talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Is a typo or bug?

In the paper, the implementation detail indicts that

Audio wavforms are preprocessed to mel-spectrogram with hop and window lengths, and mel bins are 12.5 ms, 50 ms, and 80.

But hop and window lengths, and mel bins are 10 ms, 25 ms, and 26 in the function 'def fre_audio' of "info_demo.py" and "class Talklipdata".

# train.py
L231: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F] 
# info_demo.py
L160: audio_feats = logfbank(wav_data, samplerate=sample_rate).astype(np.float32)  # [T, F]

The codes utilize the default values.

a

paper not release quantitative results aboub TalkLip (l + g + c)

the face in output video is blurred

Hi, thanks for your great work! I tested talklip with my own video, but the generated face in output video is blurred and appear clear border with background. The resolution of my test video is 1600x900.

Does anyone have a demo video for the demonstration？

size mismatch for audio_encoder.w2v_model.encoder.layers: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).

May I ask if lip_reading_expert.pt's decoder_embed_dim is 768 when it was in fientune? The original avhubert finetune's decoder_embed_dim is 1024

IndexError: list index out of range

started training, this error will be reported. Train step sometimes reports errors of over 2000, and sometimes errors of several hundred。

Calculating WER using AV_hubert

Hi,
I am trying to calculate your model's WER performance on VoxCeleb2 using your provided AV_Hubert evaluation scripts. However, I do not understand what the '../datalist/test_partial.tsv' file in toavhform.py refers to.

Is 'datalist/' a directory in LRS2?
If so, can you please tell me what the expected output should be from toavhform.py so I can write it for voxceleb?

Also, I'm assuming ground truth will have to be provided from us to calculate WER?

Task state has no factory for attribute target_dictionary

My avhubert can run without any problem, but when I use the fairseq in it to execute the total train, it will report an error: AttributeError: Task state has no factory for attribute target_dictionary, may I ask which version of fairseq you are using?My fairseq version is 1.0.0a0+afc77bd

High resolution video cause cuda out of memory

Dear authors, thank you for your wonderful project!
When trying to run inf_demo.py on a 10s(30fps) video with a resolution of 1920 * 1080, I encountered a cuda out of memory error. I was running the script on a NVIDIA RTX A6000 with 48GB memory, and I thought it was enough to do the inference on a short 1080p video. Could you tell me what am I missing?

training script

Hello,

Thank you for the great work! Any news on when the training script will be released?

Why the WER is 66% when I used your checkpoint to train the model?

Hello! Thank you for your outstanding efforts.
I met some difficulties.
Why the WER is 66% when I used your checkpoints to train the model?Is it because I didn't use your lip observer ckpt?
How should I use the lip observer ckpt you gave us?

AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Run the training script according to readme.md, the error is as follows:

Traceback (most recent call last):
  File "train.py", line 740, in <module>
    train(device, {'gen': imGen, 'disc': imDisc}, avhubert, criterion, {'train': train_data_loader, 'test': test_data_loader},
  File "train.py", line 531, in train
    average_sync_loss, valid_log = eval_model(data_loader['test'], avhubert, criterion, global_step, device, model['gen'], model['disc'], args.cont_w, recon_loss)
  File "train.py", line 592, in eval_model
    lip_loss, sample_size, logs, enc_out = criterion(avhubert, sample)
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ljy/TalkLip/av_hubert/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 79, in forward
    net_output = model(**sample["net_input"])
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/ljy/TalkLip/av_hubert/avhubert/hubert_asr.py", line 494, in forward
    ft = self.freeze_finetune_updates <= self.num_updates
  File "/data/anaconda3/envs/talklip/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AVHubertSeq2Seq' object has no attribute 'num_updates'

Can someone help to see where the problem is? Thanks!
The code to run the script is as follows:
python train.py --file_dir /data/wwp/dataset/LRS2 --video_root /data/wwp/dataset/LRS2/mvlrs_v1/main --audio_root /data/wwp/dataset/LRS2/valid_audio \ --bbx_root /data/wwp/dataset/LRS2/valid_bbx --word_root /data/wwp/dataset/LRS2/mvlrs_v1/main --avhubert_root ./av_hubert/avhubert --avhubert_path /data/ljy/checkpoints/TalkLip/lip_reading_expert.pt \ --checkpoint_dir ./checkpoints/ --log_name log_talklip_01 --n_epoch 10 --ckpt_interval 50

lip_loss too large

why my lip_loss is >90 even for gt videos?
should not LabelSmoothedCrossEntropyCriterion in fairseq be near -log(1/n)?

Is there a way to reduce the use of GPU memory?

I have resize the batch size when face_detection, but it seems not enough when running av_hubert, is there any method to fix it?
Traceback (most recent call last):
File "inf_demo.py", line 280, in
synt_demo(fa, device, model, args)
File "inf_demo.py", line 237, in synt_demo
processed_img = emb_roi2im([idAudio], imgs, bbxs, prediction, device)
File "/data/home/ss/TalkLip/utils/data_avhubert.py", line 174, in emb_roi2im
imgs[i] = imgs[i].float().to(device)
RuntimeError: CUDA out of memory. Tried to allocate 23.75 GiB (GPU 0; 14.76 GiB total capacity; 960.37 MiB already allocated; 4.07 GiB free; 9.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

LRS2 and LRW permission request

Seem like my university's email fail to send the request to the email abroad,I would be appreciated if some kind people could share me with the dataset or a few samples in it.Thank you very much!

Contact email:[email protected]

Inconsistency Between Input and Output Video

Hi, great work!
I have a question. I managed to run the inference script that you provided.
However, I observed that the output dubbed video and the input video are no longer synced.
That is, if I combine these two videos with ffmpeg or any other package, the output dubbed video lags behind the original video.
For reference, I am attaching the input video and the output dubbed video (top:input, bottom:dubbed) concatenated with the input video.

Input Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/588107c0-ac33-4e4a-9bc2-41e06eb3699f

Output Dubbed Video
https://github.com/Sxjdwang/TalkLip/assets/26086758/aa83e095-72be-4f45-9618-62913dddd362

Do you have any idea about what would be the reason for that?

Thx in advance!