huangjackson / v2vt Goto Github PK

View Code? Open in Web Editor NEW

18.0 4.0 2.0 62.7 MB

Video to video translation via few shot voice cloning & audio-based lip sync

License: MIT License

Python 98.75% C++ 0.17% Cuda 1.09%

v2vt's Introduction

v2vt's People

Contributors

Stargazers

Watchers

Forkers

yeayee ryuukeisyou

v2vt's Issues

Improve logging

Is your feature request related to a problem? Please describe.
Logging style and method is not universal across project (e.g. tts uses Python standard logging module, lipsync uses print).

Describe the solution you'd like
Use same style and method throughout the entire project to improve user experience.

Improve face detection via RetinaFace

Is your feature request related to a problem? Please describe.
When a face isn't detected in a frame by GPEN/face_detect, LipSync inference fails (#1).
Often happens when face's resolution is either too large in the video, or has different proportions.

Describe the solution you'd like
Improve face detection using batched RetinaFace. Train model on larger dataset to improve detection for large faces.

Additional context
GPEN face detection code
OpenTalker/video-retalking#14 (comment)

Match speed of original video during TTS

Is your feature request related to a problem? Please describe.
The translated speech should be similar in speed and timing to the original video's speech so that speech and video content is relevant.

Describe the solution you'd like
During transcription, include timestamps. Use timestamps during TTS to ensure relative similarity in speed and timing.

Describe alternatives you've considered
Group audio by cuts (made in #1) or pauses in audio (end of sentence). Speed up/slow down output audio so that each group fits in the same amount of time as the original video.

LipSync inference fails when video doesn't have face in all frames

Describe the bug
When running lipsync inference with a video that doesn't have a face in all frames (or when GPEN/face_detect is unable to detect a face in all frames), error UnboundLocalError: local variable 'mask_sharp' referenced before assignment is thrown.

To Reproduce
Steps to reproduce the behavior:

Download video without a face in all frames (e.g. https://youtu.be/AT1bO_nlxHY?si=tBtVKsmc_N3eEPvC&t=78 at 1:18) as video.mp4

Run LipSyncInference with video, using audio from video (saved as audio.wav)

from lipsync.inference import LipSyncInference
lsi = LipSyncInference('./video.mp4', './audio.wav')
lsi.run()

Expected behavior
Frames without a face are skipped (cut out during preprocessing?)

Screenshots/Logs

landmark Det:: 100%|████████████████████████████████████████████████▊| 1937/1943 [00:51<00:00, 149.12it/s]No face detected in this image
No face detected in this image
No face detected in this image
No face detected in this image
No face detected in this image
No face detected in this image
landmark Det:: 100%|██████████████████████████████████████████████████| 1943/1943 [00:51<00:00, 37.96it/s] 
[Step 2] Running 3DMM extraction: 100%|██████████████████████████████| 1943/1943 [00:13<00:00, 146.59it/s] 
Using expression center
Load checkpoint from: C:\Users\Jackson\Projects\v2vt\lipsync\checkpoints\DNet.pt
Load checkpoint from: C:\Users\Jackson\Projects\v2vt\lipsync\checkpoints\LNet.pth
Load checkpoint from: C:\Users\Jackson\Projects\v2vt\lipsync\checkpoints\ENet.pth
[Step 3] Stabilizing expression in video: 100%|███████████████████████| 1943/1943 [02:23<00:00, 13.57it/s] 
[Step 4] Loading audio - 1941 chunks
[Step 5] Enhancing reference frames:  94%|██████████████████████████▎ | 1823/1941 [04:40<00:18,  6.50it/s] 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Jackson\Projects\v2vt\lipsync\inference.py", line 345, in run
    pred, _, _ = enhancer.process(
  File "C:\Users\Jackson\Projects\v2vt\lipsync\third_part\GPEN\gpen_face_enhancer.py", line 123, in process    mask_sharp, (0, 0), sigmaX=1, sigmaY=1, borderType=cv2.BORDER_DEFAULT)
UnboundLocalError: local variable 'mask_sharp' referenced before assignment

Environment (please complete the following information):

OS: Windows 11 Pro
CPU/GPU: NVIDIA GeForce RTX 3070 Ti
Python: 3.9.19
Pytorch: 2.1.1
CUDA: 11.8

huangjackson / v2vt Goto Github PK

v2vt's Introduction

v2vt

Features

Getting Started

Prerequisites

Manual Installation

Usage

Roadmap

Contributing

License

Contact

Acknowledgments

v2vt's People

Contributors

Stargazers

Watchers

Forkers

v2vt's Issues

Recommend Projects

Recommend Topics

Recommend Org