Giter Site home page Giter Site logo

co-speech-motion-generation's People

Contributors

thetempaccount avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

co-speech-motion-generation's Issues

可视化openpose关键点请问是怎么定义的呢?

第一个问题:
我们从文件中输出了不同部位的keypoint的大小,可以看到:
pose_keypoints_2d 75=25*3
face_keypoints_2d 204=68*3
hand_right_keypoints_2d 63=21*3
hand_left_keypoints_2d 63=21*3
这里pose,hands都和openpose给出的点数量一致,但是face的大小不一致呢。由项目生成出来的带有关键点信息的json文件中,脸部68个关键点,手部21个关键点*2,body有25个关键点,但是由openpose开源项目生成的关键点,脸部有70个,请问您是怎么进行修改调整的呢?
第二个问题:
000002
可视化的这个人的body部分只有12个点呀(我数的),远远不足25个点,但是keypoint文件中又是25*3的大小,这应该怎么理解呢?这样的话是不是代表keypoint中多余的点数据是无效的呢?这样在训练vid2vid模型可视化出真人的时候,会出现点的不匹配导致视频人物扭曲变形吧?
就像我下面尝试的这样,请问是如何解决的呢?(这两张图是对应的openpose和生成的动态图片,可以看到openpose图像中出现了下肢点跑到上面了,手也连在了一起)
real_A_000117
B73E9789B3B7787735CAF99B98839700

希望能得到您的帮助呢,非常感谢!

TextGrid

像Speech2Gesture dataset一开始只有wav文件,请问怎么生成得到对应的TextGrid文件。

How are the `checker_stats` in `checker.py` calculated?

I have been trying to create my own dataset to use for training and I saw that stats are used in the auto labeling of the training data for each speaker and I could not find how they were computed. How are the checker_stats in checker.py calculated?

TextGrid由来

你好,请问每个音频对应的TextGrid文件是你们自己人工标注出来的吗?,能否提供一下你所测试Speech2Gesturedataset的数据集

关于生成时间

您好!请问一下生成时间需要多久?比如对于一段10秒的语音,需要多长时间生成结果?其实我想问的就是实时性怎么样?

About exp_name parameter

Hello, I run the infer.sh file and got the following message, it seems it expect an "exp_name", may I ask what should I fill in? thanks :

usage: infer.py [-h] [--gpu GPU] [--save_dir SAVE_DIR] --exp_name EXP_NAME
--speakers SPEAKERS [SPEAKERS ...] [--seed SEED]
[--use_template] [--template_length TEMPLATE_LENGTH] [--infer]
[--model_path MODEL_PATH] [--same_initial]
[--initial_pose_file INITIAL_POSE_FILE]
[--audio_file AUDIO_FILE] [--textgrid_file TEXTGRID_FILE]
[--resume] [--pretrained_pth PRETRAINED_PTH]
[--style_layer_norm] [--config_file CONFIG_FILE]
infer.py: error: argument --exp_name: expected one argument

Error in readme.md

At the part in Visualise the generated motions:

bash visualse.sh => bash visualise.sh

关于vid2vid训练进行手势可视化,我有一些问题需要您的帮助。

我自己录了一个半身的视频,从中取帧,利用openpose库生成了对应图片的json文件,作为vid2vid的训练集中train_openpose和train_img; 在从项目代码中提取出生成的手势关键点,写出成openpose格式的json文件,作为vid2vid测试集的test_openpose,test_img直接使用相同数量的train_img,使用单GPU从256训练到512再到1024,但是效果一直都非常差。
我想向您请教,您当时的训练参数是怎么样的?训练的时候是否使用了densepose呢?如果是训练使用densepose,那么测试的时候又如何生成densepose呢?最后一个问题,您当时是如何处理test_img数据的呢?也是直接拿train_img换的吗?

下面附上我训练的参数,希望能得到您的回复,恳请您指教,不胜感激!!!
————————————第一轮256的训练参数————————————
python train.py --name my_new_pose_256_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --num_D 2
--loadSize 384 --fineSize 256 --resize_or_crop randomScaleHeight_and_scaledCrop
--max_frames_per_gpu 4 --n_frames_total 12 --max_t_step 4
--niter 5 --niter_decay 5 --no_first_img --openpose_only
--checkpoints_dir ./checkpoint --add_face_disc
————————————第二轮512的训练参数————————————
python train.py --name my_new_pose_256_512_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --num_D 3 --n_scales_spatial 2
--resize_or_crop randomScaleHeight_and_scaledCrop --loadSize 768 --fineSize 512
--no_first_img --n_frames_total 12 --max_frames_per_gpu 2 --max_t_step 4
--niter_fix_global 3 --niter 5 --niter_decay 5
--lr 0.0001 --openpose_only --add_face_disc --checkpoints_dir ./checkpoint
--load_pretrain checkpoint/my_new_pose_256_g1
————————————第三轮1024的训练参数————————————
python train.py --name my_new_pose_512_1024_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --ndf 32 --n_scales_spatial 3 --num_D 4
--resize_or_crop randomScaleHeight_and_scaledCrop --loadSize 1536 --fineSize 1024
--no_first_img --n_frames_total 12 --max_t_step 4 --add_face_disc
--niter_fix_global 3 --niter 5 --niter_decay 5 --lr 0.00005
--openpose_only --checkpoints_dir ./checkpoint
--load_pretrain checkpoint/my_new_pose_256_512_g1
————————————测试的参数————————————
python test.py --name my_new_pose_256_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --resize_or_crop scaleHeight
--loadSize 256 --no_first_img --openpose_only --remove_face_labels
--checkpoints_dir ./checkpoint --add_face_disc


python test.py --name my_new_pose_256_512_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --n_scales_spatial 2 --ngf 64
--resize_or_crop scaleHeight --loadSize 512 --no_first_img
--openpose_only --remove_face_labels --checkpoints_dir ./checkpoint --add_face_disc


python test.py --name my_new_pose_512_1024_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --n_scales_spatial 3 --ngf 64
--resize_or_crop scaleHeight --loadSize 1024 --no_first_img
--openpose_only --remove_face_labels --checkpoints_dir ./checkpoint --add_face_disc
在训练第二轮和第三轮的时候会出现这种输出
SAO YJ3MG}$$JH @9QOFL
QQ截图20220516144207
最终的效果是这样的(很糟糕)
%%YF`Y 21M}X1I74 DR9 C

about running on cpu

Good morning

I tried to modify the demo.py script to make it work on cpu, but it seems there is still some other area need to recode for cpu, may I ask how could I turn it to work on cpu? thanks.

what about 3D ?

Do you think there is a way to convert the keypoints to 3d?

Visualise the generated motions时的异常

根据文档配置到Visualise the generated motions 时,产生如下信息:
(csmg) xht@xht-Z590-GAMING-X:~/SourceCode/Co-Speech-Motion-Generation/src$ bash visualise.sh
making video
Traceback (most recent call last):
File "visualise/visualise_generation_res.py", line 36, in
from visualise.draw_utils import *
File "/home/xht/SourceCode/Co-Speech-Motion-Generation/src/visualise/draw_utils.py", line 5, in
os.environ['OMP_NUM_THREADS']=1 #TODO: test
File "/home/xht/.conda/envs/csmg/lib/python3.7/os.py", line 686, in setitem
value = self.encodevalue(value)
File "/home/xht/.conda/envs/csmg/lib/python3.7/os.py", line 756, in encode
raise TypeError("str expected, not %s" % type(value).name)
TypeError: str expected, not int

已经把/config下的json文件和/pose_dataset/ckpt/freeMo.json文件中的data_root都修改为了本地路径,在Visualise the generated motions 失败后尝试直接Generate motions for a speaker in test_audios,提示找不到对应的freeMo.json文件:
(csmg) xht@xht-Z590-GAMING-X:~/SourceCode/Co-Speech-Motion-Generation/src$ bash infer.sh \

pose_dataset/ckpt/ckpt-99.pth
pose_dataset/ckpt/freeMo.json
test
Enric_Sala
/home/xht/SourceCode/Co-Speech-Motion-Generation/src/nets/graph_definition.py:35: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[0, 9, 11]
Traceback (most recent call last):
File "scripts/infer.py", line 150, in
main()
File "scripts/infer.py", line 140, in main
config = load_JsonConfig(args.config_file)
File "/home/xht/SourceCode/Co-Speech-Motion-Generation/src/trainer/config.py", line 15, in load_JsonConfig
with open(json_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'pose_dataset/ckpt/freeMo.json'

TextGrid文件

你好,请问每个音频的TextGrid文件是怎么获取的,能否提供一下代码或者方法

Speech2Gesturedataset应用

怎么将最初的Speech2Gesturedataset应用到你这个模型上,能不能提供一下中间数据处理过程的代码。

Is the speech text necessary during inference?

Hello,

Thanks for posting this awesome work!

I am wondering that, in the given 'sample_audio' folder, the wav file is paired with .TextGrid file (generated from mfa aligner), and the speech text is used to generate the label which indicates whether there is a pose mode change or not in that second.

What if I want to test the model with my own wav file which does not come with the paired text file? Did you use some ASR model to get the speech text?

Thanks,
Haozhou.

在复现这个项目的过程中,我遇到了很多问题无法解决。

问题包括以下方面:
1、new speaker的keypoint是怎么生成的?
2、我想要生成新的人物演讲动作,我需要从该人物的以往手势视频中获取lab、wav、textgrid(通过mfa音素对齐),此外还要进行openpose获取keypoint吗?
3、请问您提供的预训练模型是通用的吗?(我的意思是指:不需要指定某个说话人,生成他的动作风格)
如果您能提供一个联系方式来讨论问题,不胜感激!我已经为您这个项目忙活了半个月了,实在是很多地方不太理解。

预训练模型

你好,能提供一下你在Speech2Gesturedataset数据集上跑的预训练模型吗

Train for Chinese speech

I want to train on Chinese speeches. But I don't know how to convert the speech videos to the type that used for training. Could u public the processing codes for raw video?
Another question confused me is that how to generate the figure in the middle in the follow picture? I have followed Readme to complete the generation for the figure on the left side of the follow picture.
))@BEP@{2)KL}9EL@QX7IXH
And it would be better if you could provide a contrast way for our discussor.
Thanks so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.