thetempaccount / co-speech-motion-generation Goto Github PK
View Code? Open in Web Editor NEWFreeform Body Motion Generation from Speech
Freeform Body Motion Generation from Speech
Hello,
Thanks for posting this awesome work!
I am wondering that, in the given 'sample_audio' folder, the wav file is paired with .TextGrid file (generated from mfa aligner), and the speech text is used to generate the label which indicates whether there is a pose mode change or not in that second.
What if I want to test the model with my own wav file which does not come with the paired text file? Did you use some ASR model to get the speech text?
Thanks,
Haozhou.
Good morning
I tried to modify the demo.py script to make it work on cpu, but it seems there is still some other area need to recode for cpu, may I ask how could I turn it to work on cpu? thanks.
Thanks for your great work. In the visualization part of the inference code, key points are connected by edges, but how to get the key point name for each index?
怎么将最初的Speech2Gesturedataset应用到你这个模型上,能不能提供一下中间数据处理过程的代码。
At the part in Visualise the generated motions:
bash visualse.sh => bash visualise.sh
第一个问题:
我们从文件中输出了不同部位的keypoint的大小,可以看到:
pose_keypoints_2d 75=25*3
face_keypoints_2d 204=68*3
hand_right_keypoints_2d 63=21*3
hand_left_keypoints_2d 63=21*3
这里pose,hands都和openpose给出的点数量一致,但是face的大小不一致呢。由项目生成出来的带有关键点信息的json文件中,脸部68个关键点,手部21个关键点*2,body有25个关键点,但是由openpose开源项目生成的关键点,脸部有70个,请问您是怎么进行修改调整的呢?
第二个问题:
可视化的这个人的body部分只有12个点呀(我数的),远远不足25个点,但是keypoint文件中又是25*3的大小,这应该怎么理解呢?这样的话是不是代表keypoint中多余的点数据是无效的呢?这样在训练vid2vid模型可视化出真人的时候,会出现点的不匹配导致视频人物扭曲变形吧?
就像我下面尝试的这样,请问是如何解决的呢?(这两张图是对应的openpose和生成的动态图片,可以看到openpose图像中出现了下肢点跑到上面了,手也连在了一起)
希望能得到您的帮助呢,非常感谢!
你好,请问每个音频对应的TextGrid文件是你们自己人工标注出来的吗?,能否提供一下你所测试Speech2Gesturedataset的数据集
像Speech2Gesture dataset一开始只有wav文件,请问怎么生成得到对应的TextGrid文件。
你好,能提供一下你在Speech2Gesturedataset数据集上跑的预训练模型吗
I saw that the data contained in this download method only has a few identities, so I doubt whether it is an incomplete Ted Gesture data set. If it is not complete, can you share the complete data set or the data processing process and reference? Looking forward to your reply
你好,请问每个音频的TextGrid文件是怎么获取的,能否提供一下代码或者方法
ありがとございます
Hello, I run the infer.sh file and got the following message, it seems it expect an "exp_name", may I ask what should I fill in? thanks :
usage: infer.py [-h] [--gpu GPU] [--save_dir SAVE_DIR] --exp_name EXP_NAME
--speakers SPEAKERS [SPEAKERS ...] [--seed SEED]
[--use_template] [--template_length TEMPLATE_LENGTH] [--infer]
[--model_path MODEL_PATH] [--same_initial]
[--initial_pose_file INITIAL_POSE_FILE]
[--audio_file AUDIO_FILE] [--textgrid_file TEXTGRID_FILE]
[--resume] [--pretrained_pth PRETRAINED_PTH]
[--style_layer_norm] [--config_file CONFIG_FILE]
infer.py: error: argument --exp_name: expected one argument
问题包括以下方面:
1、new speaker的keypoint是怎么生成的?
2、我想要生成新的人物演讲动作,我需要从该人物的以往手势视频中获取lab、wav、textgrid(通过mfa音素对齐),此外还要进行openpose获取keypoint吗?
3、请问您提供的预训练模型是通用的吗?(我的意思是指:不需要指定某个说话人,生成他的动作风格)
如果您能提供一个联系方式来讨论问题,不胜感激!我已经为您这个项目忙活了半个月了,实在是很多地方不太理解。
Do you think there is a way to convert the keypoints to 3d?
我自己录了一个半身的视频,从中取帧,利用openpose库生成了对应图片的json文件,作为vid2vid的训练集中train_openpose和train_img; 在从项目代码中提取出生成的手势关键点,写出成openpose格式的json文件,作为vid2vid测试集的test_openpose,test_img直接使用相同数量的train_img,使用单GPU从256训练到512再到1024,但是效果一直都非常差。
我想向您请教,您当时的训练参数是怎么样的?训练的时候是否使用了densepose呢?如果是训练使用densepose,那么测试的时候又如何生成densepose呢?最后一个问题,您当时是如何处理test_img数据的呢?也是直接拿train_img换的吗?
下面附上我训练的参数,希望能得到您的回复,恳请您指教,不胜感激!!!
————————————第一轮256的训练参数————————————
python train.py --name my_new_pose_256_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --num_D 2
--loadSize 384 --fineSize 256 --resize_or_crop randomScaleHeight_and_scaledCrop
--max_frames_per_gpu 4 --n_frames_total 12 --max_t_step 4
--niter 5 --niter_decay 5 --no_first_img --openpose_only
--checkpoints_dir ./checkpoint --add_face_disc
————————————第二轮512的训练参数————————————
python train.py --name my_new_pose_256_512_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --num_D 3 --n_scales_spatial 2
--resize_or_crop randomScaleHeight_and_scaledCrop --loadSize 768 --fineSize 512
--no_first_img --n_frames_total 12 --max_frames_per_gpu 2 --max_t_step 4
--niter_fix_global 3 --niter 5 --niter_decay 5
--lr 0.0001 --openpose_only --add_face_disc --checkpoints_dir ./checkpoint
--load_pretrain checkpoint/my_new_pose_256_g1
————————————第三轮1024的训练参数————————————
python train.py --name my_new_pose_512_1024_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --ndf 32 --n_scales_spatial 3 --num_D 4
--resize_or_crop randomScaleHeight_and_scaledCrop --loadSize 1536 --fineSize 1024
--no_first_img --n_frames_total 12 --max_t_step 4 --add_face_disc
--niter_fix_global 3 --niter 5 --niter_decay 5 --lr 0.00005
--openpose_only --checkpoints_dir ./checkpoint
--load_pretrain checkpoint/my_new_pose_256_512_g1
————————————测试的参数————————————
python test.py --name my_new_pose_256_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --ngf 64 --resize_or_crop scaleHeight
--loadSize 256 --no_first_img --openpose_only --remove_face_labels
--checkpoints_dir ./checkpoint --add_face_disc
python test.py --name my_new_pose_256_512_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --n_scales_spatial 2 --ngf 64
--resize_or_crop scaleHeight --loadSize 512 --no_first_img
--openpose_only --remove_face_labels --checkpoints_dir ./checkpoint --add_face_disc
python test.py --name my_new_pose_512_1024_g1 --dataroot datasets/my_pose
--dataset_mode pose --input_nc 3 --n_scales_spatial 3 --ngf 64
--resize_or_crop scaleHeight --loadSize 1024 --no_first_img
--openpose_only --remove_face_labels --checkpoints_dir ./checkpoint --add_face_disc
在训练第二轮和第三轮的时候会出现这种输出
最终的效果是这样的(很糟糕)
I want to train on Chinese speeches. But I don't know how to convert the speech videos to the type that used for training. Could u public the processing codes for raw video?
Another question confused me is that how to generate the figure in the middle in the follow picture? I have followed Readme to complete the generation for the figure on the left side of the follow picture.
And it would be better if you could provide a contrast way for our discussor.
Thanks so much.
hi,作者您好!
因为我注意到生成的json数据与OpenPose并不是相同的格式,是不是有可以将生成的json数据连续绘制出来的接口而我没有找到?如果您方便可以跟我说一下接口在哪里吗?
您好!请问一下生成时间需要多久?比如对于一段10秒的语音,需要多长时间生成结果?其实我想问的就是实时性怎么样?
根据文档配置到Visualise the generated motions
时,产生如下信息:
(csmg) xht@xht-Z590-GAMING-X:~/SourceCode/Co-Speech-Motion-Generation/src$ bash visualise.sh
making video
Traceback (most recent call last):
File "visualise/visualise_generation_res.py", line 36, in
from visualise.draw_utils import *
File "/home/xht/SourceCode/Co-Speech-Motion-Generation/src/visualise/draw_utils.py", line 5, in
os.environ['OMP_NUM_THREADS']=1 #TODO: test
File "/home/xht/.conda/envs/csmg/lib/python3.7/os.py", line 686, in setitem
value = self.encodevalue(value)
File "/home/xht/.conda/envs/csmg/lib/python3.7/os.py", line 756, in encode
raise TypeError("str expected, not %s" % type(value).name)
TypeError: str expected, not int
已经把/config
下的json文件和/pose_dataset/ckpt/freeMo.json
文件中的data_root
都修改为了本地路径,在Visualise the generated motions
失败后尝试直接Generate motions for a speaker in test_audios
,提示找不到对应的freeMo.json
文件:
(csmg) xht@xht-Z590-GAMING-X:~/SourceCode/Co-Speech-Motion-Generation/src$ bash infer.sh \
pose_dataset/ckpt/ckpt-99.pth
pose_dataset/ckpt/freeMo.json
test
Enric_Sala
/home/xht/SourceCode/Co-Speech-Motion-Generation/src/nets/graph_definition.py:35: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[0, 9, 11]
Traceback (most recent call last):
File "scripts/infer.py", line 150, in
main()
File "scripts/infer.py", line 140, in main
config = load_JsonConfig(args.config_file)
File "/home/xht/SourceCode/Co-Speech-Motion-Generation/src/trainer/config.py", line 15, in load_JsonConfig
with open(json_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'pose_dataset/ckpt/freeMo.json'
As title shown, How to compute the body motion diversity score? I didnt find any code in your code and paper ,Thanks
I have been trying to create my own dataset to use for training and I saw that stats are used in the auto labeling of the training data for each speaker and I could not find how they were computed. How are the checker_stats
in checker.py
calculated?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.