Giter Site home page Giter Site logo

sato-team / stable-text-to-motion-framework Goto Github PK

View Code? Open in Web Editor NEW
89.0 2.0 4.0 149.91 MB

SATO: Stable Text-to-Motion Framework

Home Page: https://sato-team.github.io/Stable-Text-to-Motion-Framework/

License: Apache License 2.0

Python 4.08% Jupyter Notebook 95.91% Shell 0.01%
multimedia robustness text-to-motion human-motion-generation

stable-text-to-motion-framework's Introduction

SATO: Stable Text-to-Motion Framework

Wenshuo chen*, Hongru Xiao*, Erhang Zhang*, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen

Website shields.io YouTube Badge arXiv

Existing Challenges

A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs. Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. In the following demonstration, we showcase the instability of predictions generated by the previous method when presented with different user inputs conveying identical semantic meaning.

Original text: A man kicks something or someone with his left leg.
T2M-GPT MDM MoMask
gif gif gif
Perturbed text: A human boots something or someone with his left leg.
T2M-GPT MDM MoMask
gif gif gif

Motivation

motivation The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.

Our Approach

Approach Image

Attention Stability. For the original text input, we can easily observe the model's attention vector for the text. This attention vector reflects the model's attentional ranking of the text, indicating the importance of each word to the text encoder's prediction. We hope a stable attention vector maintains a consistent ranking even after perturbations.

Prediction Robustness. Even with stable attention, we still cannot achieve stable results due to the change in text embeddings when facing perturbations, even with similar attention vectors. This requires us to impose further restrictions on the model's predictions. Specifically, in the face of perturbations, the model's prediction should remain consistent with the original distribution, meaning the model's output should be robust to perturbations.

Balancing Accuracy and Robustness Trade-off. Accuracy and robustness are naturally in a trade-off relationship. Our objective is to bolster stability while minimizing the decline in model accuracy, thereby mitigating catastrophic errors arising from input perturbations. Consequently, we require a mechanism to uphold the model's performance concerning the original input.

Quantitative evaluation on the HumanML3D and KIT-ML.

eval

Visualization

Original text: person is walking normally in a circle.
T2M-GPT MDM MoMask SATO
gif gif gif gif
Perturbed text: human is walking usually in a loop.
T2M-GPT MDM MoMask SATO
gif gif gif gif

Explanation: T2M-GPT, MDM, and MoMask all don't walk in a loop.

Original text: a person uses his right arm to help himself to stand up.
T2M-GPT MDM MoMask SATO
gif gif gif gif
Perturbed text: A human utilizes his right arm to help himself to stand up.
T2M-GPT MDM MoMask SATO
gif gif gif gif

Explanation: T2M-GPT, MDM, and MoMask all lack the action of transitioning from squatting to standing up, resulting in a catastrophic error.

How to Use the Code

Setup and Installation

Clone the repository:

git clone https://github.com/sato-team/Stable-Text-to-motion-Framework.git

Create fresh conda environment and install all the dependencies:

conda env create -f environment.yml
conda activate SATO

The code was tested on Python 3.8 and PyTorch 1.8.1.

Dependencies

bash dataset/prepare/download_extractor.sh
bash dataset/prepare/download_glove.sh

Quick Start

A quick reference guide for using our code is provided in quickstart.ipynb.

Datasets

We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download link. We perturbed the input texts based on the two datasets mentioned. You can access the perturbed text dataset through the following link. Take HumanML3D for an example, the dataset structure should look like this:

./dataset/HumanML3D/
├── new_joint_vecs/
├── texts/ # You need to replace the 'texts' folder in the original dataset with the 'texts' folder from our dataset.
├── Mean.npy 
├── Std.npy 
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt

Train

We will release the training code soon.

Evaluation

You can download the pretrained models in this link.

python eval_t2m.py --resume-pth pretrained/vq_best.pth --resume-trans pretrained/net_best_fid.pth --clip_path pretrained/clip_best.pth

Acknowledgements

We appreciate helps from :

Citing

If you find this code useful for your research, please consider citing the following paper:

@misc{chen2024sato,
      title={SATO: Stable Text-to-Motion Framework}, 
      author={Wenshuo Chen and Hongru Xiao and Erhang Zhang and Lijie Hu and Lei Wang and Mengyuan Liu and Chen Chen},
      year={2024},
      eprint={2405.01461},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

stable-text-to-motion-framework's People

Contributors

shurdy123 avatar zhangerhang avatar hongru0306 avatar zanebla avatar

Stargazers

liangyujie avatar  avatar Julia avatar lihongwang avatar  avatar  avatar Sap Paul avatar  avatar Timothy Spann avatar Ivandir avatar Norio Shimizu avatar Krtolica Vujadin avatar cosmicrealm avatar  avatar  avatar 唐国梁Tommy avatar Jade Cong avatar PangWong avatar Mingdian Liu avatar Ming Li avatar  avatar  avatar  avatar Zhouyingcheng Liao(廖周应成) avatar Sejong Yang avatar  avatar Monteiro Steed avatar  avatar Felipe Menegazzi avatar Daniel Romero avatar MíkúX avatar Gabriel Marques avatar ThatXliner avatar Siva avatar Noe Thalheim avatar gwendall avatar Bjoern Rennhak avatar  avatar Michael Maurer avatar Blake Senftner avatar joonhyung-lee avatar Simeon Nedelchev avatar Silvio Traversaro avatar Marco avatar Tom Bailey avatar Jurriaan Schreuder avatar Anh Nguyen avatar June avatar  avatar  avatar Chris Chiu avatar Arindam Das avatar Gavry avatar hysios  avatar  avatar CH Sun avatar bald0wang avatar  avatar Thomas Nicholson avatar Benjamin Berman avatar codetrotter avatar Vijayant Katyal avatar Breck Yunits avatar Matthieu Gouel avatar  avatar Maxime avatar Varun Ganjigunte Prakash avatar Jason Livesay avatar  avatar Songning Lai avatar 不要葱姜蒜 avatar  avatar Tony Worm avatar 南栖 avatar  avatar anna avatar  avatar Erik Scholz avatar Hakeem Demi avatar Tongjia avatar Xiangyu Guo avatar  avatar Tonic avatar Chuanchen Luo avatar Lei Wang avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

stable-text-to-motion-framework's Issues

Checkpoint pretrained on MoMask

Hi,
I'm really impressed by your work!

I can see that the provided code and weights come from T2M-GPT. However, MoMask gives the best results. Are you planning to release the MoMask version as well?

Pretrain Models issues.

Thank you very much for your contribution to the community.

  1. There seems to be a possible discrepancy between the correct file name on the command line and the actual file name. I found that the actual weight file in the link is called clip_best.pth, while the one I need in the command line is clip_best_fid.pth.
    python eval_t2m.py --resume-pth pretrained/net_best_fid.pth --clip_path pretrained/clip_best_fid.pth

2.I'm getting a KeyError, the checkpoint file net_best_fid.pth doesn't seem to contain the key ['net'] that the code is expecting.
print('loading checkpoint from {}'.format(args.resume_pth)) ckpt = torch.load(args.resume_pth, map_location='cpu') net.load_state_dict(ckpt['net'], strict=True) net.eval() net.cuda()
Therefore, it raises a KeyError.
1714980684012

Looking forward to your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.