Giter Site home page Giter Site logo

fetv-eval's Introduction

FETV_eval

This repo describes how to reproduce the evaluation results in the FETV paper.
If you only want to evaluate a new text-to-video (T2V) generation model using the FETV benchmark and UMT-based metrics, please refer to the folder UMT.

1. Video Collection

1.1 Collect Generated Videos

We evaluate four T2V models: CogVideo, Text2Video-Zero, ModelScopeT2V and ZeroScope.

The generated videos and processed video frames are provided in this huggingface dataset. Download them to datas/videos

The folder is structured as follows:

datas/videos
	├── cogvideo
	│   ├── videos
	│   └── 16frames_uniform   
	├── cogvideo_fid
	├── text2video-zero
	├── text2video-zero_fid
	├── modelscope-t2v
	├── modelscope-t2v_fid
	├── zeroscope
	└── zeroscope_fid

cogvideo/videos contains 619 videos generated from the prompts in datas/fetv_data.json. cogvideo/16frames_uniform contains uniformly sampled frames of the videos. cogvideo_fid contains 2,055 videos generated from the prompts in datas/sampled_prompts_for_fid_fvd/prompts_gen.json, which is used to compute FID and FVD.

If you want to generate the videos yourself, follow these steps:

Generate and Process Videos

  • Enter the corresponding folder and install the environment and download the models as the README file instructed. Then, run bash run.sh.
  • By default, the input prompt file is the original FETV data file placed in the datas folder of every T2V model. To generate videos for computing FID and FVD, change the input file path in run.sh to datas/prompts_gen.json (for Text2Video-Zero, ModelScopeT2V and ZeroScope) or datas/prompts_gen_cn.txt (for CogVideo).
  • To process the generated videos (".mp4" or ".gif") into frames of ".jpg" files, run
python utils/video2frames.py \
--video_ext .mp4 \
--frm_num 16 \
--video_root_path $path/to/the/generated/videos$ \
--target_root_path $path/to/the/processed/frames$ \
--sampling_strategy uniform

To compute CLIPScore, BLIPScore and FID, we adopt the "uniform" frame sampling strategy. To compute FVD, we adopt the "offset" frame sampling strategy following stylegan-v. The processed frames are structured as follows:

path_of_processed_frames
	├── sent0
	│   ├── frame0.jpg
	│   ├── frame1.jpg
	|   ...
	|   └── frame15.jpg   
	├── sent1
	...
	└──  sent618

1.2 Collect Reference Real Videos

  • Download the real videos from MSR-VTT using the script provided by ClipBERT and then copy them to real_videos using utils/collect_real_video.py.
  • Download the real videos from WebVid to real_videos using the script utils/collect_real_video.py.
  • Cut some videos using the script utils/cut_video.py.
  • To collect the real videos for computing FID and FVD, run the script utils/collect_real_video.py (change the video_id_file from datas/fetv_data.json to datas/sampled_prompts_for_fid_fvd/prompts_real.json.). The videos will be saved to datas/videos/real_videos_fid.

NOTE: You can also contact [email protected] to access the reference real videos.

2. Manual Evaluation

The manual evaluation results from three human evaluators can be found in manual_eval_results. By running python utils/visualization_manual_results.py, these results can be visualized in the form of radar plots:

Results of static and temporal video quality

Results of video-text alignment

We also release our manual evaluation instruction with carefully designed rating level definitions and examples. We hope this can help facilitate inter-human correlation in evaluating T2V generation models.

3. Automatic Evaluation

3.1 Video-Text Alignment

3.1.1 CLIPScore and BLIPScore

Run the following command to compute CLIPScore and BLIPScore:

  python auto_eval.py \
  	--eval_model ViT-B/32 \
  	--blip_config BLIP/blip_config.yaml \
  	--prompt_file datas/fetv_data.json \
  	--gen_path datas/videos/modelscope-t2v/16frames_uniform \
    	--t2v_model modelscope-t2v \
	--is_clip_ft false \
  	--save_results true 
  • --gen_path: The path containing a T2V model's generated videos, processed into frames of images
  • --eval_model: The model used to compute CLIPScore. By default, the original ViT-B/32 CLIP model is used. To compute CLIPScore-ft, please set --eval_model to the path of the fine-tuned CLIP checkpoint on MSR-VTT (we follow the implementation from CLIP4CLIP). Note: the checkpoint should be converted via utils/transform_ckpt.py before computing CLIPScore-ft.
  • --t2v_model: The name of the text2video generation model, used to name the result file.
  • --is_clip_ft: Whether to use the fine-tuned CLIP model, set to 'true' when computing CLIPScore-ft.
  • The results will be saved to auto_eval_results/CLIPScore and auto_eval_results/BLIPScore.

3.1.2 UMTScore

Please refer to the folder UMT for how to compute the UMTScore.

3.1.3 Correlation between Automatic Metrics and Humans

To compute the correlation between automatic and human judgements of video-text alignment, run

  python auto_human_correlation.py

The results will be printed as follows:

3.2 Video Quality

3.2.1 FID

To compute FID, run

  python compute_fid.py \
    --model modelscope-t2v \
    --fg_fid
  • --model determines which T2V generation model to evaluate.
  • --fg_fid determines whether to compute fine-grained FID of different categories (e.g., people, animals) or over the entire FETV benchmark.

The results will be saved to auto_eval_results/fid_results and auto_eval_results/fid_fg_results, respectively.

3.2.2 FVD

To compute FVD over the entire FETV benchmark, enter the folder stylegan-v and run

    bash run_fvd_modelscope-t2v.sh

Change modelscope-t2v to evaluate different T2V generation models. The results will be saved to auto_eval_results/fvd_results.

To compute FVD of different categories, run

    python compute_fg_fvd.py \
      --model modelscope-t2v
  • --model determines which T2V generation model to evaluate.

The results will be saved to auto_eval_results/fvd_fg_results.

3.2.3 FVD-UMT

Please refer to the folder UMT for how to compute the FVD-UMT.

3.2.4 Correlation between Automatic Metrics and Humans

To visualize the automatic and human ranking of T2V models in terms of video quality, run

    python plot_fid_fvd_human.py

To visualize the fine-grained results in different categories, run

    python plot_fid_fvd_human_fg.py

To visualize the effect of video sample number on FID and FVD, run

    python plot_fid_fvd_numvideo.py

fetv-eval's People

Contributors

llyx97 avatar

Stargazers

hurunyi avatar Shuo Chen avatar  avatar

Watchers

 avatar

fetv-eval's Issues

About automatic evaluation

Thank you for your wonderful work! I would like to ask why the results of clipscore and blipscore only have one score, but there are six attributes scored in the paper (for example: assets/corr_results.png). Where is the script for automated evaluation of the six attributes?

requirement

可以提供一下requirement.txt吗,感谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.