Giter Site home page Giter Site logo

llyx97 / fetv Goto Github PK

View Code? Open in Web Editor NEW
45.0 1.0 2.0 181.43 MB

[NeurIPS 2023 Datasets and Benchmarks] "FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation", Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou

Python 100.00%

fetv's Introduction

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

1Peking University  2Huawei Noah’s Ark Lab

News 🚀

[2023-12] Release evaluation code to FETV-EVAL.

[2023-11] Update more detailed information about FETV data and evaluation results.

Overview

FETV Benchmark

FETV consist of a diverse set of text prompts, categorized based on three orthogonal aspects: major content, attribute control, and prompt complexity. This enables fine-grained evaluation of T2V generation models.

Data Instances

All FETV data are all available in the file fetv_data.json. Each line is a data instance, which is formatted as:

{
  "video_id": "1006807024", 
  "prompt": "A mountain stream", 
  "major content": {
       "spatial": ["scenery & natural objects"], 
       "temporal": ["fluid motions"]
     }, 
  "attribute control": {
      "spatial": null, 
      "temporal": null
    }, 
  "prompt complexity": ["simple"], 
  "source": "WebVid", 
  "video_url": "https://ak.picdn.net/shutterstock/videos/1006807024/preview/stock-footage-a-mountain-stream.mp4",
  "unusual type": null
  }

Temporal Major Contents Temporal Attributes to Control Spatial Major Contents Spatial Attributes to Control

Data Fields

  • "video_id": The video identifier in the original dataset where the prompt comes from.
  • "prompt": The text prompt for text-to-video generation.
  • "major content": The major content described in the prompt.
  • "attribute control": The attribute that the prompt aims to control.
  • "prompt complexity": The complexity of the prompt.
  • "source": The original dataset where the prompt comes from, which can be "WebVid", "MSRVTT" or "ours".
  • "video_url": The url link of the reference video.
  • "unusual type": The type of unusual combination the prompt involves. Only available for data instances with "source": "ours".

Dataset Statistics

FETV contains 619 text prompts. The data distributions over different categories are as follows (the numbers over categories do not sum up to 619 because a data instance can belong to multiple categories)

Manual Evaluation of Text-to-video Generation Models

We evaluate four T2V models, namely CogVideo, Text2Video-zero, ModelScopeT2V and ZeroScope. The generated and ground-truth videos are manually evaluated from four perspectives: static quality, temporal quality, overall alignment and fine-grained alignment. Examples of generated videos and manual ratings can be found here

Results of static and temporal video quality

Results of video-text alignment

Diagnosis of Automatic Text-to-video Generation Metrics

We develop automatic metrics for video quality and video-text alignment based on the UMT model, which exhibit higher correlation with humans than existing metrics.

Video-text alignment evaluation correlation with human

Video-text alignment ranking correlation with human PS: The above video-text correlation results are slightly different from the previous version because we fixed some bugs in calculating BLIPScore and CLIPscore. The advantage of UMTScore is more obvious in the updated results.

Video-text alignment ranking example Video quality ranking correlation with human

Todo

  • Upload evaluation codes.

License

This dataset is under CC-BY 4.0 license.

Citation

@article{liu2023fetv,
  title   = {FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation},
  author  = {Yuanxin Liu and Lei Li and Shuhuai Ren and Rundong Gao and Shicheng Li and Sishuo Chen and Xu Sun and Lu Hou},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2311.01813}
}

fetv's People

Contributors

llyx97 avatar

Stargazers

Kalpit C Thakkar avatar Daeun Lee avatar hurunyi avatar Jinfa Huang avatar Tsu-Jui Fu avatar Vijay Jaisankar avatar linzhiqiu avatar  avatar taiyan avatar somesh avatar Wufei Ma avatar QiulinW avatar  avatar Xiaolong  avatar Yang Wang  avatar meton-robean avatar Zhaohui Wang avatar Fanda Fan avatar Jian avatar An-zhi WANG avatar Zhen ZHAO avatar  avatar Jeff Carpenter avatar 姬忠鹏 avatar  avatar Wanquan Feng avatar  avatar WWLoong avatar Yanting_Kang avatar 爱可可-爱生活 avatar Shuhuai Ren avatar Sejong Yang avatar yaofang liu avatar MingTao(陶明) avatar Mohan Zhou avatar Kimbing Ng avatar Yinan He avatar Aniki avatar Said avatar yao teng avatar Dawei Zhu avatar Qingxiu Dong avatar Lei Li avatar Sishuo Chen avatar Guian Fang avatar

Watchers

 avatar

fetv's Issues

Which UMT models used in UMTScore?

Hi, thanks for your great work, I would like to confirm which UMT model (finetuned stage) you used to calculate the UMTScore? Video-text retrieval or VQA 1
2

Prompt的分类与选择

您好,我对您工作中的Prompt的分类与选择比较感兴趣。
首先,WordNet Synsets是如何使用的呢,我有点不太明白。例如,'animal.n.01',里面包含['animal', 'animate_being', 'beast', 'brute', 'creature', 'fauna'],之后对于一个Prompt,我们是遍历Prompt中每一个单词,看是否有和这个列表中相匹配的吗?
其次,我发现每个属性下的words都比较少,真的能够有效的分类吗?例如color属性下Key Phrases/Words,只有'white'。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.