Giter Site home page Giter Site logo

showlab / univtg Goto Github PK

View Code? Open in Web Editor NEW
312.0 6.0 28.0 23.2 MB

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Home Page: https://arxiv.org/abs/2307.16715

License: MIT License

Python 97.32% Shell 2.68%
highlight-detection moment-retrieval pretraining video-grounding video-language video-summarization

univtg's Introduction

UniVTG (ICCV'23)

PWC PWC

[arXiv] Open in Spaces Tweet

TL; DR: The first video temporal grounding pretraining model, unifying diverse temporal annotations to power moment retrieval (interval), highlight detection (curve) and video summarization (point).

UniVTG

📢 News

  • [2023.10.15] Upload the Clip teacher scripts to create scalable pseudo annotations.
  • [2023.8.22] Code cleaning, add training/inference instruction, upload all downstream checkpoints.
  • [2023.8.6] Create the Huggingface space demo!
  • [2023.7.31] We release the arXiv paper, codes, checkpoints, and gradio demo.

📝 Todo

  • Connect UniVTG with LLM e.g., ChatGPT.
  • Upload all downstream checkpoints.
  • Upload all pretraining checkpoints.

🌟 Run on your video:

To power practical usage, we release the following checkpoints:

can be run on a single GPU with less than 4GB memory, highly efficient, less than 1 sec to perform temporal grounding even a 10 minutes long video.

Video Enc. Text Enc. Pretraining Fine-tuning Checkpoints
CLIP-B/16 CLIP-B/16 4M - Google Drive
CLIP-B/16 CLIP-B/16 4M QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo Google Drive

Download checkpoint and put it in the dir results/omni.

Download the example videos from here and put it under examples/

Run python3 main_gradio.py --resume ./results/omni/model_best.ckpt

[ Youtube video ]Youtube video
[ Egocentric video ]Egocentric video
[ Charades video ]Charades video

⚙️ Preparation

Please find instructions in install.md to setup environment and datasets.

📦 Model Zoo

Download checkpoints in model.md to reproduce the benchmark results.

🚀 Training & Inference

We use slurm for job running, you may need to slightly modify the code to adapt your environment if you do not use slurm system.

Pretraining (multi-gpu)

Large-scale pretraining: bash scripts/pretrain.sh

Multi-datasets co-training: bash scripts/cotrain.sh

Downstream (single-gpu)

Indicate --resume to init model by pretraining weight. Refer to our model zoo for detailed parameter settings

Training: bash scripts/qvhl_pretrain.sh

Indicate --eval_init and --n_epoch=0 to evaluate selected checkpoint --resume.

Inference: bash scripts/qvhl_inference.sh

CLIP teacher to create scalable pseudo labels

  1. Download the openimages v6 class list from https://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv.

  2. Convert it as json by python3 teacher/csv2json.py then extract the textual class features by python3 teacher/label2feature.py

  3. (Before this, you should have extracted the video features of the video) Run the script to generate pseudo labels python3 teacher/clip2labels.py

🎨 Visualization

If you want to draw visualizations like our paper, you can simply run python3 plot/qvhl.py to generate corresponding figures by providing the prediction jsons (you can download them in Model Zoo).

visualization

🎓 Citation

If you find our work helps, please cite our paper.

@misc{lin2023univtg,
      title={UniVTG: Towards Unified Video-Language Temporal Grounding}, 
      author={Kevin Qinghong Lin and Pengchuan Zhang and Joya Chen and Shraman Pramanick and Difei Gao and Alex Jinpeng Wang and Rui Yan and Mike Zheng Shou},
      year={2023},
      eprint={2307.16715},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

✉️ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via [email protected] or open an issue.

😊 Acknowledgement

This codebase is based on moment_detr, HERO_Video_Feature_Extractor, UMT.

We thank the authors for their open-source contributions.

univtg's People

Contributors

eltociear avatar melroy89 avatar noahschiro avatar qinghonglin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

univtg's Issues

Training Detail for Fine-tuning?

1698998777982
Thanks for your work, some training details are not too clearly described in the readme, so I would like to ask, what are the downstream tasks, training parameters and corresponding training methods?

problem about demo

3f002fc8e08621ba6b0a0863196aa66
first question:
when i run the code by own interface,and i click the step2 it shows error
second question:
when i run the code by own interface,i use the video in your demo, the result is different from mine
69f8aca1a57013a350192befb09ae54

AttributeError: 'Textbox' object has no attribute 'style'

INFO:main.config - Loaded model saved at epoch 14 from checkpoint: ./results/model_best.ckpt
Traceback (most recent call last):
File "/content/UniVTG/main_gradio.py", line 212, in
input_message = gr.Textbox(show_label=False, placeholder="Enter text query and press enter", visible=True).style(container=False)

Getting wrong results for queries irrelavant to the video

Hi, this is great work, but I do find some problems when using this model.

When I give a text query to query something that does not exist at all in the video (e.g., when I ask "driving a car" on a video about a girl reading a book without any car), the model will always give a time frame that is completely irrelevant to the query. How can I solve this problem and make the model not output anything if the query is completely irrelavant to the video?

Questions about the metrics in the paper

Thanks for your amazing work. When I read the metrics about the charades-sta, I find that the R0.5 and R0.7 of Moment Detr is not consistent with its paper reported.
image
image

And one more question is that which set are you evaluated in TaCOS dataset (val or test).

Dataset feature files

Hi authors - Thanks for this great work. I fine-tuned the moment retrieval model on my own dataset and would love to do the evaluation on some of the datasets included in your work. I am wondering if you plan to share the CLIP video and text feature files of any or all of QVHighlights, Charades-STA, TACoS, YoutubeHL, TVSum datasets. Thanks.

model size mismatch

Hi, I have finished ran the inference code. I can run the first model (pt) successful but when I used the second model(pt+ft), i got several errors like below:

截屏2023-08-07 20 28 40

Would you please update it? Thanks.

Results on QVHighlights val set

Hi,

Thanks for sharing the code for your amazing work!!! Can you share the results on the QVHighlights val set?

Thanks,
Goutham

What is the version of slowfast(R50)?

I tried to extract video features through slowfast 8x8 R50 on the tvsum data set. There are some differences from the features downloaded in the feature file you gave.

given: XzYM3PfTM4w.npz
image

extract by slowfast 8x8 R50
image

This ended up not working well when I went to test

About timestamp calculation in DatasetMR

model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + self.clip_len / 2) / ctx_l).unsqueeze(1).repeat(1, 2)

Greetings.
Shouldn't 0.5 rather than clip_len/2 be used here?
According to my understanding, we need to calculate the center timestamp of each clip here. And ctx_l is the number of clips in this video. So to get the center of each clip, shouldn't we always use 0.5 , since torch.arange(0, ctx_l) always represents a list of intergers like (0, N)?
Something like this:
model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + 0.5) / ctx_l).unsqueeze(1).repeat(1, 2)

Discrepancy between the video length and the supposed highlight timestamp.

Hi,

first of all, thanks for releasing the code and model!

While testing the model on huggingface, I noticed a discrepancy between the video length and the supposed highlight timestamp.

The highlight should appear at 54s, but the video is only 30s long.

Do you have any idea why this might happen? (See image below)

Thanks!

grafik

License problem

Disclaimer: not a lawyer, however, I don't think including a section in the README that says the license is MIT is not sufficient to make the work here covered under MIT, and it might be possible to argue that this repo is actually copyrighted (as this is the default state for software in absence of a license).

I would suggest creating a LICENSE file and include the MIT text. I will submit a PR here shortly

H5py dataset

Thanks for your great job!
By the way, does training processes run much more faster with h5py dataset than normal format ?
Can you release the scripts for changing format ?

Installing requirements

Hi! Thanks for your nice research and codes.

I'm trying to set the environments to perform your code, but it doesn't work,
because of many requirements that end with @ file:///~~

My terminal said:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/croot/aiohttp_1670009560265/work'

How can I fix it?
+++
In addition, similar with the cause of first issue,

I changed importing area of video_extractor.py to

import pdb
import torch as th   
import math   
import numpy as np  
import torch  
from run_on_video.video_loader import VideoLoader  
from torch.utils.data import DataLoader  
import argparse  
from run_on_video.preprocessing import Preprocessing  
import torch.nn.functional as F  
from tqdm import tqdm  
import os  
import sys  
from run_on_video import clip  
import argparse

Questions about fine-tuning

Hi @QinghongLin, many thanks for sharing this great work! I was wondering when fine-tuning UniVTG on downstream datasets without curve (highlight) labels (e.g., NLQ, Charades-STA, TACoS), did you still use "CLIP teacher" method to obtain pseudo labels? In other word, are the results of UniVTG and UniVTG w/PT in Table 3 obtained by using pseudo highlight labels?

wating for your code update

RuntimeError: Given normalized_shape=[2818], expected input with shape [*, 2818], but got input of size[1, 157, 514]

Implementation on Qvhighlights

Hi, I wonder if the results of qvhighlights is got by running train_mr.py? Cause I face same data format problem when it comes to eval.py. tahnk u!😊

Code for CLIP_teacher

Hi, I could not find the code implementation of CLIP teacher, did you upload this part?😊

Reproduce of VSLNet on TACoS.

Thank you for your excellent work!
In the article, Table 3 mentions the performance of VSLNet on TACoS. Because I noticed that reproducing VSLNet is actually difficult, could you provide the code and checkpoint you used for your VSLNet experiment?

I am appreciate it if you could solve my questions.🥰

Loss parameters for downstream tasks

Hi, I got bad results when training on moment retraival without loading pretrained model, I wonder that are the hyperparameters in 'qvhl_pretrain.sh' correct? 🙂
b_loss_coef=10 g_loss_coef=0 eos_coef=0.1 f_loss_coef=10 s_loss_intra_coef=0 s_loss_inter_coef=0

meet dimension error when running the qfvs downstream tasks

when I using https://github.com/showlab/UniVTG/blob/main/model.md this checkpoint (in below picture)
image
and run with qfvs downstream tasks
I meet the issue:

First of all, there is no batch size dimension when doing UniVTG forward, so it goes smoothly.
But at some point, it suddenly turn into 4 dimension with batch size dimension in the first, so it stuck at concat step

https://github.com/showlab/UniVTG/blob/main/model/univtg.py#L119
https://github.com/showlab/UniVTG/blob/main/main/dataset_qfvs.py#L253C6-L253C6

with this error: RuntimeError: Tensors must have same number of dimensions: got 3 and 4

I wonder if is there any wrong with my script, and I use the QFVS data in https://github.com/showlab/UniVTG/blob/main/install.md

Here is my script (rename it, otherwise it cannot be uploaded)
qfvs_pretrain.txt

Thank you for your kind support!

Question about raw videos

Hi @QinghongLin! May I ask that where I can find the raw videos for TACoS and QVFS datasets? It seems that the original authors did not release them. Thank you!

hdf5 file does not exists

It is a great work!
When I ran this project, I encountered a problem.
image
This problem happened when I tried to run the "qvhl_pretrain.sh" script. The error in the Traceback indicates that the hdf5 file did not exist, meanwhile, it does not exist in the downloaded dataset.

Problem of pretrain

Thanks for you great work. I used the code and parameters provided by the author to pretrain, but I found that during the training process, loss_b collapses to a very small value very quickly and the zero-shot results on the val split of qvhighlight are also very poor. What is the reason for this?
2024_03_25_02_53_16 [Epoch] 001 [Loss] loss_b 0.0051 loss_g 0.4086 loss_f 0.1647 loss_s_inter 1.1177 loss_s_intra 1.1780 loss_overall 2.8742
2024_03_25_03_50_17 [Epoch] 002 [Loss] loss_b 0.0005 loss_g 0.3896 loss_f 0.1610 loss_s_inter 1.0108 loss_s_intra 1.1777 loss_overall 2.7396
2024_03_25_04_47_42 [Epoch] 003 [Loss] loss_b 0.0004 loss_g 0.3850 loss_f 0.1593 loss_s_inter 0.9755 loss_s_intra 1.1755 loss_overall 2.6957
2024_03_25_05_44_57 [Epoch] 004 [Loss] loss_b 0.0004 loss_g 0.3820 loss_f 0.1583 loss_s_inter 0.9541 loss_s_intra 1.1737 loss_overall 2.6685
2024_03_25_06_42_11 [Epoch] 005 [Loss] loss_b 0.0004 loss_g 0.3802 loss_f 0.1577 loss_s_inter 0.9385 loss_s_intra 1.1724 loss_overall 2.6491
2024_03_25_07_39_30 [Epoch] 006 [Loss] loss_b 0.0003 loss_g 0.3787 loss_f 0.1573 loss_s_inter 0.9261 loss_s_intra 1.1711 loss_overall 2.6336
2024_03_25_08_36_49 [Epoch] 007 [Loss] loss_b 0.0003 loss_g 0.3774 loss_f 0.1570 loss_s_inter 0.9158 loss_s_intra 1.1702 loss_overall 2.6208
2024_03_25_09_34_08 [Epoch] 008 [Loss] loss_b 0.0003 loss_g 0.3763 loss_f 0.1568 loss_s_inter 0.9068 loss_s_intra 1.1693 loss_overall 2.6094
2024_03_25_10_31_21 [Epoch] 009 [Loss] loss_b 0.0003 loss_g 0.3750 loss_f 0.1566 loss_s_inter 0.8992 loss_s_intra 1.1686 loss_overall 2.5997
2024_03_25_11_28_37 [Epoch] 010 [Loss] loss_b 0.0003 loss_g 0.3742 loss_f 0.1564 loss_s_inter 0.8920 loss_s_intra 1.1678 loss_overall 2.5907

RuntimeError: Given normalized_shape=[2818], expected input with shape [*, 2818], but got input of size[1, 298, 514]

Regardless of whether I run Slowfast R50 + CLIP-B/16 or Slowfast R50 + CLIP-B/16 QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo,I got this error
Total number of frames: 298
Traceback (most recent call last):
File "/opt/disk1/UniVTG/main_gradio.py", line 180, in
forward(vtg_model, "./examples/", 'A man takes a photo on the bottom of the sea and sees a lot of fish.')
File "/opt/disk1/UniVTG/main_gradio.py", line 91, in forward
output = model(src_vid=src_vid, src_txt=src_txt, src_vid_mask=src_vid_mask, src_txt_mask=src_txt_mask)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/disk1/UniVTG/model/univtg.py", line 107, in forward
src_vid = self.input_vid_proj(src_vid)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/disk1/UniVTG/model/univtg.py", line 402, in forward
x = self.LayerNorm(x)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, **kwargs)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/root/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[2818], expected input with shape [
, 2818], but got input of size[1, 298, 514]
the video I deployed is youtube.mp4

Training Detail for Pretrain

Hello, thanks for your fancy work. I want to make sure that the pretrain model is verified on the val set of the QVHighlight dataset, ?and the ckpt is selected by comparing [email protected] ? What's more,could you please share the log file for pretraing?

Chinese_ clip

Hello, please replace the current “clip” module with“ Chinese_ clip” will affect the UniVTG accuracy?

Questions on UniVTG

Hi, congratulations on your great sucess! I have two questions about UniVTG:

  1. ActivityNet-Captions is one of the most commonly used datasets in video moment retrieval, but I don't find results on this dataset in the paper. Have you tested UniVTG on this dataset?
  2. I tried your online demo, and find that the model gives completely different predictions for two identical text inputs. Why is this happening?
    image

Thanks!

Video summarization

Fisrt of all, saying that you have built something really cool. I've tried a bit the demo and works pretty decently considering the hard task.
However, I'm interested in the video summarization part but I can't find out how to do it.

Is there any inference example for that part?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.