showlab / univtg Goto Github PK

View Code? Open in Web Editor NEW

312.0 6.0 28.0 23.2 MB

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Home Page: https://arxiv.org/abs/2307.16715

License: MIT License

Python 97.32% Shell 2.68%

highlight-detection moment-retrieval pretraining video-grounding video-language video-summarization

univtg's Introduction

UniVTG (ICCV'23)

[arXiv]

TL; DR: The first video temporal grounding pretraining model, unifying diverse temporal annotations to power moment retrieval (interval), highlight detection (curve) and video summarization (point).

📢 News

[2023.10.15] Upload the Clip teacher scripts to create scalable pseudo annotations.
[2023.8.22] Code cleaning, add training/inference instruction, upload all downstream checkpoints.
[2023.8.6] Create the Huggingface space demo!
[2023.7.31] We release the arXiv paper, codes, checkpoints, and gradio demo.

📝 Todo

Connect UniVTG with LLM e.g., ChatGPT.
Upload all downstream checkpoints.
Upload all pretraining checkpoints.

🌟 Run on your video:

To power practical usage, we release the following checkpoints:

can be run on a single GPU with less than 4GB memory, highly efficient, less than 1 sec to perform temporal grounding even a 10 minutes long video.

Video Enc.	Text Enc.	Pretraining	Fine-tuning	Checkpoints
CLIP-B/16	CLIP-B/16	4M	-	Google Drive
CLIP-B/16	CLIP-B/16	4M	QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo	Google Drive

Download checkpoint and put it in the dir results/omni.

Download the example videos from here and put it under examples/

Run python3 main_gradio.py --resume ./results/omni/model_best.ckpt

[ Youtube video ]

[ Egocentric video ]

[ Charades video ]

⚙️ Preparation

Please find instructions in install.md to setup environment and datasets.

📦 Model Zoo

Download checkpoints in model.md to reproduce the benchmark results.

🚀 Training & Inference

We use slurm for job running, you may need to slightly modify the code to adapt your environment if you do not use slurm system.

Pretraining (multi-gpu)

Large-scale pretraining: bash scripts/pretrain.sh

Multi-datasets co-training: bash scripts/cotrain.sh

Downstream (single-gpu)

Indicate --resume to init model by pretraining weight. Refer to our model zoo for detailed parameter settings

Training: bash scripts/qvhl_pretrain.sh

Indicate --eval_init and --n_epoch=0 to evaluate selected checkpoint --resume.

Inference: bash scripts/qvhl_inference.sh

CLIP teacher to create scalable pseudo labels

Download the openimages v6 class list from https://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv.
Convert it as json by python3 teacher/csv2json.py then extract the textual class features by python3 teacher/label2feature.py
(Before this, you should have extracted the video features of the video) Run the script to generate pseudo labels python3 teacher/clip2labels.py

🎨 Visualization

If you want to draw visualizations like our paper, you can simply run python3 plot/qvhl.py to generate corresponding figures by providing the prediction jsons (you can download them in Model Zoo).

🎓 Citation

If you find our work helps, please cite our paper.

@misc{lin2023univtg,
      title={UniVTG: Towards Unified Video-Language Temporal Grounding}, 
      author={Kevin Qinghong Lin and Pengchuan Zhang and Joya Chen and Shraman Pramanick and Difei Gao and Alex Jinpeng Wang and Rui Yan and Mike Zheng Shou},
      year={2023},
      eprint={2307.16715},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

✉️ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via [email protected] or open an issue.

😊 Acknowledgement

This codebase is based on moment_detr, HERO_Video_Feature_Extractor, UMT.

We thank the authors for their open-source contributions.

univtg's People

Contributors

Stargazers

Watchers

univtg's Issues

Training Detail for Fine-tuning？

Thanks for your work, some training details are not too clearly described in the readme, so I would like to ask, what are the downstream tasks, training parameters and corresponding training methods?

problem about demo

first question:
when i run the code by own interface，and i click the step2 it shows error
second question:
when i run the code by own interface，i use the video in your demo, the result is different from mine

AttributeError: 'Textbox' object has no attribute 'style'

INFO:main.config - Loaded model saved at epoch 14 from checkpoint: ./results/model_best.ckpt
Traceback (most recent call last):
File "/content/UniVTG/main_gradio.py", line 212, in
input_message = gr.Textbox(show_label=False, placeholder="Enter text query and press enter", visible=True).style(container=False)

Getting wrong results for queries irrelavant to the video

Hi, this is great work, but I do find some problems when using this model.

When I give a text query to query something that does not exist at all in the video (e.g., when I ask "driving a car" on a video about a girl reading a book without any car), the model will always give a time frame that is completely irrelevant to the query. How can I solve this problem and make the model not output anything if the query is completely irrelavant to the video?

Questions about the metrics in the paper

Thanks for your amazing work. When I read the metrics about the charades-sta, I find that the R0.5 and R0.7 of Moment Detr is not consistent with its paper reported.

And one more question is that which set are you evaluated in TaCOS dataset (val or test).

Dataset feature files

Hi authors - Thanks for this great work. I fine-tuned the moment retrieval model on my own dataset and would love to do the evaluation on some of the datasets included in your work. I am wondering if you plan to share the CLIP video and text feature files of any or all of QVHighlights, Charades-STA, TACoS, YoutubeHL, TVSum datasets. Thanks.

nvm

The Hugging Space Demo is broken?

The Hugging Space Demo does not seem to work. I tried with example youtube video running but does not work it seems.

model size mismatch

Hi, I have finished ran the inference code. I can run the first model (pt) successful but when I used the second model(pt+ft), i got several errors like below:

Would you please update it? Thanks.

Results on QVHighlights val set

Hi,

Thanks for sharing the code for your amazing work!!! Can you share the results on the QVHighlights val set?

Thanks,
Goutham

What is the version of slowfast(R50)?

I tried to extract video features through slowfast 8x8 R50 on the tvsum data set. There are some differences from the features downloaded in the feature file you gave.

given: XzYM3PfTM4w.npz

extract by slowfast 8x8 R50

This ended up not working well when I went to test

About timestamp calculation in DatasetMR

UniVTG/main/dataset.py

Line 501 in d29baa7

    
           model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + self.clip_len / 2) / ctx_l).unsqueeze(1).repeat(1, 2)

Greetings.
Shouldn't 0.5 rather than clip_len/2 be used here?
According to my understanding, we need to calculate the center timestamp of each clip here. And ctx_l is the number of clips in this video. So to get the center of each clip, shouldn't we always use 0.5 , since torch.arange(0, ctx_l) always represents a list of intergers like (0, N)?
Something like this:
model_inputs["timestamp"] = ( (torch.arange(0, ctx_l) + 0.5) / ctx_l).unsqueeze(1).repeat(1, 2)

Discrepancy between the video length and the supposed highlight timestamp.

Hi,

first of all, thanks for releasing the code and model!

While testing the model on huggingface, I noticed a discrepancy between the video length and the supposed highlight timestamp.

The highlight should appear at 54s, but the video is only 30s long.

Do you have any idea why this might happen? (See image below)

Thanks!

How can I annotate the Foreground indicator, Boundary offsets, and Saliency score on my own moment retrieval dataset?

Thank you for the wonderful work! I want to finetune UniVTG on my moment retrieval dataset. I wonder how do authors re-annotate the original json files of moment retrieval datasets like NLQ, Charades-STA, and TACoS to divide a video into clips and give each clip a $f_{i}$?

Question about language-evaluation's version

Hello! Thank you for your outstanding work!
I encountered the following error while configuring the environment. Which version can I switch to?

Huggingface demo is breaking down

Hi:
The hugging face demo you provide is breaking down currently, can you fix it?

License problem

Disclaimer: not a lawyer, however, I don't think including a section in the README that says the license is MIT is not sufficient to make the work here covered under MIT, and it might be possible to argue that this repo is actually copyrighted (as this is the default state for software in absence of a license).

I would suggest creating a LICENSE file and include the MIT text. I will submit a PR here shortly

H5py dataset

Thanks for your great job!
By the way, does training processes run much more faster with h5py dataset than normal format ?
Can you release the scripts for changing format ?

ModuleNotFoundError: No module named 'feature_extractor'

Hi, thanks for your share.
I ran your project but got the error "ModuleNotFoundError: No module named 'feature_extractor'". Would you please tell me how to get this module?

Installing requirements

Hi! Thanks for your nice research and codes.

I'm trying to set the environments to perform your code, but it doesn't work,
because of many requirements that end with @ file:///~~

My terminal said:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/croot/aiohttp_1670009560265/work'

How can I fix it?
+++
In addition, similar with the cause of first issue,

I changed importing area of video_extractor.py to

import pdb
import torch as th   
import math   
import numpy as np  
import torch  
from run_on_video.video_loader import VideoLoader  
from torch.utils.data import DataLoader  
import argparse  
from run_on_video.preprocessing import Preprocessing  
import torch.nn.functional as F  
from tqdm import tqdm  
import os  
import sys  
from run_on_video import clip  
import argparse

Questions about fine-tuning

Hi @QinghongLin, many thanks for sharing this great work! I was wondering when fine-tuning UniVTG on downstream datasets without curve (highlight) labels (e.g., NLQ, Charades-STA, TACoS), did you still use "CLIP teacher" method to obtain pseudo labels? In other word, are the results of UniVTG and UniVTG w/PT in Table 3 obtained by using pseudo highlight labels?

wating for your code update

RuntimeError: Given normalized_shape=[2818], expected input with shape [*, 2818], but got input of size[1, 157, 514]

Implementation on Qvhighlights

Hi, I wonder if the results of qvhighlights is got by running train_mr.py? Cause I face same data format problem when it comes to eval.py. tahnk u!😊

Memory and Time for pretraining

Hello Kevin!

Can you provide how much memory space and time is spent for doing PT?

Thanks

Can I use 'highlight of the video' as an unified query to extract highlights?

I understand that UniVTG is CLIP based. Can I use 'highlight of the video' as a unified query to extract highlights? What will be the performance?

Code for CLIP_teacher

Hi, I could not find the code implementation of CLIP teacher, did you upload this part?😊

Reproduce of VSLNet on TACoS.

Thank you for your excellent work!
In the article, Table 3 mentions the performance of VSLNet on TACoS. Because I noticed that reproducing VSLNet is actually difficult, could you provide the code and checkpoint you used for your VSLNet experiment?

I am appreciate it if you could solve my questions.🥰

Loss parameters for downstream tasks

Hi, I got bad results when training on moment retraival without loading pretrained model, I wonder that are the hyperparameters in 'qvhl_pretrain.sh' correct? 🙂
b_loss_coef=10 g_loss_coef=0 eos_coef=0.1 f_loss_coef=10 s_loss_intra_coef=0 s_loss_inter_coef=0

meet dimension error when running the qfvs downstream tasks

when I using https://github.com/showlab/UniVTG/blob/main/model.md this checkpoint (in below picture)

and run with qfvs downstream tasks
I meet the issue:

First of all, there is no batch size dimension when doing UniVTG forward, so it goes smoothly.
But at some point, it suddenly turn into 4 dimension with batch size dimension in the first, so it stuck at concat step

https://github.com/showlab/UniVTG/blob/main/model/univtg.py#L119
https://github.com/showlab/UniVTG/blob/main/main/dataset_qfvs.py#L253C6-L253C6

with this error: RuntimeError: Tensors must have same number of dimensions: got 3 and 4

I wonder if is there any wrong with my script, and I use the QFVS data in https://github.com/showlab/UniVTG/blob/main/install.md

Here is my script (rename it, otherwise it cannot be uploaded)
qfvs_pretrain.txt

Thank you for your kind support!

Does it support MacBook Pro M1 Max chip (Metal)?

Hello,

Just like the title, is it possible to run it on a MPB M1 Max?

Thanks!

Asking for the raw video of QVHighlight.

May I ask if you have the original videos of the QVHighlight dataset? If so, could you please share a copy? Thank you very much.

Question about raw videos

Hi @QinghongLin! May I ask that where I can find the raw videos for TACoS and QVFS datasets? It seems that the original authors did not release them. Thank you!

hdf5 file does not exists

It is a great work!
When I ran this project, I encountered a problem.

This problem happened when I tried to run the "qvhl_pretrain.sh" script. The error in the Traceback indicates that the hdf5 file did not exist, meanwhile, it does not exist in the downloaded dataset.

FileNotFoundError: [Errno 2] No such file or directory: '/results/omni/opt.json'

Hi, thanks for your update.
I re-tried the project by running the file 'main_gradio.py' for inference, but got the error "FileNotFoundError: [Errno 2] No such file or directory: '/results/omni/opt.json'". Please help me with this error, thanks a lot.

Problem of pretrain

Thanks for you great work. I used the code and parameters provided by the author to pretrain, but I found that during the training process, loss_b collapses to a very small value very quickly and the zero-shot results on the val split of qvhighlight are also very poor. What is the reason for this?
2024_03_25_02_53_16 [Epoch] 001 [Loss] loss_b 0.0051 loss_g 0.4086 loss_f 0.1647 loss_s_inter 1.1177 loss_s_intra 1.1780 loss_overall 2.8742
2024_03_25_03_50_17 [Epoch] 002 [Loss] loss_b 0.0005 loss_g 0.3896 loss_f 0.1610 loss_s_inter 1.0108 loss_s_intra 1.1777 loss_overall 2.7396
2024_03_25_04_47_42 [Epoch] 003 [Loss] loss_b 0.0004 loss_g 0.3850 loss_f 0.1593 loss_s_inter 0.9755 loss_s_intra 1.1755 loss_overall 2.6957
2024_03_25_05_44_57 [Epoch] 004 [Loss] loss_b 0.0004 loss_g 0.3820 loss_f 0.1583 loss_s_inter 0.9541 loss_s_intra 1.1737 loss_overall 2.6685
2024_03_25_06_42_11 [Epoch] 005 [Loss] loss_b 0.0004 loss_g 0.3802 loss_f 0.1577 loss_s_inter 0.9385 loss_s_intra 1.1724 loss_overall 2.6491
2024_03_25_07_39_30 [Epoch] 006 [Loss] loss_b 0.0003 loss_g 0.3787 loss_f 0.1573 loss_s_inter 0.9261 loss_s_intra 1.1711 loss_overall 2.6336
2024_03_25_08_36_49 [Epoch] 007 [Loss] loss_b 0.0003 loss_g 0.3774 loss_f 0.1570 loss_s_inter 0.9158 loss_s_intra 1.1702 loss_overall 2.6208
2024_03_25_09_34_08 [Epoch] 008 [Loss] loss_b 0.0003 loss_g 0.3763 loss_f 0.1568 loss_s_inter 0.9068 loss_s_intra 1.1693 loss_overall 2.6094
2024_03_25_10_31_21 [Epoch] 009 [Loss] loss_b 0.0003 loss_g 0.3750 loss_f 0.1566 loss_s_inter 0.8992 loss_s_intra 1.1686 loss_overall 2.5997
2024_03_25_11_28_37 [Epoch] 010 [Loss] loss_b 0.0003 loss_g 0.3742 loss_f 0.1564 loss_s_inter 0.8920 loss_s_intra 1.1678 loss_overall 2.5907

the result is terriable

RuntimeError: Given normalized_shape=[2818], expected input with shape [*, 2818], but got input of size[1, 298, 514]

Regardless of whether I run Slowfast R50 + CLIP-B/16 or Slowfast R50 + CLIP-B/16 QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo,I got this error
Total number of frames: 298
Traceback (most recent call last):
File "/opt/disk1/UniVTG/main_gradio.py", line 180, in
forward(vtg_model, "./examples/", 'A man takes a photo on the bottom of the sea and sees a lot of fish.')
File "/opt/disk1/UniVTG/main_gradio.py", line 91, in forward
output = model(src_vid=src_vid, src_txt=src_txt, src_vid_mask=src_vid_mask, src_txt_mask=src_txt_mask)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/disk1/UniVTG/model/univtg.py", line 107, in forward
src_vid = self.input_vid_proj(src_vid)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/disk1/UniVTG/model/univtg.py", line 402, in forward
x = self.LayerNorm(x)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, **kwargs)
File "/root/.local/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/root/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[2818], expected input with shape [, 2818], but got input of size[1, 298, 514]
the video I deployed is youtube.mp4

which version for activitynet?

Hi, I need raw videos and Activitynet is released in two versions 1.2 and 1.3. Which version does UNIVTG use??
Thanks

Training Detail for Pretrain

Hello, thanks for your fancy work. I want to make sure that the pretrain model is verified on the val set of the QVHighlight dataset, ？and the ckpt is selected by comparing [email protected] ? What's more，could you please share the log file for pretraing?

Chinese_ clip

Hello, please replace the current “clip” module with“ Chinese_ clip” will affect the UniVTG accuracy?

ffprobe failed at: ./examples/00016.mp4

ffprobe failed at : ./examples/00016.mp4, i make sure that the path is reachable, is there any reason for this failure?

Trouble Accessing Online Demo

Hi! I'm trying to load and test the demo, but run into a runtime error. Is this something I can fix on my end or does this need to be fixed/resolved on your side?

About the UniVTG License

Thanks for the excellent research!
Please let me know the license of this repository.

Questions on UniVTG

Hi, congratulations on your great sucess! I have two questions about UniVTG:

ActivityNet-Captions is one of the most commonly used datasets in video moment retrieval, but I don't find results on this dataset in the paper. Have you tested UniVTG on this dataset?
I tried your online demo, and find that the model gives completely different predictions for two identical text inputs. Why is this happening?

Thanks!

Video summarization

Fisrt of all, saying that you have built something really cool. I've tried a bit the demo and works pretty decently considering the hard task.
However, I'm interested in the video summarization part but I can't find out how to do it.

Is there any inference example for that part?

showlab / univtg Goto Github PK

univtg's Introduction

UniVTG (ICCV'23)

📢 News

📝 Todo

🌟 Run on your video:

⚙️ Preparation

📦 Model Zoo

🚀 Training & Inference

Pretraining (multi-gpu)

Downstream (single-gpu)

CLIP teacher to create scalable pseudo labels

🎨 Visualization

🎓 Citation

✉️ Contact

😊 Acknowledgement

univtg's People

Contributors

Stargazers

Watchers

Forkers

univtg's Issues

Recommend Projects

Recommend Topics

Recommend Org