wangxiao5791509 / tnl2k_evaluation_toolkit Goto Github PK

Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark (CVPR 2021)

Home Page: https://sites.google.com/view/langtrackbenchmark/

MATLAB 100.00%

tracking-by-natural-language visual-tracking tracking-algorithm single-object-tracking

tnl2k_evaluation_toolkit's Introduction

TNL2K_Evaluation_Toolkit

Xiao Wang*, Xiujun Shu*, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, Feng Wu, Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark, IEEE CVPR 2021 (* denotes equal contribution). [Paper] [Project] [Slides] [TNL2K-BaiduYun (Code: pclt)] [SOT Paper List] [Demo Video (Youtube)] [COVE] [中文视频]

News:

2022.06.24 Update the language description for video INF_womanpink, NBA2k_Kawayi_video_15，INF_woman46, Fight_video_6-Done [Link]
2022.04.17 Update the damaged image on BaiduYun: test_015_Sord_video_Q01_done\00000492.png [link]
2022.01.14 Update links for GoogleDrive.
2021.10.13 Update links for the onedrive.

Abstract:

Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can't reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.

How to Download TNL2K dataset?

Currently, the dataset can be downloaded from the BaiduYun and OneDrive:

1. Download from BaiduYun:

  Link: https://pan.baidu.com/s/1Joc5DqJUwGb4cGiFeh5Iug (Code: pclt)

2. Download from Onedrive: Click [here]

Note: The annotations of 12 videos in the training subset are modified for more accurate annotation. Please update these videos with the [new annotations].

Tutorial for the Evaluaton Toolkit:

Download this github file:

git clone https://github.com/wangxiao5791509/TNL2K_evaluation_toolkit

Unzip related files for evaluation:

cd annos && tar -sxvf ./annos.tar.gz

Download the benchmark results from: [Benchmark-Results]:

tar -sxvf ./tracking_results_TNL2K.tar.gz

Open the Matlab and run the script:

Evaluate_TNL2K_dataset.m

Wait and see final results:

Acknowledgement

This code is modified based on the evaluation toolkit of [LaSOT].

[Language-Tracking-Paper-List]

Question and Answer

Q1. What is the difference between the proposed "Tracking by natural language" and "video grounding"? A1. As noted in paper [a, b], the video grounding task requires the machine to watch a video and localize the starting and ending time of the target video segment that corresponds to the given query. In contrast, our proposed tasks focus on locating the spatial location in each video frame.

[a] Gao, Jiyang, et al. "Tall: Temporal activity localization via language query." Proceedings of the IEEE international conference on computer vision. 2017. [Paper]

[b] Zeng, Runhao, et al. "Dense regression network for video grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [Paper]

Q2. Does it reasonable and fair to compare the language assisted trackers with bbox initialized trackers? A2. As mentioned in our paper, the standard visual tracking ignores the semantic information of target object, and it is hard to judge which target we human want to track if only the bounding box is provided in the first frame (for example, the player who controls the ball v.s. the player specified and fixed). Therefore, it will be OK to use the TNL2K for the standard evaluation of bbox based trackers. After all, the initialized bbox for all trackers are same. Actually, it is fair to compare the trackers under the same setting of initialization.

Q3. What can the language-based-tracking used for? A3. For example, police searching for a suspect person or vehicle from a huge amount of videos.

Citation:

If you find this work useful for your research, please cite the following papers:

@InProceedings{wang2021tnl2k,
    author    = {Wang, Xiao and Shu, Xiujun and Zhang, Zhipeng and Jiang, Bo and Wang, Yaowei and Tian, Yonghong and Wu, Feng},
    title     = {Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {13763-13773}
}

If you have any questions about this work, please contact with me via [email protected] or [email protected].

tnl2k_evaluation_toolkit's People

Contributors

Stargazers

Watchers

Forkers

alexadlu liualex1109 sasakits tchuanm ske159 fyaft2012 cxmirene

tnl2k_evaluation_toolkit's Issues

How can I get precision plots?

When I set norm_dst=true, ranking_type='AUC', it can generate Norm.Prec plots and Succ. plots, which are the same as the plots of your paper.
But Prec. plots is different, when I set norm_dst=false, ranking_type='threshold'.
The score of SiamRCNN is 0.528 in your paper, but I got 0.561.
However, I found that the score of Ours is the same as that reported in your paper.

Results reported in the paper

Hello,

I would like to ask if all the methods reported in the papers have been tuned on the TNL2K training set?

Best

About the absent annotation of training set

Hi, thanks for your work. I download your dataset from OneDrive and find that there is no absent and attribution annotation for the training set, which is different from your description in the paper. Is this expected?

Is the module open source?

Has the implementation of the baseline approach proposed in the paper been released?
Thank you.

The onedrive link has expired

It would be great if you can update the onedrive link.

The update language description on 2022.06.24 is broken.

Firstly thanks for your excellent work on language-guided tracking.
While I foune the link to the annotation data updated on 2022.06.24 was broken. Can you update the link?
Thanks.

关于benchmark results的下载

您好王教授！请问可以提供其他方式的benchmark results的下载渠道嘛，比如百度网盘或谷歌硬盘的，您提供的这种方式尝试过很多次都无法下载。万分感谢！

Typos in the dataset

I find several typos in the captions of the dataset.

sking -> skiing
nake, neaked -> naked
solider -> soldier
soliders -> soldiers
sord -> sword
sords -> swords
gril -> grill
seond -> second
occlued -> occluded
controled, contrled -> controlled
baskbetball, baskball -> basketball
unbrella, umbrealla -> umbrella
coffe -> coffee
stoller -> stroller
lef -> left
inher -> in her
againest -> against
anoter -> another
vieo -> video
mourse -> mouse
palyer -> player
tract -> track
mouth shad ?(I guess it is mask)
yello -> yellow
wo -> we

Input resolution

Firstly thanks for your excellent work on language-guided tracking.
Now I have a question: What is the input resolution in your experiments?

为什么测试集有些视频序列没有文本标注比如‘Fight_video_6-Done’

问题如上，是我下载的问题麻，但是我检查了压缩包里也是这样

Running code on octave

Is your evaluation toolkit compatible with Octave?

请问有python源码吗

请问作者有pytorch版本的代码吗

TNL2K Download

When will the TNL2K be available for download

Generate the Prec. | Norm. Prec. | Success. Plot. numbers

Hello,

Thank you for your great work!

I'm able to generate the plot but don't know how to generate the corresponding numbers as shown in table 3 in your paper.
Could you show me how to do it?

Best

Do you have any plan to release the code of the AdaSwitcher algorithm?

Dear Wang,
Thanks for your excellent work! I am very intersted in your work. Do you have any plan to release the code of the "Towards More Flexible and Accurate Object Tracking with Natural Language" algorithm?
Thank you very much.

Different frame numbers between annotations and images

I find some cases that have different frame numbers between the bbox annotations and images.
cases in training set:
{file_name: frames in annotation, frames of images}:
{'Assassin_video_5-Done': [1080, 785],
'BF5_FireGUN_video_01_done': [1266, 633],
'Baseball_game_002-Done': [714, 713],
'Boat_video1-Done': [570, 470],
'Cat_BBC_video_03-Done': [145, 133],
'catNight_video_B04_done': [1518, 151],
'Fight_video_02-Done': [916, 914],
'Fight_video_04-Done': [817, 816],
'Joker_video_X02_done': [2139, 1268],
'maliao_4-Done': [1609, 1608],
'monitor_bikeyellow': [930, 929],
'videoPlayer_video_01_done': [2009, 2008]}

I haven't check the test set.

Code of TNL2K algorithm

Hi dear Wang,
Do you have any plan to release the code of the "Towards More Flexible and Accurate Object Tracking with Natural Language" algorithm?
Thanks.

Annotation length is sometimes longer than the video length

In the following 12 videos out of the 1,300 training videos, the annotation length is longer than the video length.

Video	#Frames	#BBoxes
Assassin_video_5-Done	785	1080
BF5_FireGUN_video_01_done	633	1266
Baseball_game_002-Done	713	714
Boat_video1-Done	470	570
Cat_BBC_video_03-Done	133	145
Fight_video_02-Done	914	916
Fight_video_04-Done	816	817
Joker_video_X02_done	1268	2139
catNight_video_B04_done	151	1518
maliao_4-Done	1608	1609
monitor_bikeyellow	929	930
videoPlayer_video_01_done	2008	2009

For those videos, which part of the annotations should we be using?

Is there any code for preprocessing OTB99 dataset？Thanks

Test Code Method for testing TNL2K on STMTrack (Siamfc++ code framework)

Hello, I recently had a work that adopted a code framework similar to STMTrack, and I also wanted to test the performance of my tracker on TNL2K, but I found no similar code for reference. I saw that you tested TNL2K on STMTrack, could you please provide me with it? Or is there any similar code you can refer to? Thank you very much!
@wangxiao5791509

the language description file TNL2K_train/INF_run1/language.txt is empty

Corrupted files on google drive

Hello,

When unzipping files TNL2K_test_subset_p5.zip, TNL2K_test_subset_p3.zip, TNL2K_test_subset_p2.zip, I got the following error:

End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.

Could you check again if you have uploaded everything for the test folder?

I used "unzip -qq FILENAME" to unzip the files. Do you use different package as gunzip or tar?

Thank you!

GoogleDrive cannot download

Dear Wang,

Thanks for your excellent work!
When I use the Google Drive link to download the TNL2k dataset, the website says the link cannot be found.

Thank you!

The result comparison

Nice work. Are Siam R-CNN and other trackers only tested on TNL2k or tested after trained?
For results of Tracking by Bounding Box Only.

TNL2K object classes

How many object classes are there in the TNL2K dataset? There are 70 object classes in LaSOT, but I wonder how many classes in TNL2K.

wangxiao5791509 / tnl2k_evaluation_toolkit Goto Github PK

tnl2k_evaluation_toolkit's Introduction

TNL2K_Evaluation_Toolkit

News:

Abstract:

How to Download TNL2K dataset?

Tutorial for the Evaluaton Toolkit:

Acknowledgement

Question and Answer

Citation:

tnl2k_evaluation_toolkit's People

Contributors

Stargazers

Watchers

Forkers

tnl2k_evaluation_toolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org