Giter Site home page Giter Site logo

wangxiao5791509 / tnl2k_evaluation_toolkit Goto Github PK

View Code? Open in Web Editor NEW
39.0 4.0 7.0 13.9 MB

Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark (CVPR 2021)

Home Page: https://sites.google.com/view/langtrackbenchmark/

MATLAB 100.00%
tracking-by-natural-language visual-tracking tracking-algorithm single-object-tracking

tnl2k_evaluation_toolkit's Introduction

TNL2K_Evaluation_Toolkit

Xiao Wang*, Xiujun Shu*, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, Feng Wu, Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark, IEEE CVPR 2021 (* denotes equal contribution). [Paper] [Project] [Slides] [TNL2K-BaiduYun (Code: pclt)] [SOT Paper List] [Demo Video (Youtube)] [COVE] [中文视频]

News:

  • 2022.06.24 Update the language description for video INF_womanpink, NBA2k_Kawayi_video_15,INF_woman46, Fight_video_6-Done [Link]
  • 2022.04.17 Update the damaged image on BaiduYun: test_015_Sord_video_Q01_done\00000492.png [link]
  • 2022.01.14 Update links for GoogleDrive.
  • 2021.10.13 Update links for the onedrive.

Abstract:

Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can't reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.

How to Download TNL2K dataset?

Currently, the dataset can be downloaded from the BaiduYun and OneDrive:

1. Download from BaiduYun:

  Link: https://pan.baidu.com/s/1Joc5DqJUwGb4cGiFeh5Iug (Code: pclt) 

2. Download from Onedrive: Click [here]

Note: The annotations of 12 videos in the training subset are modified for more accurate annotation. Please update these videos with the [new annotations].

Tutorial for the Evaluaton Toolkit:

  1. Download this github file:
git clone https://github.com/wangxiao5791509/TNL2K_evaluation_toolkit
  1. Unzip related files for evaluation:
cd annos && tar -sxvf ./annos.tar.gz 
  1. Download the benchmark results from: [Benchmark-Results]:
tar -sxvf ./tracking_results_TNL2K.tar.gz
  1. Open the Matlab and run the script:
Evaluate_TNL2K_dataset.m
  1. Wait and see final results: fig-1

Acknowledgement

This code is modified based on the evaluation toolkit of [LaSOT].

[Language-Tracking-Paper-List]

Question and Answer

Q1. What is the difference between the proposed "Tracking by natural language" and "video grounding"? A1. As noted in paper [a, b], the video grounding task requires the machine to watch a video and localize the starting and ending time of the target video segment that corresponds to the given query. In contrast, our proposed tasks focus on locating the spatial location in each video frame.

[a] Gao, Jiyang, et al. "Tall: Temporal activity localization via language query." Proceedings of the IEEE international conference on computer vision. 2017. [Paper]

[b] Zeng, Runhao, et al. "Dense regression network for video grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [Paper]

Q2. Does it reasonable and fair to compare the language assisted trackers with bbox initialized trackers? A2. As mentioned in our paper, the standard visual tracking ignores the semantic information of target object, and it is hard to judge which target we human want to track if only the bounding box is provided in the first frame (for example, the player who controls the ball v.s. the player specified and fixed). Therefore, it will be OK to use the TNL2K for the standard evaluation of bbox based trackers. After all, the initialized bbox for all trackers are same. Actually, it is fair to compare the trackers under the same setting of initialization.

Q3. What can the language-based-tracking used for? A3. For example, police searching for a suspect person or vehicle from a huge amount of videos.

Citation:

If you find this work useful for your research, please cite the following papers:

@InProceedings{wang2021tnl2k,
    author    = {Wang, Xiao and Shu, Xiujun and Zhang, Zhipeng and Jiang, Bo and Wang, Yaowei and Tian, Yonghong and Wu, Feng},
    title     = {Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {13763-13773}
}

If you have any questions about this work, please contact with me via [email protected] or [email protected].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.