Giter Site home page Giter Site logo

ssl-uvos's Introduction

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Official pytorch implementation of the paper Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation. Currently, only the inference code is available. Stay tuned for the training code.

Overview

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. We develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19.

teaser

[Project Page](coming soon) [arXiv] [PDF]

Usage

Requirements

  • pytorch 1.12
  • torchvision
  • opencv-python
  • cvbase
  • einops
  • kornia
  • tensorboardX

Data preparation

Pretrained Model

  • Download our model checkpoint based on DINO-ViT-S/8 pretrained on YT-VOS 2016 [google drive]

Downstream Evaluation

After downloading the pretrained model, you can run the inference code by executing:

bash start_eval.sh

Ensure that the basepath and davis_path is set to your DAVIS data path. This will provide you with the final performance on DAVIS-2017-Unsupervised. We set the default resolution 320 480 for tradeoff between segmentation accuracy and inference speed. Feel free to set a smaller resolution, e.g., 192 384, for faster inference and evaluation or larger resolution for higher accuracy.

Visualization

We have visualized results on video sequences with occlusions. Our model successfully handles partial or complete object occlusion, where an object disappears in some frames and reappears in later ones. vis

Acknowledgement

Our code is partly based on the implementation of Motino Grouping. We sincerely thank the authors for their significant contribution. If you have any questions regarding the paper or code, please don't hesitate to send us an email or raise an issue.

Citation

If our code assists your work, please consider citing:

@article{ding2023betrayed,
  title={Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation},
  author={Ding, Shuangrui and Qian, Rui and Xu, Haohang and Lin, Dahua and Xiong, Hongkai},
  journal={arXiv preprint arXiv:2311.17893},
  year={2023}
}

ssl-uvos's People

Contributors

shvdiwnkozbw avatar mark12ding avatar

Stargazers

Jeff Carpenter avatar  avatar Varun Belagali avatar Junjie ZHANG avatar  avatar  avatar Matt Shaffer avatar Rei avatar yhzhouowo avatar 爱可可-爱生活 avatar Wan_YC avatar Yixuan Wang avatar  avatar  avatar Alex avatar  avatar  avatar  avatar  avatar Yuwei Guo avatar

Watchers

Kostas Georgiou avatar  avatar Varun Belagali avatar Matt Shaffer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.