Giter Site home page Giter Site logo

idea-research / grounded-sam-2 Goto Github PK

View Code? Open in Web Editor NEW
569.0 6.0 31.0 125.3 MB

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2

Home Page: https://arxiv.org/abs/2401.14159

License: Apache License 2.0

Shell 0.01% Python 1.95% Dockerfile 0.01% Jupyter Notebook 97.83% C++ 0.02% Cuda 0.19% Makefile 0.01%

grounded-sam-2's Introduction

Grounded SAM 2

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Grounding DINO 1.5, Florence-2 and SAM 2.

🔥 Project Highlight

In this repo, we've supported the following demo with simple implementations:

  • Ground and Segment Anything with Grounding DINO, Grounding DINO 1.5 & 1.6 and SAM 2
  • Ground and Track Anything with Grounding DINO, Grounding DINO 1.5 & 1.6 and SAM 2
  • Detect, Segment and Track Visualization based on the powerful supervision library.

Grounded SAM 2 does not introduce significant methodological changes compared to Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. Both approaches leverage the capabilities of open-world models to address complex visual tasks. Consequently, we try to simplify the code implementation in this repository, aiming to enhance user convenience.

intro_video.mp4

News

  • 2024/08/20: Support Florence-2 SAM 2 Image Demo which includes dense region caption, object detection, phrase grounding, and cascaded auto-label pipeline caption + phrase grounding.
  • 2024/08/09: Support Ground and Track New Object throughout the whole videos. This feature is still under development now. Credits to Shuo Shen.
  • 2024/08/07: Support Custom Video Inputs, users need only submit their video file (e.g. .mp4 file) with specific text prompts to get an impressive demo videos.

Contents

Installation

Download the pretrained SAM 2 checkpoints:

cd checkpoints
bash download_ckpts.sh

Download the pretrained Grounding DINO checkpoints:

cd gdino_checkpoints
bash download_ckpts.sh

Installation without docker

Install PyTorch environment first. We use python=3.10, as well as torch >= 2.3.1, torchvision>=0.18.1 and cuda-12.1 in our environment to run this demo. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended. You can easily install the latest version of PyTorch as follows:

pip3 install torch torchvision torchaudio

Since we need the CUDA compilation environment to compile the Deformable Attention operator used in Grounding DINO, we need to check whether the CUDA environment variables have been set correctly (which you can refer to Grounding DINO Installation for more details). You can set the environment variable manually as follows if you want to build a local GPU environment for Grounding DINO to run Grounded SAM 2:

export CUDA_HOME=/path/to/cuda-12.1/

Install Segment Anything 2:

pip install -e .

Install Grounding DINO:

pip install --no-build-isolation -e grounding_dino

Installation with docker

Build the Docker image and Run the Docker container:

cd Grounded-SAM-2
make build-image
make run

After executing these commands, you will be inside the Docker environment. The working directory within the container is set to: /home/appuser/Grounded-SAM-2

Once inside the Docker environment, you can start the demo by running:

python grounded_sam2_tracking_demo.py

Grounded SAM 2 Demos

Grounded SAM 2 Image Demo (with Grounding DINO)

Note that Grounding DINO has already been supported in Huggingface, so we provide two choices for running Grounded SAM 2 model:

  • Use huggingface API to inference Grounding DINO (which is simple and clear)
python grounded_sam2_hf_model_demo.py

Note

🚨 If you encounter network issues while using the HuggingFace model, you can resolve them by setting the appropriate mirror source as export HF_ENDPOINT=https://hf-mirror.com

  • Load local pretrained Grounding DINO checkpoint and inference with Grounding DINO original API (make sure you've already downloaded the pretrained checkpoint)
python grounded_sam2_local_demo.py

Grounded SAM 2 Image Demo (with Grounding DINO 1.5 & 1.6)

We've already released our most capable open-set detection model Grounding DINO 1.5 & 1.6, which can be combined with SAM 2 for stronger open-set detection and segmentation capability. You can apply the API token first and run Grounded SAM 2 with Grounding DINO 1.5 as follows:

Install the latest DDS cloudapi:

pip install dds-cloudapi-sdk

Apply your API token from our official website here: request API token.

python grounded_sam2_gd1.5_demo.py

Grounded SAM 2 Video Object Tracking Demo

Based on the strong tracking capability of SAM 2, we can combined it with Grounding DINO for open-set object segmentation and tracking. You can run the following scripts to get the tracking results with Grounded SAM 2:

python grounded_sam2_tracking_demo.py
  • The tracking results of each frame will be saved in ./tracking_results
  • The video will be save as children_tracking_demo_video.mp4
  • You can refine this file with different text prompt and video clips yourself to get more tracking results.
  • We only prompt the first video frame with Grounding DINO here for simple usage.

Support Various Prompt Type for Tracking

We've supported different types of prompt for Grounded SAM 2 tracking demo:

  • Point Prompt: In order to get a stable segmentation results, we re-use the SAM 2 image predictor to get the prediction mask from each object based on Grounding DINO box outputs, then we uniformly sample points from the prediction mask as point prompts for SAM 2 video predictor
  • Box Prompt: We directly use the box outputs from Grounding DINO as box prompts for SAM 2 video predictor
  • Mask Prompt: We use the SAM 2 mask prediction results based on Grounding DINO box outputs as mask prompt for SAM 2 video predictor.

Grounded SAM 2 Tracking Pipeline

Grounded SAM 2 Video Object Tracking Demo (with Grounding DINO 1.5 & 1.6)

We've also support video object tracking demo based on our stronger Grounding DINO 1.5 model and SAM 2, you can try the following demo after applying the API keys for running Grounding DINO 1.5:

python grounded_sam2_tracking_demo_with_gd1.5.py

Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO)

Users can upload their own video file (e.g. assets/hippopotamus.mp4) and specify their custom text prompts for grounding and tracking with Grounding DINO and SAM 2 by using the following scripts:

python grounded_sam2_tracking_demo_custom_video_input_gd1.0_hf_model.py

Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO 1.5 & 1.6)

Users can upload their own video file (e.g. assets/hippopotamus.mp4) and specify their custom text prompts for grounding and tracking with Grounding DINO 1.5 and SAM 2 by using the following scripts:

python grounded_sam2_tracking_demo_custom_video_input_gd1.5.py

You can specify the params in this file:

VIDEO_PATH = "./assets/hippopotamus.mp4"
TEXT_PROMPT = "hippopotamus."
OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
API_TOKEN_FOR_GD1_5 = "Your API token" # api token for G-DINO 1.5
PROMPT_TYPE_FOR_VIDEO = "mask" # using SAM 2 mask prediction as prompt for video predictor

After running our demo code, you can get the tracking results as follows:

15eb13c713bd51469ce4b7952b321ad1.mp4

And we will automatically save the tracking visualization results in OUTPUT_VIDEO_PATH.

Warning

We initialize the box prompts on the first frame of the input video. If you want to start from different frame, you can refine ann_frame_idx by yourself in our code.

Grounded-SAM-2 Video Object Tracking with Continuous ID (with Grounding DINO)

In above demos, we only prompt Grounded SAM 2 in specific frame, which may not be friendly to find new object during the whole video. In this demo, we try to find new objects and assign them with new ID across the whole video, this function is still under develop. it's not that stable now.

Users can upload their own video files and specify custom text prompts for grounding and tracking using the Grounding DINO and SAM 2 frameworks. To do this, execute the script:

python grounded_sam2_tracking_demo_with_continuous_id.py

You can customize various parameters including:

  • text: The grounding text prompt.
  • video_dir: Directory containing the video files.
  • output_dir: Directory to save the processed output.
  • output_video_path: Path for the output video.
  • step: Frame stepping for processing.
  • box_threshold: box threshold for groundingdino model
  • text_threshold: text threshold for groundingdino model Note: This method supports only the mask type of text prompt.

After running our demo code, you can get the tracking results as follows:

track_new_object_demo.mp4

If you want to try Grounding DINO 1.5 model, you can run the following scripts after setting your API token:

python grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py

Grounded-SAM-2 Video Object Tracking with Continuous ID plus Reverse Tracking(with Grounding DINO)

This method could simply cover the whole lifetime of the object

python grounded_sam2_tracking_demo_with_continuous_id_plus.py

Grounded SAM 2 Florence-2 Demos

Grounded SAM 2 Florence-2 Image Demo

In this section, we will explore how to integrate the feature-rich and robust open-source models Florence-2 and SAM 2 to develop practical applications.

Florence-2 is a powerful vision foundation model by Microsoft which supports a series of vision tasks by prompting with special task_prompt includes but not limited to:

Task Task Prompt Text Input Task Introduction
Object Detection <OD> Detect main objects with single category name
Dense Region Caption <DENSE_REGION_CAPTION> Detect main objects with short description
Region Proposal <REGION_PROPOSAL> Generate proposals without category name
Phrase Grounding <CAPTION_TO_PHRASE_GROUNDING> Ground main objects in image mentioned in caption
Referring Expression Segmentation <REFERRING_EXPRESSION_SEGMENTATION> Ground the object which is most related to the text input
Open Vocabulary Detection and Segmentation <OPEN_VOCABULARY_DETECTION> Ground any object with text input

Integrate Florence-2 with SAM-2, we can build a strong vision pipeline to solve complex vision tasks, you can try the following scripts to run the demo:

Note

🚨 If you encounter network issues while using the HuggingFace model, you can resolve them by setting the appropriate mirror source as export HF_ENDPOINT=https://hf-mirror.com

Object Detection and Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline object_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg

Dense Region Caption and Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline dense_region_caption_segmentation \
    --image_path ./notebooks/images/cars.jpg

Region Proposal and Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline region_proposal_segmentation \
    --image_path ./notebooks/images/cars.jpg

Phrase Grounding and Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline phrase_grounding_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The image shows two vintage Chevrolet cars parked side by side, with one being a red convertible and the other a pink sedan, \
            set against the backdrop of an urban area with a multi-story building and trees. \
            The cars have Cuban license plates, indicating a location likely in Cuba."

Referring Expression Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline referring_expression_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The left red car."

Open-Vocabulary Detection and Segmentation

python grounded_sam2_florence2_image_demo.py \
    --pipeline open_vocabulary_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "car <and> building"
  • Note that if you want to detect multi-objects you should split them with <and> in your input text.

Grounded SAM 2 Florence-2 Image Auto-Labeling Demo

Florence-2 can be used as a auto image annotator by cascading its caption capability with its grounding capability.

Task Task Prompt Text Input
Caption + Phrase Grounding <CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>
Detailed Caption + Phrase Grounding <DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>
More Detailed Caption + Phrase Grounding <MORE_DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING>

You can try the following scripts to run these demo:

Caption to Phrase Grounding

python grounded_sam2_florence2_autolabel_pipeline.py \
    --image_path ./notebooks/images/groceries.jpg \
    --pipeline caption_to_phrase_grounding \
    --caption_type caption
  • You can specify caption_type to control the granularity of the caption, if you want a more detailed caption, you can try --caption_type detailed_caption or --caption_type more_detailed_caption.

Citation

If you find this project helpful for your research, please consider citing the following BibTeX entry.

@misc{ravi2024sam2segmentimages,
      title={SAM 2: Segment Anything in Images and Videos}, 
      author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman Rädle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Dollár and Christoph Feichtenhofer},
      year={2024},
      eprint={2408.00714},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00714}, 
}

@article{liu2023grounding,
  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
  journal={arXiv preprint arXiv:2303.05499},
  year={2023}
}

@misc{ren2024grounding,
      title={Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection}, 
      author={Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang and Feng Li and Peijun Tang and Kent Yu and Lei Zhang},
      year={2024},
      eprint={2405.10300},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{ren2024grounded,
      title={Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks}, 
      author={Tianhe Ren and Shilong Liu and Ailing Zeng and Jing Lin and Kunchang Li and He Cao and Jiayu Chen and Xinyu Huang and Yukang Chen and Feng Yan and Zhaoyang Zeng and Hao Zhang and Feng Li and Jie Yang and Hongyang Li and Qing Jiang and Lei Zhang},
      year={2024},
      eprint={2401.14159},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{kirillov2023segany,
  title={Segment Anything}, 
  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
  journal={arXiv:2304.02643},
  year={2023}
}

@misc{jiang2024trex2,
      title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy}, 
      author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
      year={2024},
      eprint={2403.14610},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

grounded-sam-2's People

Contributors

rentainhe avatar shuoshende avatar morgantitcher avatar eltociear avatar wfram avatar

Stargazers

Daniel Cheung avatar yuzyang avatar Patrik Artell avatar  avatar  avatar Haodi Ma avatar klx avatar  avatar Linsan Yang avatar  avatar  avatar Wang Hao avatar aringstar avatar  avatar Lixiang avatar Eunwoo Shin avatar  avatar  avatar Jinmo Kim avatar Naiyuan Liu avatar Shantanu Nair avatar Inhee Lee avatar  avatar Yiqing Gong avatar K. S. Ernest (iFire) Lee avatar James Wong avatar  avatar WuKe avatar ATTAYU YOOTHONG avatar  avatar  avatar chuyun avatar Shimon avatar yzzzd avatar Anthony Xu avatar Jia Zeng avatar Hyun Wook Choi avatar  avatar Xinyu Jiang avatar Parallel Universe avatar Eric avatar  avatar  avatar  avatar Avery Lamp avatar  avatar N V S Pavan Kalyan avatar Miguel García García avatar sxy avatar  avatar  avatar Myeongcheol Kwak avatar QIN QI avatar Sun avatar Kim Ji Won / 김지원 avatar wen avatar Jikai Wang avatar Xuehang Zheng avatar  avatar  avatar Sean Betts avatar  avatar  avatar Du Yuntao avatar Matteo Visconti di Oleggio Castello avatar Yehor Smoliakov avatar Lële avatar  avatar Thomas Paul avatar  avatar  avatar tms avatar Tr3mor13 avatar Wolfie avatar  avatar Noah Costar avatar  avatar Chris Wu avatar Gaurang Ratnaparkhi avatar Dimitrije Antic avatar Sebastian Rietsch avatar Ping  avatar Bingyuan CHEN avatar Jongmin Park avatar Kwonyoung Ryu avatar Shuo LIANG avatar tanphan0101 avatar Tiando avatar YoH avatar Marx HA avatar 马煜 avatar Karine_H avatar 某二次元的高坂桐乃 avatar  avatar  avatar Masaki Baba avatar  avatar  avatar Marshall_G avatar Jiahao Ma avatar

Watchers

Lei Zhang avatar ericdejavu avatar hiyyg avatar cogitoErgoSum avatar Matt Shaffer avatar  avatar

grounded-sam-2's Issues

Real-Time Inference

I noticed that in all demo scripts the video is first processed in whole and then propagated, would it be possible to run a video in real-time?

Thank you in advance!

Making SAM 2 run 2x faster

I was pretty amazed with SAM 2 when it came out given all the work I do with video. My company works a ton with it and we decided to take a crack at optimizing it, and we made it run 2x faster than the original pipeline!

Unlike LLMs, video models are notorious for incredibly inefficient file reading, storage, and writing which makes them much slower than they need to be.

We wrote a bit about our work here and thought we'd share with the community:
https://www.sievedata.com/blog/meta-segment-anything-2-sam2-introduction

CUDA out of memory

Hi,
I am trying to run the python grounded_sam2_tracking_demo.py
I am using a NVIDIA RTX 4090. But it is giving me a CUDA out-of-memory issue. Which parameters can be reduced to solve the memory issue? Can you please let me know?

TIA

Tracking ID varied

Hi thanks for the great work! I've tried out the example of continuous tracking with Dino. However, it seems like the tracking ID varied a lot comparing to SAM2 itself. Could I know where I could make it better?

error when testing a picture that has a low resolution?

I use a 294x78 PNG to test grounded_sam2_hf_model_demo.py, but I get following errors, any solution?

  File "D:\project\Grounded-SAM-2\sam2\sam2_image_predictor.py", line 417, in _predict
    low_res_masks, iou_predictions, _, _ = self.model.sam_mask_decoder(
  File "D:\program\anaconda3\envs\llama_adapter\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\program\anaconda3\envs\llama_adapter\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\project\Grounded-SAM-2\sam2\modeling\sam\mask_decoder.py", line 136, in forward
    masks, iou_pred, mask_tokens_out, object_score_logits = self.predict_masks(
  File "D:\project\Grounded-SAM-2\sam2\modeling\sam\mask_decoder.py", line 203, in predict_masks
    assert image_embeddings.shape[0] == tokens.shape[0]
AssertionError

Assertion Error

I have modifed the python grounded_sam2_local_demo.py file for predicting from a video file. I found that, grounding_dino/grounddino/utils/inference.py has a function

def load_image(image_path: str) -> Tuple[np.array, torch.Tensor]:
    transform = T.Compose(
        [
            T.RandomResize([800], max_size=1333),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )
    image_source = Image.open(image_path).convert("RGB")
    image = np.asarray(image_source)
    image_transformed, _ = transform(image_source, None)
    return image, image_transformed

Here, the height of the image is resized to 800 and the maximum width is to 1333. However, if I changed the height to 400 and the max_size to 600 (it maintains the aspect ratio), I get an error like this

aa
I reduced the image size to get a higher FPS. How can I solve this issue? Moreover is there any other way to increase the FPS?

TIA

grounded_sam2_tracking_demo_with_continuous_id

Hi, I am reading your code in grounded_sam2_tracking_demo_with_continuous_id.py, in Line 157 as following:
for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state, max_frame_num_to_track=step, start_frame_idx=start_frame_idx):
I am wondering why the max_frame_num_to_track is set to step and the start_frame_idx is set to start_frame_idx?
Could please give me more information? Thanks!

ImportError: flash_attn not found when running grounded_sam2_florence2_image_demo.py

Hello,

Thank you for your excellent work. When I try to run the following command:
python grounded_sam2_florence2_image_demo.py --pipeline object_detection_segmentation --image_path ./notebooks/images/cars.jpg
it gives me an error:
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn

My environment is as follows::
CUDA=12.1
torch=2.3.1
transformers=4.33.2

Could you provide guidance on how to resolve this error?

image

Why there is a assert?

assert text_input is None, "Text input should not be none when calling object detection pipeline."
why text_input should be None? May I change the text_input for different prompt word? For example I only want to detect wheel, can I set text_input='wheel'? I've tried this but it didn't work.

"""
Pipeline-1: Object Detection + Segmentation
"""
def object_detection_and_segmentation(
    florence2_model,
    florence2_processor,
    sam2_predictor,
    image_path,
    task_prompt="<OD>",
    text_input=None,
    output_dir=OUTPUT_DIR
):
    assert text_input is None, "Text input should not be none when calling object detection pipeline."
    # run florence-2 object detection in demo
    image = Image.open(image_path).convert("RGB")
    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)

While running grounded_sam2_local_demo.py facing following error

`UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3609.)
final text_encoder_type: bert-base-uncased

UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)

UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.

UserWarning: None of the inputs have requires_grad=True. Gradients will be None
(sam2) opencvuniv@opencvuniv:~/.../Grounded-SAM2/Grounded-SAM-2$ python grounded_sam2_local_demo.py --user_reentrant=False
UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3609.)
final text_encoder_type: bert-base-uncased

UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)

UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.

UserWarning: None of the inputs have requires_grad=True. Gradients will be None
`

OOM occurs when adding certain box prompts.

Dear IDEA-Research Team,

I am working on continuity detection in indoor scenes and have been using a video predictor for this task. Specifically, I first tagged each image using RAM and selected the appropriate bounding boxes with GroundingDINO. These bounding boxes were then used as prompts for SAM2.

However, I encountered an Out of Memory (OOM) issue when too many bbox prompts are entered. I would like to ask if there is something wrong with the way I am using this? And if there is a better way to accomplish this continuity detection?

Looking forward for your reply !

Best regards,
Island

GroundingDINO Import Error

Hi Grounded-SAM-2 team,

Huge thanks for this fantastic error!

I believe there's some bad imports with regards to grounding dino -- e.g., in groundingdino/utils/util.py, imports use grounding_dino.groundingdino instead of groundingdino as in the original Grounded SAM repo (see here).

Is there a reason for this? This causes imports to fail unless the current working directory is in the root Grounded-SAM-2 directory.

Thanks!!

Support image prompt

Use case: replace logo or text in the video. Input: old logo, video (with old logo), new logo; output: video with new logo

Perhaps directly using mask prompt or box prompt in SAM 2 (instead of Uniform Point Sampling)?

Thanks for the great work! Chiming in to provide more details regarding SAM 2 prompting:

Regarding mask prompts in SAM 2:

  • SAM 2 also supports using masks as a prompt -- We have an interface add_new_mask for it in the video predictor class SAM2VideoPredictor

Regarding box prompts in SAM 2:

Maybe we could directly use the box prompt or mask prompt instead of Uniform Point Sampling in this case?

prompt type for video issue

Hi, thanks for the great work!

I am trying to obtain the point tracking results instead of segmentation masks, even though I substitute the PROMPT_TYPE_FOR_VIDEO to "point", the output video and tracking results are still segmentation masks. Is there any way to obtain the points instead of masks?

Thank you in advance!

Predict from a video file

Hi,
When I ran the code python grounded_sam2_local_demo.py
the result was good with a prompt text="car. road."
grounded_sam2_annotated_image_with_mask

But, when I have modified the code to read images from a video file and keep looping

import cv2
import torch
import numpy as np
import supervision as sv
from torchvision.ops import box_convert
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from grounding_dino.groundingdino.util.inference import load_model, load_image, predict
import time
import os

# Environment settings
# Use bfloat16 only where supported

# Build SAM2 image predictor
sam2_checkpoint = "./checkpoints/sam2_hiera_large.pt"
model_cfg = "sam2_hiera_l.yaml"
sam2_model = build_sam2(model_cfg, sam2_checkpoint, device="cuda")
sam2_predictor = SAM2ImagePredictor(sam2_model)

# Build Grounding DINO model
model_id = "IDEA-Research/grounding-dino-tiny"
device = "cuda" if torch.cuda.is_available() else "cpu"
grounding_model = load_model(
    model_config_path="grounding_dino/groundingdino/config/GroundingDINO_SwinT_OGC.py", 
    model_checkpoint_path="gdino_checkpoints/groundingdino_swint_ogc.pth",
    device=device
)

# Setup the input text prompt for Grounding DINO
text = "road. car."
output_dir = "test"
os.makedirs(output_dir, exist_ok=True)

# Capture video
video_path = 'notebooks/videos/indy.mp4'
cap = cv2.VideoCapture(video_path)
frame_num = 0

while cap.isOpened():
    start_time = time.time()
    ret, frame = cap.read()
    if not ret:
        break

    #time.sleep(0.1)

    # Convert the frame to the required format for processing
    image_source, image = load_image(frame)
   

    sam2_predictor.set_image(image_source)

    boxes, confidences, labels = predict(
        model=grounding_model,
        image=image,
        caption=text,
        box_threshold=0.35,
        text_threshold=0.25
    )

    # Process the box prompt for SAM2
    h, w, _ = frame.shape
    boxes = boxes * torch.Tensor([w, h, w, h])
    input_boxes = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()

    # Enable mixed precision only for the specific block
    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
        if torch.cuda.get_device_properties(0).major >= 8:
            # Enable tfloat32 for Ampere GPUs
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True

        # Perform SAM2 prediction within the mixed precision context
        masks, scores, logits = sam2_predictor.predict(
            point_coords=None,
            point_labels=None,
            box=input_boxes,
            multimask_output=False,
        )

    # Post-process the output of the model to get the masks, scores, and logits for visualization
    if masks.ndim == 4:
        masks = masks.squeeze(1)

    confidences = confidences.numpy().tolist()
    class_names = labels
    class_ids = np.array(list(range(len(class_names))))

    labels = [
        f"{class_name} {confidence:.2f}"
        for class_name, confidence
        in zip(class_names, confidences)
    ]

    # Calculate FPS
    end_time = time.time()
    fps = 1 / (end_time - start_time)
    
    # Visualize image with supervision API
    detections = sv.Detections(
        xyxy=input_boxes,  # (n, 4)
        mask=masks.astype(bool),  # (n, h, w)
        class_id=class_ids
    )

    box_annotator = sv.BoxAnnotator()
    annotated_frame = box_annotator.annotate(scene=frame.copy(), detections=detections)

    label_annotator = sv.LabelAnnotator()
    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)

    mask_annotator = sv.MaskAnnotator()
    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
    mask_image_save_path = os.path.join(output_dir, f"{frame_num:04d}_mask.jpg")

    cv2.imwrite(mask_image_save_path, annotated_frame)
    print(f"FPS for frame {frame_num}: {fps:.2f}")

    frame_num += 1

cap.release()
cv2.destroyAllWindows()

the result has become very bad

0002_mask

What is the reason? Can you please help??

TIA

How to divide the road lane

I had tried some prompt ,such as ‘lane divider’,‘lane_line’,‘road_markings’,but they all contain ground instead of just lane lines。 How can I get a mask that contains only lane lines?

How to track new objects?

The current code only supports adding new objects at the time of tracking initialization, and there is no way to track newly appeared objects while preserving the original tracking state. Could you provide some ideas for modifying the logic?Thanks a lot!

Florence-2 vs GroundingDino + SAM2

Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.

About SAM2 Prompt

Dear IDEA-Research Team,

It is not strictly correct to say that SAM2 only supports points as prompts. Based on my review of the SAM2 source code yesterday, it also supports masks as prompts. To verify this, I created Grounded-SAM2-Tracking (https://github.com/ShuoShenDe/Grounded-Sam2-Tracking.git) as an example, which validated the feasibility of using grounded-sam plus SAM2 for video tracking. Additionally, I have implemented continuous segmentation and tracking using Grounded-Segment-Anything and SAM2, ensuring the stability of object IDs.

As the code for this project is still being modified, do not hesitate to contact me if you have any concerns or feedback.

Best regards,
Shuo Shen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.