Giter Site home page Giter Site logo

som's Introduction

Set-of-Mark Prompting or GPT-4V - Visual Prompting for Vision!

๐Ÿ‡ [Read our arXiv Paper] ย  ๐ŸŽ [Project Page]

Jianwei Yang*โš‘, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao

* Core Contributors ย ย ย ย  โš‘ Project Lead

We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM. GPT-4V.

method2_xyz

๐Ÿ”ฅ News

  • [10/23] We released the SoM toolbox code for generating set-of-mark prompts for GPT-4V. Try it out!

๐Ÿ—’๏ธ Todos

  • Release vision benchmarks used in our paper

๐Ÿ”— Related links

Fascinating applications of SoM and LMMs:

Our method compiles the following models to generate the set of marks:

  • Mask DINO: State-of-the-art closed-set image segmentation model
  • OpenSeeD: State-of-the-art open-vocabulary image segmentation model
  • GroundingDINO: State-of-the-art open-vocabulary object detection model
  • SEEM: Versatile, promptable, interactive and semantic-aware segmentation model
  • Semantic-SAM: Segment and recognize anything at any granularity
  • Segment Anything: Segment anything

We are standing on the shoulder of the giant GPT-4V (playground)!

๐Ÿš€ Quick Start

  • Install segmentation packages
# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install mpi4py
  • Download the pretrained models
sh download_ckpt.sh
  • Run the demo
python demo_som.py

And you will see this interface:

som_toolbox

Potential solutions for some common issues:

๐Ÿ‘‰ Comparing standard GPT-4V and its combination with SoM Prompting

teaser_github

๐Ÿ“ SoM Toolbox for image partition

method3_xyz Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.

๐Ÿฆ„ Interleaved Prompt

SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices. Screenshot 2023-10-18 at 10 06 18

๐ŸŽ–๏ธ Mark types used in SoM

method4_xyz

๐ŸŒ‹ Evaluation tasks examples

Screenshot 2023-10-18 at 10 12 18

Use case

๐ŸŒท Grounded Reasoning and Cross-Image Reference

Screenshot 2023-10-18 at 10 10 41

In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17

๐Ÿ•๏ธ Problem Solving

Screenshot 2023-10-18 at 10 18 03

Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.

๐Ÿ”๏ธ Knowledge Sharing

Screenshot 2023-10-18 at 10 18 44

Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.

๐Ÿ•Œ Personalized Suggestion

Screenshot 2023-10-18 at 10 19 12

SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks

๐ŸŒผ Tool Usage Instruction

Screenshot 2023-10-18 at 10 19 39 Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction , teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.

๐ŸŒป 2D Game Planning

Screenshot 2023-10-18 at 10 20 03

GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.

๐Ÿ•Œ Simulated Navigation

Screenshot 2023-10-18 at 10 21 24

๐ŸŒณ Results

We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation. main_results

โœ’๏ธ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{yang2023setofmark,
      title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V}, 
      author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
      journal={arXiv preprint arXiv:2310.11441},
      year={2023},
}

som's People

Contributors

jwyang avatar fengli-ust avatar maureenzou avatar microsoftopensource avatar haozhang534 avatar microsoft-github-operations[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.