Giter Site home page Giter Site logo

taskswithcode / grit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jialianw/grit

0.0 1.0 1.0 75.64 MB

GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)

License: MIT License

Python 0.87% Jupyter Notebook 99.13%

grit's Introduction



Object Detection and Object captioning - examples created using published model

Table of contents

Dense captioning

Dense captioning of images - use case 1


Object Detection - working case

Object Detection - use case 2


Object Detection - behaviour of mapping objects to other known classes

Object Detection - behaviour of incorrectly mapping objects to other known classes - this is due to the absence of the classes in the training set - not an inherent deficiency of the model


Notes on using GRiT+Detic with ChatGPT
Captioning for detected objects with GRiT captures some aspect of the object features beyond just the object class name. This is a consequence of the captioning component getting object features in the form of image patches. For instance in some pictures, the captioning describes the "sky" (detected object) as clear vs dark etc.
This particular aspect is perhaps a distinguising factor of GRiT's approach compared to using a model like Detic which only outputs object classes. Detic's object detection capability may appear to be better than GRiT in images tested only because of the difference in training set. So if we combine these two model outputs and use them as input to ChatGPT we get a rich overall scene summary. Images were selected from a royalty free site Pexels.

Related Links
Twitter thread showing the performance of GRiT+Detic with ChatGPT. ChatGPT output quality improves for Winograd schemas when passing it bounding box information as well as dense captioning output by GRiT
Twitter thread comparing this model to Detic with ChatGPT. GRiT's dense captioning output improves ChatGPT's narrative compared to Detic which only detects classes with bounding boxes
A Hugging Face app 🤗 using to Detic to output object classes with coordinates and dimensions which can be used to cut-and-paste to ChatGPT playground for dense captioning.


A Generative Region-to-text Transformer for Object Understanding

GRiT is a general and open-set object understanding framework that localizes objects and describes them with any style of free-form texts it was trained with, e.g., class names, descriptive sentences (including object attributes, actions, counts and many more).

GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
1State University of New York at Buffalo, 2Microsoft
arXiv technical report (PDF)

Installation

Please follow Installation instructions.

Object Understanding Demo - One Model Two tasks

Download the GRiT model or use the following commend to download:

mkdir models && cd models
wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth && cd ..

The downloaded GRiT model was jointly trained on dense captioning task and object detection task. With the same trained model, it can output both rich descriptive sentences and short class names by varying the flag --test-task. Play it as follows! 🤩🤩🤩

Output for Dense Captioning (rich descriptive sentences)

python demo.py --test-task DenseCap --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml  --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth

Output for Object Detection (short class names)

python demo.py --test-task ObjectDet --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml  --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth

Output images will be saved under the visualization folder, which looks like:

Benchmark Inference and Evaluation

Please follow dataset preparation instructions to download datasets.

Download our trained models and put them to models/ for evaluation.

Object Detection on COCO 2017 Dataset

Model val AP test-dev AP Download
GRiT (ViT-B) 53.7 53.8 model
GRiT (ViT-L) 56.4 56.6 model
GRiT (ViT-H) 60.4 60.4 model

To evaluate the trained GRiT on coco 2017 val, run:

# GRiT (ViT-B)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet --eval-only MODEL.WEIGHTS models/grit_b_objectdet.pth
# GRiT (ViT-L)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_L_ObjectDet.yaml --output-dir-name ./output/grit_l_objectdet --eval-only MODEL.WEIGHTS models/grit_l_objectdet.pth
# GRiT (ViT-H)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_H_ObjectDet.yaml --output-dir-name ./output/grit_h_objectdet --eval-only MODEL.WEIGHTS models/grit_h_objectdet.pth

Dense Captioning on VG Dataset

Model mAP Download
GRiT (ViT-B) 15.5 model

To test on VG test set, run:

python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth

It will save the inference results to output/grit_b_densecap/vg_instances_results.json. We use the VG dense captioning official evaluation codebase to report the results. We didn't integrate the evaluation code into our project as it was written in Lua. To evaluate on VG, please follow the original codebase's instructions and test based upon it. We're happy to discuss in our issue section about the issues you may encounter when using their code.

Training

To save training memory, we use DeepSpeed for training which can work well for activation checkpointing in distributed training.

To train on single machine node, run:

python train_deepspeed.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

To train on multiple machine nodes, run:

python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

Acknowledgement

Our code is in part based on Detic, CenterNet2, detectron2, GIT, and transformers. We thank the authors and appreciate their great works!

Citation

If you find our work interesting and would like to cite it, please use the following BibTeX entry.

@article{wu2022grit,
  title={GRiT: A Generative Region-to-text Transformer for Object Understanding},
  author={Wu, Jialian and Wang, Jianfeng and Yang, Zhengyuan and Gan, Zhe and Liu, Zicheng and Yuan, Junsong and Wang, Lijuan},
  journal={arXiv preprint arXiv:2212.00280},
  year={2022}
}

grit's People

Watchers

 avatar

Forkers

rcnilay

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.