alcoholrithm / oscar_scripts Goto Github PK
View Code? Open in Web Editor NEWScripts for to inference using Oscar in Image Captioning and VQA tasks
License: MIT License
Scripts for to inference using Oscar in Image Captioning and VQA tasks
License: MIT License
Hi thanks for this amazing work!
I've tried to use your code to generate captions on raw images with bottom-up features but I'm getting bad results. I moved the notebook to a python script and tried inference on 1 image with the model 'OursL(XE)' you posted on your MODEL ZOO. I'm printing the features and classes detected and they seem to be fine but the caption generated has nothing to do with the image.
This is the image i'm using:
This is the result I get after running the code.
python test.py
Config 'workspace/detectron2/configs/VG-Detection/faster_rcnn_R_101_C4_caffe.yaml' has no VERSION. Assuming it to be compatible with latest v2.
Modifications for VG in RPN (modeling/proposal_generator/rpn.py):
Use hidden dim 512 instead fo the same dim as Res4 (1024).
Modifications for VG in RoI heads (modeling/roi_heads/roi_heads.py):
1. Change the stride of conv1 and shortcut in Res5.Block1 from 2 to 1.
2. Modifying all conv2 with (padding: 1 --> 2) and (dilation: 1 --> 2).
For more details, please check 'https://github.com/peteanderson80/bottom-up-attention/blob/master/models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt'.
/home/jhurtado/anaconda3/envs/Alcoholrithm/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[{'class': 'dog'}, {'class': 'frisbee'}, {'class': 'leg'}, {'class': 'leg'}, {'class': 'head'}, {'class': 'shadow'}, {'class': 'eye'}, {'class': 'eyes'}, {'class': 'beach'}, {'class': 'snow'}, {'class': 'dog'}, {'class': 'ear'}, {'class': 'eye'}, {'class': 'paw'}, {'class': 'legs'}, {'class': 'water'}, {'class': 'nose'}, {'class': 'ground'}, {'class': 'ground'}, {'class': 'water'}, {'class': 'shadow'}, {'class': 'ground'}, {'class': 'snow'}, {'class': 'ground'}, {'class': 'snow'}, {'class': 'head'}, {'class': 'mouth'}, {'class': 'sand'}]
[[array([[0.36432788, 0.23231444, 0. , ..., 0.99799365, 0.31319785,
0.9006105 ],
[0. , 0.01907516, 0. , ..., 0.337672 , 0.15087502,
0.07700826],
[0. , 0. , 0. , ..., 0.9307157 , 0.08133312,
0.28387186],
...,
[0.02281303, 0.04437782, 0.0105911 , ..., 0.37423557, 0.34131297,
0.36085242],
[0. , 0.00111979, 0. , ..., 0.31540072, 0.06328667,
0.08102156],
[0. , 0.93097776, 0.01253132, ..., 0.99827766, 0.82969713,
0.52183276]], dtype=float32), array([117, 123, 808, 808, 191, 683, 467, 546, 62, 176, 117, 274, 467,
786, 829, 183, 391, 465, 465, 183, 683, 465, 176, 465, 176, 191,
452, 326])]]
/home/jhurtado/image_captioning/Pruebas/MaravillasDeSergio/Oscar_Scripts/Oscar/oscar/modeling/modeling_utils.py:506: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
beam_id = idx // vocab_size
['a bunch of tools sitting on top of a pile of pens and pencils.']
I'm thinking there might be a missmatch between labels and features but I don't how can I try to fix this.
Do you think you can help me with this?
Thanks a lot in advance
Hi,
I was trying to get the image captioning code running from your repository. I was wondering if you can point me to the location of the 'checkpoint-59-554820' model that you use
Hi again!
I have been using your code and models with vinvl detections and getting good results so far but when I try to use a checkpoint from the original Oscar repository I get terrible captions.
I'm trying the model trained on COCO for image captioning finetuned with cross entropy that they have available on their VinVLModelZoo.md
This is the caption I got from your model finetuned with CIDER optimization:
['a small brown bear standing on top of a sandy beach.']
This is the caption I get with the model from Oscar's repo:
['a large number of lights that are on a building.']
I also manually checked what detections was vinvl finding from their original repository on this same image and got really good results:
I should also mention I made a few changes in the code to run it in cpu instead of cuda since the model can't fit in the gpu I have available.
I want to check what results I can get with the best performing model from their repository (41.0 BLEU4 score) but I think I'm missing something.
Do you think you can help me with this?
Thanks for the help once again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.