eladb3 / orvit Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 12.0 293 KB

"Object-Region Video Transformers”, Herzig et al., CVPR 2022

License: Apache License 2.0

Python 100.00%

orvit's People

Contributors

Stargazers

Watchers

Forkers

tianhaoyue daiguangzhao rmnthn debadityaroy holmes-gu dantong88 dl-vit yeyingdege leesunfreshing srv902 jackzhousz ilovejs

orvit's Issues

How to do inference on new video?

If I do inference, it works on the validation set and it generates the prediction accuracy. This is working fine, since we have the annotations for the validation set. But, for the test set or new input video, the detection has to be done before classification. Could you please help me with how to do inference for the new video where the detected bbox is unavailable?

Thanks in advance

where do I find bbox for something-somethingv2 dataset?

When I execute the program with something-something v2 dataset, it is asking for the empty_bbox_val.json file, where do I find this file, please?..

Thanks...

Object Region Attention block

The code uses a different method than that mentioned in the paper after ROI ALIGN
The paper has the MLP layer after Max Pooling but the code has the MLP layer first and then does Max Pooling

Problem about the bad baseline result

Hello, we run the baseline model on Something-else dataset under the ORViT framework. But the best acc we get is only 49.6, which is far from the result (60.2) reported in the paper. There are no changes to the model's config file (Smthelse_ORViT-MF_224_16x4.yaml), except for the batch_size being adjusted to 16 (4 GPUs).

Codes for Diving48

Hi. Is it possible for you to publish the corresponding codes (e.g. config, etc) related to Diving48 as well? Thanks!

Detection boxes for something-somethingv2 dataset

I have found the detected bounding box for the something somethingv2 dataset here (https://drive.google.com/drive/folders/1XqZC2jIHqrLPugPOVJxCH_YWa275PBrZ). But they are stored in the split of 4 json files namely
bounding_box_smthsmth_part1.json
bounding_box_smthsmth_part2.json
bounding_box_smthsmth_part3.json
bounding_box_smthsmth_part4.json
Could you please help me are these the files I am required to give in the place of detected boxes?. If it is a different file, could you please provide the link to download the files?

I guess the detection boxes have to be stored according to their filename. Because I got the error as
No such file or directory: '../database/something-something/something-something-v2/detected_boxes/74225'

Is there any script to split the detected boxes of each file separately?

Thanks in advance

regarding something-somethingv2 dataset

I can't get the something-somethingv2 dataset from its official website. I have downloaded it from https://developer.qualcomm.com/software/ai-datasets/something-something, but I could extract the videos only from the first folder using the command cat 20bn-something-something-v2-?? | tar zx.

Please assist

Model weights on SthElse

Hi,
Thank you for the wonder work, do you have plan on releasing the trained model on SomethingElse?

Can I get a MODEL_ZOO/K400_MVIT_B_16x4_CONV.pyth model?

Hello, I am trying to train your code with AVA dataset. But there is no checkpoint file for MVIT structure so I cannot go further. I tried to get model from link below but when I went to the MViT model's link it said it cannot find the file downloading page any more.
https://github.com/facebookresearch/SlowFast/blob/main/MODEL_ZOO.md
I would really appreciate if you can provied the checkpoint file! thank you!

Extracting bounding boxes for the remaining SSv2 videos

Hi, for SSv2, the annotations zip folder contains the bounding boxes for ~180k videos. In order to get them for the rest of the videos, could you please provide details about the setup that you used to get the detections? Using pretrained Faster-RCNN (detectron2) directly on the test set yields poor results. Please advise on the network architecture, how to fine-tune, hyperparameters, class labels used, etc. It would be helpful if you could provide the annotations for the remaining test videos as well.
Thanks!

train AVA datasets

Hello, thank you for your work. How can I use orvit to train AVA datasets? Have you replicated the training of mvit on the AVA dataset and found mvit.yaml in your configuration file

Model file for run_net.py

Hi,

Thanks for amazing work. Where can I find the model: k600_motionformer_224_16x4.pyth. On the TimeSFormer website, There doesn't seen to be a model named exactly same as mentioned in configs/ORViT/Smthelse_ORViT-MF_224_16x4.yaml. Nevermind, found it here: Motionformer. the link for Motionformer in README.md takes to TimesFormer which was causing confusion.

Thanks,
Nirat

Object Region Attention

Hello,
In the paper, it is mentioned that the in the ORVIT block the object region attention is carried out by different q, k and v values i.e; q is set to the patch tokens and k,v are set as the concatenated tokens from the patches and the object regions.

X = THWd , C = T(HW+O)d

So, in the object-region attention; it should be (acc to the paper) : Q = XWq; k = CWk; V = CWv

However, in the code, I realize that the concatenated tokens are being passed to the trajectory attention module.

ORViT/slowfast/models/ORViT/orvit.py

Line 149 in 3bfd2c7

all_tokens, thw = self.attn(

Also, in the trajectory attention module,

ORViT/slowfast/models/attention.py

Line 479 in 3bfd2c7

class TrajectoryAttention(nn.Module):

, the q, k and v values are set as identical to the ones from the concatenated tokens.

Can you please help me explain this ? I cant seem to find where the original patch tokens are set to the q for the trajectory attention mechanism.

Thanks :)

Regarding the object selected for input

Hi, Nice to meet you! Great work!
I wonder if there is a reason for selecting 4 objects per frame in EpicKitchen, where there can be clearly more than 10 objects in one frame. In this case, how did you select which 4 object information to incorporate into the model?