Giter Site home page Giter Site logo

actiongenome's Introduction

Action Genome

This repo contains README and snippets for using the Action Genome dataset v1.0.

Prerequisite

To use the snippets in this repo, python 3 and ffmpeg are required.

Get started

Download videos and annotations

Download Charades videos (scaled to 480p) from here and extract (or softlink) them under dataset/ag/videos.

Download Action Genome annotations and place them under dataset/ag/annotations.

Dump frames

We are not releasing the dumped frames from Charades videos. Instead, you can download the Charades videos from here and dump the frames following the instruction below.

After preparing all 480p videos into your dataset/ag/videos, dump the frames into dataset/ag/frames:

python tools/dump_frames.py

The dumped frames are ~74GB. The dumping may take half a day to finish. Note that we have only annotated sampled frames (see the sampling strategy in our paper) rather than all frames. If you prefer to dump all frames, run:

python tools/dump_frames.py --all_frames

Annotations structure

The object_bbox_and_relationship.pkl contains a dictionary structured like:

{...
    'VIDEO_ID/FRAME_ID':
        [...
            {
                'class': 'book',
                'bbox': (x, y, w, h),
                'attention_relationship': ['looking_at'],
                'spatial_relationship': ['in_front_of'],
                'contacting_relationship': ['holding', 'touching'],
                'visible': True,
                'metadata': 
                    {
                        'tag': 'VIDEO_ID/FRAME_ID',
                        'set': 'train'
                    }
            }
        ...]
...}

Noticeably, 'visible' indicates if the interacted object is visible in the frame.

The person_bbox.pkl contains the person bounding boxes of each frame. Here we release the Faster-RCNN detected person boxes as we've used in our paper. In our next version of the dataset, we'll release person boxes labeled manually.

The frame_list.txt contains all frames we've labeled.

The object_classes.txt contains all classes of objects.

The relationship_classes.txt contains all classes of human-object relationships.

actiongenome's People

Contributors

jingweij avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

actiongenome's Issues

About object detection in this dataset

HI! thanks for your outstanding work. I want to ask is it kind of challenging to obtain a high mAP performance on object detection tasks using this dataset? When using faster-rcnn it can only achieve 11-12 AP on the validation set? Thanks!

Question about the dataset

Hi,I have a small doubt:For a frame that is annotated,how to determine the objects?Did you first extract the list of objects included in all the actions of a video, and then label each frame by the annotator to determine which objects in the list are included in the frame?Or you just use the objects occured in the actions of which the interval contains the frame?

question about bbox annotation

hi thanks for your wonderful work. i'd like to ask is there any other operation like resize on raw images to fit bbox annotation? i find the boxes are shifting a lot when i try to visualize the annotation.

Question about the annotations

Thanks again for the wonderful work. Regarding the persons & objects annotation, I appreciate if you could clarify my questions below:

  • Is there only a single person (the actor who is supposed to conduct action(s)) annotated for each clip, even if multiple persons appear?
  • Is there only a single object of an object class annotated, even if there are multiple instances of the same object class?
  • Is there any problem regarding the annotation, for example the image below is the first annotated frame from 7H7PN.mp4 (7H7PN.mp4/000048.png), the upper-right command line screenshot is the person box annotation and the lower-right one is the objects & relationships annotations. If the annotated person is the person at the left-hand side who is taking out some stuffs from a bag, how can he simultaneously sitting on the chair (from the bbox coordinate we know that's the one at the right-hand side) and sitting on the floor?

image

Frame Sampling

Hi, can you please provide the script of how to uniformly sample 5 frames from a charade action interval? I'm trying to sample uniformly by myself from charades action interval, but the extracted frame indexes totally cannot match your extracted frame indexes.

Thank you very much!

How to evaluate Recall@K in multi-label relationship of SGG?

Dear author,
I am reimplementing on this dataset. I found that in this graph, in my statistics, there are at most 5 edges between subject and object, under the circumstance that I am considering this graph as a directional one and there is always <object, spatial relationship, person> pairs and <person, attention/contacting relationships, object>in the dataset. So I am encountering the evaluation problem. Could you explain more detailed evaluation metrics? Thank you so much!
In the original SG evaluation, there are triplets for evaluation. For this Action Genome, should I convert multi-label relationship to several pairs of single relation label triplets for evaluation?
Thank you so much!

Question about Charades Fewshot Split

Can you please provide or point me to where I can find details regarding the fewshot experiments on Charades?

  1. How is the 137/20 action split determined?
  2. How do we sample the k=[1,5,10] instances?

Please let us know if you can provide the exact split files or share how to design the experimental setup.

Thanks

Question about the 'visible' in annotations

Thank you for your work. When I load the annotations, I have some doubts about the 'visible' attribute. For example, the relation annotations of 50N4E.mp4/000682.png just like this:

[
{'class': 'light', 'bbox': None, 'attention_relationship': None, 'spatial_relationship': None, 'contacting_relationship': None, 'metadata': {'tag': '50N4E.mp4/light/000682', 'set': 'train'}, 'visible': False}, 
{'class': 'dish', 'bbox': None, 'attention_relationship': None, 'spatial_relationship': None, 'contacting_relationship': None, 'metadata': {'tag': '50N4E.mp4/dish/000682', 'set': 'train'}, 'visible': False}
]

Does this mean there is no bbox and relation in 50N4E.mp4/000682.png? So if this frame is tested, can we just ignore it?

Question about the directional information

Hi Jingwei,

thanks for your work! I have a question about the direction in the scene graph. In other scene graph dataset like VG, the relations are annotated as triplet subject-relation-object. I see in the action genome the relation labels are associated with objects. So how did you define the direction when evaluating Re@X?
I also find some opposite relations appear at the same time e.g. {'class': value, 'bbox': value, 'spatial_relationship': ['in front of', 'behind']}. I guess it means human-in front of-object and object-behind-human (or opposite). How can we get this info? Waiting for your reply!

Thanks a lot

faster rcnn on ActionGenome

Aftering training on AG, I found that mAP of faster rcnn with Res101 is quite low. Is this my own problem or the dataset's?

Please avail any pretrained model

Hello sir, happy new year and a request from my side to avail a pretrained model for action genome. So that I can use the pretrain model to predict the graph from the video for my further research work.

Annotation with None

Thank you for providing the annotations!
There are quite a lot of None object annotations, for e.g.

{'attention_relationship': None,
 'bbox': None,
 'class': 'sofa/couch',
 'contacting_relationship': None,
 'metadata': {'set': 'test', 'tag': 'BLLCM.mp4/sofa_couch/000394'},
 'spatial_relationship': None,
 'visible': False}

Are these annotations to be ignored?

bbox format for persons and objects

Is BBox format for objects (x, y, w, h)? And are (x, y) center coordinates here?

e.g. annotation for an object:

{'class': 'food',
  'bbox': (324.82430069930064,
   193.98318348318338,
   6.590909090909065,
   8.636363636363626),
  'attention_relationship': ['looking_at'],
  'spatial_relationship': ['in_front_of'],
  'contacting_relationship': ['holding'],
  'metadata': {'tag': '924QD.mp4/food/000067', 'set': 'train'},
  'visible': True}

while annotation for a person is in (x, y, x, y):

{'bbox': array([[ 75.57577,  78.03209, 212.58168, 467.56796]], dtype=float32),
 'bbox_score': array([0.95631087], dtype=float32),
 'bbox_size': (270, 480),
 'bbox_mode': 'xyxy',
 'keypoints': array([[[168.54407 , 169.3401  ,   1.      ],
         [173.26842 , 170.01521 ,   1.      ],
         [ 85.193184,  96.091156,   1.      ],
         [180.01747 , 183.17976 ,   1.      ],
         [194.19049 , 201.40762 ,   1.      ],
         [168.54407 , 188.91817 ,   1.      ],
         [183.05455 , 212.54686 ,   1.      ],
         [ 98.016396, 198.03209 ,   1.      ],
         [ 99.36621 , 198.36964 ,   1.      ],
         [111.51451 , 114.65656 ,   1.      ],
         [109.4898  , 150.43715 ,   1.      ],
         [129.39952 , 376.5975  ,   1.      ],
         [164.15718 , 368.83377 ,   1.      ],
         [153.69614 , 181.82956 ,   1.      ],
         [153.02124 , 466.38654 ,   1.      ],
         [115.226494, 126.47091 ,   1.      ],
         [114.889046, 126.80846 ,   1.      ]]], dtype=float32),
 'keypoints_logits': array([[ 0.3934058 ,  1.2183307 ,  0.36741984,  1.7435464 ,  2.248969  ,
          3.1777701 ,  1.09344   ,  2.236632  ,  3.1861217 ,  2.8617258 ,
          1.0008469 ,  3.27955   ,  3.3649373 , -1.9560733 , -2.4075575 ,
         -0.4515944 , -1.1781657 ]], dtype=float32)}

Releasing baseline models

Hi @JingweiJ Thanks for the wonderful work. Do you plan to release the baseline models for the proposed tasks, i.e. (few-shot) action recognition, spatial-temporal scene graph prediction? That would greatly facilitate researchers to experiment on this dataset.

Question about reproduction

When I reproduce the results, I find that the PredClS is really high, although I use the random predicate score. Is my own problem or the dataset's ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.