jingweij / actiongenome Goto Github PK

View Code? Open in Web Editor NEW

127.0 127.0 17.0 4 KB

A video database bridging human actions and human-object relationships

License: MIT License

Python 100.00%

actiongenome's People

Contributors

Stargazers

Watchers

Forkers

raghavgoyal14 tyshiwo1 ankitshah009 coldmanck bicheng-xu saboias21 sangminwoo dekelio asmiftekhar kiradiso zoombapup guangmingzhu maddy12 rishidesai kylemin yanan1989 trananhdat

actiongenome's Issues

Question about the dataset

Hi,I have a small doubt:For a frame that is annotated,how to determine the objects?Did you first extract the list of objects included in all the actions of a video, and then label each frame by the annotator to determine which objects in the list are included in the frame?Or you just use the objects occured in the actions of which the interval contains the frame?

bbox format for persons and objects

Is BBox format for objects (x, y, w, h)? And are (x, y) center coordinates here?

e.g. annotation for an object:

{'class': 'food',
  'bbox': (324.82430069930064,
   193.98318348318338,
   6.590909090909065,
   8.636363636363626),
  'attention_relationship': ['looking_at'],
  'spatial_relationship': ['in_front_of'],
  'contacting_relationship': ['holding'],
  'metadata': {'tag': '924QD.mp4/food/000067', 'set': 'train'},
  'visible': True}

while annotation for a person is in (x, y, x, y):

{'bbox': array([[ 75.57577,  78.03209, 212.58168, 467.56796]], dtype=float32),
 'bbox_score': array([0.95631087], dtype=float32),
 'bbox_size': (270, 480),
 'bbox_mode': 'xyxy',
 'keypoints': array([[[168.54407 , 169.3401  ,   1.      ],
         [173.26842 , 170.01521 ,   1.      ],
         [ 85.193184,  96.091156,   1.      ],
         [180.01747 , 183.17976 ,   1.      ],
         [194.19049 , 201.40762 ,   1.      ],
         [168.54407 , 188.91817 ,   1.      ],
         [183.05455 , 212.54686 ,   1.      ],
         [ 98.016396, 198.03209 ,   1.      ],
         [ 99.36621 , 198.36964 ,   1.      ],
         [111.51451 , 114.65656 ,   1.      ],
         [109.4898  , 150.43715 ,   1.      ],
         [129.39952 , 376.5975  ,   1.      ],
         [164.15718 , 368.83377 ,   1.      ],
         [153.69614 , 181.82956 ,   1.      ],
         [153.02124 , 466.38654 ,   1.      ],
         [115.226494, 126.47091 ,   1.      ],
         [114.889046, 126.80846 ,   1.      ]]], dtype=float32),
 'keypoints_logits': array([[ 0.3934058 ,  1.2183307 ,  0.36741984,  1.7435464 ,  2.248969  ,
          3.1777701 ,  1.09344   ,  2.236632  ,  3.1861217 ,  2.8617258 ,
          1.0008469 ,  3.27955   ,  3.3649373 , -1.9560733 , -2.4075575 ,
         -0.4515944 , -1.1781657 ]], dtype=float32)}

Annotation with None

Thank you for providing the annotations!
There are quite a lot of None object annotations, for e.g.

{'attention_relationship': None,
 'bbox': None,
 'class': 'sofa/couch',
 'contacting_relationship': None,
 'metadata': {'set': 'test', 'tag': 'BLLCM.mp4/sofa_couch/000394'},
 'spatial_relationship': None,
 'visible': False}

Are these annotations to be ignored?

Question about the annotations

Thanks again for the wonderful work. Regarding the persons & objects annotation, I appreciate if you could clarify my questions below:

Is there only a single person (the actor who is supposed to conduct action(s)) annotated for each clip, even if multiple persons appear?
Is there only a single object of an object class annotated, even if there are multiple instances of the same object class?
Is there any problem regarding the annotation, for example the image below is the first annotated frame from 7H7PN.mp4 (7H7PN.mp4/000048.png), the upper-right command line screenshot is the person box annotation and the lower-right one is the objects & relationships annotations. If the annotated person is the person at the left-hand side who is taking out some stuffs from a bag, how can he simultaneously sitting on the chair (from the bbox coordinate we know that's the one at the right-hand side) and sitting on the floor?

Question about the directional information

Hi Jingwei,

thanks for your work! I have a question about the direction in the scene graph. In other scene graph dataset like VG, the relations are annotated as triplet subject-relation-object. I see in the action genome the relation labels are associated with objects. So how did you define the direction when evaluating Re@X?
I also find some opposite relations appear at the same time e.g. {'class': value, 'bbox': value, 'spatial_relationship': ['in front of', 'behind']}. I guess it means human-in front of-object and object-behind-human (or opposite). How can we get this info? Waiting for your reply!

Thanks a lot

Question about Charades Fewshot Split

Can you please provide or point me to where I can find details regarding the fewshot experiments on Charades?

How is the 137/20 action split determined?
How do we sample the k=[1,5,10] instances?

Please let us know if you can provide the exact split files or share how to design the experimental setup.

Thanks

Please avail any pretrained model

Hello sir, happy new year and a request from my side to avail a pretrained model for action genome. So that I can use the pretrain model to predict the graph from the video for my further research work.

About Detecting Human-Object Relationships in Videos

Hi, it is a nice work!
Can you share the code of "Detecting Human-Object Relationships in Videos" ?
Thanks!

Question about reproduction

When I reproduce the results, I find that the PredClS is really high, although I use the random predicate score. Is my own problem or the dataset's ?

about 'keypoints_logits' and 'keypoints' in person_bbox.pkl

Thanks for your great job!
In the dataset, I dont know the meaning of keypoints_logits .

Question about the 'visible' in annotations

Thank you for your work. When I load the annotations, I have some doubts about the 'visible' attribute. For example, the relation annotations of 50N4E.mp4/000682.png just like this:

[
{'class': 'light', 'bbox': None, 'attention_relationship': None, 'spatial_relationship': None, 'contacting_relationship': None, 'metadata': {'tag': '50N4E.mp4/light/000682', 'set': 'train'}, 'visible': False}, 
{'class': 'dish', 'bbox': None, 'attention_relationship': None, 'spatial_relationship': None, 'contacting_relationship': None, 'metadata': {'tag': '50N4E.mp4/dish/000682', 'set': 'train'}, 'visible': False}
]

Does this mean there is no bbox and relation in 50N4E.mp4/000682.png? So if this frame is tested, can we just ignore it?

Frame Sampling

Hi, can you please provide the script of how to uniformly sample 5 frames from a charade action interval? I'm trying to sample uniformly by myself from charades action interval, but the extracted frame indexes totally cannot match your extracted frame indexes.

Thank you very much!

How to evaluate Recall@K in multi-label relationship of SGG?

Dear author,
I am reimplementing on this dataset. I found that in this graph, in my statistics, there are at most 5 edges between subject and object, under the circumstance that I am considering this graph as a directional one and there is always <object, spatial relationship, person> pairs and <person, attention/contacting relationships, object>in the dataset. So I am encountering the evaluation problem. Could you explain more detailed evaluation metrics? Thank you so much!
In the original SG evaluation, there are triplets for evaluation. For this Action Genome, should I convert multi-label relationship to several pairs of single relation label triplets for evaluation?
Thank you so much!

Releasing baseline models

Hi @JingweiJ Thanks for the wonderful work. Do you plan to release the baseline models for the proposed tasks, i.e. (few-shot) action recognition, spatial-temporal scene graph prediction? That would greatly facilitate researchers to experiment on this dataset.