cmhungsteve / sstda Goto Github PK

[CVPR 2020] Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation (PyTorch)

Home Page: https://arxiv.org/abs/2003.02824

License: MIT License

Python 81.90% Shell 18.10%

cvpr2020 pytorch domain-adaptation domain-discrepancy temporal-dynamics video action-segmentation self-supervised-learning video-understanding

sstda's Introduction

Hi there 👋

My name is Min-Hung (Steve) Chen (陳敏弘 in Chinese). I am a Senior Research Scientist at NVIDIA Research Taiwan, working on Vision+X Multi-Modal AI. I received my Ph.D. degree from Georgia Tech, advised by Prof. Ghassan AlRegib and in collaboration with Prof. Zsolt Kira. Before joining NVIDIA, I was working on Biometric Research for Cognitive Services as a Research Engineer II at Microsoft Azure AI, and was working on Edge-AI Research as a Senior AI Engineer at MediaTek, respectively.

My research interest is mainly Multi-Modal AI, including Vision-Language, Video Understanding, Cross-Modal Learning, Efficient Tuning, and Transformer. I am also interested in Learning without Fully Supervision, including domain adaptation, transfer learning, continual learning, X-supervised learning, etc.

[Update] I released a comprehensive paper list for Vision Transformer & Attention to facilitate related research. Feel free to check it (I would be appreciative if you can ★STAR it)!

[Personal Website][LinkedIn][Twitter][Google Scholar][Resume]

sstda's People

Contributors

Stargazers

Watchers

sstda's Issues

Is there a demo ?

Hi, @cmhungsteve , today i heard about action segmentation, and i feel interested in it, so i searched it in Github, then i find this great repo, is there a demo or method so that i can have a quick and intuitive sense about action segmentation ? Thanks !

question about the feature shape

Thanks for your great work
I have download the feature of every video, and loaded the .npy file. I find the shape of feature is (2048, frame_count). As the feature dimension of every clip is 2048, why not the featue shape is (frame_count, 2048)?

Pretrained model

Hi, Would you share the pretrained model for test purpose? I wonder it will train for serveral days even on a machine equipped with more than four advanced gpu cards.

Qualitative results

Hi, steve:

Can you share the code or script that you use to produce those visualization results?

Incomplete dataset Download

Hello.
Thank you so much for the files and code you uploaded. However, many times I downloaded the Dataset folder, it was incomplete. Could you please divide the Dataset file into several small compressed files and upload them? Thank you very much for your cooperation.
Best regards

fail to git clone https://gitlab.svail.baidu.com/steve/action-segmentation-DA.git

fatal: unable to access 'https://gitlab.svail.baidu.com/steve/action-segmentation-DA.git/': Could not resolve host: gitlab.svail.baidu.com

I3D Feature

Each video feature dimension：（2048，X）， Whether the feature dimension of each frame extracted is 2048？ or divide the video into X frames, and then extract all the frames with a special frame (2048, X)?
I3D doesn't seem to be able to extract features from a single frame, so， I want to know how you extract features from a video. Can you provide code to extract features? Thank you very much!

The reason for evaluation metric difference between your Implemented baseline and MS-TCN

Hi, steve:

 Do you think what is the reason for the improvement between your implemented baseline and reported result in the original MS-TCN paper?

Change of test set

Hi there, great work and thanks for sharing the code.

I understand that you use the test set as a target domain, does that mean any change of the test set would require retraining to get better results and what are your thoughts/the possible solutions to this?

Thanks and looking forward to your reply.

Results of fewer labeled training data

Hi, I would like to thank you for the refreshing paper.
I have a question regarding the experiments of fewer labeled training data (Table 4 in the main paper and Table 8 in Appendix). I wonder whether the results with 65% of labeled training data were acquired by setting ratio_source or ratio_label_source to 65%.
To my understanding:
(1) ratio_source: dropping both frame features and labels
(2) ratio_label_source: dropping labels only. The dropped labels won’t be used in the TCN cross-entropy loss. However the frame features will still be used in the adversarial loss of domain prediction.
I thought the results of Table 4 were obtained with ratio_source= 65% as it says “we drop labeled frames from source domains with uniform sampling for training” in the paper.
However, in the appendix it also mentions “The additional trained data are all unlabeled, so they cannot be directly trained with standard prediction loss. There we propose SSTDA to exploit unlabeled data” and “achieve performance with this strong baseline using only 65% of labels for training”, which somehow indicate that the results are acquired with ratio_label_source=65%.
Thank you in advance and please correct me if there is any misunderstanding.
Regards

cmhungsteve / sstda Goto Github PK

sstda's Introduction

Hi there 👋

sstda's People

Contributors

Stargazers

Watchers

Forkers

sstda's Issues

Recommend Projects

Recommend Topics

Recommend Org