Giter Site home page Giter Site logo

avt's Introduction

Anticipative Video Transformer

Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT)

PWC
PWC
PWC
PWC

[project page] [paper]

If this code helps with your work, please cite:

R. Girdhar and K. Grauman. Anticipative Video Transformer. IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

@inproceedings{girdhar2021anticipative,
    title = {{Anticipative Video Transformer}},
    author = {Girdhar, Rohit and Grauman, Kristen},
    booktitle = {ICCV},
    year = 2021
}

Installation

The code was tested on a Ubuntu 20.04 cluster with each server consisting of 8 V100 16GB GPUs.

First clone the repo and set up the required packages in a conda environment. You might need to make minor modifications here if some packages are no longer available. In most cases they should be replaceable by more recent versions.

$ git clone --recursive [email protected]:facebookresearch/AVT.git
$ conda env create -f env.yaml python=3.7.7
$ conda activate avt

Set up RULSTM codebase

If you plan to use EPIC-Kitchens datasets, you might need the train/test splits and evaluation code from RULSTM. This is also needed if you want to extract RULSTM predictions for test submissions.

$ cd external
$ git clone [email protected]:fpv-iplab/rulstm.git; cd rulstm
$ git checkout 57842b27d6264318be2cb0beb9e2f8c2819ad9bc
$ cd ../..

Datasets

The code expects the data in the DATA/ folder. You can also symlink it to a different folder on a faster/larger drive. Inside it will contain following folders:

  1. videos/ which will contain raw videos
  2. external/ which will contain pre-extracted features from prior work
  3. extracted_features/ which will contain other extracted features
  4. pretrained/ which contains pretrained models, eg from TIMM

The paths to these datasets are set in files like conf/dataset/epic_kitchens100/common.yaml so you can also update the paths there instead.

EPIC-Kitchens

To train only the AVT-h on top of pre-extracted features, you can download the features from RULSTM into DATA/external/rulstm/RULSTM/data_full for EK55 and DATA/external/rulstm/RULSTM/ek100_data_full for EK100. If you plan to train models on features extracted from a irCSN-152 model finetuned from IG65M features, you can download our pre-extracted features from here into DATA/extracted_features/ek100/ig65m_ftEk100_logits_10fps1s/rgb/ or here into DATA/extracted_features/ek55/ig65m_ftEk55train_logits_25fps/rgb/.

To train AVT end-to-end, you need to download the raw videos from EPIC-Kitchens. They can be organized as you wish, but this is how my folders are organized (since I first downloaded EK55 and then the remaining new videos for EK100):

DATA
├── videos
│   ├── EpicKitchens
│   │   └── videos_ht256px
│   │       ├── train
│   │       │   ├── P01
│   │       │   │   ├── P01_01.MP4
│   │       │   │   ├── P01_03.MP4
│   │       │   │   ├── ...
│   │       └── test
│   │           ├── P01
│   │           │   ├── P01_11.MP4
│   │           │   ├── P01_12.MP4
│   │           │   ├── ...
│   │           ...
│   ├── EpicKitchens100
│   │   └── videos_extension_ht256px
│   │       ├── P01
│   │       │   ├── P01_101.MP4
│   │       │   ├── P01_102.MP4
│   │       │   ├── ...
│   │       ...
│   ├── EGTEA/101020/videos/
│   │   ├── OP01-R01-PastaSalad.mp4
│   │   ...
│   └── 50Salads/rgb/
│       ├── rgb-01-1.avi
│       ...
├── external
│   └── rulstm
│       └── RULSTM
│           ├── egtea
│           │   ├── TSN-C_3_egtea_action_CE_flow_model_best_fcfull_hd
│           │   ...
│           ├── data_full  # (EK55)
│           │   ├── rgb
│           │   ├── obj
│           │   └── flow
│           └── ek100_data_full
│               ├── rgb
│               ├── obj
│               └── flow
└── extracted_features
    ├── ek100
    │   └── ig65m_ftEk100_logits_10fps1s
    │       └── rgb
    └── ek55
        └── ig65m_ftEk55train_logits_25fps
            └── rgb

If you use a different organization, you would need to edit the train/val dataset files, such as conf/dataset/epic_kitchens100/anticipation_train.yaml. Sometimes the values are overriden in the TXT config files, so might need to change there too. The root property takes a list of folders where the videos can be found, and it will search through all of them in order for a given video. Note that we resized the EPIC videos to 256px height for faster processing; you can use sample_scripts/resize_epic_256px.sh script for the same.

Please see docs/DATASETS.md for setting up other datasets.

Training and evaluating models

If you want to train AVT models, you would need pre-trained models from timm. We have experiments that use the following models:

$ mkdir DATA/pretrained/TIMM/
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth -O DATA/pretrained/TIMM/jx_vit_base_patch16_224_in21k-e5005f0a.pth
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth -O DATA/pretrained/TIMM/jx_vit_base_p16_224-80ecf9dd.pth

The code uses hydra 1.0 for configuration with submitit plugin for jobs via SLURM. We provide a launch.py script that is a wrapper around the training scripts and can run jobs locally or launch distributed jobs. The configuration overrides for a specific experiment is defined by a TXT file. You can run a config by:

$ python launch.py -c expts/01_ek100_avt.txt

where expts/01_ek100_avt.txt can be replaced by any TXT config file.

By default, the launcher will launch the job to a SLURM cluster. However, you can run it locally using one of the following options:

  1. -g to run locally in debug mode with 1 GPU and 0 workers. Will allow you to place pdb.set_trace() to debug interactively.
  2. -l to run locally using as many GPUs on the local machine.

This will run the training, which will run validation every few epochs. You can also only run testing using the -t flag. When running testing for a pre-trained model, don't forget to set the checkpoint to load weights from, using something like this in the txt experiment config:

train.init_from_model=[[path/to/checkpoint.pth]]

The outputs will be stored in OUTPUTS/<path to config>. This would include tensorboard files that you can use to visualize the training progress.

Model Zoo

EPIC-Kitchens-100

Backbone Head Class-mean
Recall@5 (Actions)
Config Model
AVT-b (IN21K) AVT-h 14.9 expts/01_ek100_avt.txt link
TSN (RGB) AVT-h 13.6 expts/02_ek100_avt_tsn.txt link
TSN (Obj) AVT-h 8.7 expts/03_ek100_avt_tsn_obj.txt link
irCSN152 (IG65M) AVT-h 12.8 expts/04_ek100_avt_ig65m.txt link

Late fusing predictions

For comparison to methods that use multiple modalities, you can late fuse predictions from multiple models using functions from notebooks/utils.py. For example, to compute the late fused performance reported in Table 3 (val) as AVT+ (obtains 15.9 recall@5 for actions):

from notebooks.utils import *
CFG_FILES = [
    ('expts/01_ek100_avt.txt', 0),
    ('expts/03_ek100_avt_tsn_obj.txt', 0),
]
WTS = [2.5, 0.5]
print_accuracies_epic(get_epic_marginalize_late_fuse(CFG_FILES, weights=WTS)[0])

Please see docs/MODELS.md for test submission and models on other datasets.

License

This codebase is released under the license terms specified in the LICENSE file. Any imported libraries, datasets or other code follows the license terms set by respective authors.

Acknowledgements

The codebase was built on top of facebookresearch/VMZ. Many thanks to Antonino Furnari, Fadime Sener and Miao Liu for help with prior work.

avt's People

Contributors

anirudh257 avatar rohitgirdhar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

avt's Issues

fps of extracted irCSN feature

Hi. Thanks for your good work in AVT. But I have some questions about your extracted irCSN feature. What's the fps of your provided ''ek100/ig65m_ftEk100_logits_10fps1s/rgb/'data.mdb'' and ''ek55/ig65m_ftEk55train_logits_25fps/rgb/data.mdb''. It looks strange for me...

Unable to reproduce val results

Hi @rohitgirdhar, I'm trying to test the irCSN-152 (IG65M) model for EK-55. I used the model https://dl.fbaipublicfiles.com/avt/checkpoints/expts/10_ek55_avt_ig65m.txt/0/checkpoint.pth and the config expts/10_ek55_avt_ig65m.txt, and added these lines to the config:

test_only=true
train.init_from_model=[[${cwd}/DATA/models/10_ek55_avt_ig65m.pth]]

However, I'm getting

[2021-10-05 12:37:04,999][root][INFO] - Reading from resfiles
[2021-10-05 12:37:11,072][func.train][INFO] - []
[2021-10-05 12:37:11,073][root][INFO] - iter_time: 0.294328
[2021-10-05 12:37:11,073][root][INFO] - data_time: 0.135377
[2021-10-05 12:37:11,074][root][INFO] - loss: 6.164686
[2021-10-05 12:37:11,074][root][INFO] - acc1/action: 7.351763
[2021-10-05 12:37:11,074][root][INFO] - acc5/action: 19.931891
[2021-10-05 12:37:11,074][root][INFO] - cls_action: 6.134162
[2021-10-05 12:37:11,074][root][INFO] - feat: 0.030524

which is far from the 14.4 and 31.7 Top 1/5 performance. Do you know what might be wrong here?

About EK55.

Hi, thanks for your work. But I'm confused about details of fusing your result on EPIC-KITCHENS-55 test-set. Could you please give more explanation? Besides, it will be most kind of you to provide your submission file on EK55-test.

Data preparation (EK)

Hi There!
As mentioned, for faster processing all videos are resized to 256x256px.
However, when running evaluation on the provided ckpt (expts/09_ek55_avt.txt), I got different results on resized/original-size versions. The reported top-{1,5} acc was applied on the original sized vidoes?
My results are:
original videos' size:

[2022-02-23 18:09:48,322][root][INFO] - acc1/action: 12.369477
[2022-02-23 18:09:48,323][root][INFO] - acc5/action: 29.859437

on the resized videos:
[2022-02-23 18:09:56,238][root][INFO] - acc1/action: 11.646586
[2022-02-23 18:09:56,238][root][INFO] - acc5/action: 26.666666

I might have missed something?
Thank you very much for your work!

Is there any alternative implementation for 'transformers.GPT2Model'?

Hi, Rohit! May I ask if I can replace the model definition using the package 'transformers' with any other package or code? That would solve my following issue.

The 'transformers' package calls the package 'tokenizers', which calls 'GLIBC_2.18'. My cluster only has an old version of GLIBC, and I have no sudo access to upgrade it. I got the following error when I run it on the cluster:

File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/models/__init__.py", line 19, in <module>
 File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/models/mt5/__init__.py", line 31, in <module>
 File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/models/t5/tokenization_t5.py", line 26, in <module>
 File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils.py", line 25, in <module>
 File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/transformers-4.2.2-py3.8.egg/transformers/tokenization_utils_base.py", line 87, in <module>
 File "/home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/tokenizers/__init__.py", line 79, in <module>
   from .tokenizers import (
ImportError: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /home/zhaoc/a100/tools/miniconda3/envs/avt2/lib/python3.7/site-packages/tokenizers/tokenizers.cpython-37m-x86_64-linux-gnu.so)

Thanks.
Chen

Long-term anticipation

Hi, thanks for repo!
Excuse me, could you please tell how to perform long-term anticipation as showed in the paper? Have I missed this part in code, or it's not implemented here yet?

Couple questions about classification loss

Hi @rohitgirdhar,

Thanks for your great work -- I found it very interesting and plan to use it in my work! I was hoping to clear up exactly how the loss functions are working with the feature decoding since I was a little confused:

The decoder at each timestep from 1..t outputs features (in causal manner), which are then passed through a linear layer to obtain predicted frame features. Another linear layer on top of this then predicts distribution over action classes. Thus, we have t action predictions. Does the predictions for timestep 1 use the action label from timestep 2? The predictions for timestep t from my understanding represent the action at timestep t+1 (next action we want to anticipate). Based on the implementation, I was wondering if the classification loss also does a loss based on predictions for the next frame's labels and the first frame label is not used? Sorry if this is confusing, hope you can help clear my understanding!

-l seems to go into deadlock

Hi, I'm able to train models with -g, but when I try to train on my local machine with multiple GPUs with -l, the model doesn't start training after the commands for hydra are printed, and a few jobs are spawned only on GPU 0. It seems to be some deadlock caused by submitit/hydra. Do you know what is going on here? Any help is greatly appreciated!

cant create video dataset

Hi,
I am trying to run the code for ek100 dataset. But i notice that it is not able to compute any video clips for dataset creation on as here

_dataset = hydra.utils.instantiate(dataset_cfg,

The _dataset variable is EpicKitchen class object since the _target variable defined here is of the same type.

_target_: datasets.epic_kitchens.EPICKitchens

Therefore it is not able to execute this line.

_dataset.video_clips.compute_clips(data_cfg.num_frames,

and it execute the except line after this.

Can you help me in this regard. Maybe I am doing something wrong !
Thanks :)

How to print the noun/verb class accuracies for EGTEA and Epic-Kitchens?

I am trying to replicate the results for Epic-Kitchens and EGTEA by training the entire dataset end-to-end. I downloaded the videos from the server and put them in the same paths as mentioned in the repository. But, when I printed the logs, I can't print the noun/verb class accuracies.

[2021-12-12 07:11:42,764][root][INFO] - Reading from resfiles
[2021-12-12 07:11:42,764][root][INFO] - Reading from resfiles
[2021-12-12 07:11:48,536][root][INFO] - iter_time: 5.330609
[2021-12-12 07:11:48,537][root][INFO] - data_time: 4.533271
[2021-12-12 07:11:48,537][root][INFO] - loss: 10.175363
[2021-12-12 07:11:48,537][root][INFO] - acc1/action: 7.338158
[2021-12-12 07:11:48,537][root][INFO] - acc5/action: 20.124647
[2021-12-12 07:11:48,537][root][INFO] - cls_action: 5.797386
[2021-12-12 07:11:48,537][root][INFO] - past_cls_action: 4.045731
[2021-12-12 07:11:48,537][root][INFO] - feat: 0.332246
[2021-12-12 07:11:48,557][func.train][INFO] - []
[2021-12-12 07:11:48,558][root][INFO] - iter_time: 5.330609
[2021-12-12 07:11:48,558][root][INFO] - data_time: 4.533271
[2021-12-12 07:11:48,558][root][INFO] - loss: 10.175363
[2021-12-12 07:11:48,558][root][INFO] - acc1/action: 7.338158
[2021-12-12 07:11:48,558][root][INFO] - acc5/action: 20.124647
[2021-12-12 07:11:48,558][root][INFO] - cls_action: 5.797386
[2021-12-12 07:11:48,559][root][INFO] - past_cls_action: 4.045731
[2021-12-12 07:11:48,559][root][INFO] - feat: 0.332246
[2021-12-12 07:11:48,566][func.train][INFO] - Logging: resseting all meters
[2021-12-12 07:11:48,566][func.train][INFO] - Logging: resseting all meters done, updating epoch to %d
[2021-12-12 07:11:48,566][func.train][INFO] - Logging: resseting all meters
[2021-12-12 07:11:48,566][func.train][INFO] - Logging: resseting all meters done, updating epoch to %d
[2021-12-12 07:11:48,569][func.train][INFO] - Storing checkpoints every 60.00 mins

What changes are needed to be made to print the required noun/verb accuracies as well?

Can't produce result for pre-trained TSN. Takes long time

Hi,

I was wondering how long would it take to evaluate a pre-trained TSN model on EK100 data? I tried to run it for a long time, and it won't stop or produce a result. I'm using 4 12GB-GPUs and was using:

  • batch_size=1
  • train.num_epoch=0 (no training, just eval)
  • +model.future_predictor.n_head=4
  • +model.future_predictor.n_layer=4
  • +model.future_predictor.inter_dim=64

However, my run never output anything new after this log:
Screen Shot 2022-01-12 at 3 06 16 PM

How long does a pre-trained model would take to just evaluate?

I just want to see how well the pre-trained model works to reproduce the same numbers. But the program just keep running, no more logs are produced after this screen shot. I even reduced the batch size, heads, layers, etc. so that it should run faster.

Doubts regarding the Causality

@rohitgirdhar Thanks for this excellent work and releasing the code.

I went through the code but I am unable to see where the masking of inputs is occurring. In

outputs = self.gpt_model(inputs_embeds=feats,
you directly pass the entire input features to the GPT itself. Does the masking and output per timesteps get taken care by GPT itself?

Generate submitted file

We can only see the accuracy of validation set, when we run
python launch.py -c expts/02_ek100_avt_tsn.txt -l -t.

Can you tell us how to generate the submitted file for the challenge ek100?

Unable to reproduce '01_ek100_avt.txt' val result

Hi @rohitgirdhar, I'm trying to reproduce the experiment '01_ek100_avt.txt'. After training, I read my evaluation results using your example code in README.md, which from ’notebooks.utils‘. And I got these outputs:

[('expts/01_ek100_avt.txt', 0)] Accuracies verb/noun/action: 32.3 77.3 22.3 51.8 13.6 32.8
[('expts/01_ek100_avt.txt', 0)] Mean class top-1 accuracies verb/noun/action: 6.0 7.9 1.5
[('expts/01_ek100_avt.txt', 0)] Recall@5 verb/noun/action: 22.3 28.7 12.0
[('expts/01_ek100_avt.txt', 0)] Recall@5 many shot verb/noun/action: nan nan 12.0
[('expts/01_ek100_avt.txt', 0)] Recall@5 tail verb/noun/action: 13.9 19.3 8.9
[('expts/01_ek100_avt.txt', 0)] Recall@5 unseen verb/noun/action: 27.5 26.0 12.6

My reproduction result is 12.0, lower than your report result 14.9. Am I doing the right evaluation process?

And the only thing I changed in the training is the interface of reading data. Could you please help me to check whether it could be the reason I fail the reproduction? The alteration is as below.

I wrote a PictureReader to read frames in iteration replace of the DefaultReader:

class PictureReader(Reader):
    def forward(self, video_path, start, end, fps, df_row, **kwargs):
        del df_row
        start = int(max(start * fps, 1))
        end = int(end * fps)
        if end <= start: return torch.Tensor()
        
        video_path = video_path[:-4]
        video = []
        for i in range(start, end):
            picture = torchvision.io.read_image(video_path+'/frame_{:010d}.jpg'.format(i))
            video.append(picture.permute(1,2,0))
        return torch.stack(video), {}, {}

As for the get_frame_rate(), I got fps from the annotation df. So I add a line in _init_ of class EPICKitchens.
df['fps'] = df['start_frame'] / df['start']

Regarding sampling method and strategy

Hi again :)
I have had some wonders regarding the frame sampling method. Would be highly appreciated to clear that out... :)
Looking at expt. 09_ek55_avt settings, the following parameters are:

  • tau_o: 20sec
  • original fps (of EK): 60
  • req_fps: 1
  • frames_per_clip: 10
  • sampling_strategy: "last_clip"

As I see it, in practice out of a 20-second input clip (1200 frames), only 10 are sampled, where each frame represents 1-second sub-clip. As a result (due to req_fps=1 and sampling_strategy='last_clip') the model only looks at the last 10 seconds. Is that correct? If so, what is the actual roll of tau_o?

Thank you very much!

Unable to initialize from the pretrained model

Hi,
I am facing an issue while initializing with the pretrained model for the experiment 13_50s_avt.txt (added logs at the end). I believe that the environment setup is correct since the model is running (training without pretrained weights). I have followed the steps mentioned in the readme to download the pretrained models. It would be really helpful if you could share any insight to debug this.
Thank you.

Logs

[2021-10-19 22:09:20,194][func.train][WARNING] - Could not init from /user/AVT-main/DATA/pretrained/TIMM/jx_vit_base_patch16_224_in21k-e5005f0a.pth: []
[2021-10-19 22:09:20,195][func.train][WARNING] - Unused keys in /user/AVT-main/DATA/pretrained/TIMM/jx_vit_base_patch16_224_in21k-e5005f0a.pth: ['pre_logits.fc.bias', 'pre_logits.fc.weight', 'head.bias', 'head.weight']

Cannot train/test with video data

How can I train/test with video data?

When I run python launch.py -c expts/09_ek55_avt.txt -t -g, I obtain the warning "No video_clips present":

[2022-02-08 17:15:53,778][func.train][INFO] - Computing clips...
[2022-02-08 17:15:53,779][func.train][WARNING] - No video_clips present
[2022-02-08 17:15:53,779][func.train][INFO] - Created dataset with 23430 elts

I have downloaded and cropped the videos and setup the same folder structure as in the README file.

Executing hasattr(_dataset, 'video_clips') results in False. How to add video_clips to _dataset to properly execute compute_clips in datasets/data.py?

Question about Object/Image Features

Hi,
I was just wondering how exactly the object features are used in the model? At each timestep, does the model consider both image features and object features concatenated? Are you familiar with how to extract these object features from the raw frames (ex: will it work at a lower resolution image)? Thanks!

Recall / Accuracy Questions on Paper

Hi Rohit, just had a few questions when re-reading the paper:

  1. In Table 4, it seems that accuracy on the new proposed model is worse, but recall is better (compared to other models), do you have some insight into this tradeoff? Your loss metric is optimizing for accuracy, so I'm just curious if there's anything else you might be doing.

  2. Do you have any results on AVT-h and AVT-b without fine-tuning on AVT-b? If not, do you have any idea on how it performs without fine-tuning?

  3. Why does replacing the backbone with AVT-b significantly boost recall, but not top-1 accuracy?

Data Augmentation and End-to-end Training

Hi,

I'm wondering if you have any results regarding performance without data augmentation and end-to-end training? If not, do you know how much augmentations helped in terms of overfitting and performance? Currently, I'm trying to run longer time-horizon experiments without augmentations, but it doesn't seem to perform much better than 2 second horizon experiments. Thank you!

Job exiting automatically after a while

I am training the Epic-55 dataset end-to-end only on the videos. I followed all the steps in the repository and was able to set up the training. My model trains for a while but gets canceled automatically.

I get this error:

MicrosoftTeams-image (2)

My batch file submitted is:

#SBATCH --cpus-per-task=10
#SBATCH --error=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.err
#SBATCH --gres=gpu:v100:4,nvme:100
#SBATCH --job-name=AVT
#SBATCH --mem=200GB
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --open-mode=append
#SBATCH --output=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.out
#SBATCH --partition=gpu
#SBATCH --signal=USR1@120
#SBATCH --time=4320
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --output /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.out --error /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.err --unbuffered /scratch/project_2000255/anwer/antic_trans/AVT/avt_env/bin/python -u -m submitit.core._submit /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j
~                       

Is this issue due to the different number of nodes as specified in the config? I don't have 4 nodes of 8 GPUs.

The input/output feature dimensions of Transformer Encoder and Causal Transformer Decoder?

Hi, thanks for your great project!
I am wondering the input/output feature dimensions of Transformer Encoder. The description in Section 4.1 of the paper shows the input/output feature dimensions are both 768D, is it right? However, the description in Section 4.4 of the paper shows the
input feature dimension of Causal Transformer Decoder is 2048D, what is the output feature dimension of Causal Transformer Decoder? And is there a dimension conversion (768D->2048D) before using Causal Transformer Decoder?

Problem when reproducing with Breakfast/50Salad data

Hi @rohitgirdhar , Thank you for great idea in video anticipation.

I'm reproducing your work with Breakfast/50salad data, and faced with 2 problem.

1.Do I have to locate breakfast video and 50 salad video at same folder?

  1. While training with 50 Salad, ffmpeg returning some error : marker does not match f_code
    (image below). Although with this error, training goes well but how can i handle this error? I've changed avi with mp4 file, or resizing it, but not worked. Is this because of codec problem?
    Untitled

Thank you

Reproduce models using avt-b on EK55

Hi,
Thanks for sharing the code for this great work!
I have been a bit struggling in reproducing the results for expt. 09_ek55_avt.txt. I simply run python launch.py -l -c 09_ek55_avt.txt using the pretrained ViT-16 from TIMM and batch size of 16 on 8 GPUs. The results I get:

[2021-10-27 08:15:14,840][root][INFO] - loss: 9.935580
[2021-10-27 08:15:14,842][root][INFO] - acc1/action: 9.955929
[2021-10-27 08:15:14,844][root][INFO] - acc5/action: 23.895375
[2021-10-27 08:15:14,844][root][INFO] - cls_action: 5.674763
[2021-10-27 08:15:14,847][root][INFO] - past_cls_action: 3.956052
[2021-10-27 08:15:14,848][root][INFO] - feat: 0.304765

In addition, I downloaded the corresponding checkpoint and run the testonly using test_only=true
The result I got:

[2021-10-27 16:17:13,831][root][INFO] - loss: 9.941379
[2021-10-27 16:17:13,831][root][INFO] - acc1/action: 9.867363
[2021-10-27 16:17:13,831][root][INFO] - acc5/action: 23.814309
[2021-10-27 16:17:13,832][root][INFO] - cls_action: 5.677493
[2021-10-27 16:17:13,832][root][INFO] - past_cls_action: 3.959041
[2021-10-27 16:17:13,832][root][INFO] - feat: 0.304845

Worth of mention that I train and test the resized version of EK.

An I doing something wrong?
BTW, what is the actual difference between 09_ek55_avt.txt and 09_ek55_avt_forAR.txt?

Thank you in advance!

Can I train AVT on RGB frames?

Hi, the official epic-kitchen provide both the rgb-frame and raw videso source, i want to ask, can i train your model in the rgb-frames, since the raw videos are too large to download for me.
if yes, would it lead to lower performance?

thanks!

How to get attention visualizations for AVT?

Hi there, thanks so much for the interesting work!

I'm a bit confused on how to reproduce the precise attention map visualizations in the AVT paper, showing which frames attend strongly to which other frames in AVT-h. Is there any code available for reproducing those visualizations?

Thanks in advance for your help!
Vineet

Use RULSTM feature

Hi, I am trying to use rulstm as the backbone, but not sure which expts is required for it.

Questions about Paper Results

Hi,

Do the results in table 2 use the full loss function (L_cls, L_feat)? Also in table 7, do you use only rgb modality?

Loading pretrained ViT base as a backbone

Hi there!
I was wondering what is the difference between the ViT ckpt file you supplied instead of simply loading timm.vit_base_patch16_224 with using timm.create_model with the pretrained=True flag. Basically they were both pretrained on IN1k. Am I correct?
Thanks!

EOFError: Ran out of input

Traceback (most recent call last):
File "train_net.py", line 44, in
main() # pylint: disable=no-value-for-parameter # Uses hydra
File "lib/python3.7/site-packages/hydra/main.py", line 95, in decorated_main
config_name=config_name,
File "lib/python3.7/site-packages/hydra/_internal/utils.py", line 401, in _run_hydra
overrides=overrides,
File "lib/python3.7/site-packages/hydra/_internal/utils.py", line 458, in _run_app
lambda: hydra.run(
File "lib/python3.7/site-packages/hydra/_internal/utils.py", line 222, in run_and_report
raise ex
File "lib/python3.7/site-packages/hydra/_internal/utils.py", line 219, in run_and_report
return func()
File "lib/python3.7/site-packages/hydra/_internal/utils.py", line 461, in
overrides=overrides,
File "lib/python3.7/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "lib/python3.7/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "lib/python3.7/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "train_net.py", line 35, in main
getattr(func, cfg.train.fn).main(cfg)
File "AVT/func/train.py", line 689, in main
init_model(model_to_init, ckpt_path, ckpt_modules_to_keep, logger)
File "AVT/func/train.py", line 467, in init_model
checkpoint = torch.load(ckpt_path, map_location="cpu")
File "lib/python3.7/site-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "lib/python3.7/site-packages/torch/serialization.py", line 1002, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

i meet this problem when i load the checkpoint and i can't figure it out, so i wonder why this happen?
here is the checkpoint:
AVT/DATA/pretrained/TIMM/jx_vit_base_p16_224-80ecf9dd.pth

Some questions about future_prediction.py

Hi,

I'm trying to re-implement the model from the paper, and for a time horizon of 2 seconds I'm able to reach the same recall, however, increasing that time horizon didn't result in an increase of accuracy or recall. I noticed that here:

self.assign_to_centroids = assign_to_centroids
, there's this block of code on kmeans/centroids that is not described in the paper, so I'm unsure if this is the insight that I'm missing for longer horizons, or not. Would you be able to help shine some light on what this code is doing? Additionally, do you have any tips on how to see an increase in accuracy / recall when increasing horizon time?

Env Create Conflicts

Hi,

Your work is so impressive. Actually, I am a beginner in this field. I try to use the env.yaml file that you provide to create a conda env through conda env create -f env.yaml python=3.7.7. But it reports that found conflicts. Then what should I do?

Thank you in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.