Giter Site home page Giter Site logo

untrimmednet's Introduction

UntrimmedNet for Action Recognition and Detection

We provide the code and models for our CVPR paper (Arxiv Preprint):

  UntrimmedNets for Weakly Supervised Action Recognition and Detection
  Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool
  in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

Updates

  • October 16th, 2018
    • Release the learned models trained only on the train set of ActivityNet1.2 datasets. Note that our previously released ActivityNet models are trained on the train+val set.
  • September 19th, 2017
    • Release the learned models on the THUMOS14 and ActivityNet1.2 datasets.
  • August 20th, 2017
    • Release the model protos.

Guide

The training of UntrimmedNet is composed of three steps:

  • Step 1: extract action proposals (or shot boundaries) for each untrimmed video. We provide a sample of detected shot boudary on the ActivityNet (v1.2) under the folders of data/anet1.2/anet_1.2_train_window_shot/ and data/anet1.2/anet1.2/anet_1.2_val_window_shot/.
  • Step 2: construct file lists for training and validation. There are two filelists: one containing file path, number of frames, and label; the other one containing the shot file path and number of frames (Examples are in the folder data/anet1.2/).
  • Step 3: train UntrimmedNets using our modified caffe: https://github.com/yjxiong/caffe/tree/untrimmednet

The testing of UntrimmedNet for action recognition is based on temporal sliding window and top-k pooling

The testing of UntrimmedNet for action detection is based on a simple baseline (see code in matlab/

Downloads

You could download our trained models on the THUMOS14 and ActivityNet datasets by using the scripts of scripts/get_reference_model_thumos.sh and scripts/get_reference_model_anet.sh.

untrimmednet's People

Contributors

lmwang9527 avatar wanglimin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

untrimmednet's Issues

prefetch data error during training

Dear Limin,

During training the optical flow network, I got an error that training was quited at prefecting data in the layer of SequenceDatalayer.
The log is as follow:
*** Aborted at 1540548898 (unix time) try "date -d @1540548898" if you are using GNU date ***
PC: @ 0x7f46cdec8f3d caffe::SequenceDataLayer<>::InternalThreadEntry()
*** SIGFPE (@0x7f46cdec8f3d) received by PID 29760 (TID 0x7f467dc61700) from PID 18446744072869416765; stack trace: ***
@ 0x7f46cce4f4b0 (unknown)
@ 0x7f46cdec8f3d caffe::SequenceDataLayer<>::InternalThreadEntry()
@ 0x7f46cca015d5 (unknown)
@ 0x7f46cc7da6ba start_thread
@ 0x7f46ccf2141d clone
@ 0x0 (unknown)

the solver's setting is as below:
net: "../models/four_class/temporal_102_class_hard_bn_inception_train_val.prototxt"

testing parameter

test_iter: 2710
test_interval: 500
test_initialization: true

output

display: 100
average_loss: 100
snapshot: 500
snapshot_prefix: "../models/four_class/flow_finetune/temporal_untrimmednet_hard_bn_inception_average_seg3_top3"
debug_info: false

learning rate

base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: [10000, 15000, 20000]
max_iter: 40000
iter_size: 1

parameter of SGD

momentum: 0.9
weight_decay: 0.0005
clip_gradients: 20

GPU setting

solver_mode: GPU
device_id: [1,0]
richness: 200

I finetuned the network with weight: anet1.2_temporal_untrimmednet_hard_bn_inception.caffemodel, and batch_size = 5.

Could help indicate what's wrong in my settings?
Thanks very much!

Post-processing scheme for detection

Hi there,

I'm running into issues when trying to understand your detection scheme.

if c_score > 0.1
mask = (f_score > max(f_score)*threshold);

From my understanding c_score is a video level classification score after aggregating classification score with attention which you also used for recognition.

Could you please help me understand where does this 0.1 come from? Further more, I have used your released model to extract test_score.mat. In order to reproduce the result shown in the paper, I'm wondering if I can just feed it into your matlab module without modification?

About the batch size.

Hi, thanks for sharing the code!
I'm a little confused about the batch size. In your paper, the batch size is set to 256. However, the batch size in the prototxt is set to 7, and the actual batch size should be doubled because iter_size=2.
(1) What exactly is the batch size used for producing the results in your paper?
(2) Let's say that the batch size is m, deos it mean that there are m videos in a batch, where each video contains 7x3 snippets?

Trained model for AN v1.3?

Thanks much for releasing this amazing work.

I am wondering besides AN v1.2, are there untrimmednet models trained on AN v1.3 and can be shared? Thanks!

About train data organization detail of THUMOS’14

When training on THUMOS14,trimmed videos are simply regarded as proposals.
That means, we need sample N trimmed videos (N stands for the number of proposals in an untrimmed video )to form a training sample?
Am I right?
Thanks !

THUMOS 14 Input Resolution

The THUMOS proto file indicates that the network takes 224x224 as input but the THUMOS 14 test video files have a resolution of 320x180.

What operations did you perform to get a 224 crop?

The Number of Flow Frames and RGB Frames

In your scripts anet_1.2_untrimmed_train_tvl1_list.txt and anet_1.2_untrimmed_train_rgb_list.txt, the number of flow frames is ALWAYS ONE LESS than RGB frames. Howerver, I extract flow&rgb images with TSN toolkits and find the rgb images are JUST AS MANY AS flow images. Why behind this matter?

label tool

Do you know any convenient temporal annotation tool?Thank you.

How to implement L1-normorlization for label vectors during the training process?

Thank you very much, I benefit a lot from reading your codes. In Section 3.3, you have pointed out the L1-norm is conducted upon the label vector. However, in the code https://github.com/yjxiong/caffe/blob/untrimmednet/src/caffe/layers/sequence_data_layer.cpp (line42: infile >> filename >> length >> label), it seems that the sequence_data_layer allows only one label for each video_source. Therefore, how do you implement L1-normorlization for label vectors during the training process?

About the detection test

Hi, Limin:
Can you help me with the following questions?
1, I want to get the attention of each frame which shows in your paper in figure 6, so how can I do it ?
2, In the detection test, what the file "test_score.mat" in "thumos_detection.m" means, if it is same with "video_scores" in "eval_net.py" of TSN, then how to define the flow_test_attention and rgb_test_attention in "thumos_detection.m" at 6# and 12# which are not calculated in TSN.
3, Do you have the python code to share?

the problem about sampling

Hi, Limin
Could you release the code of Shot-based sampling to get the clip proposals.
Thank you very much!

training problem on temporal model

Hi, I have met a problem when I trained the temporal model on my own data that the accuracy equals 0 from beginning to the end as following "Test net output #0: accuracy = 0". But the training on the spatial model works well, can you tell me where I could check and revise.
Thanks.

problem about downloading the pretrained model

hello, there is something wrong with your server where hold the pretrained model,so that i can not connect to it. and can you share your test code for generating the test_score.mat? thx very much!

Code for extract test score

Hi,Limin.
I want to reproduce the result of temporal action detection. I found following file is needed.
('~/code/temporal-segment-networks/THUMOS14_evalkit_20150930/test_score.mat')
I guess this file is too big for share. So can you share codes or scripts used for extracting "test score" file? Since current codes are used for video-level action recognition.
Thanks!

Schedule of this project

Hi wanglimin,
I read your paper in CVPR 2017, congratulations!
I wonder how about the progress of these project. I hope to see your nice work soon.

How many frames sampled every 15 frames in action detection processes?

In Section 4 of your paper, you said "For more precise localization, we perform test every 15 frames ". ATT, I am wondering how many frames sampled every 15 frames in action detection processes, is it that the same as action recognition (i.e. 1 frame of rgb and 5 frames of flow) ?

Ask for detailed guidance to reproduce the reported results

Hi @wanglimin and @yjxiong ,
Thanks very much for your open source code and models! Could you please provide detailed README about the whole process and other materials necessary to reproduce the results for UntrimmedNet? For example, the scripts to pre-process data and generate "anet_1.2_train_window_shot/3HHAEmr0Q34_window_shot.txt". It would be of great help to us who are interested in your paper.
Thanks in advance!

About testset results of THUMOS14

Hi,Limin.
Where can I find the THUMOS14 frame-level results?
('~/code/temporal-segment-networks/THUMOS14_evalkit_20150930/test_score.mat')
Thanks!

About selection module

The shape of soft attention weight in Equation (2) and (4) is
(clip proposal num ,1)?
Thanks in advance!

Performance of Spatial network

Hi,
I am working on similar project,and I want to cite your work in our paper.
And I need to know the performance of your SPATIAL network on thumos14 classification and detection task.
Could you please tell me about that?
Thanks in advance.

About the Pre-trained caffemodel only on ActivityNet_v1.2 training set.

Dear Limin Wang:

Thanks for your awesome work.

I am wondering the pre-trained ActivityNet_1.2 caffemodels that you provided are trained on training+validation datasets or just the training set?

If it is the former, would you please provide the caffemodel that only pre-trained on ActivityNet_1.2 training set?

Looking forward to your reply!

Thanks a lot!

temporal annotations of acton instances

Hi, Limin
"We do not have precise temporal annotations of action instances in training" in the paper.

  But, as far as I know,  Test_Data or Validation_Data has annotations in THUMOS2014. 
  Does that mean these annotations are not used in the training phase.

  Thank you very much,  looking forward to your reply! 

How to set the network proto for Thumos14?

To adapt the network proto from ActivityNet1.2 to Thumos14, I have modified the fc_action and the consequent layers dimensions (100-->21). However, it goes wrong when loading the trained model on Thumos14.
Thus, I am wondering how to set the network proto for Thumos14, and whether it is convenient for you to release the network proto on Thumos14?
Look forward to your feedbacks, and thank you for your kind attention to this matter.

About video-level recognition result

Hi,I'm confused that the video-level recognition result is just one label for an untrimmed video or more?Because I find that there are videos with more than one labels in the test data in THUMOS14.Then the result of the untrimmednet will recognize such videos with how many labels? And how to implement?

"caffe.LayerParameter" has no field named "sequence_data_param"

Seem this version of cafffe(https://github.com/yjxiong/caffe.git) does not have "sequence_data_param", could anyone tell me which caffe version should I using? Thanks.

Error like follows:
# sh spatial_untrimmednet_soft_bn_inception_run.sh
...
[libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 11:23: Message type "caffe.LayerParameter" has no field named "sequence_data_param".
F1126 16:51:34.954509 18856 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: ../models/spatial_untrimmednet_soft_bn_inception_train_val.prototxt
*** Check failure stack trace: ***
@ 0x7f51c72bf5cd google::LogMessage::Fail()
@ 0x7f51c72c1433 google::LogMessage::SendToLog()
@ 0x7f51c72bf15b google::LogMessage::Flush()
@ 0x7f51c72c1e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f51c77b30f1 caffe::ReadNetParamsFromTextFileOrDie()
@ 0x7f51c7784baf caffe::Solver<>::InitTrainNet()
@ 0x7f51c7785dbe caffe::Solver<>::Init()
@ 0x7f51c7785f86 caffe::Solver<>::Solver()
@ 0x40fec0 caffe::GetSolver<>()
@ 0x408103 train()
@ 0x405eb0 main
@ 0x7f51c62dd830 __libc_start_main
@ 0x4063c9 _start
@ (nil) (unknown)
Aborted (core dumped)
Done.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.