wzmsltw / bsn-boundary-sensitive-network.pytorch Goto Github PK

View Code? Open in Web Editor NEW

246.0 246.0 59.0 8.29 MB

Codes of our paper: "BSN: Boundary Sensitive Network for Temporal Action Proposal Generation"

Python 99.58% Shell 0.42%

bsn-boundary-sensitive-network.pytorch's People

Contributors

Stargazers

Watchers

Forkers

bityangke earhian yunwenhuang lwj2018 ycyuchen wangvenn coolbay andrewhuman xiaoxinlong 2qwerty shenzhengyang caoliangjie sy-zhang shadowclouds yiskw713 hzhang57 pbricky cinjon liuyuemaicha saifsayed xlliu7 elevanth zepingz jiazewang xuejiwei73 yangyang12315 konglongteng ammieqi fytrace yhzhai zhzhuangxue catosine gjylt yourbroduke maodong2056 const-yield ucrscholar guillaumevr onlyonewater assassindesign wxd000000 fangwudi dreamedrainbow sizhewei naveenkumarmulabitmovin hhhhnwl melony233 user8361 chentiao liyiming09 tigershan wpmelene seungwoo-h wangchenghuidream husterrc san-di zhangzhili1112 elijahahianyo lumospang

bsn-boundary-sensitive-network.pytorch's Issues

feature sequence for THUMOS14 dataset

Hi，thx for the brilliant work！
I encounter a problem for using sliding window method in THUMOS14 dataset and I notice your reply in Feature extraction questions #14. I would be appreciate if you can send me codes of BSN in THUMOS14? My email address is [email protected]. Thx again!

New dataset

Hi there, thanks for releasing your code. I've went through it with the intention of adding a new dataset and, as far as I can tell, the main thing that needs to be done is to generate the video_anno file, which is a large json consisting of:

I understand that the annotations field is meant to be a list of {'label': , 'segment': [start, end]}, but can you verify what the other three are meant to be? It's not clear if duration_second is according to a normalized FPS or if it's just the timestamp in the video. It's also unclear what the difference is between duration_frame and feature_frame.

In what units is the start and end of segment, i.e. is it relative to the actual time in the video or a normalized time?

Additionally, I will not be modifying the video to be 100 frames each. It seems like you did that for ActivityNet but the paper doesn't mention anything similar for Thumos. What was your strategy for Thumos?

Finally, what's the story with video_df in _get_base_data? It seems like it loads the full data in every time. That's 11G uncompressed. Is this right?

pem_low_iou_thres default value

In the opts.py, the 'pem_low_iou_thres' default value is 2.2. I suppose this is a small mistake. Do you mean 0.2 for this?

####code#######
parser.add_argument(
'--pem_low_iou_thres',
type=float,
default=2.2)
###############

Thanks.

Testing framework on THUMOS

@cinjon how is your progress to try this code on THUMOS?

If I understand this correctly I have to extract the snippet level features using TSN (https://github.com/yjxiong/anet2016-cuhk). But the anet2016-cuhk is pretrained on activity net so you first have to finetune the network on THUMOS and then extract the snippet level features from THUMOS and do the TEM, PGM & finally the PEM training? Is this correct?

Originally posted by @tobiascz in #12 (comment)

Feature extraction questions

Hi there, I am trying to train this on another dataset and am getting a bit stuck trying to figure out how exactly you extracted the features using TSN.

I am using the mmaction repository (which is what the authors of the TSN library you suggest ... suggest using) and the approach in that repository is to oversample by first computing the crops and flips and then run that through the model.

I noticed in #3 that you said you don't remember if you used oversample or not. Has that changed by any chance? It would save me a lot of time if you can remember that.

Also, I noticed that the size of the features in the provided CSV were each of size 400. That seems small given that the TSN outputs features of size 1024 out of the box. Is there some other setting you used to get size 400 features?

Thanks for your help.

Evaluation on THUMOS14

Hi,
I want to know if the evaluation program is the same on the THUMOS14 dataset?

Baidu Network Disk Link Failure, Can you send it again?

Can you provide me the classification results of ''cuhk & ethz & siat submission to activitynet challenge 2017''? Thanks very much.

Error occurs when training PEM The size of tensor a (1600) must match the size of tensor b (8000) at non-singleton dimension 0

When I train PEM module, an error occurs as followed.
But the first time I trained this module around a week ago, it works well.
Could someone solve this??

		Traceback (most recent call last):
		  File "main.py", line 297, in <module>
		    main(opt)
		  File "main.py", line 269, in main
		    BSN_Train_PEM(opt)
		  File "main.py", line 170, in BSN_Train_PEM
		    test_PEM(test_loader,model,epoch,writer,opt)
		  File "main.py", line 104, in test_PEM
		    iou_loss = PEM_loss_function(PEM_output,label_iou,model,opt)
		  File "/lvjc/project/BSN-boundary-sensitive-network.pytorch/loss_function.py", line 71, in PEM_loss_function
		    iou_loss = F.smooth_l1_loss(anchors_iou,match_iou)
		  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 2113, in smooth_l1_loss
		    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
		  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/functional.py", line 49, in broadcast_tensors
		    return torch._C._VariableFunctions.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (1600) must match the size of tensor b (8000) at non-singleton dimension 0

What does this 16*16 stands for in post_processing

Hi Tianwei @wzmsltw

Thanks for your awesome work! I'm also trying to run BSN with some customized video data. However, I'm confused with a part of your code: it' line 86 of the post_processing.py. I also noticed that you mentioned 16 in your paper in multiple places and I have no idea about the 16*16 in your code means.

Would you mind helping me with this question?

Thanks!

feature of thumos14

iam stuck to prepare data by slide window for thumos14,is anyone can send me the BSN code on thumos14? my email is [email protected],i would be really appreciated!

Please post training curves for Thumos

Here are what mine look like using the provided code for Thumos dataset, but converted to Pytorch. I am trying to debug this because I don't think these are correct given the poor test results.

Training curve. The line to the right is the same as the line to the left, except using 4 gpus and a corresponding 4x batch size and 4x learning rate.

Test curve. Note how this never goes down; it only shows overfitting to Train.

What sort of loss curves should we expect?

When training, at approximately what level of loss for each of the parts does the model start to become reasonable? And what kind of curves should we expect?

Error: Sizes of tensors must match except in dimension 0. Got 1000 and 723 in dimension 1

It works well until run this step.

python main.py --module PEM --mode inference

Here's the output result:

PEM inference start
validation subset video numbers: 4728
Traceback (most recent call last):
  File "main.py", line 298, in <module>
    main(opt)
  File "main.py", line 276, in main
    BSN_inference_PEM(opt)
  File "main.py", line 221, in BSN_inference_PEM
    for idx,(video_feature,video_xmin,video_xmax,video_xmin_score,video_xmax_score) in enumerate(test_loader):
  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/lvjc/envs/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1000 and 723 in dimension 1

What's up with sigmoid loss multiplier?

In this repo, it is .01. In the tensorflow rep, it is .1. Which (of either) did you actually use?

The detail of the feature extraction part in this model

Hi TianWei,
How many optical flow do you stack in a snippet(as described in your paper) s around center frame. Is it related to the length of snippet(delta)?

Is the weight initialization correct?

The weight_init code in models.py TEM is:

    @staticmethod
    def weight_init(m):
        if isinstance(m, nn.Conv2d):
            init.xavier_normal(m.weight)
            init.constant(m.bias, 0)

However, there are no nn.Conv2d modules in TEM, so this is never triggered. What is this supposed to be?

Is oversample used for feature extraction?

Thank you for your work!
When I used TSN to extract features, I found oversample was used in the code, which results in the output of 'fc-action' being 10*200 (input being an image).
So I have to ask the following questions.
1, Whether oversample is used for feature extraction? If so, is the average taken at the end?
2, Or center-crop(w: 224 h: 224 c: 3) is adopted when extracting features in an image(340 256 3) or flow stack(340 256 10)?

A question about TEM module loss

Hello, Mr. Lin, very thanks, the code is useful for me.
Sorry to be a bother.
When I try to run it, the TEM testing loss is almost equal from epoch 1 to 20, the module learned nothing.
What maybe cause this question? Is there any advises?
Looking forward to your reply, thank you very much!

feature extraction in THUMOS14

Thx for the excellent work for the community! I have two confusions hoping to be answered:

When reading the paper, I found the 2 stream network released in NIPS was used. But the TSN is used, when I read the code here. So what do we use?
I found that the process is different between THUMOS14 and ActivityNet in Feature extraction questions #14. Could you sent me the code on THUMOS14 please? My email is [email protected]

Typo in initialization

Hello, in models.py line 28 (

BSN-boundary-sensitive-network.pytorch/models.py

Line 28 in e50d129

if isinstance(m, nn.Conv2d):

)
you used initialization in conv2d. But here it is conv1d. Also in PEM part the same issue.

Expected time to train on Thumos?

Training on Thumos seems to go extremely quickly given the data loader that you sent out. There appears to be only ~2500 examples in the resulting dataset, which only takes ~3-5 minutes to train to 20 epochs. Is this correct?

Real time video detection

Hi,
Thanks for the code! I just wanted to ask if BSN(Boundary sensitive network) as well as BMN(Boundary matching network) can be applied for real time action detection in videos?

An easy way of getting the start and end points given a custom video

Is there a way (or let's say a snippet) of getting predictions for a custom video?

Example signature:
start_points, end_points = BSN(path_to_dir_with_video_frames)

Training procedure

As far as I can tell, the training procedure with this repo is to first train the TEM, then generate the PGMs, then use those to train the PEM. Do you then repeat or do you do it just once?

ActivityNet AR@AN

Hi, do you have the results for Table 2 (https://arxiv.org/pdf/1806.02964.pdf) for ActivityNet? I am speaking of the @50, @100, ..., @1000 proposal recall numbers. I only see those reported for Thumos. Thanks.

Is that possible to release the detection demo?

Great work! Thank you for sharing!

This work is for generating action proposals.
You mentioned in your paper, for temporal action detection, on activityNet-1.3, you adopt top-1 video-level classification results generated by method of [Zhao, Y., et al., CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2017], and use confidence scores of our proposals for detection results retrieving. I did not get how you do it. Could you please explain my following questions:

In the paper of Zhao, Y., et al., CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2017, it mentioned different methods for different tasks. Which method is adopted by you? Is it SSN(Structure Segment Networks)?
BSN will generate at least 100 proposals. Will you chose all proposals for action classification?
Assume you only choose top-k (such as k = 2, 3, 4, 5) proposals output from BSN for action classification. For each selected proposal generated from BSN, you do video-level action classification. Does it mean for a proposal "started on frame m, and ended on frame n", you will generate only one action label?
I wonder if it is possible to release your detection demo. That will be awesome!

    1, By reading your paper, i am wondering whether conv2d is better than conv1d in TEM module for action score regression?

   2, Can overlaping of slide windows in TEM feature sequence give us better scores?

looking forward for your reply!

Can i use BSN to separate video to multiple parts ?

My video has different scenes.
First, I want to cut my video to different parts , then i want to do scene classification on different parts of video . Can I use BSN to finish my first step ?