ok1zjf / vasnet Goto Github PK

View Code? Open in Web Editor NEW

180.0 180.0 47.0 30 KB

PyTorch implementation of the ACCV 2018-AIU2018 paper Video Summarization with Attention

License: MIT License

Python 99.03% Shell 0.97%

vasnet's People

Contributors

Stargazers

Watchers

Forkers

vinace sdaujohnfan kukajenny chenbohua3 nazim1021 pycccleomessi zhipengliu6 jiangwen081 srirampingali ml-lab huarui1996 giangbui ledduy610 vishkaush ombretta anwarabir18 wutianyirosun vdeeplearn ljq-09 sijialou heylakshya dopper glmanhtu nimish007 dakimakura mrochan xb-chang tusharkantdeo borsuk74 huangce norouzmi chohyoungseo kim-bob zero2er0 hwijune lemuelpuglisi yskim0 arohan007 mayank15111996 baekkom180 yuqinghao1 jvdboss7 zouxiaodong suman1220 czzhao-sjtu cristiano2003 zyz102218-gmail

vasnet's Issues

gtscore and gtsummary

@ok1zjf Hello,
I see there are two fields in the dataset - gtscore and gtsummary. Can you explain the difference and significance of these? I see that you have used gtscore in your code while calculating losses. In the readme file, you have written gtscore is the frame-level importance scores and used for regression loss while gtsummary is the ground truth summary which is used for likelihood loss. I couldn't understand this part.
Thank you.

Datasets can't get

I can't get the datasets,I don't know whether it is my net error or the url error?
@ok1zjf

Parameters for cpd_auto for youtube and ovp?

Greetings @ok1zjf ,
In one of your reply in an issue posted about Vasnet, you hinted that change points (cps) was provided in the dataset you used. I am assuming you were only referring to summe and tvsum dataset.
I am assumed you used this dataset here.
https://app.box.com/s/4lq3xkv9n536ns2vutvfa26p6p53i7dv

May you please clarify on what parameters were used for KTS cpd_auto() function for ovp and youtube dataset?

Thanks
Sam

Loss Function and Evaluation Metric Choice

I have read the paper and it seems that you are treating this as a regression task not a classification task.
I know that the final labels are binary and the ground truth summary is a continuous set between 0 and 1.
My question is since you are using a sigmoid output and f-score metric shouldn't that be called a classification model and not regression
and if so how is using MSE loss suitable in this case

I tried to replace MSE with BCE but i got slightly worse results.

Alternate algorithm to KTS to speed up the segmentation process

Hi @ok1zjf and @electroncastle ,

First of all thank you for sharing the code of your paper for video Summarization.

I have found that the Kernel Temporal Segmentation (KTS) takes a long time to segment a video larger than 3 minutes. Do you know any alternate algorithm or method that would speed up the segmentation time for longer length videos, say 5 to 15 minutes video?

Any suggestions would be highly appreciated.

Regards,

Question About The Model

Hello, I noticed that there are two layers:self. kb and self. kc is not used in your model code , so I annotated them.However, the fscore dropped after annotation. Do you know why?
self.att = SelfAttention(input_size=self.m, output_size=self.m)
self.ka = nn.Linear(in_features=self.m, out_features=1024)
self.kb = nn.Linear(in_features=self.ka.out_features, out_features=1024)
self.kc = nn.Linear(in_features=self.kb.out_features, out_features=1024)
self.kd = nn.Linear(in_features=self.ka.out_features, out_features=1)

How can I watch the videos result?

and also how can i use the model for summarize new videos?
many thanks

How to generate video from binary format.

We have trained the model and got tar.pth files but are unable to progress forward. Can you guide us from here on forward.

input to KTS

Hi,

For change point detection. What should i input to KTS? Flatten image as HxW dimension input or using some feature extraction methods so that the image become some N dimension input? What is being used in this paper to preprocess the image/frame?

can you offer the code of DR-DSNsup

I have got the code of DR-DSN unsupervised version, now I want to run the code with Augmented data, but I can't solve it. Can you offer the code to me? thank you very much!

hello, maybe a error in your orignal paper?

in the paper the formula(6) notes WCt+xt ,but at the figure2 ,the residual yellow line link the x t+1

create result H5

How do I create a h5 file of the generated summary? I want to compare the machine summary with ground truth visually.

Why multiply Q with 0.06?

Hi,
What is the reason behind multiplying Q with 0.06? Thanks.

datasets

I have a question about the features:
Do I need additional processing of video frames when using googlenet to extract video features? For example, normalization and other operations, or directly resize the original video frame and use the network to obtain features?
What should I do if I want to use resnet for feature extraction?
Thank you!

The link to the data seems to have failed

can not get the data and pretrained model through the url you provided:(

Regarding the loss_att

In this block in training, I don't get the loss_att and its use.

loss_att = 0
loss = criterion(y, target)
loss = loss + loss_att

Hi~ how did you get the CNN features? I do not know about that

Could you tell me how did you get the CNN features? It seems like that your codes didn't use any CNN model, or the CNN features came from the pre-processed video dataset?

Can't reproduce confusion matrix of attention weights for TvSum video 7, test split 2

Hi,

I thought I would try my luck here...

I am working on your VASNet paper as part of a Deep Learning seminar. Since a few days I am trying to reproduce the Confusion Matrix of Attention Weights for TVSum video 7 from split 2, which is exactly the matrix you have shown in the paper.

Unfortunately I get a completely different figure, I'm running out of ideas. Here is my code:

self_attention = SelfAttention() # your SelfAttention(nn.Module)
self_attention.load_state_dict(torch.load(SELFATT_MODEL_FILE)) # SelfAttention model from split 2

with h5py.File(TVSUM_DATASET_FILE, 'r') as f:

    video_7 = f['video_7']
    features = video_7['features'][...]

    features = torch.from_numpy(features)

    y, weights = self_attention(features)

    weights = weights.detach().cpu().numpy()

    # Values were normalized to range 0-1 across the matrix.
    weights = preprocessing.minmax_scale(weights, feature_range=(0, 1), axis=0, copy=True)

    fig, ax = plt.subplots()
    ax.xaxis.tick_top()

    heatmap = sn.heatmap(
        weights_df,
        xticklabels=50,
        yticklabels=50,
        cmap="YlGnBu")

    plt.show()

And this is what the Confusion Matrix looks like:

It definitely loads the correct model for tvsum split 2 and the Attention Weights produced are definitely the same too, I checked it against your VASNet.

Does anyone have any idea why this looks like this?

please explaing why do we have apperture=-1, in the attention class vasnet_model.py

def init(self, apperture=-1, ignore_itself=False, input_size=1024, output_size=1024):
self.apperture = apperture

here the apperture is set to -1, and in the forward you are checking,
if self.apperture > 0:
#Set attention to zero to frames further than +/- apperture from the current one
when does these condition appears.

please explain the following code part of adding x and y

please explain the following code part of adding x and y
vasnet_model.py
class VASNet(nn.Module):
.....
def forward(self, x, seq_len):

    m = x.shape[2] # Feature size

    # Place the video frames to the batch dimension to allow for batch arithm. operations.
    # Assumes input batch size = 1.
    x = x.view(-1, m)
    y, att_weights_ = self.att(x)

    y = y + x # -- what is the reason behind this step please explain,

Question about frame score

Hi i use this natwork on my own video dataset, i use the pretrain model summe_aug_splits_1_0.5858084051379817.tar.pth to get frame score.
Is this right if a frame has higher score , this frame is more important and contain more semantic information to be a key frame for video summary or video classification task?

The download link for data and models is invalid, can you provide new links?

The download link for data and models is invalid, can you provide new links? @ok1zjf

The result.txt has no data

I train the project as the readme principle successfully but the result.txt that save the result doesn't have any data

Test my own video

   Hello,I meet some questions when i want to test my own video.
   I need to get the following parameters:
              machine_summary = generate_summary(probs, cps, num_frames, nfps, positions)
   So I need to run the cpd_auto.py:
             def cpd_auto(K, ncp, vmax, desc_rate=1, **kwargs):
  But I don't know the meaning of the input parameters such as K,ncp,vmax,Could you tell me the how you use it?
   And It only return two paragrames,cps and scores2.How can i get the rest paragram in generate_summary such as positions and nfps?
   Wish your reply,Thanks a lot!

epochs?

Can't reproduce results training

With the provided splits, I can not reproduce the results when training from scratch. I get an f1 scores of 61.33% and 47.79% for TVSum and SumMe respectfully. I am thinking that maybe the seed of 1234 is not the correct seed to be used for generating these results from scratch?
Thanks!

Issue regarding the last step of self attention (weighted sum step)

Hi, I noticed that the last step of the self-attention calculation doesn't seem so right:

att_weights_ = nn.functional.softmax(logits, dim=-1)       
weights = self.dropout(att_weights_)     
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

So here the softmax probability is calculated along the dim -1, which is the column direction.
But then the weighted sum is taken along the row direction according to this line

y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

I think we should do something like this

y = torch.matmul(weights,V)

How do you think?
I hope I'm the one to be corrected.

Interpreting Datasets

@ok1zjf How do I interpret the dataset features? For example, the first video in TVSum has 10597 frames and the shape of user_summary is (20, 10597). The shape of features is (707, 1024). How do we get the 707? What does it signify? Also, how do I train my model (LSTM) with these features? What should be the input and label?

Thank you.

TVSum and SumMe: Mapping video id to key

I'm trying to generate the summary outputs for TVSum and SumMe. Based on your answer to #2, I require the following:

FPS at which the frames were extracted
Mapping of the video id/name in the TVSum and SumMe datasets to the video ID used in your h5 files

Could you please share this information?