Giter Site home page Giter Site logo

vasnet's People

Contributors

electroncastle avatar ok1zjf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vasnet's Issues

gtscore and gtsummary

@ok1zjf Hello,
I see there are two fields in the dataset - gtscore and gtsummary. Can you explain the difference and significance of these? I see that you have used gtscore in your code while calculating losses. In the readme file, you have written gtscore is the frame-level importance scores and used for regression loss while gtsummary is the ground truth summary which is used for likelihood loss. I couldn't understand this part.
Thank you.

Loss Function and Evaluation Metric Choice

I have read the paper and it seems that you are treating this as a regression task not a classification task.
I know that the final labels are binary and the ground truth summary is a continuous set between 0 and 1.
My question is since you are using a sigmoid output and f-score metric shouldn't that be called a classification model and not regression
and if so how is using MSE loss suitable in this case

  • I tried to replace MSE with BCE but i got slightly worse results.

Alternate algorithm to KTS to speed up the segmentation process

Hi @ok1zjf and @electroncastle ,

First of all thank you for sharing the code of your paper for video Summarization.

I have found that the Kernel Temporal Segmentation (KTS) takes a long time to segment a video larger than 3 minutes. Do you know any alternate algorithm or method that would speed up the segmentation time for longer length videos, say 5 to 15 minutes video?

Any suggestions would be highly appreciated.

Regards,

Question About The Model

Hello, I noticed that there are two layers:self. kb and self. kc is not used in your model code , so I annotated them.However, the fscore dropped after annotation. Do you know why?
self.att = SelfAttention(input_size=self.m, output_size=self.m)
self.ka = nn.Linear(in_features=self.m, out_features=1024)
self.kb = nn.Linear(in_features=self.ka.out_features, out_features=1024)
self.kc = nn.Linear(in_features=self.kb.out_features, out_features=1024)
self.kd = nn.Linear(in_features=self.ka.out_features, out_features=1)

input to KTS

Hi,

For change point detection. What should i input to KTS? Flatten image as HxW dimension input or using some feature extraction methods so that the image become some N dimension input? What is being used in this paper to preprocess the image/frame?

can you offer the code of DR-DSNsup

I have got the code of DR-DSN unsupervised version, now I want to run the code with Augmented data, but I can't solve it. Can you offer the code to me? thank you very much!

create result H5

How do I create a h5 file of the generated summary? I want to compare the machine summary with ground truth visually.

datasets

I have a question about the features:
Do I need additional processing of video frames when using googlenet to extract video features? For example, normalization and other operations, or directly resize the original video frame and use the network to obtain features?
What should I do if I want to use resnet for feature extraction?
Thank you!

Regarding the loss_att

In this block in training, I don't get the loss_att and its use.

loss_att = 0
loss = criterion(y, target)
loss = loss + loss_att

Can't reproduce confusion matrix of attention weights for TvSum video 7, test split 2

Hi,

I thought I would try my luck here...

I am working on your VASNet paper as part of a Deep Learning seminar. Since a few days I am trying to reproduce the Confusion Matrix of Attention Weights for TVSum video 7 from split 2, which is exactly the matrix you have shown in the paper.

Unfortunately I get a completely different figure, I'm running out of ideas. Here is my code:

self_attention = SelfAttention() # your SelfAttention(nn.Module)
self_attention.load_state_dict(torch.load(SELFATT_MODEL_FILE)) # SelfAttention model from split 2

with h5py.File(TVSUM_DATASET_FILE, 'r') as f:

    video_7 = f['video_7']
    features = video_7['features'][...]

    features = torch.from_numpy(features)

    y, weights = self_attention(features)

    weights = weights.detach().cpu().numpy()

    # Values were normalized to range 0-1 across the matrix.
    weights = preprocessing.minmax_scale(weights, feature_range=(0, 1), axis=0, copy=True)

    fig, ax = plt.subplots()
    ax.xaxis.tick_top()

    heatmap = sn.heatmap(
        weights_df,
        xticklabels=50,
        yticklabels=50,
        cmap="YlGnBu")

    plt.show()

And this is what the Confusion Matrix looks like:

confusion_matrix

It definitely loads the correct model for tvsum split 2 and the Attention Weights produced are definitely the same too, I checked it against your VASNet.

Does anyone have any idea why this looks like this?

please explain the following code part of adding x and y

please explain the following code part of adding x and y
vasnet_model.py
class VASNet(nn.Module):
.....
def forward(self, x, seq_len):

    m = x.shape[2] # Feature size

    # Place the video frames to the batch dimension to allow for batch arithm. operations.
    # Assumes input batch size = 1.
    x = x.view(-1, m)
    y, att_weights_ = self.att(x)

    y = y + x # -- what is the reason behind this step please explain,

Question about frame score

Hi i use this natwork on my own video dataset, i use the pretrain model summe_aug_splits_1_0.5858084051379817.tar.pth to get frame score.
Is this right if a frame has higher score , this frame is more important and contain more semantic information to be a key frame for video summary or video classification task?

The result.txt has no data

I train the project as the readme principle successfully but the result.txt that save the result doesn't have any data

Test my own video

   Hello,I meet some questions when i want to test my own video.
   I need to get the following parameters:
              machine_summary = generate_summary(probs, cps, num_frames, nfps, positions)
   So I need to run the cpd_auto.py:
             def cpd_auto(K, ncp, vmax, desc_rate=1, **kwargs):
  But I don't know the meaning of the input parameters such as K,ncp,vmax,Could you tell me the how you use it?
   And It only return two paragrames,cps and scores2.How can i get the rest paragram in generate_summary such as positions and nfps?
   Wish your reply,Thanks a lot!

Can't reproduce results training

With the provided splits, I can not reproduce the results when training from scratch. I get an f1 scores of 61.33% and 47.79% for TVSum and SumMe respectfully. I am thinking that maybe the seed of 1234 is not the correct seed to be used for generating these results from scratch?
Thanks!

Issue regarding the last step of self attention (weighted sum step)

Hi, I noticed that the last step of the self-attention calculation doesn't seem so right:

att_weights_ = nn.functional.softmax(logits, dim=-1)       
weights = self.dropout(att_weights_)     
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

So here the softmax probability is calculated along the dim -1, which is the column direction.
But then the weighted sum is taken along the row direction according to this line

y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

I think we should do something like this

y = torch.matmul(weights,V)

How do you think?
I hope I'm the one to be corrected.

Interpreting Datasets

@ok1zjf How do I interpret the dataset features? For example, the first video in TVSum has 10597 frames and the shape of user_summary is (20, 10597). The shape of features is (707, 1024). How do we get the 707? What does it signify? Also, how do I train my model (LSTM) with these features? What should be the input and label?
image
Thank you.

TVSum and SumMe: Mapping video id to key

I'm trying to generate the summary outputs for TVSum and SumMe. Based on your answer to #2, I require the following:

  1. FPS at which the frames were extracted
  2. Mapping of the video id/name in the TVSum and SumMe datasets to the video ID used in your h5 files

Could you please share this information?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.