ok1zjf / vasnet Goto Github PK
View Code? Open in Web Editor NEWPyTorch implementation of the ACCV 2018-AIU2018 paper Video Summarization with Attention
License: MIT License
PyTorch implementation of the ACCV 2018-AIU2018 paper Video Summarization with Attention
License: MIT License
@ok1zjf Hello,
I see there are two fields in the dataset - gtscore
and gtsummary
. Can you explain the difference and significance of these? I see that you have used gtscore
in your code while calculating losses. In the readme file, you have written gtscore
is the frame-level importance scores and used for regression loss while gtsummary
is the ground truth summary which is used for likelihood loss. I couldn't understand this part.
Thank you.
I can't get the datasets,I don't know whether it is my net error or the url error?
@ok1zjf
Greetings @ok1zjf ,
In one of your reply in an issue posted about Vasnet, you hinted that change points (cps) was provided in the dataset you used. I am assuming you were only referring to summe and tvsum dataset.
I am assumed you used this dataset here.
https://app.box.com/s/4lq3xkv9n536ns2vutvfa26p6p53i7dv
May you please clarify on what parameters were used for KTS cpd_auto() function for ovp and youtube dataset?
Thanks
Sam
I have read the paper and it seems that you are treating this as a regression task not a classification task.
I know that the final labels are binary and the ground truth summary is a continuous set between 0 and 1.
My question is since you are using a sigmoid output and f-score metric shouldn't that be called a classification model and not regression
and if so how is using MSE loss suitable in this case
Hi @ok1zjf and @electroncastle ,
First of all thank you for sharing the code of your paper for video Summarization.
I have found that the Kernel Temporal Segmentation (KTS) takes a long time to segment a video larger than 3 minutes. Do you know any alternate algorithm or method that would speed up the segmentation time for longer length videos, say 5 to 15 minutes video?
Any suggestions would be highly appreciated.
Regards,
Hello, I noticed that there are two layers:self. kb and self. kc is not used in your model code , so I annotated them.However, the fscore dropped after annotation. Do you know why?
self.att = SelfAttention(input_size=self.m, output_size=self.m)
self.ka = nn.Linear(in_features=self.m, out_features=1024)
self.kb = nn.Linear(in_features=self.ka.out_features, out_features=1024)
self.kc = nn.Linear(in_features=self.kb.out_features, out_features=1024)
self.kd = nn.Linear(in_features=self.ka.out_features, out_features=1)
and also how can i use the model for summarize new videos?
many thanks
We have trained the model and got tar.pth files but are unable to progress forward. Can you guide us from here on forward.
Hi,
For change point detection. What should i input to KTS? Flatten image as HxW dimension input or using some feature extraction methods so that the image become some N dimension input? What is being used in this paper to preprocess the image/frame?
I have got the code of DR-DSN unsupervised version, now I want to run the code with Augmented data, but I can't solve it. Can you offer the code to me? thank you very much!
in the paper the formula(6) notes WCt+xt ,but at the figure2 ,the residual yellow line link the x t+1
How do I create a h5 file of the generated summary? I want to compare the machine summary with ground truth visually.
Hi,
What is the reason behind multiplying Q with 0.06? Thanks.
I have a question about the features:
Do I need additional processing of video frames when using googlenet to extract video features? For example, normalization and other operations, or directly resize the original video frame and use the network to obtain features?
What should I do if I want to use resnet for feature extraction?
Thank you!
can not get the data and pretrained model through the url you provided:(
In this block in training, I don't get the loss_att and its use.
loss_att = 0
loss = criterion(y, target)
loss = loss + loss_att
Could you tell me how did you get the CNN features? It seems like that your codes didn't use any CNN model, or the CNN features came from the pre-processed video dataset?
Hi,
I thought I would try my luck here...
I am working on your VASNet paper as part of a Deep Learning seminar. Since a few days I am trying to reproduce the Confusion Matrix of Attention Weights for TVSum video 7 from split 2, which is exactly the matrix you have shown in the paper.
Unfortunately I get a completely different figure, I'm running out of ideas. Here is my code:
self_attention = SelfAttention() # your SelfAttention(nn.Module)
self_attention.load_state_dict(torch.load(SELFATT_MODEL_FILE)) # SelfAttention model from split 2
with h5py.File(TVSUM_DATASET_FILE, 'r') as f:
video_7 = f['video_7']
features = video_7['features'][...]
features = torch.from_numpy(features)
y, weights = self_attention(features)
weights = weights.detach().cpu().numpy()
# Values were normalized to range 0-1 across the matrix.
weights = preprocessing.minmax_scale(weights, feature_range=(0, 1), axis=0, copy=True)
fig, ax = plt.subplots()
ax.xaxis.tick_top()
heatmap = sn.heatmap(
weights_df,
xticklabels=50,
yticklabels=50,
cmap="YlGnBu")
plt.show()
And this is what the Confusion Matrix looks like:
It definitely loads the correct model for tvsum split 2 and the Attention Weights produced are definitely the same too, I checked it against your VASNet.
Does anyone have any idea why this looks like this?
def init(self, apperture=-1, ignore_itself=False, input_size=1024, output_size=1024):
self.apperture = apperture
here the apperture is set to -1, and in the forward you are checking,
if self.apperture > 0:
#Set attention to zero to frames further than +/- apperture from the current one
when does these condition appears.
please explain the following code part of adding x and y
vasnet_model.py
class VASNet(nn.Module):
.....
def forward(self, x, seq_len):
m = x.shape[2] # Feature size
# Place the video frames to the batch dimension to allow for batch arithm. operations.
# Assumes input batch size = 1.
x = x.view(-1, m)
y, att_weights_ = self.att(x)
y = y + x # -- what is the reason behind this step please explain,
Hi i use this natwork on my own video dataset, i use the pretrain model summe_aug_splits_1_0.5858084051379817.tar.pth to get frame score.
Is this right if a frame has higher score , this frame is more important and contain more semantic information to be a key frame for video summary or video classification task?
The download link for data and models is invalid, can you provide new links? @ok1zjf
I train the project as the readme principle successfully but the result.txt that save the result doesn't have any data
Hello,I meet some questions when i want to test my own video.
I need to get the following parameters:
machine_summary = generate_summary(probs, cps, num_frames, nfps, positions)
So I need to run the cpd_auto.py:
def cpd_auto(K, ncp, vmax, desc_rate=1, **kwargs):
But I don't know the meaning of the input parameters such as K,ncp,vmax,Could you tell me the how you use it?
And It only return two paragrames,cps and scores2.How can i get the rest paragram in generate_summary such as positions and nfps?
Wish your reply,Thanks a lot!
With the provided splits, I can not reproduce the results when training from scratch. I get an f1 scores of 61.33% and 47.79% for TVSum and SumMe respectfully. I am thinking that maybe the seed of 1234 is not the correct seed to be used for generating these results from scratch?
Thanks!
Hi, I noticed that the last step of the self-attention calculation doesn't seem so right:
att_weights_ = nn.functional.softmax(logits, dim=-1)
weights = self.dropout(att_weights_)
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)
So here the softmax probability is calculated along the dim -1, which is the column direction.
But then the weighted sum is taken along the row direction according to this line
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)
I think we should do something like this
y = torch.matmul(weights,V)
How do you think?
I hope I'm the one to be corrected.
@ok1zjf How do I interpret the dataset features? For example, the first video in TVSum has 10597 frames and the shape of user_summary is (20, 10597). The shape of features is (707, 1024). How do we get the 707? What does it signify? Also, how do I train my model (LSTM) with these features? What should be the input and label?
Thank you.
I'm trying to generate the summary outputs for TVSum and SumMe. Based on your answer to #2, I require the following:
Could you please share this information?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.