Hi, Thanks for sharing. Do you not need to subtract mean from th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Do you need to substract mean from the input images/video frames?,about chenxinpeng/s2vt

Comments (22)

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi, thank you for your question.
I only use the pre-trained VGG-16 layers model to extract features. I don't back-propagate gradients both for efficiency and also to prevent overfitting.
So, in the cnn_util.py python script, I used the ilsvrc_2012_mean.npy, the mean value of ImageNet dataset.

In other words, if I train the model with VGG model, I should subtract the mean value of video frames. Here I only used it to extract features.

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
And, this is my understanding and explanation.

By the way, in the paper: A Hierarchical Approach for Generating Descriptive Image Paragraphs, the authors also processed the images like me.

You can read the paper in section 4.5. Transfer Learning .

from s2vt.

hhxxttxsh commented on May 24, 2024

Thank you for the reply.
In the training process of the original authors' code, for the encoder LSTM, they feed this batch fc7_frame_feature of size (1000, 32, 4096). Within the batch, the adjacent frames seem not to be correlated cuz before this step they have been shuffled. I am wondering how this could make sense cuz the time step t to t+1 are not even belonging to the same video.

Do you also have this similar implementation in your code? How do you organize your batch for training the encoder-decoder?

Thanks!

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,
I think you may be misunderstanding about the original code. In the original code model.py:

index = list(train_data.index)
np.random.shuffle(index)
train_data = train_data.ix[index]

He only shuffled the orders of videos. But the order of frames in each video are NOT changed.

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, Thank you for the reply. Sorry I should have mentioned the code I was talking about is: https://github.com/vsubhashini/caffe/blob/recurrent/examples/s2vt/framefc7_stream_text_to_hdf5_data.py if shuffle: random.shuffle(self.lines) In this, they shuffled each frames, which seems very very weird. In your implementation of the S2VT, did you manage to reproduce the same result from the paper? Thanks.

…

On 15 February 2017 at 01:17, Xinpeng.Chen ***@***.***> wrote: @hhxxttxsh <https://github.com/hhxxttxsh> Hi, I think you may be misunderstanding about the original code. In the original code model.py <https://github.com/jazzsaxmafia/video_to_sequence/blob/master/model.py#L244> : index = list(train_data.index) np.random.shuffle(index) train_data = train_data.ix[index] He only shuffled the *orders of videos*. But the *order of frames* in each video are NOT changed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxsyBeh8ZxTZz7G_HyMjP9UqVlPTjks5rclIggaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,
I know what you mean...

In the original paper, S. Venugopalan et al. have done an experiment about the shuffled frames. In section 4.3. Experimental details of our model

We also experiment with randomly re-ordered input frames to verify that S2VT learns temporal-sequence information.
And the METEOR score of this experiment result is 28.2%, very close to the result when dosen''t randomly re-ordered. input frames. You can see this in Table 2 in the paper.

And I think, the features of frames in a video are close, so when we shuffle this frames, the influence is small.

In my code, I haven't shuffle the frames.

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, Thank you for the reply! Have you got reasonable results after training the model for 16k iterations? I tested this model with the s2vt_captioner.py ( https://github.com/vsubhashini/caffe/blob/recurrent/examples/s2vt/s2vt_captioner.py) and it is pretty bad (none of the captions make sense). Do you only train on Youtube2text dataset to get good results? or you actually have to train your model on some other datasets like COCO image caption dataset too? Thanks!

…

On 15 February 2017 at 08:41, Xinpeng.Chen ***@***.***> wrote: @hhxxttxsh <https://github.com/hhxxttxsh> Hi, I know what you mean... In the original paper, S. Venugopalan et al. have done an experiment about the shuffled frames. In section *4.3. Experimental details of our model* We also experiment with randomly re-ordered input frames to verify that S2VT learns temporal-sequence information. And the METEOR score of this experiment result is *28.2%*, very close to the result when dosen''t randomly re-ordered. input frames. You can see this in *Table 2* in the paper. And I think, the features of frames in a video are close, so when we shuffle this frames, the influence is small. In my code, I haven't shuffle the frames. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxpEFgX33K9bTK8jWzyhWpdxq_e0Wks5rcrohgaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

I also have tested the source code by S. Venugopalan et al. And the generation sentences is reasonable. I haven't trained the model by myself, just use the trained model downloaded from the authors.

And I only train the model on Youtube (MSVD) video).

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, Thanks for your reply. Now I found the bug and the captions I got look OK now. But I only get the METEOR of 0.246. Have you managed to get the exact number 0.292 as the paper reported? I noticed that you have not sampled the training set like what the authors did (1 in 10 frames in a video). Is that not going to affect the results?

…

On 16 February 2017 at 03:45, Xinpeng.Chen ***@***.***> wrote: @hhxxttxsh <https://github.com/hhxxttxsh> Hi, I also have tested the source code by S. Venugopalan et al. And the generation sentences is reasonable. I haven't trained the model by myself, just use the trained model downloaded from the authors. And I only train the model on Youtube (MSVD) video) <http://www.cs.utexas.edu/users/ml/clamp/videoDescription/>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxsiLQWFNspOuFtVdrvp7aJvd5Xxcks5rc8ZTgaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

hhxxttxsh commented on May 24, 2024

ok, it is 0.272 when I use the iteration_16000, but still not quite 0.292

…

On 17 February 2017 at 08:16, Ruomei Yan ***@***.***> wrote: Hi, Thanks for your reply. Now I found the bug and the captions I got look OK now. But I only get the METEOR of 0.246. Have you managed to get the exact number 0.292 as the paper reported? I noticed that you have not sampled the training set like what the authors did (1 in 10 frames in a video). Is that not going to affect the results? On 16 February 2017 at 03:45, Xinpeng.Chen ***@***.***> wrote: > @hhxxttxsh <https://github.com/hhxxttxsh> > Hi, > > I also have tested the source code by S. Venugopalan et al. And the > generation sentences is reasonable. I haven't trained the model by myself, > just use the trained model downloaded from the authors. > > And I only train the model on Youtube (MSVD) video) > <http://www.cs.utexas.edu/users/ml/clamp/videoDescription/>. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ASsNxsiLQWFNspOuFtVdrvp7aJvd5Xxcks5rc8ZTgaJpZM4L-knI> > . > -- Best Regards, Yours Sincerely Ruomei

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

When I only use RGB features which extracted on the VGG model, I got 28.1 of METEOR score. You can see in the README.md of my code.
And I only split the dataset into two parts, training dataset, testing dataset. I think that is why I didn't get the same METEOR score which is 29.2.

Do you use the same number of training dataset, validation dataset and testing dataset as the S. Venugopalan et al.?

And, I think we don't need to get the same results as in the original paper.
I think the most important is that we understand the idea of the paper. :）

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, 'Do you use the same number of training dataset, validation dataset and testing dataset as the S. Venugopalan et al.?'

…

---- I did split the dataset according to what the authors did. I think the reason might be the fact that in your implementation the way you sample the dataset is very different from the authors. They sample 1 frame out of 10 frames from each video, while in your implementation your sample intervals between frames can be quite large when the number of the frames are large, right? I suspect that would decrease the METEOR score. But I will try to prove it soon. I am trying to get the exact result as what the authors get. Thank you very much for the help and I learned a lot from you :)

On 17 February 2017 at 11:40, Xinpeng.Chen ***@***.***> wrote: @hhxxttxsh <https://github.com/hhxxttxsh> Hi, When I only use RGB features which extracted on the VGG model, I got 28.1 of METEOR score. You can see in the *README.md* of my code. And I only split the dataset into two parts, training dataset, testing dataset. I think that is why I didn't get the same METEOR score which is 29.2. Do you use the same number of training dataset, validation dataset and testing dataset as the S. Venugopalan et al.? And, I think we don't need to get the same results as in the original paper. I think the most important is that we understand the idea of the paper. :） — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxuDxHYTojLvgGbLoEI_FTYMyeSXTks5rdYcugaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, I have been trying to get the exact number as what is reported in Table 2 the RGB(VGG) as 29.2, but it is very difficult to get there... then I found this paper https://arxiv.org/pdf/1502.08029.pdf It suggests that 29.2 is only achievable with extra data... In table 1 you can see that. I am wondering whether your tensorflow implementation is better in performance than the caffe implementation by the authors....

…

On 17 February 2017 at 13:51, Ruomei Yan ***@***.***> wrote: Hi, 'Do you use the same number of training dataset, validation dataset and testing dataset as the S. Venugopalan et al.?' ---- I did split the dataset according to what the authors did. I think the reason might be the fact that in your implementation the way you sample the dataset is very different from the authors. They sample 1 frame out of 10 frames from each video, while in your implementation your sample intervals between frames can be quite large when the number of the frames are large, right? I suspect that would decrease the METEOR score. But I will try to prove it soon. I am trying to get the exact result as what the authors get. Thank you very much for the help and I learned a lot from you :) On 17 February 2017 at 11:40, Xinpeng.Chen ***@***.***> wrote: > @hhxxttxsh <https://github.com/hhxxttxsh> > Hi, > > When I only use RGB features which extracted on the VGG model, I got 28.1 > of METEOR score. You can see in the *README.md* of my code. > And I only split the dataset into two parts, training dataset, testing > dataset. I think that is why I didn't get the same METEOR score which is > 29.2. > > Do you use the same number of training dataset, validation dataset and > testing dataset as the S. Venugopalan et al.? > > And, I think we don't need to get the same results as in the original > paper. > I think the most important is that we understand the idea of the paper. :） > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ASsNxuDxHYTojLvgGbLoEI_FTYMyeSXTks5rdYcugaJpZM4L-knI> > . > -- Best Regards, Yours Sincerely Ruomei

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

As was expected, the authors used the extra data. Thank you for reminding me this trick.

My implementation got 28.2, worse but close to 29.2.

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, I'm not quite sure whether the author did use extra data but without them I got the same result as table 1 in Li Yao's paper. Do you think the caffe implementation and tensorflow will have such huge difference? I'm new to deep learning. Is it going to take longer time to get used to tensorflow? I have experience in Torch and quite like it but not sure how difficult tensorflow is. Thanks.

…

Sent from my iPhone

On 23 Feb 2017, at 10:27, Xinpeng.Chen ***@***.***> wrote: @hhxxttxsh Hi, As was expected, the authors used the extra data. Thank you for reminding me this trick. My implementation got 28.2, worse but close to 29.2. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from s2vt.

chenxinpeng commented on May 24, 2024

Hi,

I don't think there has such huge difference between Caffe and Tensorflow.

Yes, Tensorflow is a basic framework, so I suggest you to learn a higher framework which is based on Tensorflow: Keras if you don't need very detail control in your network. And then you can go deeper into Tensorflow.
By the way, in Tensorflow 1.0.0 version, the Keras has been fused in Tensorflow already.

I have been used Torch six month ago, I like Torch than Tensorflow, but the Lua language is minority, so I switch from Torch to Tensorflow.

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, Have you managed to use Tensorflow to reproduce the same file as: https://github.com/vsubhashini/caffe/blob/recurrent/examples/s2vt/get_s2vt.sh this one: yt_allframes_vgg_fc7_val.txt My fc7 features look nothing like this.... Thanks

…

On 24 February 2017 at 02:12, Xinpeng.Chen ***@***.***> wrote: Hi, I don't think there has such huge difference between Caffe and Tensorflow. Yes, Tensorflow is a basic framework, so I suggest you to learn a higher framework which is based on Tensorflow: Keras <http://keras.io/> if you don't need very detail control in your network. And then you can go deeper into Tensorflow. By the way, in Tensorflow 1.0.0 version, the Keras has been fused in Tensorflow already. I have been used Torch six month ago, I like Torch than Tensorflow, but the Lua language is minority, so I switch from Torch to Tensorflow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxhg_lbTaG18uvr78P25CCrvvColQks5rfjyJgaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, I have managed to reproduce the exact result as mentioned in the s2vt paper. Now, can I ask that whether you think it is possible to implement s2vt with attention model entirely in Keras? Or it should be a mix? Thanks.

…

On 25 February 2017 at 11:26, Ruomei Yan ***@***.***> wrote: Hi, Have you managed to use Tensorflow to reproduce the same file as: https://github.com/vsubhashini/caffe/blob/recurrent/examples/s2vt/get_ s2vt.sh this one: yt_allframes_vgg_fc7_val.txt My fc7 features look nothing like this.... Thanks On 24 February 2017 at 02:12, Xinpeng.Chen ***@***.***> wrote: > Hi, > > I don't think there has such huge difference between Caffe and Tensorflow. > > Yes, Tensorflow is a basic framework, so I suggest you to learn a higher > framework which is based on Tensorflow: Keras <http://keras.io/> if you > don't need very detail control in your network. And then you can go deeper > into Tensorflow. > By the way, in Tensorflow 1.0.0 version, the Keras has been fused in > Tensorflow already. > > I have been used Torch six month ago, I like Torch than Tensorflow, but > the Lua language is minority, so I switch from Torch to Tensorflow. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ASsNxhg_lbTaG18uvr78P25CCrvvColQks5rfjyJgaJpZM4L-knI> > . > -- Best Regards, Yours Sincerely Ruomei

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

Hi,

Of course it is possible to implement S2VT with Keras. And I recommend a paper to you: https://www.aclweb.org/anthology/C/C16/C16-1005.pdf . This paper combines the S2VT model with attention machanism.

from s2vt.

hhxxttxsh commented on May 24, 2024

Hi, That's pretty awesome, Thank you! I installed keras+ tensorflow as backend last week and I am really enjoying playing with Keras right now. I am very interested in video caption and working on it right now. Is it OK for you to leave your email to me for further discussion?

…

On 8 March 2017 at 02:35, Xinpeng.Chen ***@***.***> wrote: Of course it is possible to implement S2VT with Keras. And I recommend a paper to you: https://www.aclweb.org/anthology/C/C16/C16-1005.pdf . This paper combines the S2VT model with attention machanism. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASsNxodZd4NLwixIHyLGxJMnWD3iCSNpks5rjhPtgaJpZM4L-knI> .

-- Best Regards, Yours Sincerely Ruomei

from s2vt.

chenxinpeng commented on May 24, 2024

@hhxxttxsh

Hi,

That's ok. I'm still working on image or video captioning.

My e-mail: [email protected]

from s2vt.

jozefmorvay commented on May 24, 2024

@hhxxttxsh @chenxinpeng Hi both of you. Were you able to implement this in Keras? I will make my attempt in the coming weeks, and I would appreciate a how-to guide for this, as I am completely new to DL.

from s2vt.

Do you need to substract mean from the input images/video frames? about s2vt HOT 22 OPEN

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent