Giter Site home page Giter Site logo

Comments (22)

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi, thank you for your question.
I only use the pre-trained VGG-16 layers model to extract features. I don't back-propagate gradients both for efficiency and also to prevent overfitting.
So, in the cnn_util.py python script, I used the ilsvrc_2012_mean.npy, the mean value of ImageNet dataset.

In other words, if I train the model with VGG model, I should subtract the mean value of video frames. Here I only used it to extract features.

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
And, this is my understanding and explanation.

By the way, in the paper: A Hierarchical Approach for Generating Descriptive Image Paragraphs, the authors also processed the images like me.

You can read the paper in section 4.5. Transfer Learning .

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

Thank you for the reply.
In the training process of the original authors' code, for the encoder LSTM, they feed this batch fc7_frame_feature of size (1000, 32, 4096). Within the batch, the adjacent frames seem not to be correlated cuz before this step they have been shuffled. I am wondering how this could make sense cuz the time step t to t+1 are not even belonging to the same video.

Do you also have this similar implementation in your code? How do you organize your batch for training the encoder-decoder?

Thanks!

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,
I think you may be misunderstanding about the original code. In the original code model.py:

index = list(train_data.index)
np.random.shuffle(index)
train_data = train_data.ix[index]

He only shuffled the orders of videos. But the order of frames in each video are NOT changed.

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,
I know what you mean...

In the original paper, S. Venugopalan et al. have done an experiment about the shuffled frames. In section 4.3. Experimental details of our model

We also experiment with randomly re-ordered input frames to verify that S2VT learns temporal-sequence information.
And the METEOR score of this experiment result is 28.2%, very close to the result when dosen''t randomly re-ordered. input frames. You can see this in Table 2 in the paper.

And I think, the features of frames in a video are close, so when we shuffle this frames, the influence is small.

In my code, I haven't shuffle the frames.

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

I also have tested the source code by S. Venugopalan et al. And the generation sentences is reasonable. I haven't trained the model by myself, just use the trained model downloaded from the authors.

And I only train the model on Youtube (MSVD) video).

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

When I only use RGB features which extracted on the VGG model, I got 28.1 of METEOR score. You can see in the README.md of my code.
And I only split the dataset into two parts, training dataset, testing dataset. I think that is why I didn't get the same METEOR score which is 29.2.

Do you use the same number of training dataset, validation dataset and testing dataset as the S. Venugopalan et al.?

And, I think we don't need to get the same results as in the original paper.
I think the most important is that we understand the idea of the paper. :)

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh
Hi,

As was expected, the authors used the extra data. Thank you for reminding me this trick.

My implementation got 28.2, worse but close to 29.2.

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

Hi,

I don't think there has such huge difference between Caffe and Tensorflow.

Yes, Tensorflow is a basic framework, so I suggest you to learn a higher framework which is based on Tensorflow: Keras if you don't need very detail control in your network. And then you can go deeper into Tensorflow.
By the way, in Tensorflow 1.0.0 version, the Keras has been fused in Tensorflow already.

I have been used Torch six month ago, I like Torch than Tensorflow, but the Lua language is minority, so I switch from Torch to Tensorflow.

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

Hi,

Of course it is possible to implement S2VT with Keras. And I recommend a paper to you: https://www.aclweb.org/anthology/C/C16/C16-1005.pdf . This paper combines the S2VT model with attention machanism.

from s2vt.

hhxxttxsh avatar hhxxttxsh commented on May 24, 2024

from s2vt.

chenxinpeng avatar chenxinpeng commented on May 24, 2024

@hhxxttxsh

Hi,

That's ok. I'm still working on image or video captioning.

My e-mail: [email protected]

from s2vt.

jozefmorvay avatar jozefmorvay commented on May 24, 2024

@hhxxttxsh @chenxinpeng Hi both of you. Were you able to implement this in Keras? I will make my attempt in the coming weeks, and I would appreciate a how-to guide for this, as I am completely new to DL.

from s2vt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.