I’m currently a Master’s student at Institute of Software, Chinese Academy of Sciences.
My recent research interest lie in the fields of:
- Computer Vision
- Multimodality
- Large Language Models
Obsidian, Zotero, Things, Notion, ...
[ICCV 2023] Accurate and Fast Compressed Video Captioning
Home Page: https://arxiv.org/abs/2309.12867
License: MIT License
I’m currently a Master’s student at Institute of Software, Chinese Academy of Sciences.
My recent research interest lie in the fields of:
Obsidian, Zotero, Things, Notion, ...
I trained the model on single Nvidia RTX-4090 use the default config setting. However the result of the test dataset is significantly worse than the paper reported e.g. CIDer in msvd dataset from 113.0 -> 101.5.
I also tuned the accumulation step to 32 in order to satisfy the requirement of batch_size 64 in the paper in config setting but it seemed not helpful.
I found that using some specific versions of CUDA and PyTorch may cause segfaults, dynamic link library exceptions, etc., making the code unable to be reproduced. I hope the author can provide information on a runnable environment (the version of the required library)
thank you
Thank you for sharing your code.
Could you please provide additional details regarding the inference speed calculation in Fig. 2 and Table 3? I am a bit confused.
Regarding Table 3, where the inference time for your model is listed as 178 ms, could you specify if this time corresponds to generate caption for one video file ?
Additionally, I would appreciate clarification on whether the time costs of IO operations and frame extraction are excluded from these calculations.
Lastly, Lastly, the videos in MSRVTT have a different number of frames, so how was this issue addressed in Table 3? For your model, how many frames per video are considered?
I converted the video according to the method you provided. I found that some errors occurred in the batch of videos(num_workers=12, 4 were correct and 8 were wrong) , the wrong videos are:
./dataset/msvd/videos_240_h264_keyint_60/Nd45qJn61Dw_0_10.avi
./dataset/msvd/videos_240_h264_keyint_60/5P6UU6m3cqk_57_75.avi
./dataset/msvd/videos_240_h264_keyint_60/PD6eQY7yCfw_32_37.avi
./dataset/msvd/videos_240_h264_keyint_60/77iDIp40m9E_159_181.avi
./dataset/msvd/videos_240_h264_keyint_60/9Wr48VFhZH8_45_50.avi
./dataset/msvd/videos_240_h264_keyint_60/HxRK-WqZ5Gk_30_50.avi
./dataset/msvd/videos_240_h264_keyint_60/UgUFP5baQ9Y_0_7.avi
./dataset/msvd/videos_240_h264_keyint_60/PqSZ89FqpiY_65_75.avi
and I converted these wrong videos to mp4 but got the same error.
I wonder if there is something wrong with the MSVD dataset or cv_reader (I can train normally on MSRVTT).
Your help will be greatly appreciated.
Thanks to your work!
Could you please tell me how can I use the checkpoints you released?
Hello, thank you very much for publishing such a high-level code. When I use your code to run on my personal video dataset, the memory usage of the program is very high, but the RAM of the workstation I use is 128GB. Of course, this may also be related to the size of my video. Is there any way to reduce the RAM of the code by modifying the config?
Thanks for your work!
Could you upload the model's pretrained checkpoint file?
I want to test with the weights file to caption video input.
Thank you
Hello, thank you very much for your open-source code. I've been working on reproducing your code recently and applying it to my personal dataset. However, I've encountered an issue where the process gets stuck after completing one epoch, with no error reported. I hope to get your help. Thank you very much!
Dear sir, I have trouble installing cv_reader, for the sake of I have not the sudo permission and my linux server cannot connect to the external network. So is there a suitable docker image or some other way to prepare the environment? Thank you, sir!
在您提供的代码中,您似乎没有用验证集验证?例如在您的MSVD_caption.json文件中似乎只有train和test的split划分。如果用测试集挑选最合适的训练权重是否有些不公平?三个数据集上似乎都没有用验证集去验证。期待您的回复
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.