Hi Pritam, thank you very much for your amazing work. I have some questions about the dataset you used in this work. The pretrained dataset : K400, AudioSet and Kinetics-Sound, do you always use both audio and visual information, and do they always contain audio stream? Because I am trying k400, but I found some videos miss audio stream. In addition, the downstream dataset like UCF-101 and HMDB-51, do you use both audio and visual pairs , or just use visual information for evaluation? It seems that videos files in UCF-101 do not always contain the audio stream. Thank you very much.
Hi, I am planning to do research based on your model. I found that in the paper that you cited (Look, listen and
learn), there are 34 classes in Kinetics-Sound. Among these classes, 32 classes are used in your research. Could you provide the category list? Many thanks for considering my request.
Hello, I am planning to do research based on this model. Could you release the entire code..? Would it be available within this month? Thank you in advance.