Giter Site home page Giter Site logo

macil_sd's Introduction

MACIL_SD

PWC

[ACM MM 2022] Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Jiashuo Yu*, Jinyu Liu*, Ying Cheng, Rui Feng, Yuejie Zhang (* equal contribution)

Paper

Overview

Results

Our model achieves state-of-the-art results on the XD-Violence dataset while maintaining low parameter amounts.

Method Modality AP (%) Params
Ours (light) Audio & Visual 82.17 0.347M
Ours (full) Audio & Visual 83.40 0.678M

XD-Violence Dataset & Features

The audio and visual features of the XD-Violence dataset can be downloaded at this link. Note that in this paper, only the RGB and VGGish features are required. You can download the RGB.zip, RGBTest.zip, and vggish-features.zip and unzip them into the data/ folder.

Requirements

python==3.7.11  
torch==1.6.0  
cuda==10.1  
numpy==1.17.4

Note that the reported results are obtained by training on a single Tesla V100 GPU. We observe that different GPU types and torch/cuda versions can lead to slightly different results.

Training

python main.py --model_name=macil_sd

Testing

python infer.py --model_dir=macil_sd.pkl

Citation

If you find our work interesting and useful, please consider citing it.

@article{yu2022macil,
  title={Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection},
  author={Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang},
  journal={arXiv preprint arXiv:2207.05500},
  year={2022}
}  

License

This project is released under the MIT License.

Acknowledgements

The codes are based on XDVioDet and RTFM. We sincerely thank them for their efforts. If you have further questions, please contact us at [email protected] and [email protected].

macil_sd's People

Contributors

justinyuu avatar jinyuliu1130 avatar

Stargazers

Song Tang  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar zuble avatar  avatar  avatar  avatar  avatar Mengyang Zhao avatar  avatar  avatar Rui Zhang avatar Chase Wu avatar 猫树你都是我的。 avatar jinghongke avatar Psyduck avatar Jewel avatar Victor Chen avatar  avatar  avatar songyu_wang avatar nanzhang_art avatar  avatar TimZ avatar  avatar

Watchers

 avatar

macil_sd's Issues

ValueError, probable mismatch in dataset

Question

Hi @JustinYuu, I hope this finds you well,
We downloaded the dataset from XD-Violence, as mentioned in the readme file, and then we converted it into list files (attached below) using the make_list.py file in the list directory and stored them in appropriate places. After that, when I tried to run the code(main.py) as it is in GitHub. But I am getting a ValueError message -cannot reshape array of size 408544 into shape (1047,1024) (I have attached the same for your reference). I even tried to print the values of f_v and f_a (attached).

env used
python==3.6.9
torch==1.6.0
cuda==10.2
numpy==1.17.4

we also didn't change anything in option.py
please suggest what could be the issue, it would be of great help.

MACIL_SD_error

audio_test list
audio list
f_a shape after process_feat
f_v shape after process_fe list index at
orginal shape of f_v
![original shape of f_a](http
rgb_test list
s://github.com/JustinYuu/MACIL_SD/assets/96651265/503530ff-1bdf-4ec3-9cb9-1b4e3e4d73e3)

rgb_test list
rgb list

Thank You!

about the code of the reference work "Pang et al."

thanks for your excellent work and open-sourced code
in the paper, you say "For the multimodal baseline [43], we remove the mutual loss and multimodal fusion modules and leverage the vanilla attention-based variant (‡) for comparison. ", but I can't find the released code of the work baseline [43]----pang et al "Violence Detection in Videos Based on Fusing Visual and Audio Information. "
could you please share the link you found?
Thank you very much

About visualizing the XD-Violence datasets.

Dear Author,

Thank you for sharing your code, the proposed solution is truly inspiring.

I am currently conducting research on weakly supervised video anomaly detection, similar to your published study on the "Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection", and am also using the XD-Violence datasets. Upon downloading the corresponding raw videos from the XD-Violence official website, I found that many video files were corrupted.

I am writing this letter to inquire whether you might have a complete set of raw videos for both the training and testing from the XD-Violence dataset.

I apologize for the disturbance and greatly appreciate your assistance.

Thank you very much.

Feature extraction

Hi @JustinYuu and @ljy6666666,

Thanks for the great work! The paper is very concise and easy to understand.

You've adopted the same feature extracting procedure as the other state-of-the-art models: 16-frame visual segments on 24 fps, and 960-ms overlapped audio segments with 96 × 64 bins. This means that the audio and visual segments do not match completely (they only have the end time aligned).

What's the reasoning behind this? Is there any advantage in using overlapped audio segments over e.g. using 24-frame visual segments on 25 fps and thus completely matching the time span of the VGGish features?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.