justinyuu / macil_sd Goto Github PK

[ACM MM 2022] Modality-aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

License: MIT License

Python 100.00%

macil_sd's Introduction

MACIL_SD

[ACM MM 2022] Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Jiashuo Yu*, Jinyu Liu*, Ying Cheng, Rui Feng, Yuejie Zhang (* equal contribution)

Paper

Overview

Results

Our model achieves state-of-the-art results on the XD-Violence dataset while maintaining low parameter amounts.

Method	Modality	AP (%)	Params
Ours (light)	Audio & Visual	82.17	0.347M
Ours (full)	Audio & Visual	83.40	0.678M

XD-Violence Dataset & Features

The audio and visual features of the XD-Violence dataset can be downloaded at this link. Note that in this paper, only the RGB and VGGish features are required. You can download the RGB.zip, RGBTest.zip, and vggish-features.zip and unzip them into the data/ folder.

Requirements

python==3.7.11  
torch==1.6.0  
cuda==10.1  
numpy==1.17.4

Note that the reported results are obtained by training on a single Tesla V100 GPU. We observe that different GPU types and torch/cuda versions can lead to slightly different results.

Training

python main.py --model_name=macil_sd

Testing

python infer.py --model_dir=macil_sd.pkl

Citation

If you find our work interesting and useful, please consider citing it.

@article{yu2022macil,
  title={Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection},
  author={Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang},
  journal={arXiv preprint arXiv:2207.05500},
  year={2022}
}

License

This project is released under the MIT License.

Acknowledgements

The codes are based on XDVioDet and RTFM. We sincerely thank them for their efforts. If you have further questions, please contact us at [email protected] and [email protected].

macil_sd's People

Contributors

Stargazers

Watchers

Forkers

masakirio whystopped sidan-dan vividus-tfg ginstone jinyuliu1130

macil_sd's Issues

Is the singlemodel only use visual？

为什么MA-CIL Loss的值这么大，该loss值不是越接近0越好吗

ValueError, probable mismatch in dataset

Question

Hi @JustinYuu, I hope this finds you well,
We downloaded the dataset from XD-Violence, as mentioned in the readme file, and then we converted it into list files (attached below) using the make_list.py file in the list directory and stored them in appropriate places. After that, when I tried to run the code(main.py) as it is in GitHub. But I am getting a ValueError message -cannot reshape array of size 408544 into shape (1047,1024) (I have attached the same for your reference). I even tried to print the values of f_v and f_a (attached).

env used
python==3.6.9
torch==1.6.0
cuda==10.2
numpy==1.17.4

we also didn't change anything in option.py
please suggest what could be the issue, it would be of great help.

![original shape of f_a](http

s://github.com/JustinYuu/MACIL_SD/assets/96651265/503530ff-1bdf-4ec3-9cb9-1b4e3e4d73e3)

Thank You!

about the code of the reference work "Pang et al."

thanks for your excellent work and open-sourced code
in the paper, you say "For the multimodal baseline [43], we remove the mutual loss and multimodal fusion modules and leverage the vanilla attention-based variant (‡) for comparison. ", but I can't find the released code of the work baseline [43]----pang et al "Violence Detection in Videos Based on Fusing Visual and Audio Information. "
could you please share the link you found?
Thank you very much

About visualizing the XD-Violence datasets.

Dear Author,

Thank you for sharing your code, the proposed solution is truly inspiring.

I am currently conducting research on weakly supervised video anomaly detection, similar to your published study on the "Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection", and am also using the XD-Violence datasets. Upon downloading the corresponding raw videos from the XD-Violence official website, I found that many video files were corrupted.

I am writing this letter to inquire whether you might have a complete set of raw videos for both the training and testing from the XD-Violence dataset.

I apologize for the disturbance and greatly appreciate your assistance.

Thank you very much.

Visualization

Feature extraction

Hi @JustinYuu and @ljy6666666,

Thanks for the great work! The paper is very concise and easy to understand.

You've adopted the same feature extracting procedure as the other state-of-the-art models: 16-frame visual segments on 24 fps, and 960-ms overlapped audio segments with 96 × 64 bins. This means that the audio and visual segments do not match completely (they only have the end time aligned).

What's the reasoning behind this? Is there any advantage in using overlapped audio segments over e.g. using 24-frame visual segments on 25 fps and thus completely matching the time span of the VGGish features?

how to reproduce the accuracy in the paper

The best result is 81.1 with mode mix2. how to reproduce the accuracy in the paper？