Giter Site home page Giter Site logo

taoruijie / talknet-asd Goto Github PK

View Code? Open in Web Editor NEW
289.0 7.0 65.0 53.54 MB

ACM MM 2021: 'Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection'

License: MIT License

Python 100.00%
active-speaker-detection audio-visual multimedia awesome-asd

talknet-asd's Introduction

Is someone talking? TalkNet: Audio-visual active speaker detection Model

This repository contains the code for our ACM MM 2021 paper (oral), TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. [Paper] [Video_English] [Video_Chinese].

Updates:

A new demo page. Thanks the contribution from mvoodarla !

overall.png

  • Awesome ASD: Papers about active speaker detection in last years.

  • TalkNet in AVA-Activespeaker dataset: The code to preprocess the AVA-ActiveSpeaker dataset, train TalkNet in AVA train set and evaluate it in AVA val/test set.

  • TalkNet in TalkSet and Columbia ASD dataset: The code to generate TalkSet, an ASD dataset in the wild, based on VoxCeleb2 and LRS3, train TalkNet in TalkSet and evaluate it in Columnbia ASD dataset.

  • An ASD Demo with pretrained TalkNet model: An end-to-end script to detect and mark the speaking face by the pretrained TalkNet model.


Dependencies

Start from building the environment

conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt

Start from the existing environment

pip install -r requirement.txt

TalkNet in AVA-Activespeaker dataset

Data preparation

The following script can be used to download and prepare the AVA dataset for training.

python trainTalkNet.py --dataPathAVA AVADataPath --download 

AVADataPath is the folder you want to save the AVA dataset and its preprocessing outputs, the details can be found in here . Please read them carefully.

Training

Then you can train TalkNet in AVA end-to-end by using:

python trainTalkNet.py --dataPathAVA AVADataPath

exps/exps1/score.txt: output score file, exps/exp1/model/model_00xx.model: trained model, exps/exps1/val_res.csv: prediction for val set.

Pretrained model

Our pretrained model performs mAP: 92.3 in validation set, you can check it by using:

python trainTalkNet.py --dataPathAVA AVADataPath --evaluation

The pretrained model will automaticly be downloaded into TalkNet_ASD/pretrain_AVA.model. It performs mAP: 90.8 in the testing set.


TalkNet in TalkSet and Columbia ASD dataset

Data preparation

We find that it is challenge to apply the model we trained in AVA for the videos not in AVA (Reason is here, Q3.1). So we build TalkSet, an active speaker detection dataset in the wild, based on VoxCeleb2 and LRS3.

We do not plan to upload this dataset since we just modify it, instead of building it. In TalkSet folder we provide these .txt files to describe which files we used to generate the TalkSet and their ASD labels. You can generate this TalkSet if you are interested to train an ASD model in the wild.

Also, we have provided our pretrained TalkNet model in TalkSet. You can evaluate it in Columbia ASD dataset or other raw videos in the wild.

Usage

A pretrain model in TalkSet will be download into TalkNet_ASD/pretrain_TalkSet.model when using the following script:

python demoTalkNet.py --evalCol --colSavePath colDataPath

Also, Columnbia ASD dataset and the labels will be downloaded into colDataPath. Finally you can get the following F1 result.

Name Bell Boll Lieb Long Sick Avg.
F1 98.1 88.8 98.7 98.0 97.7 96.3

(This result is different from that in our paper because we train the model again, while the avg. F1 is very similar)


An ASD Demo with pretrained TalkNet model

Data preparation

We build an end-to-end script to detect and extract the active speaker from the raw video by our pretrain model in TalkSet.

You can put the raw video (.mp4 and .avi are both fine) into the demo folder, such as 001.mp4.

Usage

python demoTalkNet.py --videoName 001

A pretrain model in TalkSet will be downloaded into TalkNet_ASD/pretrain_TalkSet.model. The structure of the output reults can be found in here.

You can get the output video demo/001/pyavi/video_out.avi, which has marked the active speaker by green box and non-active speaker by red box.

If you want to evaluate by using cpu only, you can modify demoTalkNet.py and talkNet.py file: modify all cuda into cpu. Then replace line 83 in talkNet.py into loadedState = torch.load(path,map_location=torch.device('cpu'))


Citation

Please cite the following if our paper or code is helpful to your research.

@inproceedings{tao2021someone,
  title={Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
  author={Tao, Ruijie and Pan, Zexu and Das, Rohan Kumar and Qian, Xinyuan and Shou, Mike Zheng and Li, Haizhou},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  pages = {3927–3935},
  year={2021}
}

I have summaried some potential FAQs. You can also check the issues in Github for other questions that I have answered.

This is my first open-source work, please let me know if I can future improve in this repositories or there is anything wrong in our work. Thanks for your support!

Acknowledge

We study many useful projects in our codeing process, which includes:

The structure of the project layout and the audio encoder is learnt from this repository.

Demo for visulization is modified from this repository.

AVA data download code is learnt from this repository.

The model for the visual frontend is learnt from this repository.

Thanks for these authors to open source their code!

Cooperation

If you are interested to work on this topic and have some ideas to implement, I am glad to collaborate and contribute with my experiences & knowlegde in this topic. Please contact me with [email protected].

talknet-asd's People

Contributors

taoruijie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

talknet-asd's Issues

Minimum length of the audio and video feature

At line 233 in demoTalkNet.py, should it be:
length = min((audioFeature.shape[0] - audioFeature.shape[0] % 4) / 100, videoFeature.shape[0]/25)
as the first term is computing length of audio in seconds?

Thanks!

face detection models

I first would like to thank you for making this amazing work and code publicly available. This work will definitely help my future research.

I have a minor question about face detection model, and i'm wondering if other better or even sota face detection models could improve the performance of demo? would it be compatible to talknet_ASD if I use other face detection model?

thank you!

FileNotFoundError After Running python demoTalkNet.py --videoName 001 Command

Thank you for sharing your work.

I tried an ASD Demo with pretrained TalkNet model.
After I ran the command "python demoTalkNet.py --videoName 001",
I got the message "FileNotFoundError: [Errno 2] No such file or directory: 'demo\001\pycrop\demo\001\pycrop\00000.wav'"

** I used only CPU so I had changed the codes in demoTalkNet.py and talkNet.py by replacing all 'cuda' words to 'cpu' and replaced line 83 in talkNet.py into loadedState = torch.load(path,map_location=torch.device('cpu')) then I ran the command "python demoTalkNet.py --videoName 001
Capture
"

Question about download

When I download and preprocess the data, I come across a bunch of warnings:
"[h264 @ 0x55d29a46d080] mmco: unref short failure"

Is that okay?

Question about window length and hop size for spectrogram

Hi!
Thanks to good repo.

I wonder that why you adjust window and hop size based on fps value?

Of course, I also think that is not critical issue and good choice.

But, from an audio point of view, even if fps changes everytime, but audio was recorded with fixed sampling rate like 16k.

So, I think, we can convert audio with just fix window length (ex. 25ms) and hop length (ex. 16ms).

thanks to read my question.

"Out of memory" issue

Thanks a lot for sharing your excellent work with us. Your code is really friending and the downloading process is painless.

Here's the problem I encountered, while training talknet end-to-end with the command below:
python trainTalkNet.py --dataPathAVA AVADataPath
I got such error message:
RuntimeError: CUDA out of memory. Tried to allocate 1.87 GiB (GPU 0; 23.65 GiB total capacity; 6.17 GiB already allocated; 1.52 GiB free; 20.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Seems the 24GB memory of my titan rtx is not sufficient for the training task. Since I am still new to the code, I still have no idea where is the code that can be revised to reduce the GPU memory usage. would you minding telling me where to find the relevant code?

Missing License file

Hi @TaoRuijie, we love the work that you have done and that you posted it as open-source, can you also add a license file to the repository that states the open-source license the project should have? MIT would be great.

Thanks!

Generation of TalkSet/lists_in

As I want to generate TalkSet with different data from Vox and LRS3, I wonder if you could provide the code to generate the three lists(Vox_list.txt, LRS3_S_list.txt, LRS3_ST_list.txt), thank you very much!

Tips to increase batch size

Hi,

Thanks for the clean code base and the paper. I was trying to replicate your training code, and found that I can have a maximum batch size of 1500 (beyond which I face CUDA OOM error on a V100 GPU). Did you use/recommend any tips or hacks to use a batch size of 2500 for training your model?

Thanks in advance!

Talkset数据集中损失比较大

大佬你好,请问你在训练talkset数据集时,训练集的损失大吗?我发现我在训练的时候loss最低也就只能下降到0.8,但使用其他数据集的时候往往都能下降到0.2,请问这种情况正常吗?

About real-time detection

Hi @TaoRuijie,
Thanks for your great work!
I am trying to use the TalkNet with Webcam, and I notice that you have mentioned that it is possible #7
Unfortunately, I can not figured out what you mean in that issue. Could you please give me some suggestion on how to modify the code in demoTalkNet.py?
By the way, my idea is that we could feed the input stream into the model every 2 seconds, though this would make it delay for 2 seconds. How do you think about it?
Thanks for your attention!

ASD confidence/score

Hi there,

Thank you for releasing your code and models, very impressive results! :)

I am evaluating your TalkNet demo model on a new dataset. I now need to draw a precision-recall curve, so I was expecting to get something like a per-frame ASD confidence value in the range [0; 1] so that I can gradually vary the confidence threshold from 0% to 100% and compute a pair of precision and recall rates at each step.
Instead, TalkNet provides per-frame scores that seem to range from about -2 or -3 to +2 or +3, more or less, without clear max or min boundaries, where positive values are active and negative silent frames.

From my understanding, the last FC layer is followed by a softmax operation so shouldn't the output be expressed in terms of [0; 1] confidence?
Is there a way to convert the output score into a confidence? I was thinking of simply applying a Sigmoid function to the output score, but perhaps I am missing something.

Thank you again for your work, looking forward to hearing back from you!
Davide

Warning while evaluating on test set

Hi,

I met with the following warning, while running your code with my model. I tried, but did not find any NFFT in your codebase. Is this something I can safely ignore, or can cause degradation in the test performance?

WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

No such file or directory: 'demo\\001\\pycrop\\demo\\001\\pycrop\\00000.wav'

Hello, I would like to ask what is wrong here. The problem is reported as follows:
PS D:\Pycharm\TalkNet-ASD-main> python demoTalkNet.py --videoName 001
2023-03-06 16:19:46 Extract the video and save in demo\001\pyavi\video.avi
2023-03-06 16:19:46 Extract the audio and save in demo\001\pyavi\audio.wav
2023-03-06 16:19:47 Extract the frames and save in demo\001\pyframes
VideoManager is deprecated and will be removed.
base_timecode argument is deprecated and has no effect.
demo\001\pyavi\video.avi - scenes detected 1
2023-03-06 16:19:48 Scene detection and save in demo\001\pywork
2023-03-06 16:20:09 Face detection and save in demo\001\pywork
2023-03-06 16:20:09 Face track and detected 8 tracks
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.57it/s]
2023-03-06 16:20:14 Face Crop and saved in demo\001\pycrop tracks
Model pretrain_TalkSet.model loaded from previous state!
0%| | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
File "demoTalkNet.py", line 458, in
main()
File "demoTalkNet.py", line 444, in main
scores = evaluate_network(files, args)
File "demoTalkNet.py", line 218, in evaluate_network
_, audio = wavfile.read(os.path.join(args.pycropPath, fileName + '.wav'))
File "D:\ProgramData\Anaconda3\envs\FYP\lib\site-packages\scipy\io\wavfile.py", line 647, in read
fid = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'demo\001\pycrop\demo\001\pycrop\00000.wav'
PS D:\Pycharm\TalkNet-ASD-main> python demoTalkNet.py --videoName 001
2023-03-06 16:24:03 Extract the video and save in demo\001\pyavi\video.avi
2023-03-06 16:24:03 Extract the audio and save in demo\001\pyavi\audio.wav
2023-03-06 16:24:04 Extract the frames and save in demo\001\pyframes
VideoManager is deprecated and will be removed.
base_timecode argument is deprecated and has no effect.
demo\001\pyavi\video.avi - scenes detected 1
2023-03-06 16:24:04 Scene detection and save in demo\001\pywork
2023-03-06 16:24:26 Face detection and save in demo\001\pywork
2023-03-06 16:24:26 Face track and detected 8 tracks
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.57it/s]
2023-03-06 16:24:31 Face Crop and saved in demo\001\pycrop tracks
03-06 16:24:32 Model para number = 15.01
Model pretrain_TalkSet.model loaded from previous state!
0%| | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
File "demoTalkNet.py", line 458, in
main()
File "demoTalkNet.py", line 444, in main
scores = evaluate_network(files, args)
File "demoTalkNet.py", line 218, in evaluate_network
_, audio = wavfile.read(os.path.join(args.pycropPath, fileName + '.wav'))
File "D:\ProgramData\Anaconda3\envs\FYP\lib\site-packages\scipy\io\wavfile.py", line 647, in read
fid = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'demo\001\pycrop\demo\001\pycrop\00000.wav'

dataset

Thank you for sharing your code and models
Could you share your Google Cloud Disk link?
Thanks very much!

多卡训练

代码能否多卡训练?单张卡显存太小,想用两张2080Ti同时训练,我将代码里的模型加载部分修改了,但是不起作用

      class talkNet(nn.Module):
          def __init__(self, lr = 0.0001, lrDecay = 0.95, **kwargs):
              super(talkNet, self).__init__()        
              #self.model = talkNetModel().cuda()
              # 修改部分
              self.model = talkNetModel()
              self.model = nn.DataParallel(self.model, device_ids=[0,1])
              self.model = self.model.cuda()
              self.model = self.model.module
              
              self.lossAV = lossAV().cuda()
              self.lossA = lossA().cuda()
              self.lossV = lossV().cuda()
              self.optim = torch.optim.Adam(self.parameters(), lr = lr)
              self.scheduler = torch.optim.lr_scheduler.StepLR(self.optim, step_size = 1, gamma=lrDecay)

Talk Dataset

As I am working inside China mainland, it is not easy to download with the script given
python trainTalkNet.py --dataPathAVA AVADataPath --download

It seems the script does not check the related files exist or not, each time it just re-download all file?
Is there any better way to fix it?

dataset

国内数据集下载慢有解决的办法吗
tool.py中def extract_audio提取不了音频文件

How cause the gap about labels of TalkSet?

Thank you for your job and very detailed explanation!
I have download the voxceleb2 & lrs3, and used your 'generate_TalkSet.py' code and successfully got the TalkSet! But it looks like a difference in labels between 'TAudio.txt' generated by myself with the same file in 'lists_out'. Does this Gap have a big impact on training a 'in the wild' model?
My labels:
TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.05 0 5.05 0 0
TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.63 0 5.63 0 0
TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0
TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 5.95 0 5.95 0 0
Your labels:
TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.12 0 5.12 0 0
TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.69 0 5.69 0 0
TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0
TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 6.01 0 6.01 0 0

关于预训练模型的下载

你好,按照您说的命令操作后,

代码报错如下:
(pytorch) D:\project_code_home\V-A\TalkNet_ASD>python trainTalkNet.py --dataPathAVA AVA --evaluation
Traceback (most recent call last):
File "trainTalkNet.py", line 93, in
main()
File "trainTalkNet.py", line 46, in main
visualPath=os.path.join(args.visualPathAVA, 'train'), **vars(args))
File "D:\project_code_home\V-A\TalkNet_ASD\dataLoader.py", line 95, in init
mixLst = open(trialFileName).read().splitlines()
FileNotFoundError: [Errno 2] No such file or directory: 'AVA\csv\train_loader.csv'

dataPath 路径修改为相对和绝对路径,都试过,还是不行,无法下载。我是在windows平台运行的,下载预训练模型的话和我电脑有没有gpu关系不大吧?还有我的网络是可以使用google浏览器的,从github下载文件是可以的。

期待您的回应,谢谢!

Could you plz explain details meaning of csv file.

I saw there are a lot of array looks like [1,1,1,1,1,1,1,1,1] and a big number like 21231 after with that, within test_loader.csv

And there are a lot of id, such as video id, instance id, entity id, and label id.

I guess video id is the clip_figure accompanied id with each related video, and entity_box_x,y is the face clip marks, but what are the meanings of the others' id?
Could you please give more details ?

Really Thanks!

Evaluation score is 43.0% mAP

Hi.

First, thank you for publish your source code.

I ran your code to check it is working.

I had some troubles when download and preprocess datasets (some files are cracked during downloading). But, after re-download, the problem is gone.

After preprocessing, I ran evaluation code with your pre-trained model, but I got 43.09% score for val set. And for test set, I got 100% (But, I found a issue page for test set score, so it is okay).

So, What problem happen to me?
Is the dataset can corrupted? However, I randomly check video files, there seems to be no problem with video and audio.
Or library version issue? I set up with your requirement file.
I can't guess any reason of this problem.

Can you guess this situation?

Thank you for reading. And I apologize for my poor English.

"List index out of range" error during evaluation

Thanks for sharing your excellent work! Your code is really friendly and easy to use.

While evaluating the performance of pre-trained model with command: python trainTalkNet.py --dataPathAVA AVADataPath --evaluation.

For the context, I downloaded the required data but hadn't trained a model -- guess it's not necessary since we are using downloaded pre-trained model here.

The I got the following error message:

Downloading...
From: https://drive.google.com/uc?id=1NVIkksrD3zbxbDuDbPc_846bLfPSZcZm
To: /home/xiangdong/experiment/TalkNet_ASD/pretrain_AVA.model
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.2M/63.2M [00:20<00:00, 3.10MB/s]
12-23 12:26:36 Model para number = 15.01
Model pretrain_AVA.model loaded from previous state!
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8015/8015 [14:50<00:00, 9.00it/s]
Traceback (most recent call last):
File "trainTalkNet.py", line 86, in
main()
File "trainTalkNet.py", line 52, in main
mAP = s.evaluate_network(loader = valLoader, **vars(args))
File "/home/xiangdong/experiment/TalkNet_ASD/talkNet.py", line 75, in evaluate_network
mAP = float(str(subprocess.run(cmd, shell=True, capture_output =True).stdout).split(' ')[2][:5])
IndexError: list index out of range

100% mAP on test set

Evaluated the performance on test set with command below:
python trainTalkNet.py --dataPathAVA AVADataPath --evaluation --evalDataType test

And the result was:

03-23 11:28:11 Model para number = 15.01
Model pretrain_AVA.model loaded from previous state!
45%|████████████████████████████████████████████████████████████████████████▊ | 9663/21361 [12:25<19:26, 10.03it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
...
21361/21361 [25:31<00:00, 13.94it/s]
mAP 100.00%

Obviously that's too good to be true. Tried different models but the result was the same. I wonder how to fix it.
BTW: the evaluation result on validation set was right.

Overlapped speech performance

Hi,
I did check this with a video where 2 speakers speaking at the same time. It can detect only one speech at a time.

关于测试集test

我进行评估test集的时候,发现结果一直都是mAP=100%,评估val集就没有问题,不知道作者有没有遇到这个问题?而且我观察到test_orig.csv文件中,所有的标签都是SPEAKING_AUDIBLE, 不知道是否和这个有关

下面是运行时的输出

`WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
46%|███████████████████████████████████████████▏ | 9814/21361 [32:11<09:59, 19.25it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

46%|███████████████████████████████████████████▏ | 9819/21361 [32:12<17:17, 11.12it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

46%|███████████████████████████████████████████▎ | 9834/21361 [32:14<30:59, 6.20it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

100%|███████████████████████████████████████████████████████████████████████████████████████████| 21361/21361 [1:07:15<00:00, 5.29it/s]

mAP 100.00%`

how to download the sfd_face.pth

Thanks for sharing your excellent work! Your code is really friendly and easy to use.

While evaluating the performance of pre-trained model with command: python demoTalkNet.py --videoName 001

The I got the following error message:
demo\001\pyavi\video.avi - scenes detected 1
2021-12-25 21:13:14 Scene detection and save in demo\001\pywork
Traceback (most recent call last):
File "demoTalkNet.py", line 458, in
main()
File "demoTalkNet.py", line 421, in main
faces = inference_video(args)
File "demoTalkNet.py", line 97, in inference_video
DET = S3FD(device='cuda')
File "D:\python project\TalkNet_ASD-main\model\faceDetector\s3fd_init_.py", line 27, in init
state_dict = torch.load(PATH, map_location=self.device)
File "D:\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "D:\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "D:\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'D:\python project\TalkNet_ASD-main\model/faceDetector/s3fd/sfd_face.pth'

how to download the sfd_face.pth

Bug in audio loss computation?

Hi @TaoRuijie,

You have an audio classifcation loss in

nlossA = self.lossA.forward(outsA, labels)
, and the labels that you use there are the ASD labels (which basically say if the voice matches the face or not). However, it's not possible to predict just on the basis of audio which face is speaking since there is no concept of speaker classes that's uniform/consistent throughout the dataset. Shouldn't the labels be voice-activity labels instead, i.e. one should first check if for the particular frame any of the visible faces is an active speaker and if that's the case, assign the frames a label of 1 and 0 otherwise?

missing file sfd_face.pth

The 9th line of 'TalkNet_ASD/model/faceDetector/s3fd/init.py' gives the file 'sfd_face.pth' which does not exist in 'model/faceDetector/s3fd/'.

Do i fail to find it out in this project or it just missing?

TalkSet data

Thank you for doing such an excellent job. Currently I want to train a model on TalkSet, and have already processed voxceleb2 and lrs3 in accordance with the proposed generate_talkset code, but i lack the dataloader code for this part. Could you please share it?

problem in test in talkset

Hello, I generated the talkset according to generate_talkset, used the talkset pre-training model provided by you, tested the model on the validation set of talkset, but i found, in my code, the mAP is only 55%, and i am very confused about it. I set the TAudio label to 1, and the rest to 0, may I ask you is it set up in this way? Is it convenient to expose the talkset Dataset?

ONNX export issue

Hi author,

I successfully run you pretrain model on custom video and present really good performace !
when I tried to exported to ONNX model and run, it present some error which said Try match Slice_Slice_704:out0 failed, catch exception!

I think there is something wrong with my forward function as shown in below
`
def forward(self,audio,video):

    audio = audio.unsqueeze(1).transpose(2, 3)        
    audio = self.audioEncoder(audio)

    B, T, W, H = video.shape  
    video = video.view(B*T, 1, 1, W, H)
    video = (video / 255 - 0.4161) / 0.1688
    video = self.visualFrontend(video)
    video = video.view(B, T, 512)        
    video = video.transpose(1,2)     
    video = self.visualTCN(video)
    video = self.visualConv1D(video)
    video = video.transpose(1,2)


    audioEmbed = self.crossA2V(src = audio, tar = video)
    visualEmbed = self.crossV2A(src = video, tar = audio)      

    outsAV = torch.cat((audioEmbed,visualEmbed), 2)    
    outsAV = self.selfAV(src = outsAV, tar = outsAV)       
    outsAV = torch.reshape(outsAV, (-1, 256))

    outsAV = outsAV.squeeze(0)
    outsAV = self.FC(outsAV)


    return outsAV`

I am not sure this is correct, Is is possible provide the ONNX format pretrain model?

question

File "trainTalkNet.py", line 86, in
main()
File "trainTalkNet.py", line 35, in main
loader = train_loader(trialFileName = args.trainTrialAVA,
File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in init
sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True)
File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in
sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True)
ValueError: could not convert string to float: '20467",
请问这是什么原因,data格式是正确的

Question Regrading Online Inference

Hello,

I have been working on investigating ASD online inference and I have been looking at the demo code.
If we are processing by frames and were in controlled setting, what is the purpose and benefit of using Scene detection?

Thank you very much!
Hiro

Dataloader of Talkset

你好,请问您会开源用合成以后的Talkset做训练时用的Dataloader吗,谢谢!

h264 mmco: unref short failure

when working with extract_video_clips
warning [h264 @ 0000023c835c4800] mmco: unref short failure very high frequency.
Should I really take care with this warning?

Getting NaN values for prediction

Hello everyone,

Thank you so much for sharing the code. It was so helpful. I was using the pre-trained model on a video, and I can tell that it can smoothly detect all faces, but it highlights them all in red color (not-speaking), and it doesn't show a number for prediction. It simply shows a NaN error. Do you know what could be wrong?

Image attached: https://i.imgur.com/ssgjjP0.png

Is it possible to use different fps and sample rate than 25 and 16000

Hi, first of all, thank you for providing the codes, I found it very helpful and has tested it on a set of youtube videos with high quality result.

One thing I am wondering is, if there is a way to make the minimal change to let the code inference on arbitrary fps and sample rate. Or is the model trained in a way that it can only ran on a certain combination of (fps, sample rate)?

audio input size

Thanks for the contribution. From your paper the input image size is 128x128 and the dimension of MFCC is 13. what is the size of the MFCC is the temporal direction? I am asking this cause your code inputs different MFCC size at each interaction. How the network managed to input different audio input at each iteration I was expect uniform input size for the audio network.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.