Giter Site home page Giter Site logo

mravanelli / sincnet Goto Github PK

View Code? Open in Web Editor NEW
1.1K 31.0 257.0 80.84 MB

SincNet is a neural architecture for efficiently processing raw audio samples.

License: MIT License

Python 100.00%
deep-learning audio waveform filtering cnn convolutional-neural-networks speaker-recognition speaker-verification speaker-identification speech-recognition

sincnet's Introduction

SincNet

SincNet is a neural architecture for processing raw audio samples. It is a novel Convolutional Neural Network (CNN) that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters.

In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.

This project releases a collection of codes and utilities to perform speaker identification with SincNet. An example of speaker identification with the TIMIT database is provided. If you are interested in SincNet applied to speech recognition you can take a look into the PyTorch-Kaldi github repository (https://github.com/mravanelli/pytorch-kaldi).

Take a look into our video introduction to SincNet

Cite us

If you use this code or part of it, please cite us!

Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv

Prerequisites

  • Linux
  • Python 3.6/2.7
  • pytorch 1.0
  • pysoundfile ( conda install -c conda-forge pysoundfile)
  • We also suggest using the anaconda environment.

SpeechBrain

SincNet is implemented in the SpeechBrain (https://speechbrain.github.io/) project as well. We encourage you to take a look into it as well! It is an all-in-one pytorch-based speech processing toolkit that currently supports speech recognition, speaker recognition, SLU, speech enhancement, speech separation, multi-microphone signal processing. It is designed to be flexible, easy-to-use, modular, and well documented. Check it out.

Updates

Feb, 16 2019:

  • We replaced the old "sinc_conv" with "SincConv_fast". The latter is 50% faster.
  • In the near future, we plan to support SincNet based speaker-id within the PyTorch-Kaldi project (the current version of the project only supports SincNEt for speech recognition experiments). This will allow users to perform speaker recognition experiments in a faster and much more flexible environment. The current repository will anyway remain as a showcase.

How to run a TIMIT experiment

Even though the code can be easily adapted to any speech dataset, in the following part of the documentation we provide an example based on the popular TIMIT dataset.

1. Run TIMIT data preparation.

This step is necessary to store a version of TIMIT in which start and end silences are removed and the amplitude of each speech utterance is normalized. To do it, run the following code:

python TIMIT_preparation.py $TIMIT_FOLDER $OUTPUT_FOLDER data_lists/TIMIT_all.scp

where:

  • $TIMIT_FOLDER is the folder of the original TIMIT corpus
  • $OUTPUT_FOLDER is the folder in which the normalized TIMIT will be stored
  • data_lists/TIMIT_all.scp is the list of the TIMIT files used for training/test the speaker id system.

2. Run the speaker id experiment.

  • Modify the [data] section of cfg/SincNet_TIMIT.cfg file according to your paths. In particular, modify the data_folder with the $OUTPUT_FOLDER specified during the TIMIT preparation. The other parameters of the config file belong to the following sections:
  1. [windowing], that defines how each sentence is split into smaller chunks.
  2. [cnn], that specifies the characteristics of the CNN architecture.
  3. [dnn], that specifies the characteristics of the fully-connected DNN architecture following the CNN layers.
  4. [class], that specify the softmax classification part.
  5. [optimization], that reports the main hyperparameters used to train the architecture.
  • Once setup the cfg file, you can run the speaker id experiments using the following command:

python speaker_id.py --cfg=cfg/SincNet_TIMIT.cfg

The network might take several hours to converge (depending on the speed of your GPU card). In our case, using an nvidia TITAN X, the full training took about 24 hours. If you use the code within a cluster is crucial to copy the normalized dataset into the local node, since the current version of the code requires frequent accesses to the stored wav files. Note that several possible optimizations to improve the code speed are not implemented in this version since are out of the scope of this work.

3. Results.

The results are saved into the output_folder specified in the cfg file. In this folder, you can find a file (res.res) summarizing training and test error rates. The model model_raw.pkl is the SincNet model saved after the last iteration. Using the cfg file specified above, we obtain the following results:

epoch 0, loss_tr=5.542032 err_tr=0.984189 loss_te=4.996982 err_te=0.969038 err_te_snt=0.919913
epoch 8, loss_tr=1.693487 err_tr=0.434424 loss_te=2.735717 err_te=0.612260 err_te_snt=0.069264
epoch 16, loss_tr=0.861834 err_tr=0.229424 loss_te=2.465258 err_te=0.520276 err_te_snt=0.038240
epoch 24, loss_tr=0.528619 err_tr=0.144375 loss_te=2.948707 err_te=0.534053 err_te_snt=0.062049
epoch 32, loss_tr=0.362914 err_tr=0.100518 loss_te=2.530276 err_te=0.469060 err_te_snt=0.015152
epoch 40, loss_tr=0.267921 err_tr=0.076445 loss_te=2.761606 err_te=0.464799 err_te_snt=0.023088
epoch 48, loss_tr=0.215479 err_tr=0.061406 loss_te=2.737486 err_te=0.453493 err_te_snt=0.010823
epoch 56, loss_tr=0.173690 err_tr=0.050732 loss_te=2.812427 err_te=0.443322 err_te_snt=0.011544
epoch 64, loss_tr=0.145256 err_tr=0.043594 loss_te=2.917569 err_te=0.438507 err_te_snt=0.009380
epoch 72, loss_tr=0.128894 err_tr=0.038486 loss_te=3.009008 err_te=0.438005 err_te_snt=0.019481
....
epoch 320, loss_tr=0.033052 err_tr=0.009639 loss_te=4.076542 err_te=0.416710 err_te_snt=0.006494
epoch 328, loss_tr=0.033344 err_tr=0.010117 loss_te=3.928874 err_te=0.415024 err_te_snt=0.007215
epoch 336, loss_tr=0.033228 err_tr=0.010166 loss_te=4.030224 err_te=0.410034 err_te_snt=0.005051
epoch 344, loss_tr=0.033313 err_tr=0.010166 loss_te=4.402949 err_te=0.428691 err_te_snt=0.009380
epoch 352, loss_tr=0.031828 err_tr=0.009238 loss_te=4.080747 err_te=0.414066 err_te_snt=0.006494
epoch 360, loss_tr=0.033095 err_tr=0.009600 loss_te=4.254683 err_te=0.419954 err_te_snt=0.005772

The converge is initially very fast (see the first 30 epochs). After that the performance improvement decreases and oscillations into the sentence error rate performance appear. Despite these oscillations an average improvement trend can be observed for the subsequent epochs. In this experiment, we stopped our training at epoch 360. The fields of the res.res file have the following meaning:

  • loss_tr: is the average training loss (i.e., cross-entropy function) computed at every frame.
  • err_tr: is the classification error (measured at frame level) of the training data. Note that we split the speech signals into chunks of 200ms with 10ms overlap. The error is averaged for all the chunks of the training dataset.
  • loss_te is the average test loss (i.e., cross-entropy function) computed at every frame.
  • err_te: is the classification error (measured at frame level) of the test data.
  • err_te_snt: is the classification error (measured at sentence level) of the test data. Note that we split the speech signal into chunks of 200ms with 10ms overlap. For each chunk, our SincNet performs a prediction over the set of speakers. To compute this classification error rate we averaged the predictions and, for each sentence, we voted for the speaker with the highest average probability.

You can find our trained model for TIMIT here.

Where SincNet is implemented?

To take a look into the SincNet implementation you should open the file dnn_models.py and read the classes SincNet, sinc_conv and the function sinc.

How to use SincNet with a different dataset?

In this repository, we used the TIMIT dataset as a tutorial to show how SincNet works. With the current version of the code, you can easily use a different corpus. To do it you should provide in input the corpora-specific input files (in wav format) and your own labels. You should thus modify the paths into the *.scp files you find in the data_lists folder.

To assign to each sentence the right label, you also have to modify the dictionary "TIMIT_labels.npy". The labels are specified within a python dictionary that contains sentence ids as keys (e.g., "si1027") and speaker_ids as values. Each speaker_id is an integer, ranging from 0 to N_spks-1. In the TIMIT dataset, you can easily retrieve the speaker id from the path (e.g., train/dr1/fcjf0/si1027.wav is the sentence_id "si1027" uttered by the speaker "fcjf0"). For other datasets, you should be able to retrieve in such a way this dictionary containing pairs of speakers and sentence ids.

You should then modify the config file (cfg/SincNet_TIMIT.cfg) according to your new paths. Remember also to change the field "class_lay=462" according to the number of speakers N_spks you have in your dataset.

The version of the Librispeech dataset used in the paper is available upon request. In our work, we have used only 12-15 seconds of training material for each speaker and we processed the original librispeech sentences in order to perform amplitude normalization. Moreover, we used a simple energy-based VAD to avoid silences at the beginning and end of each sentence as well as to split in multiple chunks the sentences that contain longer silence

References

[1] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv

sincnet's People

Contributors

hbredin avatar mravanelli avatar rickychanhoyin avatar seungwonpark avatar vroger11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sincnet's Issues

speaker verification with GE2E loss

I use your Sincnet architecture with GE2E loss but loss value can't drop. I try to replace activation unit in DNN to logistic function but it doesn't work?
Can you give me some advice why this happen? or just because that loss function can't cooperate with sincnet ?

computed d-vectors aren't consistent across runs

Thanks for this great toolkit!

I've train the speaker-id model and I'm now trying to extract the d-vectors for various wav files.

I notice that if I print the final utterance d-vector, I get different vector values each time:

For example, on the first run, I get:

[0.00486447 0.01101663 0.00926225 ... 0.0081592  0.00329391 0.01262286]

And the 2nd time I start the script, I get

[-6.3974774e-05 -1.0634900e-02 -1.0657464e-04 ...  1.9263949e-02
 -2.3915395e-03  4.1378587e-02]

And again:

[0.00949155 0.01689023 0.00099393 ... 0.01041446 0.01067957 0.01114707]

If I put twice the same audio file in the wav list, I get consistent values within a run, but always different across run.

Any clues? I get this either running on the GPU or the CPU.

Question about paper <LEARNING SPEAKER REPRESENTATIONS WITH MUTUAL INFORMATION>

Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation.
when transfer the MI to KL based method which contains a joint distribution and two marginal production.

but how to understand when we train the network, we sample (z1,z2)as the joint distribution, and sample (z1,zrand) as the other? what's the true joint distribution and two marginal distribution?Thanks.

Configuration

I configured the following which I believe will allow me to load the software I need and run SincNet. The manufacturer said it would run but questioned why I bought so much memory and only one monitor. If you see holes that need plugging, I would appreciate any comments.
Thank you, Gerard

Workstation 7920
Intel Xeon Gold 6140 2.3GHz, 3.7GHz Turbo 18C, 10.4GT/s 3UPI, 25MB Cache, HT (140W) DDR4-2666
Ubuntu Linux 18.04
NVIDIA Quadro GV100, 32GB, 4 DP (Precision xx20 Towers)
384GB 12x32GB DDR4 2666MHz RDIMM ECC
M.2 256GB PCIe NVMe Class 40 Solid State Boot Drive
M.2 2TB PCIe NVMe Class 40 Solid State Drive
Keyboard, Mouse, Monitor
Ethernet LAN

How can I used sincnet for speaker verification task?

Thank you for your contribution.
In paper you are mentioning about the speaker verification performance but in code I did not found any code related to speaker verification. Please can you explain me how can i implement verification.

LayerNorm == torch.nn.InstanceNorm1d ?

I believe what you call LayerNorm is actually InstanceNorm1d in pytorch.

SincNet/dnn_models.py

Lines 112 to 123 in 488c982

class LayerNorm(nn.Module):
def __init__(self, features, eps=1e-6):
super(LayerNorm,self).__init__()
self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta

It is my understanding that Layer Normalization would actually have one weight/bias per sample in the sequence, while Instance Normalization only has one per channel. Do you confirm?

References

Installation Update

As promised -- an update

Moved from mravanelli/pytorch-kaldi#88 to here at SyncNet
On pytorch-kaldi #88: I stated:
I am going to attempt to run the SyncNet speaker id experiment.
If it fails to run I will look into more hardware. So ...

Attempt to install on Raspberry Pi 3B+

Requirements:
Linux
Python 3.6/2.7
pytorch 1.0
pysoundfile
anaconda

Bottom line toward Requirements:
(I am skipping the installation output except for the last line)

Linux: Raspbian GNU/Linux 9 (stretch) *** YES ***

Python 3.6/2.7 *** NO *** I have Python 3.5.3 -- Anaconda installed 3.4

pi@raspberrypi:~ $ conda install anaconda-client
Anaconda *** Maybe *** The following NEW packages will be INSTALLED:
anaconda-client: 1.0.2-py34_0
clyent: 0.4.0-py34_0
freetype: 2.5.2-2
jpeg: 8d-0
libpng: 1.6.17-0
libtiff: 4.0.2-1
pillow: 2.9.0-py34_0
pip: 7.1.2-py34_0
python-dateutil: 2.4.2-py34_0
pytz: 2015.4-py34_0
setuptools: 18.1-py34_0
six: 1.9.0-py34_0
wheel: 0.24.0-py34_0

pi@raspberrypi:~ $ sudo apt-get install pytorch
pytorch E: Unable to locate package pytorch
pi@raspberrypi:~ $ pip3 install pytorch
Exception: You tried to install "pytorch". The package named for PyTorch is "torch"
pi@raspberrypi:~ $ sudo apt-get install torch
RuntimeError: PyTorch does not currently provide packages for PyPI (see status at pytorch/pytorch#566).
Please follow the instructions at http://pytorch.org/ to install with miniconda instead.
pi@raspberrypi:~ $ conda install pytorch=0.4.1 -c pytorch
pytorch *** NO *** Error: No packages found in current linux-armv7l channels matching: pytorch 0.4.1*

pysoundfile ( conda install -c conda-forge pysoundfile)
pi@raspberrypi:~ $ conda install -c conda-forge pysoundfile
Fetching package metadata: ......
Solving package specifications:
Error: Could not find some dependencies for pysoundfile: cffi
pi@raspberrypi:~ $ conda install --channel https://conda.anaconda.org/poppy-project cffi
The following packages conflict with each other:
cffi
python 3.4*
pysoundfile *** NO ***

Toward Requirements: 1.5 out of 5

************************* PLAN B *************************

Order a new system

https://developer.nvidia.com/embedded/buy/jetson-nano-devkit $99
NVIDIA® Jetson Nano™ Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing.
JetPack is compatible with NVIDIA’s world-leading AI platform for training and deploying AI software, and reduces complexity and effort for developers by supporting many popular AI frameworks, like TensorFlow, PyTorch, Caffe, and MXNet. It also includes a full desktop Linux environment and out-of-the-box support for a variety of popular peripherals, add-ons, and ready-to-use projects.

Technical Specifications
GPU 128-core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD (not included)
Video Encode 4K @ 30 | 4x 1080p @ 30 | 9x 720p @ 30 (H.264/H.265)
Video Decode 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 18x 720p @ 30 (H.264/H.265)
Camera 1x MIPI CSI-2 DPHY lanes
Connectivity Gigabit Ethernet, M.2 Key E
Display HDMI 2.0 and eDP 1.4
USB 4x USB 3.0, USB 2.0 Micro-B
Others GPIO, I2C, I2S, SPI, UART
Mechanical 100 mm x 80 mm x 29 mm

https://www.adafruit.com/product/1995 $8
5V 2.4A Switching Power Supply

https://www.sandisk.com/home/memory-cards
SanDisk microSDHC™ 64GB SD Card $13

Existing USB Keyboard, USB Mouse, HDMI screen

Toward Requirements:
Linux: Ubuntu 18.04 LTS
Python 3.6
PyTorch 1.1

Watched a tutorial on configuring, booting up, and running some examples.
And another on Introduction to Deep Learning

Unknown out of 5 (I will update you in a week / 10 days)

SincNet Weights

Hi sir, can you please share pre-trained model or it's weights. :)

training on new datset

while running speaker id on my dataset I am getting the following error

Traceback (most recent call last):
  File "speaker_id.py", line 227, in <module>
    [inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
  File "speaker_id.py", line 52, in create_batches_rnd
    sig_batch[i,:]=signal[snt_beg:snt_end]*rand_amp_arr[i]
ValueError: could not broadcast input array from shape (3200,2) into shape (3200)

trained model error

Hi, the trained model seems not work when compute_d_vector:
Missing key(s) in state_dict: "conv.0.low_hz_", "conv.0.band_hz_"
the keys in checkpoint_load['CNN_model_par']:

conv.0.filt_b1
conv.0.filt_band
conv.1.weight
conv.1.bias
conv.2.weight
conv.2.bias
bn.0.weight
bn.0.bias
bn.0.running_mean
bn.0.running_var
bn.1.weight
bn.1.bias
bn.1.running_mean
bn.1.running_var
bn.2.weight
bn.2.bias
bn.2.running_mean
bn.2.running_var
ln.0.gamma
ln.0.beta
ln.1.gamma
ln.1.beta
ln.2.gamma
ln.2.beta
ln0.gamma
ln0.beta

Thanks a lot.

How to do the GPU parallel with this code?

Hello,

I can succeed in running this code with only one gpu. But I failed when I tried to use DataParallel to
call several gpus. How can I use several gpus to faster the training process?

Main code is listed:

CNN_net=CNN(CNN_arch)
CNN_net.cuda()
CNN_net=nn.DataParallel(CNN_net)

DNN1_net=MLP(DNN1_arch)
DNN1_net.cuda()
DNN1_net=nn.DataParallel(DNN1_net)

DNN2_net=MLP(DNN2_arch)
DNN2_net.cuda()
DNN2_net=nn.DataParallel(DNN2_net)

The error info like this:

Traceback (most recent call last):
File "speaker_id.py", line 180, in
DNN1_arch = {'input_dim': CNN_net.out_dim,
File "/Work19/2017/xxx/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 535, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'out_dim'

Visualizing sinc filter kernels

Hi Marco,

thanks for sharing this great work! I'm trying to visualize the weights of the sinc layer in time and frequency domain but I'm having trouble getting it right. Some filters don't look like a single bandpass but have different amplitudes for different frequencies (e.g., see the right example in the eleventh row in the figure below).

The code below just loads a trained model from the saved checkpoint, computes the filters in time domain, and visualizes them alongside their Fourier transform. I'm certain it's just a problem with visualizing them. If you could have a look or share the example from Figure 2 in the paper, that would be highly appreciated.

Thanks,
Benedikt

%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import math
import torch

sampling_frequency = 16000
num_filters = 80
filter_length = 251
frequency_scaling = sampling_frequency * 1.0

# Load trained model from checkpoint
torch_checkpoint = torch.load("exp/SincNet_TIMIT/model_raw.pkl", map_location='cpu')
low_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_b1"]
band_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_band"]

# Compute filter kernel. Except from renaming some variables, this should be the same as the original code.
def flip(x, dim):
    xsize = x.size()
    dim = x.dim() + dim if dim < 0 else dim
    x = x.contiguous()
    x = x.view(-1, *xsize[dim:])
    x = x.view(x.size(0), x.size(1), -1)[:,
        getattr(torch.arange(x.size(1) - 1, -1, -1), ('cpu', 'cuda')[x.is_cuda])().long(), :]
    return x.view(xsize)

def sinc(band, t_right):
    y_right = torch.sin(2 * math.pi * band * t_right) / (2 * math.pi * band * t_right)
    y_left = flip(y_right, 0)

    y = torch.cat([y_left, torch.autograd.Variable(torch.ones(1)), y_right])

    return y

filters = torch.autograd.Variable(torch.zeros((num_filters, filter_length)))
N = filter_length
t_right = torch.autograd.Variable(torch.linspace(1, (N - 1) / 2, steps=int((N - 1) / 2)) / sampling_frequency)

min_freq = 50.0;
min_band = 50.0;

filt_beg_freq = torch.abs(low_hz) + min_freq / frequency_scaling
filt_end_freq = filt_beg_freq + (torch.abs(band_hz) + min_band / frequency_scaling)

n = torch.linspace(0, N, steps=N)

# Filter window (hamming)
window = 0.54 - 0.46 * torch.cos(2 * math.pi * n / N);
window = torch.autograd.Variable(window.float())

for i in range(num_filters):
    low_pass1 = 2 * filt_beg_freq[i].float() * sinc(filt_beg_freq[i].float() * frequency_scaling, t_right)
    low_pass2 = 2 * filt_end_freq[i].float() * sinc(filt_end_freq[i].float() * frequency_scaling, t_right)
    band_pass = (low_pass2 - low_pass1)

    band_pass = band_pass / torch.max(band_pass)

    filters[i, :] = band_pass * window

filters = filters.view(num_filters, 1, filter_length)
filters = filters.detach().numpy()

# Visualize filter kernels (similar to https://gist.github.com/endolith/236567)
# Two filters and their Fourier transform per row.
num_cols = 4
num_rows = int(np.ceil(num_filters * 2 / num_cols))

fig, axes = plt.subplots(num_rows, num_cols, figsize=(9, 80))
for i in range(num_filters):
    spatial_ax = axes[(i * 2) // 4, (i * 2) % 4]
    frequency_ax = axes[(i * 2 + 1) // 4, (i * 2 + 1) % 4]
    
    weights = filters[i, 0, :]
    
    # Frequency computation
    ampl = 1/N * np.abs(np.fft.rfft(weights))
    
    # RFFT frequency bins
    freqs = np.fft.rfftfreq(N, 1/sampling_frequency)
    
    spatial_ax.plot(weights)
    frequency_ax.stem(freqs, ampl)
    
fig.tight_layout()

visualize_weights

Perform speaker identification

Hi, how would I perform inference for speaker identification using your implementation? For example, how would I get the predicted speaker for the TIMIT wavefile dr3/fjlr0/sa1.wav from your SincNet implementation?

Using my own database

Hi,
I wanted to run speaker_id with our own database. We will have train and test wav files. How do we go about doing this?

Thanks,
Sivam.

error in create_batches

Hi,
First of all, thanks for sharing your work :)

i use google colab to excute this project

when i run ''!python3 "/content/keras-sincnet/speaker_id.py" --cfg=/content/keras-sincnet/cfg/SincNet_TIMIT.cfg''

I get the following error:

File "/content/keras-sincnet/speaker_id.py", line 227, in
[inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
File "/content/keras-sincnet/speaker_id.py", line 53, in create_batches_rnd
lab_batch[i]=lab_dict[wav_lst[snt_id_arr[i]]]
KeyError: 'TRAIN/DR1/FSJK1/SX305.WAV'

Can you give some guidelines about how to resolve this error

Training on GPU got error on CPU decoding

I got this error while doing d_vector extraction. Is D-vector output of 2nd CNN layers? Error seems to be due to availability of Torch GPU version and not CPU. How did you handle in your case? Thanks! Below is error message:

import torch
File "anaconda3/lib/python3.6/site-packages/torch/init.py", line 80, in
from torch._C import *
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

Question about gap between err_tr and err_te

I really appreciate you released the paper together with the source code. Also, I have a question on the performance gap between err_tr and err_te:

According to the result shown in res.res, after epoch 360, err_tr=0.009600, err_te=0.419954. The gap between training and validation's performance seems quite large. Does it mean the model suffers some kind of overfitting problem?

Is the d-vector extracted strictly according to the original d-vector paper?

Table 2 of the paper compares the performance on the speaker verification task bewteen SincNet with d-vector and several other models. How's the d-vector extracted? Which of the following is correct?

  1. The naive way. Namely a random chunk (200 ms, not necessarily comes from the same utterance) of the speaker's audio is pre-processed (\x -> x / abs(max(y)), where x is the current chunk and y is whole signal), and then fed to the network, the output of the last hidden layer (i.e. the output of DNN1_net) is obtained, do the same for other audio chunks (thus many audio chunks are consumed), one would get many vectors, denoted d_1, d_2, ..., d_n, each d_i is L2 normalized to get d'_i and the final d-vector of this speaker is mean(d'_1, d'_2, ..., d'_n).

  2. The "original" way. A speaker is represented by a sequence of utterance, {O_i: i}, each utterance is consisted of a sequence of frames, {o_j: j}, during enrollment/verification, each o_j is companied by its context, (the original paper uses 30 frames to the left and 10 frames to the right), and fed to the network, the output of the last hidden layer is extracted, denoted with a_j, a_j is then L2 normalized, the d-vector of utterance O_i is sum_{j} {a_j}; the d-vector of the speaker is mean({d-vector of O_i: i})

The d-vector paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

PS. in both cases, I assume that the network used for extracting d-vector is the same one that is produced by speaker_id.py.

voxCeleb1 and libri speech

Hi Mirco,
Thanks for the great work! I was wondering if you plan to share the data preparation recipe for voxCeleb1 and librispeech that can allow us to reproduce other experiments from your paper.

Re-training the trained SincNet on new dataset

Hi Mirco,
I have a task where I need to take a pre-trained SincNet and re-train it on our data. Both datasets are prepared according to your protocols. SincNet trained on first dataset is available as well. Now, I want to remove the output layer of trained network and use a new output corresponding to new speakers from Dataset 2. What scripts I need to modify to do it. Do you think such a strategy is good rather than combining both datasets and re-train on composite dataset. In real-world, we get more data after every few month and re-training could be time consuming. So, I want to test this strategy to initialized from a previously trained network hoping that is could converge faster. Thanks!

TIMIT_preparation.py

File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 51, in
copy_folder(in_folder,out_folder)
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 36, in copy_folder
shutil.copytree(in_folder, out_folder, ignore=ig_f)
File "/home/administrator/Music/dell/lib/python3.5/shutil.py", line 303, in copytree
names = os.listdir(src)
NotADirectoryError: [Errno 20] Not a directory: '/home/administrator/Videos/SincNet-master/TIMIT_preparation.py'

need solution for the above mentioned error...

How do I run the trained model file

@mravanelli
I have completed the training with my own dataset. Now I wanted to use the trained model to make predictions with wav files. How do I get the prediction? Can you please help?

Thanks!
Sivam.

TIMIT_preparation.py

Traceback (most recent call last):
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 45, in
in_folder = sys.argv[1]
IndexError: list index out of range
please give the solution to above mentioned error...

Utilization rate of the gpu

my data is very very big, Utilization rate of the gpu(24G) is very low when trainning, 128 batch size used 1G of 24G, I try to add batch size to improve the utilization rate, But I worry about the impact on convergence, do you have some way which can improve the utilization rate of gpu? thank you

Run TIMIT data preparation.

Hello,When I was doing this step, the following problem occurred:
$ python TIMIT_preparation.py $TIMIT_FOLDER $OUTPUT_FOLDER data_lists/TIMIT_all.scp
Traceback (most recent call last):
File "TIMIT_preparation.py", line 41, in
out_folder=sys.argv[2]
IndexError: list index out of range

I hope you can give me some answers, thank you very much.

Rationale for dividing speech into chunks of 200ms with 10ms overlap

Hi, I am trying to check the performance of SincNet on VoxCeleb dataset. I am wondering about the rationale of extracting chunks of 200ms windows of signals during training, and also the 10ms overlap that you have in test? Does the model depend on this?

Can I use longer chunks like 3s of audio like the VoxCeleb paper seems to be using? Seeing that VoxCeleb is a much larger dataset?

how about generalization of sincNet ?

First, thanks for your contribution.
In your expriment, Classification Error Rates - CER% for speaker-id task and Equal Error Rate - EER% for speaker verification used, howeve, at present, deep feature representition and similarity scores were used to speaker recognition, so whether if should explain generalization of sincNet from A dataset training to B dataset testing.

Arxiv Paper Link

Dear Mirco,

I couldn't find the link to arxiv paper. When will it be available?

Thanks.

Question about TIMIT_labels.npy and others

Hello,
First of all, thanks for sharing your work :)
I am new in speech recognition science field.

Wanted to ask couple of questions:

  1. TIMIT_labels.npy. Is there a way to track how id's (particular numbers) are assigned in the dictionary? I loaded the file from python command prompt. For example : 'train/dr5/fjxm0/sx311.wav': 267. How is 267 assigned (maybe there is a reference somewhere in TIMIT dataset)? As i understand the id in such case is: fjxm0. So, is 267 important? Or, for example (all occurrences of 267 in the file can be changed to some new value).

  2. Can you give some guidelines about how to choose training and test sets for the model? E.g. what percentage for training, and testing?

  3. File: model_raw.pkl. Is the the file for trained model? How can i use this file?

  4. As i understand SincNet solves 'speaker identification task'. What are the differences, compared to speech recognition task? How can i adapt Sincet for individual speaker? (e.g. to compare results for the same speaker spoken twice) Maybe for individual speaker, i need to design test set in a way that only audio files for that speaker is included?

Looking forward to hearing from you :)

Best regards,
Andrius L.

Questions aiming experiments replication

I really appreciate you released the paper together with the source code. Also, I have few questions:

  1. I have plotted the "err_te" from "res.res" file and it seems to be different than Fig. 4 from the paper. Fig. 4 contains values below 0.4% FER, while the "res.res" file has a minimum value for "err_te" equal to 0.410. Why?
  2. If "err_te_snt" represents the classification error, that means the classification error is equal to 0.57% after 360 epochs (according to "res.res" file). In the paper it is equal to 0.85% (Table 1). Why?
  3. What is the experimental setup for speaker verification?

Thank you!

Adding new speakers/using transfer learning

Hello,

First of all, thanks for all the great work :)

I managed to reproduce the results from the paper using the TIMIT dataset and I am now thinking about the following scenario:

I have a dataset of 500 speakers, I trained the model on it, I get a good enough accuracy and the model can reliably identify one of those 500 speakers from an audio sample. Now I need to add one or more new speakers, let's say 5; the desired outcome is a model that can identify one of the now 505 speakers. This could be a case that repeats in the future, as I get more audio data.

I currently have these approaches in mind:

  1. Train the model from scratch every time I need to add new speakers. The disadvantage to this is that I don't leverage any accumulated knowledge from previous trainings.

  2. Use transfer learning somehow - load the weights from the "500 speakers" trained model and replace the softmax layer with one that has 505 classes, then train a few more epochs.

  3. Same as 2, except we also freeze all the layers except softmax.

How would you approach this? If 2 and 3 are viable options, how would you implement that? Would changing "class_lay" in the config to 505 and training with the new dataset be enough for 2? How would you approach freezing the non-softmax layers?

Thanks again,
Bogdan

Performance on larger datasets

Hi Mirco,

I'm curious if you had to make any adjustments to the structure of the model to handle the >2k speakers in LibriSpeech? I've been attempting to fit a model on >3k speakers, and the Sentence Classification Error doesn't drop below ~50%. I have ~3min of speech per speaker.

Could you provide the res.res file for LibriSpeech that you provided for TIMIT?

Improving SincNet results on TIMIT by adding reverberation

Not an issue, but just wanted to post that you can further decrease the sentence classification error by reverberating each training call and including them in the training dataset (effectively doubling the training size). The error drops to 0%. I will try with Libri as well

TIMIT (.wrd, .txt, .phn) file interpretations (numbers in front of the line)

Hello,

I want to find out more details about TIMIT database (in particular .TXT, .PHN and .WRD files):
For example (in folder train/dr1/FCJF0).

File SI1657.TXT i have the following:
0 45466 Or borrow some money from someone and go home by bus?

Question: What does numbers '0' and '45466' refer to? Perhaps time duration in miliseconds?

File SI1657.WRD :
2120 3533 or
3533 8200 borrow
8200 12291 some
12291 15325 money
15325 18435 from
18435 25984 someone
25984 28960 and
28960 31000 go
31000 34599 home
34599 36200 by
36200 43480 bus

Question: What does numbers (in first two columns) refer to?

File SI1657.PHN (took a fragment) :
0 2120 h#
2120 2725 q

Question: What does numbers (0, 2120 and 2120, 2725) refer to?

Another question: Would SincNet work if no .phn (phonetics) files are provided to the dataset?

Best regards,
Andrius L.

Taking the whole speech sequence as input without chunking

Hi Micro, many thanks for the great work and sharing the code! It's super useful for the work I am doing. So for the speaker identification task, each training sample has the length of 3200 since fs=16000; cw_len=200; wlen=int(fs*cw_len/1000.00)=3200. And for testing, voting over even smaller chunks are performed.

I have a speech classification task that, I think, it's the best to take the whole speech sequence as input without chunking. Currently, is it possible to use variable length input sequence for sincnet at all? If not, then I would pad each batch with zeros to the max length (in each batch). Would sincnet be affected with padded batches?

Oh btw, my speech sequences vary in length by a lot actually. What would be your suggestion Micro? Many thanks again!

AttributeError: module 'torch' has no attribute 'flip'

Hi I am using torch 0.4.0 as mentioned in the README file. I get the following error. Is this because of a version problem or do I need to install additional dependancies(apart from the ones mentioned in the README)

Traceback (most recent call last):
File "speaker_id.py", line 228, in
pout=DNN2_net(DNN1_net(CNN_net(inp)))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 448, in forward
x = self.drop[i](self.act[i](self.ln[i](F.max_pool1d(torch.abs(self.convi), self.cnn_max_pool_len[i]))))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 144, in forward
band_pass_right= torch.flip(band_pass_left,dims=[1])
AttributeError: module 'torch' has no attribute 'flip'

How to divide the test set?

Hello!
Firstly, thanks for sharing the code of your paper, it's really a fantastic work!
But I'm quite confused when I'm going to test my own model. When we are going to test the model in Speaker Identification, we should divide the test set into two parts, some of the data is used for enrollment and the others is used for test.
For example, there's ten sentences of each speaker, maybe it's not appropriate to set nine of the sentences for enrollment and one for test, as the model may learn much from the nine sentence and it's easy to make a correct predition of the rest one during test. Thus, in this condition, the accuracy might be higher than it truly should be. But that's not what I want.
I read your code carefully, but didn't find the answer, sorry about that. :(
So could you please tell me the way to divide the test set?

Reproducing LibriSpeech results

Hi,

I've been able to reproduce to a very close degree the results of the TIMIT experiment. However, I believe to reproduce the LibriSpeech results, I'll need a bit more information if you don't mind. I've currently downloaded the clean 100 and 360 hour datasets as well as the 500 hour "other" dataset. This has about 100 fewer speakers than the number you reference in your paper. Could you provide the names of the Libri datasets from which you drew your speakers?

How did you preprocess them? I know in the paper you mention using only 12-15s for training. Did you just taking the first 15s of each utterance (or less if the utterance was shorter)? If not, could you explain how you arrived at the training utterances as well as the preprocessing you've done?

Could you also provide the training/testing data lists similar to TIMIT_train.scp and TIMIT_test.scp?

Again, this is aimed at trying to reproduce the Libri results, so as much specificity as possible would be great!

Thanks for all you've done!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.