Giter Site home page Giter Site logo

cpc_audio's Introduction

CPC_audio

This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper Unsupervised Pretraining Transfers well Across Languages. This is an unsupervised method to train audio features directly from the raw waveform.

Moreover, this code also implements all the evaluation metrics used in the paper:

Setup instructions

The installation is a tiny bit involved due to the torch-audio dependency.

0/ Clone the repo: git clone [email protected]:facebookresearch/CPC_audio.git && cd CPC_audio

1/ Install libraries which would be required for torch-audio https://github.com/pytorch/audio :

  • MacOS: brew install sox
  • Linux: sudo apt-get install sox libsox-dev libsox-fmt-all

2/ conda env create -f environment.yml && conda activate cpc37

3/ Run setup.py python setup.py develop

You can test your installation with: nosetests -d

CUDA driver

This setup is given for CUDA 9.2 if you use a different version of CUDA then please change the version of cudatoolkit in environment.yml. For more information on the cudatoolkit version to use, please check https://pytorch.org/

Standard datasets

We suggest to train the model either on Librispeech or libri-light.

How to run a session

To run a new training session, use:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION

Where:

  • $PATH_AUDIO_FILES is the directory containing the audio files. The files should be arranged as below:
PATH_AUDIO_FILES  
│
└───speaker1
│   └───...
│         │   seq_11.{$EXTENSION}
│         │   seq_12.{$EXTENSION}
│         │   ...
│   
└───speaker2
    └───...
          │   seq_21.{$EXTENSION}
          │   seq_22.{$EXTENSION}

Please note that each speaker directory can contain an arbitrary number of subdirectories: the speaker label will always be retrieved from the top one. The name of the files isn't relevant. For a concrete example, you can look at the organization of the Librispeech dataset.

  • $PATH_CHECKPOINT_DIR in the directory where the checkpoints will be saved
  • $TRAINING_SET is a path to a .txt file containing the list of the training sequences (see here for example)
  • $VALIDATION_SET is a path to a .txt file containing the list of the validation sequences
  • $EXTENSION is the extension of each audio file

Custom architectures

The code allows you to train a wide range of architectures. For example, to train the CPC method as described in Van Den Oord's paper just run:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --normMode batchNorm --rnnMode linear

Or if you want to train a model with a FFD prediction network instead of a transformer:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --rnnMode ffd --schedulerRamp 10

The --schedulerRamp option add a learning rate ramp at the beginning of the training: it barely affects the performance of a model with a transformer predictor but is necessary with other models.

Launch cpc/train.py -h to see all the possible options.

How to restart a session

To restart a session from the last saved checkpoint just run

python cpc/train.py --pathCheckpoint $PATH_CHECKPOINT_DIR

How to run an evaluation session

All evaluation scripts can be found in cpc/eval/.

Linear separability:

After training, the CPC model can output high level features for a variety of tasks. For an input audio file sampled at 16kHz, the provided baseline model will output 256 dimensional output features every 10ms. We provide two linear separability tests one for speaker, one for phonemes, in which a linear classifier is trained on top of the CPC features with aligned labels, and evaluated on a held-out test set.

Train / Val splits as well as phone alignments for librispeech-100h can be found here.

Speaker separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT

Phone separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --pathPhone $PATH_TO_PHONE_LABELS

You can also concatenate the output features of several model by providing several checkpoint to the --load option. For example the following command line:

python cpc/eval/linear_separability.py -$PATH_DB $TRAINING_SET $VAL_SET model1.pt model2.pt --pathCheckpoint $PATH_CHECKPOINT

Will evaluate the speaker separability of the concatenation of the features from model1 and model2.

ABX score:

You can run the ABX score on the Zerospeech2017 dataset. To begin, download the dataset here. Then run the ABX evaluation on a given checkpoint with:

python ABX.py from_checkpoint $PATH_CHECKPOINT $PATH_ITEM_FILE $DATASET_PATH --seq_norm --strict --file_extension .wav --out $PATH_OUT

Where:

  • $PATH_CHECKPOINT is the path pointing to the checkpoint to evaluate
  • $PATH_ITEM_FILE is the path to the .item file containing the triplet annotations
  • $DATASET_PATH path to the directory containing the audio files
  • $PATH_OUT path to the directory into which the results should be dumped
  • --seq_norm normalize each batch of features across the time channel before computing ABX
  • --strict forces each batch of features to contain exactly the same number of frames.

Cross lingual transfer

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

This model is also available via torch.hub. For more details, have a look at hubconf.py.

Citations

Please consider citing this project in your publications if it helps your research.

@misc{rivire2020unsupervised,
    title={Unsupervised pretraining transfers well across languages},
    author={Morgane Rivière and Armand Joulin and Pierre-Emmanuel Mazaré and Emmanuel Dupoux},
    year={2020},
    eprint={2002.02848},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

CPC_audio is MIT licensed, as found in the LICENSE file.

cpc_audio's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cpc_audio's Issues

Which Common Voice version?

Thanks for open-sourcing the code to reproduce the results of the paper.

Which Common Voice Version was used to produce the evaluation/test results? Was it Common Voice 1,2,3 or 4?

Phones.txt

Hi,

May I ask how to obtain validated_phones_reduced.txt, trainSeqs_5.0_uniform_new_version.txt and trainSeqs_5.0_uniform_new_version.txt? For Chinese, besides Taiwan (zh-TW), are there any relevant documents for mainland China (zh-CN)?

Looking forward to your reply, thank you.

Could you provide any instructions to preprocess the dataset?

Hello,

I am new to the CPC method and want to learn something from your marvelous codes. However, I am still confused about how to prepossess the dataset. I downloaded the librispeech-train-clean-100 subset from the website but I did not know how to arrange them as follows. It seems that this dataset only has training samples without labels. And I am also not sure how to use the training/validation sequences lists and the Train / Val splits. Are there any detailed instructions?
PATH_AUDIO_FILES

└───speaker1
│ └───...
│ │ seq_11.{$EXTENSION}
│ │ seq_12.{$EXTENSION}
│ │ ...

└───speaker2
└───...
│ seq_21.{$EXTENSION}
│ seq_22.{$EXTENSION}

librispeech-360 train/val split

Dear CPC audio authors,

Greetings! Thank you so much for making this codebase public! It significantly contributes to the community especially those who are interested in CPC!

May I gently ask do you plan to release the split .txt files of train/val set of librispeech-360?

Thank you so much!

Sincerely,
Martin

After running 'nosetests -d', got 'TypeError: 'AudioMetaData' object is not subscriptable' in four tests

Hi,

After installing the model in colab, I tried to run installation test as instructed. Then got the error message as shown in the attachment.

There were four tests reporting similar error, they were:

ERROR: testDataLoader (cpc.unit_tests.TestDataLoader)
ERROR: testLoadData (cpc.unit_tests.TestDataLoader)
ERROR: testPartialLoader (cpc.unit_tests.TestDataLoader)
ERROR: testSeqLabels (cpc.unit_tests.TestPhonemParser)

The TypeError were all 'TypeError: 'AudioMetaData' object is not subscriptable'.

Does anyone happen to know the reasons for this?

Thank you very much.

error message

Negative Sampling

In the cpc/criterion/criterion.py file, between line 181 and 201 as shown below. The sampling negative part.
In the paper, it's said that "sampling negative within speaker".

For extIdx, the total length is batchSize * self.negativeSamplingExt * windowSize. And if it samples within each speaker, then the first self.negativeSamplingExt * windowSize elements of extIdx should always point to speaker 1 within this batch. If so, following the default settings, batchSize= 8, negativeSamplingExt=128, windowSize=116, then values of the first self.negativeSamplingExt * windowSize (128 * 116) elements of extIdx should be between [0, 128) because the first 128 (116 training + 12 testing) samples in the negExt are data of speaker 1. Samely, values of the second self.negativeSamplingExt * windowSize elements of extIdx should be between [128, 256).

However, when I check the values from the first self.negativeSamplingExt * windowSize elements of extIdx and the second self.negativeSamplingExt * windowSize elements of extIdx, both have range [0, 1024) instead of [0, 128) & [128, 256) respectively. Sampling among [0, 1024) means sampling across all different speakers.

Can you please advise?

code:

batchIdx = torch.randint(low=0, high=batchSize, size=(self.negativeSamplingExt * windowSize * batchSize, ))

seqIdx = torch.randint(low=1, high=nNegativeExt, size=(self.negativeSamplingExt * windowSize * batchSize, ))

baseIdx = torch.arange(0, windowSize, device=encodedData.device)
baseIdx = baseIdx.view(1, 1, windowSize).expand(batchSize, self.negativeSamplingExt, windowSize)
seqIdx += baseIdx.contiguous().view(-1)
seqIdx = torch.remainder(seqIdx, nNegativeExt)

extIdx = seqIdx + batchIdx * nNegativeExt
negExt = negExt[extIdx].view(batchSize, self.negativeSamplingExt, windowSize, dimEncoded)

Can't reproduce the redult of the paper

Hi, team.
I am very greatful you provide the code and data splits for your CPC audio paper "https: //arXiv.org/abs/2002.02848".

First I tried to pretrain Mod. CPC on libri-100 and frozen the features for common voice 1-hour ASR task, I got avg per of 45.2% on 5 languages (es, fr, it, ru, tt), which is reported as 43.9% in your paper (Table 3), I think my results is close (-1.3%) to what you reported, which seemed reasonable.

But when I test the pre-trained features on 5-hour common voice ASR tasks (es, fr, it, ru, tt), I just got a avg per (frozen features) of 42.5%, which had a big gap (-3.7%) with the reported per (38.8%, Table 5 in paper); when finetuning features, the gap was even bigger, the avg per was 37.2% (in the paper it is reported as 31.0%).
Unfortunately, the 5-hour common voice ASR experiments also perform badly when training from scratch, a avg per of 43.2%, far behind 38.3% reported in your paper.

I will be very thankful if you kindly provide more detailed hyper-parameters to help me reproduce your results.
Especially, I noticed you have set a optional argument --LSTM in ./eval/common_voices_eval.py to add a LSTM layer before the linear softmax layer. I think it would significantly increase the model capacity and may lead to better performance, did you use it?
Thnk you very much!

For now I used the default hyper-parameters on common voice ASR transfer experiments:
--batchSize 8
--lr 2e-4
--nEpoch 30
--kernelSize 8
......

CPC pre-training early stopping method?

As the paper does not mention it. What criterion is used to decide when the pre-training iterations should be stopped ? What kind of early stopping is used?

The paper only says the number of iterations that have been used to train CPC. It does not explain how these number of iterations were found.

Question about input normalization

Hi, I'm wondering if the data is normalized to equal power or the
model receive input of different power during the training? I check the code and don't find anything like that and the paper doesn't mention this, too.
That's to say, if I train the model with negative sample from different speaker, than there is probability for the model to make use of speaking volume of the speaker the distinguish positive sample from negative sample?

Facing error when loading the checkpoints after training cpc/train.py

I have trained the cpc/train.py model. When I was evaluating the model using pc/eval/linear_separability.py using checkpoints saved by training cpc/train.py I got an error "FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/CPC_audio/cpc/checkpoints/checkpoint_args.json' ". After checking the file I found that no checkpoint_args.json file was saved. Code is also not saving "checkpoint_args.json".
`def run(trainDataset,
valDataset,
batchSize,
samplingMode,
cpcModel,
cpcCriterion,
nEpoch,
pathCheckpoint,
optimizer,
scheduler,
logs):

print(f"Running {nEpoch} epochs")
startEpoch = len(logs["epoch"])
bestAcc = 0
bestStateDict = None
start_time = time.time()

for epoch in range(startEpoch, nEpoch):

    print(f"Starting epoch {epoch}")
    utils.cpu_stats()

    trainLoader = trainDataset.getDataLoader(batchSize, samplingMode,
                                             True, numWorkers=0)

    valLoader = valDataset.getDataLoader(batchSize, 'sequential', False,
                                         numWorkers=0)

    print("Training dataset %d batches, Validation dataset %d batches, batch size %d" %
          (len(trainLoader), len(valLoader), batchSize))

    locLogsTrain = trainStep(trainLoader, cpcModel, cpcCriterion,
                             optimizer, scheduler, logs["logging_step"])

    locLogsVal = valStep(valLoader, cpcModel, cpcCriterion)

    print(f'Ran {epoch + 1} epochs '
          f'in {time.time() - start_time:.2f} seconds')

    torch.cuda.empty_cache()

    currentAccuracy = float(locLogsVal["locAcc_val"].mean())
    if currentAccuracy > bestAcc:
        bestStateDict = fl.get_module(cpcModel).state_dict()

    for key, value in dict(locLogsTrain, **locLogsVal).items():
        if key not in logs:
            logs[key] = [None for x in range(epoch)]
        if isinstance(value, np.ndarray):
            value = value.tolist()
        logs[key].append(value)

    logs["epoch"].append(epoch)

    if pathCheckpoint is not None \
            and (epoch % logs["saveStep"] == 0 or epoch == nEpoch-1):

        modelStateDict = fl.get_module(cpcModel).state_dict()
        criterionStateDict = fl.get_module(cpcCriterion).state_dict()

        fl.save_checkpoint(modelStateDict, criterionStateDict,
                           optimizer.state_dict(), bestStateDict,
                           f"{pathCheckpoint}_{epoch}.pt")
        utils.save_logs(logs, pathCheckpoint + "_logs.json")`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.