m3hrdadfi / soxan Goto Github PK

View Code? Open in Web Editor NEW

240.0 8.0 35.0 3.66 MB

Wav2Vec for speech recognition, classification, and audio classification

License: Apache License 2.0

Python 0.39% Jupyter Notebook 99.61%

speech-emotion-recognition emotion-recognition automatic-speech-recognition speech-recognition speech-classification

soxan's Introduction

Soxan

در زبان پارسی به نام سخن

This repository consists of models, scripts, and notebooks that help you to use all the benefits of Wav2Vec 2.0 in your research. In the following, I'll show you how to train speech tasks in your dataset and how to use the pretrained models.

How to train

I'm just at the beginning of all the possible speech tasks. To start, we continue the training script with the speech emotion recognition problem.

Training - Notebook

Task	Notebook
Speech Emotion Recognition (Wav2Vec 2.0)
Speech Emotion Recognition (Hubert)
Audio Classification (Wav2Vec 2.0)

Training - CMD

python3 run_wav2vec_clf.py \
    --pooling_mode="mean" \
    --model_name_or_path="lighteternal/wav2vec2-large-xlsr-53-greek" \
    --model_mode="wav2vec2" \ # or you can use hubert
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --train_file=/path/to/train.csv \
    --validation_file=/path/to/dev.csv \
    --test_file=/path/to/test.csv \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --num_train_epochs=5.0 \
    --evaluation_strategy="steps"\
    --save_steps=100 \
    --eval_steps=100 \
    --logging_steps=100 \
    --save_total_limit=2 \
    --do_eval \
    --do_train \
    --fp16 \
    --freeze_feature_extractor

Prediction

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
from src.models import Wav2Vec2ForSpeechClassification, HubertForSpeechClassification

model_name_or_path = "path/to/your-pretrained-model"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate

# for wav2vec
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

# for hubert
model = HubertForSpeechClassification.from_pretrained(model_name_or_path).to(device)


def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate, sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in
               enumerate(scores)]
    return outputs


path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

Output:

[
    {'Emotion': 'anger', 'Score': '0.0%'},
    {'Emotion': 'disgust', 'Score': '99.2%'},
    {'Emotion': 'fear', 'Score': '0.1%'},
    {'Emotion': 'happiness', 'Score': '0.3%'},
    {'Emotion': 'sadness', 'Score': '0.5%'}
]

Demos

Demo	Link
Speech To Text With Emotion Recognition (Persian) - soon	huggingface.co/spaces/m3hrdadfi/speech-text-emotion

Models

Dataset	Model
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/wav2vec2-xlsr-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-gender-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-large-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-base-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition
Eating Sound Collection	m3hrdadfi/wav2vec2-base-100k-eating-sound-collection
GTZAN Dataset - Music Genre Classification	m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres

soxan's People

Contributors

Stargazers

Watchers

soxan's Issues

Error in Preprocessing the data

I was using the colab notesbook for training a model using wav2vec2forclassification. In the preprocessing step when I am running the following code -

train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=10,
    batched=True,
   
)
eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=10,
    batched=True,
    
)

I am getting into the following error, I guess it has something to do with hugging face datasets -

0%|          | 0/1765 [00:00<?, ?ba/s]
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_29222/3011913806.py in <module>
      2 
      3 
----> 4 train_dataset = train_dataset.map(
      5     preprocess_function,
      6     batch_size=10,

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   1667 
   1668         if num_proc is None or num_proc == 1:
-> 1669             return self._map_single(
   1670                 function=function,
   1671                 with_indices=with_indices,

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    183         }
    184         # apply actual function
--> 185         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    186         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    187         # re-apply format to the output

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    395             # Call actual function
    396 
--> 397             out = func(self, *args, **kwargs)
    398 
    399             # Update fingerprint of in-place transforms + update in-place history of transforms

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc)
   2036                             else:
   2037                                 batch = cast_to_python_objects(batch)
-> 2038                                 writer.write_batch(batch)
   2039                 if update_data and writer is not None:
   2040                     writer.finalize()  # close_stream=bool(buf_writer is None))  # We only close if we are writing in a file

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
    401             typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
    402             typed_sequence_examples[col] = typed_sequence
--> 403         pa_table = pa.Table.from_pydict(typed_sequence_examples)
    404         self.write_table(pa_table, writer_batch_size)
    405 

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/datasets/arrow_writer.py in __arrow_array__(self, type)
    105                 out = numpy_to_pyarrow_listarray(self.data)
    106             else:
--> 107                 out = pa.array(self.data, type=type)
    108             if trying_type and out[0].as_py() != self.data[0]:
    109                 raise TypeError(

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.pyenv/versions/3.8.7/envs/bg_classifier/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Can only convert 1-dimensional array values

use_amp not define in CTCTrainer

`2 frames
in training_step(self, model, inputs)
41 inputs = self._prepare_inputs(inputs)
42
---> 43 if self.use_amp:
44 with autocast():
45 loss = self.compute_loss(model, inputs)

AttributeError: 'CTCTrainer' object has no attribute 'use_amp'`

AttributeError: 'CTCTrainer' object has no attribute 'scaler'

Error while trying the regression model

C:\Users\XTEND\anaconda3\envs\ftorch_gpu\python.exe "C:/Program Files/JetBrains/PyCharm Community Edition 2022.3.2/plugins/python-ce/helpers/pydev/pydevd.py" --multiprocess --qt-support=auto --client 127.0.0.1 --port 63293 --file C:\Users\XTEND\PycharmProjects\Regression_Wav2vec\regression_model_train.py
Connected to pydev debugger (build 223.8617.48)
Dataset({
features: ['name', 'path', 'emotion'],
num_rows: 6925
})
Dataset({
features: ['name', 'path', 'emotion'],
num_rows: 1732
})
A regression problem with 3 items: [0, 1, 2]
C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\site-packages\transformers\configuration_utils.py:380: UserWarning: Passing gradient_checkpointing to a config initialization is deprecated and will be removed in v5 Transformers. Using model.gradient_checkpointing_enable() instead, or if you are using the Trainer API, pass gradient_checkpointing=True in your TrainingArguments.
warnings.warn(
regression
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
The target sampling rate: 16000
Map: 100%|██████████| 100/100 [00:01<00:00, 60.78 examples/s]
Map: 100%|██████████| 100/100 [00:01<00:00, 56.42 examples/s]
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at lighteternal/wav2vec2-large-xlsr-53-greek and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0%| | 0/60 [00:00<?, ?it/s]C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\site-packages\torch\amp\autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\site-packages\accelerate\accelerator.py", line 988, in accumulate
yield
File "C:\Users\XTEND\anaconda3\envs\ftorch_gpu\lib\site-packages\transformers\trainer.py", line 1892, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "C:\Users\XTEND\PycharmProjects\Regression_Wav2vec\regression_model_train.py", line 456, in training_step
self.scaler.scale(loss).backward()
AttributeError: 'CTCTrainer' object has no attribute 'scaler'
python-BaseException
0%| | 0/60 [00:18<?, ?it/s]

Process finished with exit code -1073741510 (0xC000013A: interrupted by Ctrl+C)

when I add the number of the train dataset, the eval acc get bad result

what do I need to change ?
thank u

Data processing : issue due to resampled array shape

During the data preprocessing I obtain a issue due to the array sizes:

shape after resampling : torch.Size([1, 34176])
shape after resampling : torch.Size([2, 64000]) ...

to resolve this, should I flatten the arrays having two rows ??

Running out of memory

Thanks for providing your great code. I have ran the wav2vec emotion classification of greek audio and it works great. However, I keep running into memory errors when changing the dataset.

I am trying to do emotion classification on the IEMOCAP dataset (https://sail.usc.edu/iemocap/), but it breaks due to running out of memory. I am changing nothing in your code except for the dataset (which fits into your pipeline well).

I made an SO post describing my issues. Do you have any ideas? I am trying to run a p2.8xlarge instance with 100 GiB mounted, so I can't imagine I don't have enough compute.

SO post: https://stackoverflow.com/questions/68624392/running-out-of-memory-with-pytorch

Thanks!

Can't Load trained model

Thanks for your repositories.

I trained the model. But I don't know how to load my model to predict.
Can you help me?

what dataset did you use to train the model?

About the dataset creation and training speed

Hello, @m3hrdadfi , sorry to disturb. I created my own train.csv (4213 records) and dev.csv (527 records) , and run the run_wav2vec_clf.py to train a music genre recognition model, but found that

The cache data is too large, my record is 16khz mp3 and less than 1 GB, but the file generated under ~/.cache/huggingface/datasets/csv/default-f524d204c50754f6/0.0.0/ is larger than more than 18 GB, did you meet this? Morever, the train_dataset.map stage cost more than 2 hours, so long ...
How can I use all the GPU (4) to train? I tried setting CUDA_VISIBLE_DEVICES=0,1,2,3 but not works, and the gpu utilization is very low.

Thanks a lot!

Error during Training on private dataset

Morning,
I used your notebook Speech Emotion Recognition (Wav2Vec 2.0) with another dataset and I got an error during the training...
Could you help me please, the code and error are just below .

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=finetune_output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="steps",  #"epoch"
    gradient_accumulation_steps=1,
    num_train_epochs=50,
    fp16=True,
    save_steps= 10, #n_steps,
    eval_steps= 10, #n_steps,
    logging_steps=10,
    learning_rate=1e-4,
    save_total_limit=10,
)

trainer = CTCTrainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: language, audio_name, path.
***** Running training *****
  Num examples = 10769
  Num Epochs = 50
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 134650
/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

[ 142/134650 1:17:58 < 1248:33:33, 0.03 it/s, Epoch 0.05/50]

Step	Training Loss	Validation Loss	Accuracy
10	0.698400	0.497485	0.813416
20	0.394700	0.291701	0.913778
30	0.225200	0.138921	0.951371
40	0.389500	0.137598	0.962752
50	0.373600	0.469463	0.878255
60	0.079500	0.144742	0.972237
70	0.213000	0.185833	0.969822
80	0.046400	0.295700	0.947405
90	0.003300	0.149647	0.979134
100	0.000800	0.124717	0.978617
110	0.313800	0.237750	0.958441
120	0.251000	0.166465	0.965166
130	0.032900	0.044269	0.989826
140	0.051600	0.061006	0.989826

Attempted to log scalar metric loss:
0.6984
Attempted to log scalar metric learning_rate:
9.999257333828444e-05
Attempted to log scalar metric epoch:
0.0

The following columns in the evaluation set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: language, audio_name, path.
***** Running Evaluation *****
  Num examples = 5799
  Batch size = 4

Attempted to log scalar metric eval_loss:
0.4974852204322815
Attempted to log scalar metric eval_accuracy:
0.8134161233901978
Attempted to log scalar metric eval_runtime:
296.3331
Attempted to log scalar metric eval_samples_per_second:
19.569
Attempted to log scalar metric eval_steps_per_second:
4.893
Attempted to log scalar metric epoch:
0.0

Saving model checkpoint to MODEL/wav2vec2-xlsr-speech-emotion-recognition_dropout-0.5_3/checkpoint-10
Configuration saved in MODEL/wav2vec2-xlsr-speech-emotion-recognition_dropout-0.5_3/checkpoint-10/config.json
Model weights saved in MODEL/wav2vec2-xlsr-speech-emotion-recognition_dropout-0.5_3/checkpoint-10/pytorch_model.bin
Configuration saved in MODEL/wav2vec2-xlsr-speech-emotion-recognition_dropout-0.5_3/checkpoint-10/preprocessor_config.json

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-3435b262f1ae> in <module>
----> 1 trainer.train()

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1330                         tr_loss_step = self.training_step(model, inputs)
   1331                 else:
-> 1332                     tr_loss_step = self.training_step(model, inputs)
   1333 
   1334                 if (

<ipython-input-29-878b4353167f> in training_step(self, model, inputs)
     43         if self.use_amp:
     44             with autocast():
---> 45                 loss = self.compute_loss(model, inputs)
     46         else:
     47             loss = self.compute_loss(model, inputs)

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1921         else:
   1922             labels = None
-> 1923         outputs = model(**inputs)
   1924         # Save past state if it exists
   1925         # TODO: this needs to be fixed and made cleaner later.

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

<ipython-input-16-dd9fe3ea0f13> in forward(self, input_values, attention_mask, output_attentions, output_hidden_states, return_dict, labels)
     70     ):
     71         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
---> 72         outputs = self.wav2vec2(
     73             input_values,
     74             attention_mask=attention_mask,

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py in forward(self, input_values, attention_mask, mask_time_indices, output_attentions, output_hidden_states, return_dict)
   1285 
   1286         hidden_states, extract_features = self.feature_projection(extract_features)
-> 1287         hidden_states = self._mask_hidden_states(
   1288             hidden_states, mask_time_indices=mask_time_indices, attention_mask=attention_mask
   1289         )

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py in _mask_hidden_states(self, hidden_states, mask_time_indices, attention_mask)
   1228             hidden_states[mask_time_indices] = self.masked_spec_embed.to(hidden_states.dtype)
   1229         elif self.config.mask_time_prob > 0 and self.training:
-> 1230             mask_time_indices = _compute_mask_indices(
   1231                 (batch_size, sequence_length),
   1232                 mask_prob=self.config.mask_time_prob,

/anaconda/envs/azureml_py38_pytorch/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py in _compute_mask_indices(shape, mask_prob, mask_length, attention_mask, min_masks)
    240 
    241         # get random indices to mask
--> 242         spec_aug_mask_idx = np.random.choice(
    243             np.arange(input_length - (mask_length - 1)), num_masked_span, replace=False
    244         )

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

library import issue

I'm trying to use the model "m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres". For doing it I'm following your guide at ([https://huggingface.co/m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres]). I cloned the model to local. My problem is this row of code:

model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

generates the error:

NameError: name 'Wav2Vec2ForSpeechClassification' is not defined

It seems that this model doesn't exists. How to solve it?

AttributeError: 'Wav2Vec2Config' object has no attribute 'problem_type'

Hi,can you help me with this problem.
thank u!

AttributeError Traceback (most recent call last)
in
----> 1 trainer.train()

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1051 raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
1052
-> 1053 logger.info(f"Loading model from {resume_from_checkpoint}).")
1054
1055 if os.path.isfile(os.path.join(resume_from_checkpoint, CONFIG_NAME)):

in training_step(self, model, inputs)
45 loss = self.compute_loss(model, inputs)
46 else:
---> 47 loss = self.compute_loss(model, inputs)
48
49 if self.args.gradient_accumulation_steps > 1:

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
1473 # Save model checkpoint
1474 checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
-> 1475
1476 if self.hp_search_backend is not None and trial is not None:
1477 if self.hp_search_backend == HPSearchBackend.OPTUNA:

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

in forward(self, input_values, attention_mask, output_attentions, output_hidden_states, return_dict, labels)
83 loss = None
84 if labels is not None:
---> 85 if self.config.problem_type is None:
86 if self.num_labels == 1:
87 self.config.problem_type = "regression"

AttributeError: 'Wav2Vec2Config' object has no attribute 'problem_type'

'Wav2Vec2FeatureExtractor' object has no attribute 'feature_extractor'

When running wav2vec training using my dataset, the following problem occurred, on line 296 of the run_wav2vec_clf.py script:

Traceback (most recent call last): File "run_wav2vec_clf.py", line 490, in <module> main() File "run_wav2vec_clf.py", line 295, in main target_sampling_rate = feature_extractor.feature_extractor.sampling_rate AttributeError: 'Wav2Vec2FeatureExtractor' object has no attribute 'feature_extractor'

The solution for me was to replace with the line:

target_sampling_rate = feature_extractor.sampling_rate

Hope this helps if anyone else has the same problem.

Error when loading tokenizer after fine-tuning

Hi, first of all, congrats on the repo, it;s really useful!
I followed the Emotion recognition in Greek speech using Wav2Vec2.ipynb notebook.
After finishing the training on my own data, I am getting the following error when trying to load the processor with

processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)

The error:

OSError: Can't load tokenizer for '[/path/to/model/]checkpoint-860/'. If you were trying to load it from 'https://huggingface.co/models',
Otherwise, make sure '[/path/to/model/]checkpoint-860/' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

Checking the checkpoint folder, there is no tokenizer file in there, am I missing something? This is the content of the mentioned folder:

PD: the model loads correctly with model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name)

multi decoder speech model

Hi there, I am planning to finetune XLS-R on multiple decoder heads like langauge detection, ASR, speech to IPA, gender identification etc do you know any XLS-R, WavLM or any speech model implementations of the same preferably in huggingface that i could use to build multiple decoder heads out of a single pretrained model that does all these tasks at once ?

m3hrdadfi/hubert-base-persian-speech-gender-recognition dataset name

Hi @m3hrdadfi, thanks for your notable endeavors for the Persian community. I want to retrain the speech-gender recognition model and if you have trained it on a dataset other than the common voice dataset I would appreciate it if you could share it.
I appreciate any help you can provide.

Demo of training gtzan music genre classifier

You have not included a demo notebook of music genre classifier. I used your pretrained model to predict and the prediction scores seem to be correct. Could you share your training process for this gtzan dataset? If not, could you at least tell me what pretraines model you used to model gtzan dataset? It couldn't have been the same as what you used for modeling eating sounds, right?

how to make DataCollatorCTCWithPadding in my own train function

This repo use DataCollatorCTCWithPadding for pading the input waveforms, but is that possible to use this function in my own collate_fn? Or is that possible to use another input in this funtion for example, input_values and input features?

cannot import name 'Wav2Vec2Processor'

An error occurred while running the code：
from transformers import AutoConfig, Wav2Vec2Processor, Wav2Vec2FeatureExtractor
ImportError: cannot import name 'Wav2Vec2Processor'

Confirming eval and test sets

Hi @m3hrdadfi,

Thank your for the great repository.
I just want to confirm, in the colab that you gave, the evaluation and test sets are from the same.
It is intended for demo only, right? Since the test set is included in the training process (as eval_dataset)
it is not a big surprise that the performance was high.