Giter Site home page Giter Site logo

gabrielmittag / nisqa Goto Github PK

View Code? Open in Web Editor NEW
583.0 24.0 111.0 2.34 MB

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment

License: MIT License

Python 100.00%
speech-quality deep-learning interspeech icassp tts pytorch voice-conversion text-to-speech speech-synthesis quality-of-experience

nisqa's Introduction

NISQA: Speech Quality and Naturalness Assessment

+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.

Speech Quality Prediction:
NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness to give more insight into the cause of the quality degradation.

TTS Naturalness Prediction:
The NISQA-TTS model weights can be used to estimate the Naturalness of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.).

Training/Finetuning:
NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) .

Speech Quality Datasets:
We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels.

Table of Contents

More information about the deep learning model structure, the used training datasets, and the training options, see the NISQA paper and the Wiki.

Installation

To install requirements install Anaconda and then use:

conda env create -f env.yml

This will create a new environment with the name "nisqa". Activate this environment to go on:

conda activate nisqa

Using NISQA

We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model.

There are three different model weights available, the appropriate weights should be loaded depending on the domain:

Model Prediction Output Domain Filename
NISQA (v2.0) Overall Quality, Noisiness, Coloration, Discontinuity, Loudness Transmitted Speech nisqa.tar
NISQA (v2.0) mos only Overall Quality only (for finetuning/transfer learning) Transmitted Speech nisqa_mos_only.tar
NISQA-TTS (v1.0) Naturalness Synthesized Speech nisqa_tts.tar

Prediction

There are three modes available to predict the quality of speech via command line arguments:

  • Predict a single file
  • Predict all files in a folder
  • Predict all files in a CSV table

Important: Select "nisqa.tar" to predict the quality of a transmitted speech sample and "nisqa_tts.tar" to predict the Naturalness of a synthesized speech sample.

To predict the quality of a single .wav file use:

python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results

To predict the quality of all .wav files in a folder use:

python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results

To predict the quality of all .wav files listed in a csv table use:

python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results

The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel.

Training

Finetuning / Transfer Learning

To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows:

python run_train.py --yaml config/finetune_nisqa.yaml
  • If the NISQA Corpus is used, only two arguments need to updated in the YAML file and you are ready to go: The data_dir to the extracted NISQA_Corpus folder and the output_dir, where the results should be stored.

  • If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed.

    Your CSV file needs to contain at least three columns with the following names

    • db with the individual dataset names for each file
    • filepath_deg filepath to the degraded WAV file, either absolute paths or relative to the data_dir (CSV column name can be changed in YAML)
    • mos with the target labels (CSV column name can be changed in YAML)

    The finetune_nisqa.yaml needs to be updated as follows:

    • data_dir path to the main folder, which contains the CSV file and the datasets
    • output_dir path to output folder with saved model weights and results
    • pretrained_model filename of the pretrained model, either nisqa_mos_only.tar for natural speech or nisqa_tts.tar for synthesized speech
    • csv_file name of the CSV with filepaths and target labels
    • csv_deg CSV column name that contains filepaths (e.g. filepath_deg)
    • csv_mos_train and csv_mos_val CSV column names of the target value (e.g. mos)
    • csv_db_train and csv_db_val names of the datasets you want to use for training and validation. Datasets names must be in the db column.

See the comments in the YAML configuration file and the Wiki (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration.

Training a new model

NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows:

  1. Framewise model: CNN or Feedforward network
  2. Time-Dependency model: Self-Attention or LSTM
  3. Pooling: Average, Max, Attention or Last-Step-Pooling

The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the NISQA Corpus as follows:

python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml

If the NISQA Corpus is used, only the data_dir needs to be updated to the unzipped NISQA_Corpus folder and the output_dir in the YAML file. Otherwise, see the previous finetuning section for updating the YAML file if a custom dataset is applied.

It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the train_nisqa_cnn_lstm_avg.yaml example configuration file is provided.

To train a double-ended model for full-reference speech quality prediction, the train_nisqa_double_ended.yaml configuration file can be used as an example. See the comments in the YAML files and the Wiki (not yet added) for more details on different possible model structures and advanced training options.

Evaluation

Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation):

python run_evaluate.py

Before running, the options and paths inside the Python script run_evaluate.py should be updated. If the NISQA Corpus is used, only the data_dir and output_dir paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus.

NISQA Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions.

For the download link and more details on the datasets and used source speech samples see the NISQA Corpus Wiki.

Paper and License

The NISQA code is licensed under MIT License.

The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License

The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the NISQA Corpus Wiki.

Copyright © 2021 Gabriel Mittag
www.qu.tu-berlin.de

nisqa's People

Contributors

gabrielmittag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nisqa's Issues

Making .exe from anaconda

Hey, I need to to make an .exe file from this project simply to make it more portable and make its size smaller. I tried to make one using pyinstaller but the result .exe is not working for me resulting in an error like this.

Traceback (most recent call last):
File "run_predict.py", line 5, in <module>
ModuleNotFoundError: No module named 'nisqa'
[4816] Failed to execute script 'run_predict' due to unhandled exception!

Is it even possible to have an .exe of the prediction part of NISQA?

I have no problem of running it inside of anaconda but its really not what I want.

MOS results change depending on the audio sequence

Hi,

I got an interesting observation when using the default MOS predictor.
When I'm testing on a sample with packet-loss, I got the MOS score of 1.66996. The waveform is as below:

image

Then I cut the first half of the waveform that encountered packet-loss and paste it to the end of the waveform as shown below:
image
I got the MOS score of 2.3836, which gets 0.7136 improvement related to the original audio.

But for a human, these two audios are basically the same, so the MOS prediction should not have this much difference...

In the paper the authors mentioned that there are several ways of pooling to deal with the time-related information. I'm thinking that maybe that's the reason? Do you have any suggestion to avoid this kind of situation?

Thanks!

Prediction runs slowly

Hi! Thank you so much for this project.
I'm trying to speed up the running time of the predictions, but even when running on GPU, it still takes a long time.
Is there any way to speed up the prediction? (I have tried to increase the data loader workers to 10 and batch size to 1000, but it didn't help).
Thank you!

Availability of the source code

Hi,
thanks for this project, I was wondering if you were planning to make the source code available at some point?

We would like to test this architecture for a different context (evaluation of TTS speech quality) so it would useful to have access to the code itself.

Thanks.

Using NISQA as a loss function

Hi, thank you very much for the work, the repo and the pretrained models.

Can we use NISQA trained weights (nisqa_tts) as a loss function to train a TTS/audio synthesis model?

Thanks in advance

CDUA device does not load the model

Hi,
It is an attempt to reach you about a a problem I face, the model does not load in to GPU while making inference. CPU is fast though , each file takes abour 1 second to complete, however this is not the behaviour what i want. I want it a little more faster. Can you please give me a hint. I have even used self.dev to cuda in NISQ_DIM, but model still goes to CPU.
image

Thanks in advance!

modular version

hello
at first, thanks for your great and valuable work.
is it possible to provide a modular version of the NISQA?
or is it possible to use NISQA as a module?
best regards
@gabrielmittag

Do we need any other calibration before predicting MOS?

Hi,

Thanks for the great works!

I notice that #4 issue mentioned the volume of the input audio might affect the result. Is there any other calibration or pre-processing needs to be done for the wav files before applying run_predict.py to them? To be more specifically, if I want to compare the MOS of speech audios recorded from different devices, is there any other calibration I need to apply? For example, Active Speech Level adjustment in double-ended methods [1]. Thanks!

Reference:
[1] Section 3.2.1.4.: https://cdn.head-acoustics.com/fileadmin/data/global/Application-Notes/Telecom/3QUEST-Application-Note.pdf

RuntimeError: Could not infer dtype of numpy.float32

Hello, sorry to bother you. I have met a problem. I have configured the environment as described in the readme file, but the program is wrong when I run the run_predict_file.py, RuntimeError: Could not infer dtype of numpy.float32. Is the version of some libs are incorrectly? I have tried serveral versions of numpy, but it doesn't work and displays the same error.
Looking forward to your reply!

TTS naturalness prediction based on which model

Hi,
thanks for your wonderful work firstly, and I just wonder the model of TTS naturalness prediction is self-attention or CNN-LSTM? And whether the training datasets contain Chinese corpus. Hope for your reply.
Best
ShengnanZhang

Question about the effectiveness of Chinese speech quality evaluation

Hi:
Thank you for your hard work, I have a question, would be great if you could find some time to answer it.
The pre-trained model "NISQA (v2.0) mos only" you provided shows good results on the English data-set. However, performance results are limited on our self-made Chinese data-set. We collected 2000 simulated distortions speech and obtained their MOS by crowdsourcing scoring. However, The PCC of "NISQA (v2.0) mos only" is only 68% and the RMSE is 0.99.
Does the provided pre-training model have greater restrictions on the type of speech?
Thanks
Yimingxiao

Audio input requirements

Are there any specific requirements for audio files to make the results of NISQA valid?
I couldn't find any documentation in this repo or the original paper describing the audio requirements, but I was hoping to use home-made recordings to evaluate the performance of speech enhancement algorithms. Can any audio be used and gives valid results?

I've been running NISQA on some local files and have found that the MOS scores don't always correlate with subjectively listening to the files. Is there anything I should be doing to make these files valid for use in NISQA?
For example, are there requirements/recommendations on:

  • total duration of audio file;
  • proportion of speech and non-speech in the file;
  • level requirements;
  • suggested SNR for evaluation files (before speech enhancement is applied).

Interpertation of Different metrics

Hi,
Thanks for your continued support and efforts in this subject. I wanted to ask a question different metric:
'mos_pred': 1 worst 5 best
'noi_pred': 1 best 5 worst
'dis_pred': 1 best 5 worst
'col_pred': 1 best 5 worst
loud_pred': 1 worst 5 best

Is this understanding right?
thanks in advance.

Can the quality of front-end signal processing be evaluated?

Hi
Thank you very much for providing this project.
Can this project be used to evaluate the voice quality after front-end signal processing (AEC--Acoustic Echo Canceller, NS --noise suppression , AGC - automatic gain control)?
Is it necessary to fine-tune the model? Do you know where to find some open source data sets focused on speech enhancment / noise suppression ?
Thanks!

Utilizing finetuned weights

Hi, I've finetuned nisqa_tts.tar following the README instructions and obtained a new set of weights. However, when I try to use them for inference, I get the following error message.

python run_predict.py --mode predict_csv --pretrained_model weights/nisqa_custom.tar --csv_file /home/udesa_ubuntu/tesis/NISQA_analysis/NISQA_finetuning/test_set.csv --csv_deg stimuli --num_workers
0 --bs 10 --output_dir /home/udesa_ubuntu/tesis/NISQA_analysis/NISQA_finetuning/test

Device: cpu
Model architecture: NISQA
Loaded pretrained model from weights/nisqa_custom.tar
Traceback (most recent call last):
File "/home/udesa_ubuntu/tesis/NISQA/run_predict.py", line 42, in
nisqa = nisqaModel(args)
File "/home/udesa_ubuntu/tesis/NISQA/nisqa/NISQA_model.py", line 35, in init
self._loadDatasets()
File "/home/udesa_ubuntu/tesis/NISQA/nisqa/NISQA_model.py", line 738, in _loadDatasets
self._loadDatasetsCSVpredict()
File "/home/udesa_ubuntu/tesis/NISQA/nisqa/NISQA_model.py", line 818, in _loadDatasetsCSVpredict
csv_con_file_path = os.path.join(self.args['data_dir'], self.args['csv_con'])
File "/home/udesa_ubuntu/miniconda3/envs/nisqa/lib/python3.9/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
File "/home/udesa_ubuntu/miniconda3/envs/nisqa/lib/python3.9/genericpath.py", line 152, in _check_arg_types
raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

max window length error for most audio files

I have started trying out this library on some of my audio files but I am facing the below error for most of my wav files

ValueError: n_wins 2540 > max_length 1300 --- ok.wav. Increase max window length ms_max_segments!

Could you please explain why this shall occur?

It seams slowly because of some functions running on CPU

y_hat_list = [ [model(xb.to(dev), n_wins.to(dev)).cpu().numpy(), yb.cpu().numpy()] for xb, yb, (idx, n_wins) in dl]

This function are running on cpu, cost my 60% CPU(8 core 16 processors R5 5700X)

I predict a 10 seconds audio on RTX4060TI used 0.26 seconds

cost in neural network less than 0.05 seconds
cost more than 0.2 seconds in DataLoader

About the NISQA_lib.py

Hi,
First of all thank you for your fantastic research!
I read the NISQA_lib.py code and found that the model you provided in this code is more like the model structure defined in "Quality Degradation Diagnosis for Voice Networks – Estimating the Perceived Noisiness, Coloration, and Discontinuity of Transmitted Speech", and not include the Per-frame Quality.
Will you continue to complete these codes in the future?
Thanks!

Script quits unexpectedly (without errors) when trying to export model to ONNX

First of all, thanks for this invaluable resource :)

As for my issue: I would like to export the NISQA v2 model to ONNX so that I can use for data evaluation within my tensorflow-based environment without introducing pytorch as a dependency (similarly to Microsoft's DNSMOS p.835). However, when attempting to run torch.onnx.export(), the script exits unexpectedly without throwing any error or raising exceptions. This happens no matter which opset version I pick. Do you know if the issue is related certain operations happening within the model?

Here is the code that I used. I just replace the content of export_dim with this and run the prediction script as usual.
If you're interested in further debugging the issue, I can upload my fork of the repo with the full script.

  x, y, (idx, n_wins) = ds[0]
  x = x.unsqueeze(0)
  x.requires_grad = True
  n_wins = torch.from_numpy(n_wins).unsqueeze(0)
  #n_wins.requires_grad = True
  model.eval()
  torch.onnx.export(model,                   # model being run
                  (x, n_wins),               # model input (or a tuple for multiple inputs)
                  "nisqa.onnx",              # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=14,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
  )
  print('Done')   # !!! this line never gets executed because the script quits unexpectedly !!!

Thanks in advance

upper bound and larger bound inconsistent with step sign

Hi, I'm trying to batch predict on some speech CSVs, at first some audio segments are a bit lengthy which caused the problem of #11, so I saved the new model as with new max as 5500: nisqa_tts_5500 and nisqa_all_5500, the models seem to predict fine until the problem of "upper bound and larger bound inconsistent with step sign" show up.

What is the best performance for overall quality ? Higher? Lower? And the range ?

Thrnak you so much for your contribution. There has five speech quality dimeansons output by inference the network. I wounder to make sure whether my understand is correct for some metrics.

For better synthetic speech
MOS: higher, [0, 5]
Noiseness: lower, range ?
Coloration: higher, range ?
Discontinuity: lower, range ?
Loudness: higher, range ?

Thank you so much for your help. Have a nice day.

MOS prediction during runtime

I would like to use NISQA to predict MOS during runtime. Using run_predict.py is nice when I only have to predict MOS once. I would like to avoid loading NISQA model every time I want to calculate MOS of a wave file. My use case is something like

for i in (# of steps in the order of 10k):
    wav, sr = librosa.load(path)
    ...
    Do same changes to wav
    ....
    mos = NISQA(wav)

It would be really nice to have a method that loads the NISQA model once and allows me to pass wavfiles during runtime. Is there a way already to do this?

Can't download NISQA Corpus

Hi,

I want to download the NISQA Corpus you just published. But the file I downloaded is 0 kb, and the download fails if I try again.
I think something is wrong about the download link. Could you check it?
Thank you.

Liu

License

Hi,
Amazing project! I was wondering if you might consider switching the license of your pretrained models to a more permissive license, since the rest of this project is MIT-licensed.
Thank you!

Bug in Alignment

There is a bug in Alignment in line 1267 when calculating the padding mask for the attention(scores) :

NISQA/nisqa/NISQA_lib.py

Lines 1264 to 1274 in 2c8917f

def _mask_attention(self, att, y, n_wins):
mask = torch.arange(att.shape[2])[None, :] < n_wins[:, None].to('cpu').to(torch.long)
mask = mask.unsqueeze(1).expand_as(att)
att[~mask] = float("-Inf")
def forward(self, query, y, n_wins_y):
if self.att is not None:
att_score, sim = self.att(query, y)
self._mask_attention(att_score, y, n_wins_y)
y = self.apply_att(y, att_score)
return y

The padded elements of att should be set to 0 instead of float(-Inf). float(-Inf) is only used if att would go into a softmax afterwards to achieve 0 attention at those places. However, in self.att(query, y) the softmax is already calculated for att (line 1273). Hence, it should be att[~mask] = 0. If I use the current implemetations, I get NaNs in ApplySoftAttention. If I use the proposed implemetation, everything works as intended

Continuous metrics?

Hi Gabriel, thanks for making such a useful model!

I have 2 files: 1 denoised, and another noisy. In some cases when the quality drops below a certain threshold, I'd like it to move to whichever has the highest quality so to speak (excluding noisiness).

However, I'd like to do this in a smooth way. So my question is, is it possible to export continuous metrics?

Would I be right in my estimation that y_hat_list is a list of metrics? If so, how could I map this back onto number of samples? Or would it be even accurate/advisable to do so in my case?

def predict_dim(model, ds, bs, dev, num_workers=0):     
    '''
    predict_dim: predicts MOS and dimensions of the given dataset with given 
    model. Used for NISQA_DIM model.
    '''        
    dl = DataLoader(ds,
                    batch_size=bs,
                    shuffle=False,
                    drop_last=False,
                    pin_memory=False,
                    num_workers=num_workers)
    model.to(dev)
    model.eval()
    with torch.no_grad():
        y_hat_list = [ [model(xb.to(dev), n_wins.to(dev)).cpu().numpy(), yb.cpu().numpy()] for xb, yb, (idx, n_wins) in dl]
    yy = np.concatenate( y_hat_list, axis=1 )
    
    y_hat = yy[0,:,:]
    y = yy[1,:,:]
    
    ds.df['mos_pred'] = y_hat[:,0].reshape(-1,1)
    ds.df['noi_pred'] = y_hat[:,1].reshape(-1,1)
    ds.df['dis_pred'] = y_hat[:,2].reshape(-1,1)
    ds.df['col_pred'] = y_hat[:,3].reshape(-1,1)
    ds.df['loud_pred'] = y_hat[:,4].reshape(-1,1)
    
    return y_hat, y

NISQA Double Ended pretrained weights?

Hi,

I am sorry if I have missed it somehow, but may I know how to use the double ended NISQA model for quality assessment with reference signal?

Best regards
Nabarun

Some questions about the sample rate of speech samples

Hi,

I am trying to use the NISQA Corpus dataset published by you. I found that the speech samples in the NISQA Corpus are 48k sampling and their true frequency distribution is only 0-8k ( some is 0-4k ). In your codes, you didn't do down-sampling operation. So after mel-spectrogram, won’t there be many mel bins with a value of 0?

And in your paper 'NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets', you said 'NISQA model is that it can be applied to speech samples of any duration or sample rate without any preprocessing steps or level normalisation'. Can you point out which part of the code processes speech samples with different sampling rates? I didn't find it.

I will be very thankful for your answer.

Bests,
Liu

Is it possible to use the model as a function?

Thanks for sharing the repo.
I have tried to use the model based on the guidelines in README and find it very useful.
I am thinking is it possible to call this model as a function and return a scalar (the MOS) in python instead of using it only through command prompt.
However, I am not good at programming, is there any advice where to start?

n_wins

hi, i have been going through the code and paper to get a thorough understanding of the internal mechanisms of the architecture. one thing that has eluded me is n_wins in the forward pass of many modules.

can anyone give an insight as to what this n_wins is and what an expected input might be so i can simulate a forward pass with a random tensor?

thank you

Question regarding the speech quality dimensions

Hi,

Thank you for your hard work, I have a few question, would be great if you could find some time to answer them.

  1. nisqa_tts.tar : Does this model take into account discontinuity in synthesized speech (long silence regions within a sentence)?
  2. nisqa_tts.tar : Does this model take into account gibberish (partially failed synthesis, some random gibberish synthesized)?
  3. nisqa.tar : Is it okay to use this model for evaluating the above mentioned discontinuity in synthesized speech? What is the level of discontinuity that the model can effectively evaluate? (for example random 2-5 second silence within a sentence)
  4. nisqa.tar : For the predicted values of the various dimensions, is the scale 0-5? And are all the predicted values follow 'higher is better' order?

Thanks
Nabarun

mos_pred预测分数有些奇怪

1712055089638 1712055259976

这个评分似乎有些问题,我对原声音频和转换后的音频进行预测,歌曲原声的评分都低于3,但这是不可能的,而且在同一文件夹下进行预测时,分数惊人的接近,我想请问会有哪些原因造成这种情况,又该如何解决呢?

The predict result seems not reliable

Hi, thanks for the job. I am searching for a tool to filter bad audio from ASR corpus to get TTS dataset. I had tried this one, and what I concern about is the noise_pred, and discontinuity_pred, I am using this tool on 16K audios so I ignored the col_pred.
The test result is frustrating. I checked some samples and their scores, it seems no better then random. Is the model trained on 48K samples? should we train a 16K version?

Usage of PackedSequence class

In the framewise model, a new PackedSequence instance is generated from x_packed with x as the data argument (line 481 in nisqa_lib.py). However, this can lead to errors which tell you to never use PackedSequence but one of the helper functions instead, e.g. pack_padded_sequence. Since a PackedSequence is nothing more than a NamedTuple, you could circumvent this issue by using the _replace method:

This

    x = PackedSequence(
            x,
            batch_sizes = x_packed.batch_sizes,
            sorted_indices = x_packed.sorted_indices,
            unsorted_indices = x_packed.unsorted_indices
        )

becomes

    x = x_packed._replace(data=x)

A pretty minor thing but I ran into issues using PackedSequence directly before.

NISQA Corpus download issue

Hi,sorry to bother you. I have met a problem.
I want to download the NISQA Corpus you just published.When I download NISQA Corpus to about 10G, I can't continue to download data, whether I use mirror or download link, the same problem will occur.
I think something is wrong about the download link. Could you check it?
Thank you.

magicmans

nisqa.exe result changes with volume

A note for future users of nisqa.exe: the results change depending on the audio clip volume. Experimentally, I got best results (= best correlation with MOS on a private dataset) with peak volume = 0 dB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.