py-lidbox / lidbox Goto Github PK
View Code? Open in Web Editor NEWEnd-to-end spoken language identification out of the box.
License: MIT License
End-to-end spoken language identification out of the box.
License: MIT License
Lines 101 to 106 in 49c27a0
Defining an end-to-end pipeline in yaml adds an unnecessary layer of complexity. Perhaps a single example pipeline could be supported but any customization is easier to do with a custom Python script by using the lidbox API.
utt2path, utt2label etc -> pandas.DataFrame
Are there any plans to train more languages, e.g. adding this dataset:
https://www.50languages.com/
If it helps I can provide MP3-files as ZIPs.
This might be because the structure of the downloaded .tar.gx
files have changed since it was first written. Currently, the validated.tsv
lies deeper inside the directory than assumed by the script, hence giving the following error at runtime:
unpacking './downloads/br.tar.gz'
cut: ./data/br/validated.tsv: No such file or directory
error: unable to load list of paths from metadata file at './data/br/validated.tsv'
Hi,
Today I tried to use the lidbox and I ran into the following error:
File "D:\anaconda\envs\slid\lib\site-packages\lidbox\embed\sklearn_utils.py", line 6, in <module> from plda import Classifier as PLDAClassifier ModuleNotFoundError: No module named 'plda'
Seems like a package is missing on install. My installation of the lidbox is done through pip.
Thanks in advance!
Changing dataset metadata should always invalidate all existing signal caches. Compare e.g. config-file contents or save checksums of metadata alongside caches.
For example, the x-vector architecture should be trained on arbitrary length input. Without ragged batches, this limits the batch size to 1. By supporting ragged batches, we could train with larger batch sizes.
There's a ridiculous amount of hand written dict-juggling that could be one-liners in pandas.
https://github.com/py-lidbox/lidbox/tree/master/lidbox/features
Especially the correctness of DSP-related functions.
Line 81 in abc2a43
E.g. separate class-metrics from summary metrics.
Hi, thanks for creating such a well documented project! I'm part of a university student group using this to train a model on (we hope) 20-30 languages commonly found in Australia. Your explanations in the examples section were incredibly useful, especially since none of us have any experience in this area. Please excuse any ignorance in the following questions.
We received the below warnings when running the project using the same datasets and code as your 'common-voice-small' example, but these didn't prevent model training from completing. Now that we're increasing the size of our dataset to include additional languages (totaling ~12gb audio), we're hitting predictable seg faults when caching the dataset or during model training. We're guessing that the issues stem from the CUDA version installed on the university machines, which is something we have no control over. We're wondering if you've encountered these issues using Lidbox, and/or if you have advice on circumventing them.
This is running on Ubuntu 20.4.06 LTS with a NVIDIA A40 w/ 45gb RAM.
Memory leak in CUDA11.x
The below warning appears when either caching the dataset or - if we omit caching - when model training begins. Using nvidia-smi to monitor GPU memory usage shows that usage gradually increases until it reaches capacity, upon when the program seg faults. So it certainly seems that the issue is caused by the cuFFT plan creation memory leak.
tensorflow/core/kernels/fft_ops.cc:472] The CUDA FFT plan cache capacity of 512 has been exceeded. This may lead to extra time being spent constantly creating new plans. For CUDA 11.x, there is also a memory leak in cuFFT plan creation which may cause GPU memory usage to slowly increase. If this causes an issue, try modifying your fft parameters to increase cache hits, or build TensorFlow with CUDA 10.x or 12.x, or use explicit device placement to run frequently-changing FFTs on CPU.
If we omit pre-caching the dataset, we proceed to training and see the above warning along with the following warning:
The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead.
My assumption is that this second error simply increases training time because it would affect access speed rather than the number of FFTs being executed. Is that the case?
Do you have any advice about how to modify our usage of Lidbox in order to minimise the effect of this memory leak? Until we resolve the seg fault issue we are running training on CPU, which works but is incredibly slow.
Thank you for your time,
Toby
This error comes from running step 5 lidbox train-embeddings -v config.xvector-NB.yaml
in the Common Voice example:
...
2020-07-01 18:25:25.740 I lidbox.embeddings.sklearn_utils: Wrote embedding demo to './lidbox-cache/naive_bayes/common-voice-4-embeddings/figures/test/embeddings-PCA-2D.png'
2020-07-01 18:25:28.541 I lidbox.embeddings.sklearn_utils: Wrote embedding demo to './lidbox-cache/naive_bayes/common-voice-4-embeddings/figures/test/embeddings-PCA-3D.png'
2020-07-01 18:25:28.541 I lidbox.embeddings.sklearn_utils: Fitting with train_X (22794, 3) and train_y (22794,) classifier:
GaussianNB(priors=None, var_smoothing=1e-09)
Traceback (most recent call last):
File "/Users/knethil/.pyenv/versions/3.7.5/bin/lidbox", line 11, in <module>
load_entry_point('lidbox==0.5.0', 'console_scripts', 'lidbox')()
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/lidbox/__main__.py", line 36, in main
ret = command.run()
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/lidbox/cli.py", line 184, in run
metrics = lidbox.api.fit_embedding_classifier_and_evaluate_test_set(split2ds, split2meta, labels, config)
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/lidbox/api.py", line 305, in fit_embedding_classifier_and_evaluate_test_set
utt2prediction, utt2target = process_predictions(test_data["ids"], predictions["test"], "test")
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/lidbox/api.py", line 279, in process_predictions
utt2prediction = generate_worst_case_predictions_for_missed_utterances(utt2prediction, utt2target, labels)
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/lidbox/api.py", line 326, in generate_worst_case_predictions_for_missed_utterances
predictions = np.stack([p for _, p in utt2prediction])
File "<__array_function__ internals>", line 6, in stack
File "/Users/knethil/.pyenv/versions/3.7.5/lib/python3.7/site-packages/numpy/core/shape_base.py", line 423, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
I tried to follow the code and this seems to be in the part where you process predictions of the NB classifier. Is there a way to bypass this training/prediction step and just get the x-vector embeddings from the trained model ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.