Comments (8)
There is http://bark.phon.ioc.ee/voxlingua107/ if you like
VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours. The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
from lidbox.
It would be nice to have some pre-trained models trained on different datasets at some point but I don't think it's going to happen any time soon. If your 50-language dataset is available under a free, public license, I can try to train some models at some point when I have time. The examples might be a useful starting point for training custom models on non-public datasets.
from lidbox.
A pre-trained model would be great.
The "50 language" dataset is "create common" licensed - so an attribution would be required:
https://www.50languages.com/?user_lang=EN
If it helps I can provide a tarball containing the samples.
from lidbox.
Does this page contain all the samples? If so, I'm unsure if the amount of data is enough. Deep-learning based language identification models usually need at least 5 hours (preferably 20 hours) per language before they become useful. Each language also needs to have at least 10 (preferably many 100s) of different speakers.
from lidbox.
You have to open each language-page separately.
If it helps, I have downloaded and sorted all ZIP files (3.5 GB):
http://doppelbauer.name/50languages.tar
The proper license:
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
from lidbox.
Thanks for preparing the zip files. I downloaded the tar-file and listened to some of the Finnish and Swedish samples. Unfortunately, there are too few speakers (only 1 or 2) to train a model on this dataset. Last year, I tried training a model on e-books but there were simply not enough speakers to ensure the model learns the languages. Instead, the model learned only the speakers (or maybe the microphones). I'm quite sure this would happen also with this dataset, even with several hours of data per language.
In any case, this dataset has a very large amount of languages and it could work nicely as an evaluation set for testing trained models on new data. I don't currently have any pre-trained models to test and I'm not sure when I'll have a good one (I'm doing this on my free time). I'm currently working with the Common Voice datasets and if I happen to get something useful working I can let you know by commenting here.
from lidbox.
You are right. Common Voice hat about ~50 languages.
Thanks for this project.
from lidbox.
@nshmyrev
Good catch.
from lidbox.
Related Issues (20)
- Make cache-invalidation more aggressive
- Rewrite all metadata preprocessing with pandas
- Prettier plots
- Return classification report in a pandas dataframe
- Support for ragged batches during training
- Inspect if model2function ops are placed on a device during function creation or when the function is called
- Rewrite end-to-end builder or decide not to use config files at all
- Cleanup embed-package
- Add generic Kaldi-like metadata loader
- Cannot use tf.data.Dataset.cache for iterators with more than 10M samples HOT 1
- Code review of the feature extraction package
- Replace scipy-filter and resampling with tf ops
- No module called 'plda' HOT 3
- Using 'common-voice-small' example setup with larger dataset results in seg fault (core dumped) error HOT 2
- scripts/prepare.bash in Common Voice example is broken HOT 3
- Error while running train-embeddings option in cli HOT 3
- Implement tests and add API examples for the tf.data.Dataset interface
- separate training and prediction steps for backend classifiers
- Rewrite Common Voice example preparation script in Python HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lidbox.