syhw / wer_are_we Goto Github PK
View Code? Open in Web Editor NEWAttempt at tracking states of the arts and recent results (bibliography) on speech recognition.
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
You may want to take a look at that, I'd like to see in the list the research that will use this dataset in the future:
https://voice.mozilla.org/
Hi, recently, we have done some improvements on TIMIT:
Average PER 15.58% (15.08% min.) on the core test set. fMLLR, 4x1024 LSTM, http://arxiv.org/abs/1806.07974 It is going to be presented at TSD2018 next week.
Further, we had boosted the result by NN ensembled and by the regularization post-layer in SPECOM 2018 (18.-22. September). Average PER 14.84% (14.69% min.) https://arxiv.org/abs/1806.07186
In addition, we share ready-to-try python scripts here:
https://github.com/OrcusCZ/NNAcousticModeling
To be fair, we had found a nice result of average PER 14.9% by Ravanelli with fMLLR and a M-reluGRU based NN, https://arxiv.org/abs/1710.00641
Thanks, Jan
When I started this repo, I put only numbers that I trusted, a (very few) of them I reproduced, or I knew they were reproducible. Now it seems (from the past year issues and pull requests) that people want it to be exhaustive. In this regard, there are two broad families of solutions:
$your_suggestion
I don't want to be the gatekeeper for the community, but I do care about trustable numbers and reproducible science. Classic concerns are validating on the test set, having a language model including the test set, and basically human error. Still, it doesn't mean that even slightly bogus numbers (plain wrong ones are still banished) are not interesting, they should just be taken with a grain of salt. Otherwise I'm an adept of "trust, but verify" a.k.a. optimism in the face of uncertainty. Thus, I am leaning towards 2. and adding a column for "trustability" that gives (if there is) an argument for this number. It can be a link to a github repo, a paper reproducing it (e.g. openseq2seq for DeepSpeech2 and wav2letter), or a high number of citations or a noteworthy citation. What do you think of that?
I am also going to include baselines from Kaldi if there is no good argument against in #31.
#28 raises the question of my (lack of) responsiveness lately. If you're interested in helping out with the maintaining of the repo and if you adhere with the above, feel free to submit PRs of course. A good PR is not just the number(s) and the paper title, but also a note explaining what is special/specific in this paper approach. It's even better if there is a note in your PR that is slightly longer form than the "Notes" column, that shows that you understood the paper.
I will also consider adding a few trusted maintainers with push access.
Let me know in the comments if you have suggestions on how to scale that better while informing readers about the trustworthiness of the results we list.
Hi there,
While the top TIMIT scores of 13.8% and 14.9% are reproducible, they perform a non-standard evaluation wherein silence phones are removed from reference and hypothesis transcripts (https://github.com/mravanelli/pytorch-kaldi/blob/6234b86df5ea65fe61091519d27358177b04a198/kaldi_decoding_scripts/local/score.sh). The result is a non-negligible decrease in PER. For reference, when Kaldi went back to including silences in its eval, here were its results kaldi-asr/kaldi@bdd752b.
Best,
Sean
i don't see anything new till now. how about big model in audio
Hi, can you add some CER results for mandarin dataset e.g. AIShell-1?
I intend to put the best and baselines Kaldi results, e.g. from https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/RESULTS for WSJ, for all datasets. Those numbers are not always supported by publications, but I believe they are still indicative of what can be achieved / reproduced. Thoughts?
I think for the Librispeech results it would be good to mention the latest Kaldi TDNN-F results
https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/local/chain/tuning/run_tdnn_1d.sh
which are:
test, == test-clean: 3.80
test-other: 8.76
Hi, I noticed there are several issues and PRs with updated results.
Is this repo being maintained? I'd be happy to help as a maintainer or take over the repo if you need help :-)
Thanks for creating this repository. Are WER comparisons planned for other languages as well? Could be organized in separate .MD files for each language.
We host a free and public testset as well as a Kaldi training recipe for German since a couple of years: https://github.com/uhh-lt/kaldi-tuda-de
Several papers have started to use tuda-test as a benchmark as well and it could be a candidate for a German WER benchmark.
https://arxiv.org/abs/2010.10504
we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%
Perhaps for the "Noise robust ASR" section :
https://www.microsoft.com/en-us/research/publication/recognizing-overlapped-speech-in-meetings-a-multichannel-separation-approach-using-neural-networks-2/
released few days ago.
Kind
| 7.3% | ??.?% | [Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence
Labelling](https://arxiv.org/pdf/1703.00096.pdf) | March 2017 | RNN + CTC + Gram CTC acoustic model trained on SWB+Fisher+CH, N-gram |
Hi, I would find it useful to also have links to the datasets' papers (or the datasets' webpage when there's no paper). Thanks!
Good WER using attention-based encoder-decoder on WSJ:
https://arxiv.org/abs/2005.08100
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
Why stopped updating?
Hi !
A recent paper (accepted at ICASSP 2019) presented a PER of 13.8% on TIMIT !
https://arxiv.org/pdf/1811.07453.pdf
Thanks you for this helpful repo !
https://arxiv.org/abs/2108.06209
1.5/2.7 without noise student training, 1.4/2.5 with self-training.
Hi,
I saw that the README.md has not included the Wav2Letter yet.
Cheers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.