VoxForge
The latest 20180611 builds of the english models were trained on over 800 hours of training material (containing material with noise and phone codec effects added).
You can find download links to all our models and dicts here:
https://github.com/gooofy/zamia-speech#download
WER results for these models are not comparable to previous releases as we are measuring WERs for speakers not in the training set from now on and also tried to make the language model more neutral (i.e. not over-represent prompts in the training material) so the WER results should give a more realistic assessment of what performance one can expect from our models without adaptation.
WER for the large kaldi model is 7.02% for the large model and 7.84% for the embedded model.
WER for the continuous CMU Sphinx model is 25.4%.
We have also been quite busy cleaning up our scripts and documentation so it should become easier to understand what we are doing here. The models come complete with example scripts and pre-compiled binary packages for various platforms, more information on that can be found in our getting started guide here:
https://github.com/gooofy/zamia-speech#get-started-with-our-pre-trained-models
Please note that we have changed the tarball format of our models significantly so you will have to use the latest 0.3.1 py-kaldi-asr wrappers with these models. The new tarball format allows for model adaptation
https://github.com/gooofy/zamia-speech#model-adaptation
as well as automatic segmentation and transcript alignment of long audio recordings (e.g. librivox audiobooks):
https://github.com/gooofy/zamia-speech#audiobook-segmentation-and-transcription-kaldi
comments, suggestions and contributions are very welcome. For more information about the zamia-speech project, please visit http://zamia-speech.org/
--- (Edited on 6/23/2018 5:38 pm [GMT-0500] by guenter) ---
I have just released an updated version of the Kaldi Models which comes with improved noise resistance and tokenizer bugfixes resulting in slightly better WERs:
https://github.com/gooofy/zamia-speech#download
%WER 6.97 [ 53104 / 761856, 3598 ins, 14296 del, 35210 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 7.78 [ 59271 / 761856, 4323 ins, 14974 del, 39974 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
--- (Edited on 7/2/2018 11:34 am [GMT-0500] by guenter) ---
The latest 20180815 Kaldi Models are trained on 1200 hours of recordings now that we have added the mozilla common voice v1 corpus material. Available for download in the usual places:
https://github.com/gooofy/zamia-speech#download
WERs are still good:
%WER 8.03 [ 65993 / 821583, 4460 ins, 18032 del, 43501 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 9.03 [ 74192 / 821583, 5394 ins, 19016 del, 49782 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
A slight increase was to be expected as the new training material has more diverse speakers and more noisy content which should contribute to real world unknown-speaker performance as well as noise resistance.
--- (Edited on 8/17/2018 7:34 am [GMT-0500] by guenter) ---
A new Zamia-Speech Kaldi nnet3-chain model based on factorized TDNN is available for download now here:
https://github.com/gooofy/zamia-speech#download
the new model is trained on the same dataset as the models from the 20180815 release but offers slightly better performance:
%WER 8.03 [ 65993 / 821583, 4460 ins, 18032 del, 43501 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 9.03 [ 74192 / 821583, 5394 ins, 19016 del, 49782 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER 7.54 [ 61946 / 821583, 3834 ins, 17569 del, 40543 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
--- (Edited on 9/1/2018 10:16 am [GMT-0500] by guenter) ---
The latest r20190227 release of the English Zamia-Speech models for Kaldi have been trained on two additional corpora:
zamia_en 0:05:38 voxforge_en 72:16:35 cv_corpus_v1 247:32:39 librispeech 427:13:56 NEW: ljspeech 20:34:33 NEW: m_ailabs_en 43:31:54
%WER 7.98 [ 65664 / 822605, 5716 ins, 13364 del, 46584 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_8_0.0%WER 9.06 [ 74542 / 822605, 6206 ins, 15879 del, 52457 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0%WER 7.55 [ 62115 / 822605, 4768 ins, 13736 del, 43611 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
--- (Edited on 3/3/2019 4:45 am [GMT-0600] by guenter) ---
With the addition of the TED-LIUM 3 corpus and positive results from the auto-review process the r20190609 release of the English Zamia-Speech models for Kaldi has been trained on the largest amount of audio material yet (over 1100 hours):
zamia_en 0:05:38
voxforge_en 102:07:05
cv_corpus_v1 252:31:11
librispeech 450:49:09
ljspeech 23:13:54
m_ailabs_en 106:28:20
tedlium3 210:13:30
additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs):
voxforge_en_noisy 22:01:40
librispeech_noisy 119:03:26
cv_corpus_v1_noisy 78:57:16
cv_corpus_v1_phone 61:38:33
zamia_en_noisy 0:02:08
voxforge_en_phone 18:02:35
librispeech_phone 106:35:33
zamia_en_phone 0:01:11
so in total this release has been trained on over 1500 hours of audio material (training took over 6 weeks on a GeForce GTX 1080 Ti GPU).
Stats:
%WER 10.64 exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER 8.84 exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
%WER 5.80 exp/nnet3_chain/tdnn_fl/decode_test/wer_9_0.0
The tdnn_250 model is the smallest one meant for use in embedded applications (i.e. RPi-3 class hardware), tdnn_f is our regular model, tdnn_fl is the tdnn_f model adapted to a larger language model (results illustrate the importance of language model domain adaptation btw.).
Downloads:
https://github.com/gooofy/zamia-speech#asr-models
--- (Edited on 6/20/2019 3:37 am [GMT-0500] by guenter) ---