VoxForge
Hi,
I just started developing with sphinx4 (version: 5prealpha-snapshot). After some successful testing with the default English model, I tried to use the German voxForge model. I downloaded the file cmusphinx-cont-voxforge-de-r20161117.tar.xz and tried to use it with sphinx4. After starting, the following error occurred:
18:12:20.936 INFO largeTrigramModel Loading n-gram language model from: file:vox_cont/etc/voxforge.lm.dmp
Exception in thread "main" java.lang.Error: Bad binary LM file magic number: 1701409364, not an LM dumpfile?
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.readHeader(BinaryLoader.java:469)
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.loadModelLayout(BinaryLoader.java:393)
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.<init>(BinaryLoader.java:99)
at edu.cmu.sphinx.linguist.language.ngram.large.LargeNGramModel.allocate(LargeNGramModel.java:206)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:334)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:243)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:52)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:39)
at de.martin.sphinxtest.TranscriberDemo.test1(TranscriberDemo.java:61)
at de.martin.sphinxtest.TranscriberDemo.main(TranscriberDemo.java:149)
------------------------------------------------------------------------
The following configuration-code is used:
configuration.setAcousticModelPath("file:vox_cont/model_parameters/voxforge.cd_cont_6000");
configuration.setDictionaryPath("file:vox_cont/etc/voxforge.dic");
configuration.setLanguageModelPath("file:vox_cont/etc/voxforge.lm.dmp");
I also tried an older german voxforge-model from 2014. It runs without exceptions.
If someone has an idea, where the error lies, I would be grateful for every note.
Thanks in advance
Martin
This LM is in Trie format, so should have name "voxforge.lm.bin". If you rename the file it should load properly.
Thanks a lot! The error does not occur anymore.
Unfortunately another error occurs now when calling recognizer.startRecognition(stream):
19:37:34.708 INFO trieNgramModel Loading n-gram language model from: file:vox_cont/etc/voxforge.lm.bin
2017-01-08 19:37:34 SCHWERWIEGEND de.martin.sphinxtest.TranscriberDemo main null
java.lang.NullPointerException
at edu.cmu.sphinx.linguist.language.ngram.trie.NgramTrieQuant.setTable(NgramTrieQuant.java:50)
at edu.cmu.sphinx.linguist.language.ngram.trie.BinaryLoader.readQuant(BinaryLoader.java:95)
at edu.cmu.sphinx.linguist.language.ngram.trie.NgramTrieModel.allocate(NgramTrieModel.java:225)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:334)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:243)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:52)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:39)
at de.martin.sphinxtest.TranscriberDemo.test1(TranscriberDemo.java:61)
at de.martin.sphinxtest.TranscriberDemo.main(TranscriberDemo.java:149)
------------------------------------------------------------------------
Do you know a solution for this problem?
Ok, the LM is also corrupted.
I repackaged files at
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/German/
download them and try, I verified they work with latest s4.
The new files work, no crashes anymore. Thank you very much!
> Ok, the LM is also corrupted.
uh, that sounds bad. what did you do to fix it?
Make sure you are using latest sphinxbase for conversion. Also its better to avoid such a big lm, its useless, you can prune it to the size currently uploaded on the cmusphinx site with not accuracy drawbacks.
thanks for the quick reply. what tools/options would you recommend for lm pruning?
srilm, you can use something like
ngram -prune 1e-9 -lm your.lm -write-lm your-pruned.lm
to reduce lm size significantly.
ah, srilm, very cool :)
BTW: I was wondering whether I could or should build just one lm using srilm for both sphinx and kaldi instead of my current approach where I build a separate lm using cmuclmtk for sphinx.
Anyway, I will put this on my TODO list for the next iteration of the german model. Thanks again for your help and comments!