VoxForge
Hi,
I am trying to adapt the german voxforge model (cmusphinx-de-voxforge-5.2.tar.gz with the appropriate lm and dictionary).
I have done this according to the guide on the cmu sphinx homepage (http://cmusphinx.sourceforge.net/wiki/tutorialadapt), using MLLR-transforming.
Then I tested the result using pocketsphinx_batch and word_align.pl.
Unfortunately, the detection rate has dropped significantly from 59% to 19%, which is why I am now looking for my fault.
I've done the following steps to adapt:
1. I created 30 german records (16kHz Mono) and created the related .fileids- and .transcription-file.
2. creating acoustic feature files:
sphinx_fe -argfile de-de/feat.params -samprate 16000 -c adapt30.fileids -di . -do . -ei wav -eo mfc -mswav yes
(I renamed the acoustic model directory to de-de)
3. Accumulating observation counts:
bw -hmmdir de-de -moddeffn de-de/mdef -ts2cbfn .cont. -feat 1s_c_d_dd -cmn current -agc none -dictfn voxforge.dic -ctlfn adapt30.fileids -lsnfn adapt30.transcription -accumdir .
(I renamed the dictionary to voxforge.dic)
4. MLLR-transforming:
mllr_solve -meanfn de-de/means -varfn de-de/variances -outmllrfn mllr_matrix -accumdir .
5. Update the means-file:
mllr_transform -inmeanfn de-de/means -outmeanfn de-de/means-new -mllrmat mllr_matrix
(and renamed means-new to means)
For testing i used the command:
pocketsphinx_batch -adcin yes -cepdir wav -cepext .wav -ctl test.fileids -lm voxforge.lm.bin -dict voxforge.dic -hmm de-de -hyp test.hyp
Can you tell me if something is wrong with this approach? I'm aware that the 30 recordings are not much, but according to my understanding, the recognition rate should not drop so much.
I would be grateful for every note.
Thanks in advance
Martin
You forgot
-lda de-de/feature_transform
on stage 3.
Many thanks, the detection rate has now increased 65%.
I guess this is a normal result for the low number of records?
It is hard to give you an accuracy advise without seeing the data and understanding the whole situation. You need to provide the test set.
You can check our tutorial on tuning the accuracy for details.
I added an attachment with my training data and all generated files, with exception of the model.
The result:
TOTAL Words: 40 Correct: 27 Errors: 14
TOTAL Percent correct = 67.50% Error = 35.00% Accuracy = 65.00%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 13
(see the attachment for the complete result)
I also created bigger test sets (around 100 words), but the accuracy is not getting better. So I would like to be able to estimate, if it's worth testing with a few thousand words.
I am grateful for any help :)
If you have more adaptation data you'd better use MAP adaptation, not MLLR adaptation.
I would also use smaller language model more specific for your application. With generic language model it is not going to work very accurately.
Thanks again. With a smaller language model, containing only needed words, the test results have improved significantly.
Now I'm trying to improve the detection rate in Live-mode (using LiveSpeechRecognizer) even further. Here it comes to word-mixes still often.
In this context, two more questions:
1. Is it possible to say in general whether it is worthwhile to adapt the same word with a large number of recordings? This would significantly increase the workload - so I would only like to do so if a clear recognition improvement is to be expected.
2. In addition to simple commands, I would like to be able to recognize spoken numbers. For example, the number 123 (German: "einhundertdreiundzwanzig"). In this case, could the word parts "ein", hundert", "drei", "und", "zwanzig" be individually adapted and the entire word "einhundertdreiundzwanzig" only be given to the language model? Or does the word as a whole have to be adapted for a significant improvement in the recognition rate?
> 1. Is it possible to say in general whether it is worthwhile to adapt the same word with a large number of recordings? This would significantly increase the workload - so I would only like to do so if a clear recognition improvement is to be expected.
> 2. In addition to simple commands, I would like to be able to recognize spoken numbers. For example, the number 123 (German: "einhundertdreiundzwanzig"). In this case, could the word parts "ein", hundert", "drei", "und", "zwanzig" be individually adapted and the entire word "einhundertdreiundzwanzig" only be given to the language model? Or does the word as a whole have to be adapted for a significant improvement in the recognition rate?
Another thanks for the reply.
Currently I am dealing with the following problem: If words are not recognized correctly, word confusion often occurs. Since the language model contains only required commands, this often leads to the execution of a wrong command in my application.
Therefore I check each word with wordResult.getConfidence() (scaled to the range from 0 to 1 with getLogMath().logToLinear()). This value, however, does not appear to be very significant. Even if I tap the microphone with my finger sometimes a word with a confidence value > 0.99 is recognized. Correctly recognized words usually also have values > 0.99. Because of this, no useful filtering of recognized words seems possible.
Is this a known problem, or can this problem be reduced with more adaption-data?
> Is this a known problem, or can this problem be reduced with more adaption-data?