VoxForge
While I was trying to synchronize my testing set with the one used in the Sphinx experiments
(link: http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/AcousticModels/Sphinx/voxforge-en-r0_1_2.tar.gz)
I have found that prompts for some of the speech files are missing from the repository
http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Prompts/Prompts.tgz
(downloaded 2-26-09)
I was able to recover some from transcription files found in the archive with the Sphinx experiments. I have uploaded them here:
http://liks.fav.zcu.cz/tomas/prompts_recovered.txt
With these prompts added to the ones in the repository there are no more prompts missing for the Sphinx test set (although some of the speech in Voxforge corpus files still do not have prompts).
--- (Edited on 2/26/2009 4:35 am [GMT-0600] by tpavelka) ---
Hm, I failed to find out why it's missing. Probably during the update Ken forgot to add this particular speaker chunk to prompts. Actually I don't think master prompt file is useful at all, it's much easier to build prompts from the data available in each archive.
Some prompts needs correction still as far as I remember, but I don't remember which ones exactly, here is my list, not sure if it's corrected already
corno1979-10102006
kylegoetz-10122006
corno1979-10102006-NR - bad PROMPTS
mfread* - no PROMPTS, just prompts.txt
douglaid-20080205 vf-01 instead of vf-1
many PROMPTS has ../../../Audio/MFCC/XXkHz_YYbit/MFCC_0_D/ inside
douglaid-20080203 - incorrect prompt line
mojomove411-20071102-poe/wav/iaf0007 KILTARTAN\342\200\231S - bad word
--- (Edited on 2/28/2009 4:17 am [GMT-0600] by nsh) ---
Hi tpavelka,
Sorry for the delay in getting back to you on this, been travelling...
I think that most (if not all...) of the prompts in your prompts_recovered.txt file were intentionally removed because there were problems with the audio. Please see the diffs for these changesets:
diff | [10245] | 05/27/08 12:16:57 | kmaclean | removed prompt entries containing various errors (audio is still in … | ||
diff | [10244] | 05/27/08 10:01:39 | kmaclean | removed prompt entries (because there is too much line noise on these … | ||
diff | [10243] | 05/27/08 09:57:04 | kmaclean | removed prompt entries (because there is too much line noise on these … |
--- (Edited on 3/7/2009 3:34 pm [GMT-0500] by kmaclean) ---
Hi nsh,
Sorry for the delay in getting back to you on this, been travelling...
>Actually I don't think master prompt file is useful at all, it's much easier to
>build prompts from the data available in each archive.
The master_prompts files are used by the nightly HTK acoustic model build. Sometimes, if the audio in a submission is noisy, but not that bad (i.e. might be useful for a noisy environment corpus), I will add it to the repository (using the nightly build scripts - which add the prompts to the master_promtps file), and then remove manually. This is all logged in Subversion. Trac's search is useful for finding these occurences.
>Some prompts needs correction still as far as I remember, but I don't
>remember which ones exactly, here is my list, not sure if it's corrected
>already
Here is a post (dated: 9/6/2008 7:06 am) where you listed all the bad prompts which you had found up to that point (which have not yet been corrected):
Ken
--- (Edited on 3/7/2009 4:10 pm [GMT-0500] by kmaclean) ---
> Here is a post (dated: 9/6/2008 7:06 am) where you listed all the bad prompts which you had found up to that point (which have not yet been corrected):
Thanks Ken!
I hope ecret will fix them now :)
--- (Edited on 3/7/2009 3:25 pm [GMT-0600] by nsh) ---
I'm still trying to train acoustic models with VoxForge and my results are still rather poor. I always thought that regarding speech corpora the bigger the better. But I was able to achieve much better results with a much smaller corpus (the size was about 13 hours). The two important differences were that the laguage was Czech (I do not believe it is easier to recognize than English, quite the opposite) and the recordings were all read using the same microphone in very quiet environment.
I will write about my results later but before that I still need to run a few more tests. While I was running HTK training I tried to figure out how much the quality of the recordings can affect recognition and how to use this information to improve recognition accuracy. I was thinking that if I was able to identify the bad recordings I could take them out of the training set and see if the recognition accuracy goes up.
The problem is how to automatically identify the "bad" recordings since I wasn't up to listenning to the whole 58 hours of VoxForge. I ran an experiment of which the results are unfortunatelly rather inconclusive but still might be interesting. Here are the details:
I created a phoneme-only recognizer by having a dictionary containing only phonemes and a grammar (EBNF) that looks like this:
(sil < aa | ae | ah | ao | ... | zh > sil)
With single mixture monophones I ran the recognition for the whole training set and got the recognition accuracy for every file in the set. Here is the histogram:
http://liks.fav.zcu.cz/tomas/plot_histogram.png
Looks like normal distribution, I do not see any clear outlier groups which would indicate transcriptions that are way off. What i tried next is to get the recordings that were recognized with %Corr>70 and %Corr<30, make playlists of them and listen to see if I could hear any patterns. Here's what I found:
%Corr>70
%Corr<30
I know there has been a lot of talk about whether the corpus should consist only if clean recordings or whether noise, foreign accents etc. should be present as well. I wonder if we could do experiments which would support either version (or which would show what are the advantages/disadvantages of either approach).
But to do this we would have to be able to classify the recordings by quality etc. I tried to extract this information from the readmes but could not find anything useful. For example the gender for 32% of recordigs is "unknown" (unless I missed something, you know how it is with scripts). Or regarding the dialect there are many variants of English but little information about non native speakers/foreign accent.
I know how to automatically solve the gender problem: train two recognizers by the known male recordings and known female recordings. Compare the acoustic score of both these recognisers on the unknown recordings and you should expect reasonable classification accuracy. If you average the results by user name, the accuracy should improve even further.
As for the quality of recordings, pronunciation, foreign accents etc., this is more difficult to do automatically, any thoughts on this?
Also, I think there is the problem of ballance: if there is too much speech from one user, the accuracy for that user will go up (see my results for "ralfherzog"). But, how will this affect the accuracy for the others? The same problem (I guess) is with the male/female ratio, someone once told me that female voice is harder to recognize so your male/female ratio should be like 40/60 (unfortunatelly I do not have any links to support this). As for the foreign accents you can recognize English with strong French/Spanich/Indian accent, but only if you have sufficient training data for that otherwise I guess this data will have negative impact on your performance.
I know some of the issues I have addressed here are hard to solve. E.g. how do you convince more women to submit speech to VoxForge ;-) What I am interested in is whether some of these problems could be solved by the means of automatic speech recognition which is where I might be able to help some.
--- (Edited on 3/9/2009 4:19 am [GMT-0500] by Visitor) ---
Hey tpavelka
Yes, it's all true. Indeed Voxforge corpus has many challenges and bring a lot of material for research. We would be really happy to get it turned to a research article to promote Voxforge in academic community.
From a practical point of view I have two comments. First of all to get the quality you could just use forced alignment score which can be more straightforward. This should strip the bad recordings automatically. Not sure if your training process includes that.
Second, for our system oriented on geeks male voice recognition is probably more important :)
Third, for practical application we probably need to drop Ralf's recordings from training. It's interesting what is the performance of the model trained with and without them. They itself are invaluable contribution but I think they seriously affect the quality.
Another thing I'm thinking of recently. Voxforge in the future will be helpful for continuous dictation, but we currently don't cover dialog transcription task at all. We need to get a way to encourage the contributions of real dialogs that will be helpful to test and train such a model.
--- (Edited on 3/9/2009 5:15 am [GMT-0500] by nsh) ---
Would it not be possible to train a speech recognition engine to recongise gender, accent and quailty of the speech, rather than what it was saying? I guess more like speaker recognition, as is sometimes used for access control.
It should be possible to cluster the recordings around different accents, qualities etc. and then generate different trained engines for each one.
Another thought I had was inspired by the RANSAC algorithm - take a small subset of the corpus and train based on it, then test the accuracy of the rest of the corpus. Repeat for several different randomly selected small subsets. For each recording, build a score of how well it was recognised by training corpuses not containing itself. Use the best performing at this to generate the master corpus.
--- (Edited on 3/9/2009 6:01 am [GMT-0500] by Visitor) ---
>We would be really happy to get it turned to a research article to
> promote Voxforge in academic community.
Can I spam here? ;-) If you would like to visit Pilsen, we are organizing a conference in September,
www.tsdconference.org
The deadline is March 22th, but is very likely to be extended.
>From a practical point of view I have two comments. First of all
>to get the quality you could just use forced alignment score
>which can be more straightforward. This should strip the bad
> recordings automatically. Not sure if your training process
> includes that.
The problem is that you search for the sentence with the highest posterior probability
P(W|O)=P(O|W)P(W)/P(O)
where P(O|W) is the acoustic model likelihood, P(W) is language model prior and P(O) is the prior probability of the speech data. During Viterbi search you can ignore P(O) because it is the same for all the paths you consider. But when comparing the scores of two different utterances you have to take P(O) into account. Unfortunatelly noone knows how to compute it and the problem of assigning a confidence to a recognition result is a hard one.
My experience is that the acoustic score coming out of the Viterbi algorithm is pretty much useless (unless you have a really big mismatch between transcription and the actual utterance). Results from phoneme only recognizer are a bit better, but (as I have shown in the experiment) not by much.
> Second, for our system oriented on geeks male voice
> recognition is probably more important :)
I guess training separate male/female models might help, especially with the ratio that is in the corpus right now.
> Third, for practical application we probably need to drop Ralf's
> recordings from training. It's interesting what is the
> performance of the model trained with and without them.
> They itself are invaluable contribution but I think they seriously
> affect the quality.
What my data shows is that they are recognized better by a phoneme-only recognizer (I did not try this with words, the result may be different). This may be due to the fact that there is a lot of data from this particular speaker, but may also be because of audio quality/pronunciation. It may affect the performance of the whole system, but maybe not. I guess you would have to test it to find out.
> Another thing I'm thinking of recently. Voxforge in the future
>will be helpful for continuous dictation, but we currently don't
>cover dialog transcription task at all. We need to get a way to
>encourage the contributions of real dialogs that will be helpful
>to test and train such a model.
We have tried to build a corpus form lecture recordings. Besides the ususal artefacts of spontaneous speech (false starts etc.) the problem is, they need to be transcribed. Our observation is, that a skilled transcriber can transcribe about 3 minutes of speech in one hour :-(
--- (Edited on 3/9/2009 7:40 am [GMT-0500] by tpavelka) ---
> Would it not be possible to train a speech recognition engine to
> recongise gender, accent and quailty of the speech, rather than
> what it was saying?
For this to work you would first need anotated training data which we do not have here. I do not think anyone is up to listening to the 58 hours of the corpus and classifiing each recording into cathegories.
> It should be possible to cluster the recordings around different
> accents, qualities etc. and then generate different trained
> engines for each one.
I guess the first step would be to train a recognizer with clean and well pronounced recordings, the problem is, how to find them. Such system should (in theory) give better results on data with similar quality but may perform worse on noisy data. There is always a tradeoff, but if we could separate the data into cathegories this tradeoff could be quantified.
> Another thought I had was inspired by the RANSAC algorithm
Never heard of it, but it sounds interesting. Given the size of the corpus, this would take ages, but it might be a way to avoid manual classification.
--- (Edited on 3/9/2009 7:54 am [GMT-0500] by tpavelka) ---