VoxForge
Hi All,
I need help getting the VoxForge Acoustic Models working with Sphinx 4. I'm only on this forum asking after I spent an entire day looking through user manuals and wikis.
I understand the differences between acoustic models, language models, etc. I'm also a Java developer, so that helps since Sphinx 4 is written in Java.
I was originally guessing the VoxForge download is in Sphinx 3 format, so I looked at Sphinx 4's build.xml and see what it does to convert 3 to 4, and the stuff doesn't match up with the contents of the VoxForge download.
I also looked around the Sphinx 4 source tree to see any similarities between the existing CMU models and the VoxForge model, and I can't find any. For example, the CMU models don't have *.align files, while VoxForge does.
Is there any instruction manual for on integrating VoxForge into Sphinx 4? Call me stupid but I couldn't find anything in manuals and wikis.
In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them. Basically similar to what Google Voice does.
Thanks in advance for any help,
William
--- (Edited on 1/8/2010 8:51 am [GMT-0600] by ) ---
EDIT: Second follow-up question: How come there are no daily builds for Sphinx, and only for HTK/Julius? Would be nice if we could get most up-to-date models for Sphinx as well, wouldn't it be?
--- (Edited on 1/8/2010 8:55 am [GMT-0600] by ) ---
>Is there any instruction manual for on integrating VoxForge into Sphinx
>4?
See nsh's post here: Re: Sphinx3 model to Sphinx4 model
> Would be nice if we could get most up-to-date models for Sphinx as
>well, wouldn't it be?
On my to do list...
Are you volunteering to help out?
Ken
--- (Edited on 1/8/2010 11:31 am [GMT-0500] by kmaclean) ---
>On my to do list...
>Are you volunteering to help out?
Hey Ken, yeah sure, I'd be willing to help set this up. I have no idea how to build these bundles at this point, but I'm sure it's not rocket science and if you send me a build guide, I can set it up.
--- (Edited on 1/8/2010 11:06 am [GMT-0600] by ) ---
>I have no idea how to build these bundles at this point,
>but I'm sure it's not rocket science and if you send me a build guide, I
>can set it up.
See this post: Creating Sphinx Acoustic Model.
Downloading the corpus will take a while (we have about 70 hours worth of speech).
Ken
--- (Edited on 1/8/2010 1:40 pm [GMT-0500] by kmaclean) ---
Hello Ken
> See nsh's post here: Re: Sphinx3 model to Sphinx4 model
I wouldn't link that rather sketchy post when official documentation is available.
--- (Edited on 1/9/2010 03:26 [GMT+0300] by nsh) ---
> Is there any instruction manual for on integrating VoxForge into Sphinx 4? Call me stupid but I couldn't find anything in manuals and wikis.
There is official documentation on this which you can easily find in google and in sphinx4 source
http://cmusphinx.sourceforge.net/sphinx4/doc/UsingSphinxTrainModels.html
> In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them. Basically similar to what Google Voice does.
In case you are interested, that's not that simple as you might think. Voicemails are usually 8kHz audio and Voxforge model is trained on 16kHz, it will not work for telephone recordings. There are other issues here.
--- (Edited on 1/9/2010 03:25 [GMT+0300] by nsh) ---
> In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them. Basically similar to what Google Voice does.
Don't you need a dictation system for that? Maybe I'm wrong, but I think that you need much more than 70 hours of speech for that.
--- (Edited on 1/8/2010 7:20 pm [GMT-0600] by Visitor) ---
> Don't you need a dictation system for that? Maybe I'm wrong, but I think that you need much more than 70 hours of speech for that.
It's a common myth spreaded on Voxforge for some reason. Database size means is only one of the factors that affect accuracy, there are many others. And there is no direct dependency between size and accuracy. It's possible to have good accuracy with 70 hours, it's possible that with 10 thousands you'll have bigger error rate.
Here on page 13 you can find comparision of accuracy and database size
http://mi.eng.cam.ac.uk/research/projects/EARS/pubs/evermann_sttmay04.pdf
Basically the difference in accuracy between 400 hours and 2200 hours is 2%.
--- (Edited on 1/9/2010 04:49 [GMT+0300] by nsh) ---
>It's a common myth spreaded on Voxforge for some reason.
Yes, it is all part of a grand conspiracy to collect thousands of hours of recorded speech to create "The One Acoustic Model"... :)
I don't have the time or the inclination to collect more speech than we need. If we have enough English speech now, please let me know, so I can cut a release, and move on to other things.
>Database size means is only one of the factors that affect accuracy,
>there are many others. And there is no direct dependency between
>size and accuracy.
Good to know...
>It's possible to have good accuracy with 70 hours, it's possible that
>with 10 thousands you'll have bigger error rate.
Wouldn't more data reduce the impact of outliers/errors in transcriptions or pronunciation, or non-speech noise? i.e. the theory being that rather than spending lots of time manually transcribing/reviewing a small database of speech, you just collect lots of it, and hope that the statistical analysis performed during the acoustic model training process drops the outliers.
>Here on page 13 you can find comparision of accuracy and database size
>http://mi.eng.cam.ac.uk/research/projects/EARS/pubs/evermann_sttmay04.pdf
From the Experiments with Fisher Data Power Point slide:
2.2% abs. WER reduction from using 800h Fisher instead of 360h h5train03
Could that not also be interpreted as saying that more than doubling the speech from 360hours to 800 hours, you only get 2.2% improvement in Word Error Rate (which as far as I could tell, was pretty bad to start off with...), therefore, everything else being equal, you need lots more speech to get decent improvement because of diminishing returns?
So how much speech do we need for command and control acoustic models (70 hours - less than 400 hours?) and how much for dictation (no more than 400 hours), if the data was perfectly transcribed (speech and non-speech) and the audio perfectly clean, and the target was North American users with minimal accent variance?
Ok that last question was rhetorical... my real question is:
Is the only way to find out how much speech we really need for a given domain to create a test set (as you have suggested many times...) of the target domain (e.g. North American English) and keep collecting speech until we gain no more improvement in recognition using the test set? Once that is accomplished, then we should focus our attention on AM and LM adaptation frameworks as described in your post: How to create a speech recognition application for your needs?
thanks,
Ken
--- (Edited on 1/19/2010 3:27 pm [GMT-0500] by kmaclean) --- Fix Font size
--- (Edited on 1/19/2010 3:38 pm [GMT-0500] by kmaclean) ---
> I don't have the time or the inclination to collect more speech than we need. If we have enough English speech now, please let me know, so I can cut a release, and move on to other things.
Hi Ken
Well, I certainly didn't want to say we should stop this. Definitely it's very important project that should go forever. Even if it will not bring a lot in accuracy terms, think about its social role. By incouraging people involvement into open data domain that require little involvement Voxforge does very important things.
> Wouldn't more data reduce the impact of outliers/errors in
transcriptions or pronunciation, or non-speech noise? i.e. the theory
being that rather than spending lots of time manually
transcribing/reviewing a small database of speech, you just collect
lots of it, and hope that the statistical analysis performed during the
acoustic model training process drops the outliers.
There is the issue here that mathematical model of the speech that is trained is not very consistent with the speech itself. It means that with a large amount of data training converge to optimal solution which is not optimal for the user in terms of accuracy. That's the reason discriminative methods are becoming popular. This article is not very related, but attracted my attention recently
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.4227
> Is the only way to find out how much speech we really need for a given domain to create a test set (as you have suggested many times...) of the target domain (e.g. North American English) and keep collecting speech until we gain no more improvement in recognition using the test set? Once that is accomplished, then we should focus our attention on AM and LM adaptation frameworks as described in your post: How to create a speech recognition application for your needs?
It will never be accomplished, there will be a space for improvements, but I like your idea. Once we'll have "We recognize with 90% accuracy on top page" it will be way more encouraging for our visitors :)
--- (Edited on 1/19/2010 6:41 pm [GMT-0600] by Visitor) ---