VoxForge
--- (Edited on 2008-03-03 9:57 am [GMT-0600] by ralfherzog) ---
Hi Ralph,
thank you for your answer. I am not technically up-to-date with speech recognition, though I do find it interesting. The only thing I can mean for this project is to make many people donate. I'll leave it to you folks to make sure they donate with the right variety.
But since I am interested: It doesn't make sence to me why the phrase is the building block? I can imagine words pronounced differently in different phrases, but isn't it overdone to model every possible phrase? That's impossible! (and would require a hard disk with the size of a mammoth tanker to store it)
Johan
--- (Edited on 3/3/2008 10:18 am [GMT-0600] by JohanLingen) ---
--- (Edited on 2008-03-03 12:00 pm [GMT-0600] by ralfherzog) ---
Hi Johan/Ralph,
I think we need to look at how a *dictation* system works to get to the bottom of this discussion.
All Speech Recognition Engines ("SRE"s) are made up of the following components:
- Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence. They are used in dictation applications. Grammars are a much smaller file containing sets of predefined combinations of words. Grammars are used in IVR or desktop Command and Control applications. Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
- Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
- Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It then searches the Language Model or Grammar file for the equivalent series of phonemes. If a match is made it returns the text of the corresponding word or phrase to the calling program.
So the Language Model and Acoustic model work together. I've been always under the assumption that from an acoustic model creation perspective, we need as many samples of phones in many different contexts as possible. Sphinx and HTK/Julius use triphones to provide this context (i.e. one phone and its associated left and right phone). Nsh, in a previous post (where we had a similar discussion to this one, and which would be good for you to review too) corrected me and essentially said that the key is to get recordings of words that contain the most common
triphones, and use "tied-state triphone" models to cover the rare triphones. Tied-state triphones are a shortcut to group similar triphones together, reducing the need to have samples for every possible triphone in a language.
So from the analog recognition of the elemental sounds that make up a word (i.e. phones) using an acoustic model, good coverage of the most common triphones is what we should be shooting for.
However, from a language model perspective, this is where we should focusing on the probabilities of occurence of 400,000 different words and millions of different sentence combinations - which is what Ralph's was referring to.
Ken
P.S. Ralf - thanks for really challenging my thinking on this ... it will help make our objective clearer in the future.
--- (Edited on 3/4/2008 2:02 pm [GMT-0500] by kmaclean) ---
--- (Edited on 2008-03-04 5:35 pm [GMT-0600] by ralfherzog) ---
Well, I don't know if the phrases contain the right triphones, but I do know they contain words I have never used in my life yet and hopefully will not use ever again. And I am almost sure that most phrases are impossible for children to pronounce.
If the words are too difficult, there will be a great diversity in the pronounciation!
--- (Edited on 3/5/2008 5:24 pm [GMT-0600] by JohanLingen) ---