Acoustic Model Discussions

Nested
Sphinx Language Model --> Acoustic Model question
User: Eric
Date: 5/14/2010 8:15 pm
Views: 6748
Rating: 2

Hello,

I have a fairly simple question that I cannot find the answer to.  Does your Acoustic model HAVE to contain the exact spoken utterances that your language model dictionary defines?  In other words, if I need the word:

   commodity

in my language model (so that it can be recognized), does the associated acoustic model need to be recorded with that word somewhere in it, OR, will the speech recognition system recognize 'commodity' based on its similarity to other word(s) that are in the acoustic model that were similar to 'commodity'?

 

Other questions I have are:

- what matters most in terms of speech recognition speed -- acoustic model size, language model size, or both?  In other words, if I want the fastest recognition time possible, should my language model just contain the words I need, or should I use a much larger pre-defined (open source) language model?

 

- one can use lmtool to generate a smaller language model dict/grammar, however, what tools would I use to develop my own acoustic models (based on the limited vocabulary I need)?  does it make sense to create my own acoustic model and train it myself in order to get higher accuracy for the limited vocabulary I need (i.e. commodity trading terms)?

 

Thank You,

Eric

 

--- (Edited on 5/14/2010 8:15 pm [GMT-0500] by Visitor) ---

Re: Sphinx Language Model --> Acoustic Model question
User: Robin
Date: 5/19/2010 3:43 pm
Views: 192
Rating: 1

Hi Eric,

It helps a lot if an acoustic model is trained with speech from the same type of domain as it will be used for in your application. So if you would like to dictate blog posts on investing for instance, it would be good to train the model with corresponding texts (speech). However, it is not necessary that each and every word that you want your application to recognise is used somewhere in your training texts.

Of course, your application will not be able to recognise words that are not present in your phonetic dictionary. Also, if a word was not present in the texts that are used to create the language model (a statistical representation of the language) then according to the language model, the chance of that word being used is zero, so if the user will use that word, the application will assume it is another word. If the application will be a command and control application, then the word should be incorporated in the list of commands (the so-called grammar).

I think other people are better able to answer your remaining questions.

Robin

--- (Edited on 5/19/2010 3:43 pm [GMT-0500] by Robin) ---

Re: Sphinx Language Model --> Acoustic Model question
User: kmaclean
Date: 6/9/2010 8:01 pm
Views: 97
Rating: 2

>what matters most in terms of speech recognition speed -- acoustic model

>size, language model size, or both?  

I don't have first hand knowledge on this, but since the speech decoder must search through both of these to come up with recognition results, keeping the size of both down would be your best bet.   

 

>- one can use lmtool to generate a smaller language model dict/grammar,

>however, what tools would I use to develop my own acoustic models (based

>on the limited vocabulary I need)?  

For Sphinx, use SphinxTrain, for HTK or Julius, use the HTK toolkit

 

>does it make sense to create my own acoustic model and train it myself in

>order to get higher accuracy for the limited vocabulary I need (i.e.

>commodity trading terms)?

Yes, but you might be able to get away with a small generic acoustic model and adapt it with speech from your domain.

 

--- (Edited on 6/9/2010 9:01 pm [GMT-0400] by kmaclean) ---

Re: Sphinx Language Model --> Acoustic Model question
User: TonyR
Date: 6/10/2010 1:48 am
Views: 2376
Rating: 3

Hi Eric (Hi Voxforge)

 

I hope you have found most of your answers now.

 

> I have a fairly simple question that I cannot find the answer to.  Does your Acoustic model HAVE to contain the exact spoken utterances that your language model dictionary defines?

 

Definitely not.   This is the whole point of building state clustered triphone systems.   You can use the pronunciation to get a set of states that were trained on the same phonemes but in other words.

 

> what matters most in terms of speech recognition speed -- acoustic model size, language model size, or both?  In other words, if I want the fastest recognition time possible, should my language model just contain the words I need, or should I use a much larger pre-defined (open source) language model?

 

The language model should contain only the words you are going to see at recognition time (as you won't access the others there's no point having them).   However, the question is much more complex than that.   There is always a tradeoff between run time speed and accuracy.   Smaller models may run faster at the same pruning settings, but very small models make more errors, so it's not worth it.  Similarly, very big models may only be a little bit better than models half their size, but run slower and so again be not worth it.   You've just got to experiment and find out what works best for you.

 

>  does it make sense to create my own acoustic model and train it myself in order to get higher accuracy for the limited vocabulary I need (i.e. commodity trading terms)?

 

As Robin said, you probably want to do this in order to match the acoustic domains (e.g. if you were dealing with speech over a trading floor or from telephone conversations).   As Ken said, you can adapt general models to your domain.    I'd like to add that you should collect some in domain data anyway, one set for tuning your system and another for evaluating performance.

 

Tony

 

 

 

 

--- (Edited on 10-June-2010 7:48 am [GMT+0100] by TonyR) ---

PreviousNext