VoxForge
From this article: On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup:
For all of those thinking of integrating speech recognition into their apps I have a word of advice for you: Don’t.
[...]
[The] speech rec discussed in this article is the kind that understands short phrases and/or commands with no training required. It’s not free flowing dictation like that found in Dragon software. [...]
He reviews some of ways to integrate speech recognition into a web application:
And then describes the main stumbling block for open source speech recognition:
[...] The only real differences between the open source and commercially available solutions lie in what’s called their Acoustic Models. AMs for speech rec are like gold. A good AM is produced from several thousand hours of good audio samples.
Found an interesting reply on the original article. From the post:
[...] I use sphinx4 and
pocketsphinx and am very pleased. They are state of the art decoders.
The acoustic model, or lack thereof is the reason why commercial
engines are perceived as superior. [...] But after making my own model, and still
in the neverending process of making my own, tweaking it, etc.., I
appreciate the misery that is collecting and organizing transcribed
data and appreciate the work they do even if I can't use it. Notice
something wrong with your model, need to retrain it. Takes days with a
quad core. [...] And
spotting errors is hard. I cant emphasize enough how boring it is to
listen to hours on end of audio and see that it matches up with the
text perfectly. In some cases, listening a bunch of times to make sure.
Noticing issues with your model, having to go figure out why. [...] You
dont need thousands of hours unless your doing dictation and if that
extra few percent is worth it. You can get good results with low
hundreds. There are other equally important factors like language
models that he should of mentioned that could be equally as important
as the acoustic model. How its important to have relevant, and lots of
data to train them. The acoustic model is only one of many factors(as
is the decoder for the matter). [...]