VoxForge
Click the 'Add' link to add a comment to this page.
Note: You need to be logged in to add a comment!
Interesting article on the Discovery News Website (Why are Speech Recognition and Natural Language Neither of Those?) which says that part of the author's frustration with a telephony speech IVR application was due to his expectations:
[...] The more human the electronic operator sounds, the more I expect from her. When she doesn’t perform, my eye-rolling, jaw-jutting, and nose-exhaling ensues. Japanese roboticist Masahiro Mori (b. 1927) devised a theory around this phenomenon and calls it the “Uncanny Valley."
It says, basically, that humans will tolerate and even show empathy for artificially intelligent life forms (robots, electronic operators) as long as the machines don’t get too big for their britches and start looking and acting all Homo sapiens sapiens. [...]
So what to do? For starters, stop trying to simulate humans, said Bilmes. Keep speech recognition technology on a short leash and use it for applications where expectations are not so high.
He then goes on to give examples where speech recognition makes sense:
Ken
Bing 411 is Microsoft's answer to Goog411 (originally discussed in this post: Google Voice Local Search). It allows you to connect directly to businesses for free, and can also give directions (from this article):
The system, which is powered by Tellme, uses speech recognition technology to retrieve results. The free service helps users find a business, or receive a text message with a link to a map. It also includes star ratings of businesses based on reviews from others. [...]
How does it work? Users dial 1-800-Bing 411 (1-800-246-4411) from any phone and give the system the information they’re seeking. Bing 411 will give you directions over the phone (you can stop and repeat the directions several times, if needed). Or users can request them via text message.
Dasher, is a mouse interface that allows you to enter text without a keyboard:
Dasher is a text-entry system in which a language model plays an integral role, and it's driven by continuous gestures. Users can achieve single-finger writing speeds of 35 words per minute and hands-free writing speeds of 25 words per minute.
Demo video located here: Single-finger text input
Google has added voice search to the Android platform via the new software update.
Because it was trained on Goog411 (which is currently only available in the US and Canada) Google voice search has difficulty with other accents. Here's a statement from Google explaining why:
The acoustic model for Voice Search was trained, in part, by using data from GOOG-411 which has only launched broadly in the US. Since the acoustic model was trained using mostly American accents, the tool currently works best when receiving queries with American accents. While you can still download the Google Mobile App and turn on the Voice Search here, we've turned off the voice functionality by default when the app is downloaded from anywhere outside of the US. We don't have any specific launches to announce at this time, but we think this is exciting new technology and the speech recognition and understanding will only get better for other accents and jargon as we keep working on it
The New York Times is reporting that Google has added a voice interface to their iPhone search software:
Users of the free application, which Apple is expected to make available as soon as Friday through its iTunes store, can place the phone to their ear and ask virtually any question, like “Where’s the nearest Starbucks?” or “How tall is Mount Everest?” The sound is converted to a digital file and sent to Google’s servers, which try to determine the words spoken and pass them along to the Google search engine.
Other similar services:
Google Android has its own speech recognition engine, which is used by their application VoiceDialer.
You can check out the code on http://source.android.com/download
It is licensed in apache license, which is compatible with GPL.
Google is now using speech recognition to make the spoken content of political videos searchable, from the Google blog:
Today [July 14, 2008], the Google speech team (part of Google Research) is launching the Google Elections Video Search gadget [view as a standalone page], our modest contribution to the electoral process. With the help of our speech recognition technologies, videos from YouTube's Politicians channels are automatically transcribed from speech to text and indexed. Using the gadget you can search not only the titles and descriptions of the videos, but also their spoken content. Additionally, since speech recognition tells us exactly when words are spoken in the video, you can jump right to the most relevant parts of the videos you find.
You basically enter your search term, and the search results section gives you a list of videos to select from (most recent first) with number of mentions of the search term. You then select a video and the service highlights approximately where in the video your search term occurred.
This is similar to other services such as:
Jott will have transcriptions of all those jotts (notes) and the sound recordings to go with them.
It seems that Jott will have quite an Accoustic model for later on when they do make the switch from human to machine.
Fred
Just a correction to the last poster.
Jott is just a front end to a bunch of people in India who then transcribe a voice message. There is some machine speech recognition I'm sure to handle the routing of the 'jott' but the jott itself is transcribed by people.
It's the best kind of speech recognition available today, the kind done by a human being.
Jott is a service that "converts your voice into emails, text messages, reminders, lists and appointments". You just call a number with your phone (landline or mobile), speak your message, and it converts it to an email and sends where you want it to go. They offer this service for free (they are still in beta) - no invite required.
What is amazing from a speech recognition perspective is that they can recognize an unconstrained vocabulary over the phone. Up until a short time ago, most phone-based speech recognition systems could only recognize a predefined combinations of certain key words (described in a grammar file) - think of speech-based IVR systems where you say: "tech support" to route you to a support representative. Speech recognition using unconstrained grammars were limited to desktop based SR systems because of the size of language model that was required, and the amount of speech audio required to create a viable acoustic model.
Very impressive.
Other similar services include: