VoxForge
Click the 'Add' link to add a comment to this page.
Note: You need to be logged in to add a comment!
You may have seen some posts on this site by Dr. Tony Robinson. I'd like to congratulate Tony on his recent appointment as Director of SpinVox' Advanced Speech Group. From this press release on the SpinVox site:
Robinson's remit will be to further build a team from the ASR expertise that is concentrated in the Cambridge area. Under his leadership, the SpinVox ASG will further develop the Voice Message Conversion System that is at the heart of SpinVox services.
We'd like to wish him luck, and thank him for all his help to the VoxForge community in the past year and a half.
Ken
This is an interesting article on the use of zero crossing rather than feature vectors (such as the MFCCs we use with HTK/Julius) that are traditionally used in speech recognition. Shubhendu Trivedi was looking to create a speaker dependent, isolated word, speech recognizer for a 8051 micro-controller. But traditional HMM approaches using MFCC based feature vectors were too computationally intensive to work on this controller.
He found a paper that provided the solution. In it, the authors describe a way of only using zero crossings of the speech signal to determine the feature vector. Shubhendu says in his article:
This feature vector is basically the histogram of the time interval between successive zero-crossings of the utterance in a short time window. These feature vectors for each window are then combined together to form a feature matrix. Since we are dealing with only small time series (isolated words), we can employ Dynamic Time Warping to compare the input matrix with the reference matrix’ stored.
Microsoft is working on a system to allow contextual ads to be served alongside video. The system uses speech recognition to identify the topic of a video. This would allow advertisers to display ads for sports gear alongside a video about soccer or furniture with a video for home improvement.
See video at this link: One To Watch: Microsoft's Video Advertising Systems
From Slashdot article Open Source Speech Recognition:
"The first version of the open source speech recognition suite simon was released. It uses the Julius large vocabulary continuous speech recognition to do the actual recognition and the HTK toolkit to maintain the language model. These components are united under an easy-to-use graphical user interface. Simon can import dictionaries directly from wiktionary (a subproject of wikipedia) or from files formated in the HADIFIX- or HTK format and grammar structures directly from personal texts. It also provides means to train the language model with new samples and add new words."
Here is an interesting list of services that let you convert your speech to an email or text message over your telephone.
SpinVox is a service that can be accessed from any phone. Speak Freely and SpinVox will convert your words into text and send them wherever you decide: mobile, inbox, TV screen, blog. They use a Voice Message Conversion System called ‘D2’ (the Brain), that takes spoken words and converts them to text.
MIT's Address Browser provides an speech recognition interface to Google maps. From the site:
The AddressBrowser is a prototype speech-based interface that allows users to speak any city or address in the United States. You can say any valid address that follows a simple pattern, e.g., 32 Vassar Street in Cambridge, Massachusetts, the intersection of Main Street and Vassar Street in Cambridge, Massachusetts, or just Cambridge, Massachusetts. In order to be heard you will need a microphone connected to your computer.Press and hold the big green button and say your address. The button will turn red when it is recording your voice. Just speak your address naturally. Release the button after you have finished talking....
The AddressBrowser is a prototype speech-based interface that allows users to speak any city or address in the United States. You can say any valid address that follows a simple pattern, e.g., 32 Vassar Street in Cambridge, Massachusetts, the intersection of Main Street and Vassar Street in Cambridge, Massachusetts, or just Cambridge, Massachusetts. In order to be heard you will need a microphone connected to your computer.
It seems like it uses a Java Applet client front-end and a back-end speech recognition server.
Android is a new operating system for cell phones, designed by Google engineers. Unlike most existing cell phone operating systems, it'll be friendly to applications created by outside software developers.
Basically the phone (scheduled for 2008) will run on a Linux core, with Java-based apps.
Setup of the "early look" SDK is quite easy (if you know Eclipse). I was able to create the HelloAndroid app without much problem. When you press Run, the Android Emulator starts up, and you can see the results on the screen.
Here is what I have gleaned from comments from Dan Morrill on the Android Developer Google Group list:
The Nabaztag robotic rabbit is a wireless Internet contraption that can speak, move its ears and flash its lights in response to user inputs, and includes speech recognition.
From the site's "Voice Recognition FAQ" (sic):
Services available with voice recognition : WeatherAir QualityParis TrafficStock ticker (free and full)Radio
Weather
Air Quality
Paris Traffic
Stock ticker (free and full)
Radio
The commands it recognizes are pretty simplistic:
WeatherAir or SmogTrafficMarket Radio
but it seems to be an interesting harbinger of what Consumer Speech Recognition Appliances might look like in the not to distant future.
In a PC World Interview with Google co-founder Rich Miner, Miner says:
When we looked at the other [mobile] Linux activities out there, oftentimes they're initiatives that are based on Linux but their resulting platforms aren't completely open. Or they're completely open and they're Linux, but they're missing most of the things that [Android has]. They probably don't have video codecs, Midi sequencer, speech recognition. So they're not a complete phone stack. The goal with Android was to build into it everything you needed to release a phone: an entire stack to build a competitive smartphone or high-end feature phone.
Although Android is to be released under the Apache License, the speech recognition component likely will *not* be, since Nuance is also an Open Handset Alliance partner. The Android™ SDK is set to be released on November 12, 2007.
From a Globe and Mail Article:
A reputed leader of Colombia's biggest drug cartel radically altered his facial appearance with repeated plastic surgeries. But his own words gave him away, thanks to advanced voice recognition technology that has become a key tool in the war against drugs and terrorism. U.S. agents confirmed the identity of Juan Carlos Ramirez Abadia using the equivalent of a vocal fingerprint, his attorney said Friday.
A reputed leader of Colombia's biggest drug cartel radically altered his facial appearance with repeated plastic surgeries. But his own words gave him away, thanks to advanced voice recognition technology that has become a key tool in the war against drugs and terrorism.
U.S. agents confirmed the identity of Juan Carlos Ramirez Abadia using the equivalent of a vocal fingerprint, his attorney said Friday.
Background on voice recognition (from Wikipedia):
Speaker recognition, or voice recognition is the task of recognizing people from their voices. Such systems extract features from speech, model them and use them to recognize the person from his/her voice. Note that strictly speaking there is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). Generally these two terms are frequently confused and voice recognition is used as a synonym for speech recognition instead.
Speaker recognition, or voice recognition is the task of recognizing people from their voices. Such systems extract features from speech, model them and use them to recognize the person from his/her voice.
Note that strictly speaking there is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). Generally these two terms are frequently confused and voice recognition is used as a synonym for speech recognition instead.
--- (Edited on 8/10/2007 10:47 pm [GMT-0400] by kmaclean) ---