VoxForge
Hi all!
I have just started with that of SR. I was thinking of programing (sphinx) a small demo app for a pda or so.
At this point I wonder why are there no acoustic samples for such situations?? The less the noise present in the samples the better the recognition results or it's advisable to include audio with 'normal' (for the target situation) ???
Thanks!!
--- (Edited on 11/27/2007 4:47 am [GMT-0600] by coriscow) ---
>The less the noise present in the samples the better the recognition results or
>it's advisable to include audio with 'normal' (for the target situation) ???
The short answer to this question is ... it depends
The general rule (see David Gelbart's entry in the Creating a cheap "recording studio" post) is:
The more closely the training data matches the data provided by users, the better speech recognition systems tend to work.
But to do this, you limit the potential uses of your corpus. Current telephony audio is limited to 8kHz sampling rate, at 8 bits per sample. It would not work very well in a command and control application on a desktop with very different audio capabilities/characteristics. It is also very costly to create corpora targeted to specific applications.
To create a more general purpose corpus, you try to minimize the noise, and record at a high sample/bits per sample rate. Then you can convert your corpus to the target by down-sampling the audio and runingn it through the target codec you are planning to use, or even pushing the recordings through the type of connection you are planning to use.
See these discussion threads for more (and varying opinions)on this topic:
Looking for Best Practices for Collecting Speech for a Free GPL Speech Corpus www.isca-students.org website, Ebru Arisoy's post
Hope that helps!
Ken
--- (Edited on 11/27/2007 10:46 am [GMT-0500] by kmaclean) ---
Hi coriscow,
>I have just started with that of SR. I was thinking of programing (sphinx) a
>small demo app for a pda or so.
>At this point I wonder why are there no acoustic samples for such situations??You might check the CMU PDA Database. And the AudioSources page on the VoxForgeDevWiki.
Ken
--- (Edited on 11/27/2007 10:50 am [GMT-0500] by kmaclean) ---
Hi Ken!
Thanks a lot for your useful comments and links!! The problem is that I was looking for a Spanish (castilian) corpus, and it is quite difficult to find an open source one (the hub4 at itesm.mx is no longer available, isn't it?). I think I'll do my best recording my own so as to start a Spanish corpora thread.
Best regards!
--- (Edited on 11/28/2007 3:33 am [GMT-0600] by Visitor) ---
"Current telephony audio is limited to 8kHz sampling rate, at 8 bits per sample. It would not work very well in a command and control application on a desktop with ..."
I think it might not be that bad, especially if a channel normalization technique such as cepstral mean subtraction (also known as cepstral mean normalization) is used to minimize the differences between the two types of audio channel.
A 16 kHz sampling rate can be expected to work better than an 8 kHz sampling rate, all else being equal, but the change may not be that big (see, e.g., http://www.cs.cmu.edu/~robust/Papers/ica94p.pdf).
"Then you can convert your corpus to the target by down-sampling the audio and..."It can also be useful to add noise or reverberation to the training data to make it match the target better.
A Google Scholar search for: stahl speech recognition
will turn up some papers on the subject co-authored by Volker Stahl. Other authors have also published on this subject but I can't remember any names offhand.
This approach may not be quite as good as collecting targeted data. For example, adding noise to clean recordings does not take into account the Lombard effect (i.e., people speaking differently in a noisy environment in an effort to overcome the noise). And reverberation is usually simulated using a single, fixed room impulse response for a particular room while in reality the reverberation can vary depending on the speaker's position, speaker's orientation, and and so on (and also, using a room impulse response to model the room also carries some assumptions with it). But as Ken pointed out, collecting targeted data can be expensive.
Regards,
David
--- (Edited on 11/29/2007 5:52 pm [GMT-0600] by DavidGelbart) ---
I just came across a paper that may be of interest to you called: Embedded Julius: Continuous Speech Recognition Software for Microprocessor.
They basically developed an embedded version of Julius for use on microprocessors. From the paper:
Julius is open source CSR software, and has been used by many researchers and developers in Japan as a standard decoder on PCs. Julius works as a real time decoder on a PC. However further computational reduction is necessary to use Julius on a microprocessor. Further cost reduction is needed. For reducing cost of calculating pdfs (probability density function), Julius adopts a GMS (Gaussian Mixture Selection) method. In this paper, we modify the GMS method to realize a continuous speech recognizer on microprocessors. This approach does not change the structure of acoustic models in consistency with that used by conventional Julius, and enables developers to use acoustic models developed by popular modeling tools [i.e HTK ...]
[... ] Finally, the embedded version of Julius was tested on a developmental hardware platform named “T-engine”. The proposed method showed 2.23 of RTF (Real
Time Factor) [...]
Ken
--- (Edited on 11/30/2007 2:05 pm [GMT-0500] by kmaclean) ---
"for use on microprocessors"
I think the usual American English term is "embedded processor" or in some cases "microcontroller". I guess the authors made a mistake in their English when they used the term "microprocessor".
--- (Edited on 11/30/2007 3:50 pm [GMT-0600] by DavidGelbart) ---
"A 16 kHz sampling rate can be expected to work better than an 8 kHz sampling rate, all else being equal, but the change may not be that big (see, e.g., http://www.cs.cmu.edu/~robust/Papers/ica94p.pdf)."
Just to make sure I was clear, I was not trying to say that the table in that paper fully shows the performance cost of training a desktop application on telephone data. There are more differences than just the sampling rate.
--- (Edited on 11/30/2007 3:53 pm [GMT-0600] by DavidGelbart) ---