Speech Recognition Engine Files
Speech Recognition engines require two types of files to recognize speech. They require an Acoustic Model,
which is created by taking audio recordings of speech, and their transcriptions, and 'compiling'
them into a statistical representations of the sounds that make up each word. They
also require a Language Model or Grammar
file. A Language Model
is a file containing the probabilities of sequences of words. A
Grammar is a much smaller file containing sets of predefined
combinations of words. Language Models are used for dictation
applications, whereas Grammars are used in Desktop Command and Control or Telephony IVR type
applications.
Acoustic Models
Audio can be encoded at different Sampling Rates (i.e. samples per second - the most common being: 8kHz, 16kHz, 32kHz, 44.1kHz, 48kHz and 96kHz), and different Bits per Sample (the most common being: 8-bits, 16-bits or 32-bits). Speech Recognition engines work best if the Acoustic Model they use was trained with speech audio which was recorded at the same Sampling Rate/Bits per Sample as the speech being recognized.
Telephony
For Telephony, the limiting factor is the bandwidth at which speech can be transmitted. For example, your standard land-line telephone only has a bandwidth of 64kbps at a sampling rate of 8kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000bps = 64kpbs). Therefore, for Telephony based speech recognition, you need Acoustic Models trained with 8kHz/8-bit speech audio files.
For
Voice over IP ("VoIP"), the codec
used usually determines the sampling rate/bits per sample
of
speech transmission. If you use a codec with a higher sampling
rate/bits per sample
for speech transmission (to improve the sound quality), then your
Acoustic Model must be trained with audio data that matches that
sampling rate/bits per sample. In the specific case of the
Asterisk PBX system, audio
is upsampled internally to 8kHz/16-bits regardless of the codec
sampling/bits per sample. Therefore, Asterisk needs an Acoustic
Model
trained with 8kHz/16-bit audio data.
Desktop
For speech recognition on your PC, the limiting factor is your sound card. Most sound cards today can record at sampling rates of between 16kHz-48khz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 98kHz.
As a general rule, a Speech Recognition Engine works better with Acoustic Models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow your recognition engine down. You need a balance. Thus for desktop speech recognition, the current standard is Acoustic Models trained with speech audio data recorded at sampling rates of 16kHz/16bits per sample.
You can still use Acoutic Models trained at 8 kHz for desktop applications, but you generally need at least twice (and usually more ...) the audio data to get comparable recognition results of Acoustic Models trained at 16kHz.
Additional information can be found at the following link: