VoxForge
Sorry, I was talking about pretrained acoustic model
--- (Edited on 1/20/2010 02:22 [GMT+0300] by nsh) ---
> I don't have the time or the inclination to collect more speech than we need. If we have enough English speech now, please let me know, so I can cut a release, and move on to other things.
Hi Ken
Well, I certainly didn't want to say we should stop this. Definitely it's very important project that should go forever. Even if it will not bring a lot in accuracy terms, think about its social role. By incouraging people involvement into open data domain that require little involvement Voxforge does very important things.
> Wouldn't more data reduce the impact of outliers/errors in
transcriptions or pronunciation, or non-speech noise? i.e. the theory
being that rather than spending lots of time manually
transcribing/reviewing a small database of speech, you just collect
lots of it, and hope that the statistical analysis performed during the
acoustic model training process drops the outliers.
There is the issue here that mathematical model of the speech that is trained is not very consistent with the speech itself. It means that with a large amount of data training converge to optimal solution which is not optimal for the user in terms of accuracy. That's the reason discriminative methods are becoming popular. This article is not very related, but attracted my attention recently
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.4227
> Is the only way to find out how much speech we really need for a given domain to create a test set (as you have suggested many times...) of the target domain (e.g. North American English) and keep collecting speech until we gain no more improvement in recognition using the test set? Once that is accomplished, then we should focus our attention on AM and LM adaptation frameworks as described in your post: How to create a speech recognition application for your needs?
It will never be accomplished, there will be a space for improvements, but I like your idea. Once we'll have "We recognize with 90% accuracy on top page" it will be way more encouraging for our visitors :)
--- (Edited on 1/19/2010 6:41 pm [GMT-0600] by Visitor) ---
>This article is not very related, but attracted my attention recently
>http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.4227
Found some background (from this paper: Acoustic Model Clustering Based on Syllable Structure; by Izhak Shafran & Mari Ostendorf) that helped me understand your link :
In many ASR systems, the acoustic variation of words are modeled at two levels - the pronunciation model which maps word sequences to phonemes, and the acoustic model which maps phoneme sequences to multivariate acoustic models. Work with simulated data which was produced using the acoustic models of speech, have pointed to pronunciation variability as a key problem in recognizing conversational speech [...] However, the work on pronunciation modeling in terms of phoneme-level substitutions, deletions and insertions has so far only yielded small performance gains [...]
Conventionally, phone-level acoustic variation has been captured by conditioning the acoustic models for a phoneme on the context of neighboring phonemes in the hypothesized sequence. Typically, in large vocabulary ASR, phonemes with immediate neighbors (triphones) [...] are used. Conditioning only on phonemic context does not capture the acoustic variation of conversational speech fully [...]
Our hypothesis is that, in English, syllable structure is also useful in modeling the variation not accounted for by phoneme context. Consider the phoneme \t" (in the context \iy t er") in \beater", \beat Ernest" and \return". Even though it is the same triphone, the articulation of phone \t" in the three contexts is distinctly different - in the first it is flapped, in the second it is an unreleased closure and in the third it is a closure plus a release. These differences are closely related to syllable structure [...] The use of syllable structure is motivated in part by results from psychoacoustic studies, which argue for the syllable as a unit of perception [...]
Whereas a phoneme, as defined in Wikipedia:
[...] is a group of slightly different sounds which are all perceived to have the same function by speakers of the language or dialect in question. An example of a phoneme is the /k/ sound in the words kit and skill. [...] Even though most native speakers don't notice this, in most dialects, the k sounds in each of these words are actually pronounced differently: they are different speech sounds, or phones [...] . In our example, the /k/ in kit is aspirated, [kʰ], while the /k/ in skill is not, [k]. The reason why these different sounds are nonetheless considered to belong to the same phoneme in English is that if an English-speaker used one instead of the other, the meaning of the word would not change: saying [kʰ] in skill might sound odd, but the word would still be recognized.
The paper you refer to (Moving Beyond the `Beads-On-A-String' Model of Speech) goes even further:
[...] several researchers have recently argued for the syllable as an alternative to the phoneme for representing speech. In this paper, we take a different tack and argue for finer-grained low level representation, incorporating dependence on syllable (and higher level) structure via context conditioning.
They then cite two very different approaches:
1. data-driven:
Acoustically derived sub-word units (ASWUs) represent a data driven approach to defining the sub-word units of speech. Recognition system design involves a combination of automatic segmentation into stationary regions or ‘segments’, clustering the segments based on acoustic similarity, and dictionary design.
2. linguistically based:
In linguistics, it is [linguistic] features and not phonemes that are viewed as the fundamental units of speech, where phones are specified (or coded) in terms of distinctive features. [...]For the most part, distinctive features are related to the manner in which a speech sound is produced (the degree of constriction in the vocal tract), the particular articulator that is used (glottis, soft palate, lips and tongue blade, body and root) and/or place of constriction, and how an articulator is used to produce the sound.
Although I'm not sure I understand exactly how the data-driven or linguistically-based approaches might implemented in Sphinx or HTK/Julius, it seems from your blog post (Moving Beyond the `Beads-On-A-String') that you're leaning toward the syllabic approach, which seems like it could be implemented with a pronunciation dictionary that uses syllables rather than phonemes.
Ken
--- (Edited on 1/22/2010 10:57 pm [GMT-0500] by kmaclean) ---
Interesting article supporting your argument: DEPLOYING GOOG-411: EARLY LESSONS IN DATA, MEASUREMENT, AND TESTING. From the article:
Interestingly, recognition performance does not increase dramatically with the amount of training data (8% absolute CA [Correct Accept] increase at 10% FA [False Accept] for a factor 64 increase in training size). Part of the reason may be that the training data is well matched to the test set, both phonetically and acoustically (the same users may even appear in both training and testing, in different calls of course, but probably on the same device, and sometimes speaking the same query). Another reason may simply be that we haven’t explored that space much yet
Another interesting factoid is that they use mapreduce for acoustic model training.
--- (Edited on 3/11/2010 11:07 pm [GMT-0500] by kmaclean) ---
Newer technologies have emerged from this and I during my uni days we used Raspberry Pi for Home Automation using Speech Recognition.
--- (Edited on 10/14/2021 1:27 am [GMT-0500] by sfssZSDSER32) ---