Acoustic Model Discussions

Nested
Re: Acoustic model testing
User: tpavelka
Date: 4/7/2009 5:00 am
Views: 222
Rating: 17

Here is the test with the testing set excluded from the training:

SENT: %Correct=75.00 [H=75, S=25, N=100]
WORD: %Corr=93.97, Acc=93.09 [H=857, D=9, S=46, I=8, N=912]

Slightly lower than the previous one, but still surprisingly high. I guess I have to check my training process becase my acoustic model's results on the same test are much lower (see my previous posts).

@nsh: intersting results from DNS, so it seems that most of the power of  commercial dictation comes from speaker adaptation.

 

--- (Edited on 4/7/2009 5:00 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: tpavelka
Date: 4/14/2009 5:54 am
Views: 3092
Rating: 19

I have finally fully automated my training process and figured out why was my accuracy lower than the one achieved with the official VoxForge acoustic model. Originally I did not use the sp models between words and omitted the forced Viterbi alignment to fix problems with multiple pronunciations. This has somewhat increased the accuracy, but the main improvement comes from the choice of features. Here are the results from experiments with various types of features:

MFCC_0_D_N_Z
SENT: %Correct=77.48 [H=86, S=25, N=111]
WORD: %Corr=93.94, Acc=92.75 [H=946, D=5, S=56, I=12, N=1007]

MFCC_0_D_A_Z
SENT: %Correct=43.00 [H=43, S=57, N=100]
WORD: %Corr=88.16, Acc=85.96 [H=804, D=12, S=96, I=20, N=912]

MFCC_0_D_A_N_Z
SENT: %Correct=50.00 [H=50, S=50, N=100]
WORD: %Corr=88.05, Acc=85.31 [H=803, D=13, S=96, I=25, N=912]

MFCC_0_D_Z
SENT: %Correct=68.00 [H=68, S=32, N=100]
WORD: %Corr=93.20, Acc=92.76 [H=850, D=8, S=54, I=4, N=912]

MFCC_0_D
SENT: %Correct=57.00 [H=57, S=43, N=100]
WORD: %Corr=80.59, Acc=79.50 [H=735, D=67, S=110, I=10, N=912]

First, here is my understanding of the various parameters:

_0 use zeroth cepstral coefficient - this can be interpreted as energy (i.e. sound volume) and can be used instead of _E which is computed differently. I don't know whether _0 or _E is more precise.

_D so called delta coefficients, these can be seen as first difference with respect to time, but the computation is more complicated (eq. 5.16 in HTKBook) delta coefficients provide information about surrounding frames and thus should improve recognition accuracy (it is not possible to just glue several consecutive frames together since there is a requirement that the elements of the feature vector should not be correlated. This comes from the use of a diagonal covariance matrix in the Gaussian mixtures).  

_A so called acceleration coefficients, computed as delta of delta coefficients.

_N absolute energy supression - the way this works is that it throws away the first coefficient (added by either _E or _0) but keeps its delta and acceleration. Don't know the reasons for doing this.

_Z zero mean normalization - in my oppinion this is not a good thing because the way it is done is a so called local normalization where the mean of all features in a file is computed and then subtracted from all the feature vectors. This means that a file can only be further processed once it is read in full. In live recognition you have to wait until the utterance is recorded before you start recognition. There are techniques how to avoid this and use some kind of mean estimate but if the mean is estimated incorrectly it has a huge effect on accuracy.

The most surprising thing for me is that the use of acceleration coefficients severely decreases accuracy. I always thought that MFCC_D_A is a sort of a standard that is used in all ASR tutorials.

All the experiments were done with single mixture only. When I start adding mixtures it might change.

--- (Edited on 4/14/2009 5:54 am [GMT-0500] by tpavelka) ---

PreviousNext