Acoustic Model Discussions

Flat
HVite log
User: void
Date: 12/11/2007 2:22 pm
Views: 8956
Rating: 18

How to understand HVite_log file numbers. What is meaning of Ac, LM, Act. For example in my log file they are: [1176 frames] -48.0691 [Ac=-56529.3 LM=0.0] (Act=4.9). It is good or bad result. 

What is good score1 in Julian Recognition output. Is it better when score is lower? 

--- (Edited on 12/11/2007 2:22 pm [GMT-0600] by void) ---

Re: HVite log
User: kmaclean
Date: 12/11/2007 7:44 pm
Views: 4165
Rating: 28

Hi void,

>How to understand HVite_log file numbers. What is meaning of Ac, LM, Act.

>For example in my log file they are: [1176 frames] -48.0691 [Ac=-56529.3

>LM=0.0] (Act=4.9). It is good or bad result. 

I am assuming you are talking about the HVITE_log file included with a submission in the VoxForge corpus. 

All audio submitted to VoxForge is validated by creating an acoustic model using only the speech in the submission.  It is just a quick way to ensure that the transcriptions of the audio are reasonably correct (it does not catch all errors).  The HVITE_log file is just the process of running a realignment of the submitted audio data (similar to Step 8 in the VoxForge and HTK tutorials).  If there are problems with transcriptions, errors will show up at this step.

With respect to your specific question, from the HTK book (page 43):

     READY[2]>
     DIAL ZERO EIGHT SIX TWO
          == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8)
    READY[3]>
     etc

[...] 

After each utterance, the numerical information gives the total number of frames, the average log likelihood per frame, the total acoustic score, the total language model score and the average number of models active.

Therefore, for your example:

  • [1176 frames] - total number of frames
  • -48.0691 - the average log likelihood per frame
  • Ac=-56529.3 - the total acoustic score
  • >LM=0.0 - total language model score (zero because you are using Julian)
  •  (Act=4.9) - the average number of models active

With respect to interpreting the average log likelihood per frame, this passage from Jurafsky's SPEECH and LANGUAGE PROCESSING textbook is helpful (Chapter 9, Automatic Speech Recognition, section 9.4.3 Probabilities, log probabilities and distance functions):

Up to now, all the equations we have given for acoustic modeling have used probabilities. It turns out, however, that a log probability (or logprob) is much easier to work with than a probability. Thus in practice throughout speech recognition (and related ?elds) we compute log-probabilities rather than probabilities.

One major reason that we can’t use probabilities is numeric under?ow. To compute a likelihood for a whole sentence, say, we are multiplying many small probability values, one for each 10ms frame. Multiplying many probabilities results in smaller and smaller numbers, leading to under?ow. The log of a small number like .00000001 = 10−8, on the other hand, is a nice easy-to-work-with-number like −8. A second reason to use log probabilities is computational speed. Instead of multiplying        probabilities, we add log-probabilities, and adding is faster than multiplying. 

>What is good score1 in Julian Recognition output. Is it better when score is lower?

From the Julius 3.2 manual:

The recognition process takes place in two passes. First 2-gram frame synchronous recognition is performed on the input. An example of the output of this first pass is shown below.
[...]
pass1_best_score: The hypothesis score (Log-likelihood) 
After the first pass finishes, the second pass is performed and a final recognition result is displayed.  The 2nd pass uses the interim results from the first pass and searches these results using a 3-gram stack decoding technique.
[...]
score1: -14819.208008

Though there is nothing mentioned in the document to define what score1 is, it is likely the log-likelihood hypothesis score for the second pass.

Hope that helps,

Ken 

 

 

--- (Edited on 12/11/2007 8:44 pm [GMT-0500] by kmaclean) ---

PreviousNext