Speech Recognition Engines

Flat
Acoustic model HTK
User: julien marin
Date: 5/10/2007 2:01 pm
Views: 11536
Rating: 10

Hi guys,

I'm willing to use HTK to create a large vocabulary recognizer.

I've already managed to make it work with sphinx4 using HUB4 or WSJ acoustic model. These one are freely available. I was wondering wether it exists free acoustic model for HTK as well?

I didn't find any. I guess we are forced to train our own. But it just take too much time.

 

I hope you'll be able to enlight me.

Thank you all for reading ;)

regards

--- (Edited on 5/10/2007 2:01 pm [GMT-0500] by Visitor) ---

Re: Acoustic model HTK
User: kmaclean
Date: 5/11/2007 8:38 am
Views: 721
Rating: 17

Hi,

Try Keith Vertanen's HTK Wall Street Journal Acoustic Models.

Ken 

--- (Edited on 5/11/2007 9:38 am [GMT-0400] by kmaclean) ---

Re: Acoustic model HTK
User: Marion
Date: 4/15/2009 11:52 am
Views: 91
Rating: 1

Hi,

I'm trying to compare the recognition results between the Voxforge acoustic model and the WSJ ones with Julius, with the same wav files and the same grammars. It works good with the Voxforge AM, but most of the time it crashes with the WSJ AM at recognition, saying something like:

 

### Recognition: 2nd pass (RL heuristic best-first)

scan_word: word too long (>40)

 

where 40 is the value of maxwn and represents the "maximum number of HMM states per word(statistic)".

Or else the search just fails.

Did someone try to run it with Julius and had that kind of trouble?

Thank you

Marion

--- (Edited on 15-April-2009 6:52 pm [GMT+0200] by Marion) ---

Re: Acoustic model HTK
User: tpavelka
Date: 4/16/2009 2:59 am
Views: 94
Rating: 2

Hi,

I don't have much experience with Julius, but here's what I found by looking at the sources:

The error is thrown by function scan_word which can be found in file search_bestfirst_v1.c. This line is important:

whmm = new_make_word_hmm(hmminfo, winfo->wseq[word], winfo->wlen[word], (enable_iwsp && hmminfo->multipath) ? dwrk->has_sp : NULL);

My understanding is that this creates a word HMM from phoneme hmms (in hmminfo) by concatenating those HMMs based on the phonetic transcription of the word (winfo->wseq[word]). The structure winfo holds the vocabulary and the variable winfo->maxwn holds the maximum number of states of a word that can be found in the vocabulary. Since all phoneme models in both WSJ and VoxForge acoustic models have three emitting states I would expect that the total number of states of a word  = 3 x the number of phonemes in the transcription.

But, somehow (and I am out of my depth regarding the how part) it can happen that a word is created that has more states than the maximum number of states of a word in the vocabulary (Note that this word is actully created using the same vocabulary from which the maximum is computed).

The authors of Julius were aware of this fact and decided to solve this by adding a magic constant (10) to the maximum number of states used to allocate some memory, examples:

maxwn = r->lm->winfo->maxwn + 10;

dwrk->wordtrellis[0] = (LOGPROB *)mymalloc(sizeof(LOGPROB) * maxwn);

But, in some cases (i.e. yours) 10 is not enough so rather than letting a segmentation fault happen this error is thrown:

wordhmmnum = whmm->len;
if (wordhmmnum >= winfo->maxwn + 10) {
  j_internal_error("scan_word: word too long (>%d)\n", winfo->maxwn + 10);
}

I could say something about good programming practices but let him who is without sin, throw the first stone...

I am doing some acoustic model experiments of my own. They take an awful lot of time to finish so meanwhile I have a lot of free time on my hands. If you want some help with the debugging upload your experiments somewhere and I will have a look.

Tomas

 

--- (Edited on 4/16/2009 2:59 am [GMT-0500] by tpavelka) ---

--- (Edited on 4/16/2009 3:03 am [GMT-0500] by tpavelka) ---

Re: Acoustic model HTK
User: Marion
Date: 4/16/2009 4:25 am
Views: 158
Rating: 2

Thank you for your answer Tomas,

I had noticed this constant too and as I don't really understand why the normal number of states is exceeded, I just changed the constant so that I don't have a 'word too long' error. But now I have the <search failed> message, so the problem must come from something serious. I'll keep looking. 

Another difference that I noticed in the hmmdefs is that Voxforge uses MFCC_D_N_Z_0 and WSJ uses MFCC_D_A_Z_0. Could it come from here?

--- (Edited on 16-April-2009 11:25 am [GMT+0200] by Marion) ---

Re: Acoustic model HTK
User: tpavelka
Date: 4/16/2009 4:50 am
Views: 46
Rating: 1

MFCC_D_N_Z_0 was the only one that Julius 3.2 supported, but the new version should be able to read HTK config files which would let you specify other types of features.

I have found the option in the new JuliusBook -

B.5.5 Misc. AM options

-htkconf file

--- (Edited on 4/16/2009 4:50 am [GMT-0500] by tpavelka) ---

Re: Acoustic model HTK
User: Marion
Date: 4/16/2009 12:03 pm
Views: 64
Rating: 2

Thanks for your help Tomas, but the -htkconf didn't change anything. I found out that it works fine using mfcfile input instead of rawfile input, so maybe Julius doesn't process the MFCCs properly...

The weird thing is, when I run HVite on the same data I don't have the same time alignment!


my jconf file:

-h cmu/wsj_all_8000_1_txt/hmmdefs

-hlist cmu/wsj_all_8000_1_txt/tiedlist

-penalty1 5.0       
-penalty2 20.0       

-iwcd1 max

-gprune safe

-b2 200                
-sb 200.0

-spmodel "sp"

-iwsp           
-iwsppenalty -70.0

-input mfcfile

-gram gram

-htkconf cmu/config

-walign

 

and my HVite command:

HVite -A -D -T 1 -H cmu/wsj_all_8000_1_txt/hmmdefs -C cmu/config -l '*' -i test.mlf -w wdnet -m cmu6 cmu/wsj_all_8000_1_txt/tiedlist test.mfc

--- (Edited on 16-April-2009 7:03 pm [GMT+0200] by Marion) ---

Re: Acoustic model HTK
User: tpavelka
Date: 4/16/2009 12:40 pm
Views: 81
Rating: 2

The difference in time alignment can be due to word transition penalty: Julius is configured by two values (-penalty1 -penalty2) I do not know what exactly they mean but they will be similar to the HTK's word transition penalty set by the -p switch of HVite (default value is 0). If you want exactly the same time alignment you need to have the same word transition penalties.

As for the HTK config file that can be a bit tricky. The way you set up the config file determines what should be done with the input. So if you want to read e.g. wav files you need to say so in the config file and need to specify how the wav should be converted to MFCCs (number of filter banks etc.)

What exactly is in your HTK config file?

--- (Edited on 4/16/2009 12:40 pm [GMT-0500] by tpavelka) ---

--- (Edited on 4/16/2009 12:40 pm [GMT-0500] by tpavelka) ---

Re: Acoustic model HTK
User: Visitor
Date: 4/17/2009 8:59 am
Views: 95
Rating: 2

OK my fault, I was using the 'config' file from The HTK TIMIT + WSJ0 recipe in the 'common' folder:

TARGETKIND = MFCC_0_D_A_Z
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
ZMEANSOURCE = T
USEPOWER = T
BYTEORDER = VAX

instead of the 'configcross':

FORCECXTEXP = T
ALLOWXWRDEXP = T

so I had only word internal triphones and sil appended between each word, like:

11600000 12400000 sil-hh+ae -108.174774 HANDS
12400000 13600000 hh-ae+n -94.700050
13600000 14700000 ae-n+z -105.264160
14700000 15400000 n-z+sil -136.026886

instead of:

11500000 12500000 m-hh+ae -101.873779 HANDS
12500000 13600000 hh-ae+n -94.045685
13600000 14700000 ae-n+z -105.264160
14700000 15300000 n-z+g -131.401932

with the end of the previous word (SLIM) and the beginning of the next one (GRIPPED).

This isn't such a big difference at phone level, but it is a bigger one at word level.

I think it should be OK now. Thanks a lot for your help and good luck with your experiments!

--- (Edited on 4/17/2009 8:59 am [GMT-0500] by Visitor) ---

Re: Acoustic model HTK
User: Marion
Date: 4/17/2009 9:01 am
Views: 45
Rating: 1

Oups, it was Marion speaking.

--- (Edited on 17-April-2009 4:01 pm [GMT+0200] by Marion) ---

PreviousNext