Speech Recognition Engines

Flat
HTK and Julius : different recognition score
User: brunal2496
Date: 9/19/2008 5:34 am
Views: 6642
Rating: 1

Hi all,


I have a problem using HKT and Julius, as I don't have the same recognition score using one or the other.


I use HTK to train a word model and then use Julius to decode that model. As I use a very specific microphone, and the language used is French, I wasn't able to use prerecorded corpus. I choosed to train a word based model because my corpus isn't very big (total of 45 occurrences for each word, from 15 speakers) and I want to use my system for real-time command speaking.


I use Hinit to init model under HTK and then HRest to train model. I get a mean of 100% of good recognition under HTK, using cross-validation on my corpus.

Here is one of the result from HResult on one iteration of cross-validation :

HResults -A -D -T 1 -p -u 0.01 -e ??? sil -I iter7/test/testref.mlf listemot_sil.txt iter7/test/recog.mlf 
No HTK Configuration Parameters Set
====================== HTK Results Analysis =======================
  Date: Mon Sep 15 17:28:53 2008
  Ref : iter7/test/testref.mlf
  Rec : iter7/test/recog.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=100.00 [H=25, S=0, N=25]
WORD: %Corr=100.00, Acc=100.00 [H=25, D=0, S=0, I=0, N=25]
------------------------ Confusion Matrix -------------------------
       c   g   d   z   p 
       a   a   r   e   i 
       m   u   o   r   l 
       e   c   i   o   o 
       r   h   t       t  Del [ %c / %e]
came   5   0   0   0   0    0
gauc   0   5   0   0   0    0
droi   0   0   5   0   0    0
zero   0   0   0   5   0    0
pilo   0   0   0   0   5    0
Ins    0   0   0   0   0
===================================================================

No HTK Configuration Parameters Set

But when I use julius for real-time decoding, I don't have a similar score, but something more around 40%.

To facilitate testing, I've decided to do the same cross-validation with julius using soundfile as input. I get a mean around 30-40% of good recognition.

Here is the commandline :

julius -input rawfile -realtime -filelist $sndfile -h $mmf -gramlist julius/gramlist.txt -multipath -lv 2500 -rejectshort 70 -headmargin 50 -tailmargin 50 -progout -sp sil -b 0

And here is part of the result from julius for the same mmf file as above :

### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_camera.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_GAUCHE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAMERA
pass1_best: CAMERA
pass1_best: CAMERA
pass1_best_wordseq: 0
pass1_best_phonemeseq: camera
pass1_best_score: -5912.543457
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 198
sentence1: CAMERA
wseq1: 0
phseq1: camera
cmscore1: 1.000
score1: -5912.554688
grammar1: 2


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_zero.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5937.828613
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 205
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 1.000
score1: -5937.822754
grammar1: 1


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_pilotage.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5945.857422
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 212
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 1.000
score1: -5945.850098
grammar1: 1


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_droite.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5936.231934
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 184
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 0.999
score1: -5936.239258
grammar1: 0


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_gauche.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best_wordseq: 0
pass1_best_phonemeseq: zero
pass1_best_score: -5817.412109
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 189
sentence1: CAM_PIL_ZERO
wseq1: 0
phseq1: zero
cmscore1: 1.000
score1: -5834.108887
grammar1: 0


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_gauche.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -4845.695801
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 194
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -4865.897461
grammar1: 0


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_pilotage.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5589.427246
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 208
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -5598.542969
grammar1: 0


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_droite.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -4763.124023
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 198
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -4772.822266
grammar1: 0


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_camera.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_GAUCHE
pass1_best: CAMERA
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5862.962891
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 206
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 0.658
score1: -5862.988281
grammar1: 1


------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_zero.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels

pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5343.823730
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 206
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -5358.856445
grammar1: 0


I'm using <MFCC_D_A> . I don't use C0 because my corpus was recorded too low compare to the output of my soundcard.


Does anyone have an idea why I don't get the same result between HTK and Julius?

Thanks a lot for your help.

Best regards,

Bruno.

--- (Edited on 9/19/2008 5:34 am [GMT-0500] by brunal2496) ---

Re: HTK and Julius : different recognition score
User: nsh
Date: 9/19/2008 9:59 am
Views: 218
Rating: 1

> I'm using <MFCC_D_A> . I don't use C0 because my corpus was recorded too low compare to the output of my soundcard.

According to Juilius manual:

Note  that  Julius  itself  can only extract MFCC_E_D_N_Z features from speech data.  If you use an acoustic HMM trained by other feature type, only the HTK parameter file of the same feature type can be used.

--- (Edited on 9/19/2008 9:59 am [GMT-0500] by nsh) ---

Re: HTK and Julius : different recognition score
User: brunal2496
Date: 9/22/2008 5:27 am
Views: 75
Rating: 1

Can you give me the page reference in the manual?

I've checked directly inside Julius code files, adn I get this commentary in wav2mfcc.c :

The supported parameter is MFCC, with any combination of all the qualifiers in HTK: _0, _E, _D, _A, _Z, _N

I'm checking with Lee Akinobu if it's also the same whe using Julius in realtime with a microphone.

Anyway, does anybody have an idea why I don't have the same score? Is there something that I should really take care about?



--- (Edited on 9/22/2008 5:27 am [GMT-0500] by brunal2496) ---

Re: HTK and Julius : different recognition score
User: kmaclean
Date: 9/23/2008 7:05 pm
Views: 62
Rating: 2

Hi brunal2496,

The changelog for Julius 3.5.2 on the Julius front page says:

  o  Wider MFCC types support:
     - Added extraction of acceleration coefficients (_A).  Now you
       can recognize waveform or microphone input with AM trained with _A.
     - Support all MFCC qualifiers (_0, _E, _N, _D, _A, _N, _Z) and their
       combination 
     - Support for any vector lenth (will be guessed from AM header)
     - New option: "-accwin"
     - New option "-zmeanframe": frame-wise DC offset removal, like HTK
     - New options to specify detailed analysis parameters (see manual):
          -preemph, -fbank, -ceplif, -rawe / -norawe, 
          -enormal / -noenormal, -escale, -silfloor

The Julius book was written for an older release - Julius r3.2.  On page 15 on the Julius book, in the Microphone Input section, the author says:  "At present the only possible feature extraction method that can take place within Julius/Julian is MFCC_E_D_NZ feature extraction".

>Anyway, does anybody have an idea why I don't have the same score?

I don't know.... I always seemed to get better recognition results with Julius... each nightly build has a very rudimentary "sanity" test included with it, and Julius seems to recognize better than HTK (at least for the feature set that I use). 

Maybe you need a larger training set?

Ken

 

--- (Edited on 9/23/2008 8:05 pm [GMT-0400] by kmaclean) ---

Re: HTK and Julius : different recognition score
User: nsh
Date: 9/24/2008 5:29 pm
Views: 2559
Rating: 1

Anyhow this huge difference means you are using wrong parameters, wrong feature set probably. Try to decode mfc files with julius.

 

--- (Edited on 9/24/2008 5:29 pm [GMT-0500] by nsh) ---

PreviousNext