Speech Recognition Engines

Nested
unsupervised speaker adaptation
User: dsubbu
Date: 7/6/2009 7:47 am
Views: 5505
Rating: 2

Hi,

I have been using HTK for about a month now. In my experiments on supervised adaptation, I have managed to increase the accuracy from aroung 45% to 85% for a medium sized vocabulary.

I attempted to do unsupervised adaptation and saw that the accuracy actually went down for most parts and increased only in 1 case to around 79%(this was a case of digit recognition). Most of the literature I saw on unsupervised methods do not show high accuracies.

Can anyone direct me to some study that shows high accuracies (like 90%+) being achieved through unsupervised methods?

Or anyone who has performed experiments on unsupervised adaptation pl. share any methods or results.

Thanks in advance!

Cheer

--- (Edited on 7/6/2009 7:47 am [GMT-0500] by Visitor) ---

Re: unsupervised speaker adaptation
User: nsh
Date: 7/6/2009 11:09 am
Views: 79
Rating: 3

The high accuracy (95%) is archived by jump from 90% of accuracy without adaptation. This value is reasonable accuracy for a medium vocabulary task in good quality recording. The fact that you start from 45% makes me think there is something wrong with your original model.

 

--- (Edited on 7/6/2009 11:09 am [GMT-0500] by nsh) ---

Re: unsupervised speaker adaptation
User: dsubbu
Date: 7/7/2009 1:20 am
Views: 53
Rating: 3

hi,

thanks for that. I used the voxforge SI model as my base. the low initial accuracy is probably due to the vast difference in accents. has the model been trained initially with an indian accent(my accent)?

Since I had a low initial accuracy, i adapted in a supervised fashion first, increased the accuracy to 80% and then tried unsupervised. but that did not work.

is 80% accuracy low as well? I found that the accuracy sort of saturates to this value even if I used 10 mins of my voice.

Thanks a lot!

Cheer

 

--- (Edited on 7/7/2009 1:20 am [GMT-0500] by Visitor) ---

Re: unsupervised speaker adaptation
User: nsh
Date: 7/7/2009 8:43 am
Views: 109
Rating: 3

It's all a senseless discussion without data. 80% is still not enough to start unsuperwised adaptation or it should adapt slower probably. There are also things like dictionary adaptation. I suggest you to provide all data you are using - input audio, test audio, models, test results and so on.

 

--- (Edited on 7/7/2009 8:43 am [GMT-0500] by nsh) ---

Re: unsupervised speaker adaptation
User: dsubbu
Date: 7/8/2009 2:09 am
Views: 105
Rating: 2

hi nsh,

I have uploaded them to this link:

http://rapidshare.com/files/253299894/uploadvf.tar.gz.html

have included a readme file(which I hope gives all the necessary info.) . I am using ubuntu and HTK 3.2.1

Thanks a lot for the help!

Cheers!

 

 

 

--- (Edited on 7/8/2009 2:09 am [GMT-0500] by Visitor) ---

Re: unsupervised speaker adaptation
User: nsh
Date: 7/10/2009 5:32 am
Views: 82
Rating: 3

Hey, I've checked this. The issue here is a proper handling of the sp. I was wrong when I suggested to remove it from the dictionary  because there are indeed short pauses and when you remove sp the accuracy drops to 50% and thus adapted accuracy also drops (you can check it with dumping the result with -i during HVite adaptation)


To solve this properly I suggest to modify grammar this way:

$word = QUIT | HELP | CHAT | WEB | WORDS| INTRODUCE | AND| TRY | ALSO |
LET | SPEECH | FASTER |DECODE | TRYING | TEXT| RANDOM| SOME | COME | NOW
| DELETE | MESSAGE |MENU | ENTER  | ADDRESS | SEARCH | WINDOW | NEW |
FRESH | STOP | FORWARD | BACK;

(SIL <$word> SIL)


and add SIL to the dictionary with sp:

 

ZUE             [ZUE]           z uw sp
ZURICH          [ZURICH]        z uh r ix k sp
SIL             [SIL]           sil


Then there will be no "Tee state" error and adaptation results will be acceptable:

 

Adapt set:

====================== HTK Results Analysis =======================
  Date: Fri Jul 10 02:29:22 2009
  Ref : badaptprompts.mlf

  Rec : outonline.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=70, N=70]
WORD: %Corr=89.96, Acc=69.28 [H=609, D=2, S=66, I=140, N=677]
===================================================================

Test set:


====================== HTK Results Analysis =======================
  Date: Fri Jul 10 02:29:27 2009
  Ref : btestprompts.mlf
  Rec : outonline.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=30, N=30]
WORD: %Corr=86.69, Acc=66.19 [H=241, D=5, S=32, I=57, N=278]
===================================================================

--- (Edited on 7/10/2009 5:32 am [GMT-0500] by nsh) ---

Re: unsupervised speaker adaptation
User: dsubbu
Date: 7/10/2009 6:45 am
Views: 120
Rating: 2

Hi nsh,

Thanks a ton for your time!I have a few questions and observations:

1. for what models have you given the results? I am asking because the accuracy does not improve with the speaker independent model after those changes you suggested.

2. i had the "tee" model problem and found a naive "solution" that removed the error. i just made the sp model a non-tee model by removing the 1-3 state transition. it was ad hoc and probably against the philosophy of a short pause i admit but it seemed to work. what do you think?

3. also, i had trouble with the j parameter in HVite while creating the TMF. when j was 1 i got significant performance reduction!

when I changed j to "number-of-adaptation-files", the performance either improved or did not degrade atleast. I am reading up on MLLR to understand why but whats the reason that incremental adaptation is giving bad TMFs?

pasting the best result I got here:

with hmm50(i.e. 50 supervised adaptation sentences)

====================== HTK Results Analysis =======================
  Date: Wed Jul  8 15:19:34 2009
  Ref : btestprompts.mlf
  Rec : supres.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=43.33 [H=13, S=17, N=30]
WORD: %Corr=91.01, Acc=89.57 [H=253, D=14, S=11, I=4, N=278]
===================================================================

with hmm50 and -j 50 for the unsup!!!(unsupervised on top of supervised)

====================== HTK Results Analysis =======================
  Date: Wed Jul  8 17:02:41 2009
  Ref : btestprompts.mlf
  Rec : unsupres.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=56.67 [H=17, S=13, N=30]
WORD: %Corr=94.60, Acc=93.53 [H=263, D=0, S=15, I=3, N=278]
===================================================================

when I used -j 1 for the same thing, there was a crazy reduction in accuracy.

Thanks a lot again.

Cheers!

--- (Edited on 7/10/2009 6:45 am [GMT-0500] by Visitor) ---

Re: unsupervised speaker adaptation
User: nsh
Date: 7/13/2009 7:41 am
Views: 126
Rating: 2

> 1. for what models have you given the results? I am asking because the accuracy does not improve with the speaker independent model after those changes you suggeste

On the same models you shared.

> 2. i had the "tee" model problem and found a naive "solution" that removed the error. i just made the sp model a non-tee model by removing the 1-3 state transition. it was ad hoc and probably against the philosophy of a short pause i admit but it seemed to work. what do you think?

I think that solution with SIL on the boundaries is better. sp should have skip state

> 3. also, i had trouble with the j parameter in HVite while creating the TMF. when j was 1 i got significant performance reduction


This shows that j should be more than 1 of course. You need to collect enough material to do adaptation properly. Probably j like 20-50 is reasonable.

 

--- (Edited on 7/13/2009 7:41 am [GMT-0500] by nsh) ---

Re: unsupervised speaker adaptation
User: dsubbu
Date: 7/15/2009 12:41 am
Views: 2115
Rating: 2

thanks a lot for that!

--- (Edited on 7/15/2009 12:41 am [GMT-0500] by Visitor) ---

PreviousNext