VoxForge
Hi,
I am trying to train up some telephony acoustic models from the Fisher corpus using SphinxTrain and I have been running into a couple of problems during decoding that I can't seem to suss, which I think are related to my training environment.
In the past I have successfully trained acoustic models with SphinxTrain for several languages but seem to be having trouble with the Fisher corpus for some reason. I suspect that the issue has something to do with either,
I am currently using just a very small subset of this rather enormous corpus, which consists of just 1.5 hours of data. I've set the final number of densities to 8, 3-state hmms, 1000 senones, and continuous models. I'd like to work out these kinks before moving forward with a larger chunk of training data.
Using these parameters and the conversion setup above I am able to train my models without incident, and the only errors that appear in the logs during my training steps:
are typical errors related to a very small number of utterances where no final state was reached, e.g.,
utt: 706 fe_03_00085-58 137 0 28 23 ERROR: "backward.c", line 431: final state not reached
This gave me the initial impression that my models were OK, but when I try to run sphinx3_decode on a couple of the training utterances as a sanity check, using all the same decoding parameters as were used during training, I invariably get many of the following errors,
ERROR: "srch_time_switch_tree.c", line 848: ***ERROR*** Fr 9, best HMM score > 0 (2146874566); int32 wraparound?
I checked these out in the sphinxtrain faq,
http://www.speech.cs.cmu.edu/sphinxman/logfiles.html#201
but altering the filler model dictionary didn't seem to have any effect, and the transitions matrices all looked OK.
The decoding does complete, and the hypotheses being generated are reasonable, but they are invariably truncated, i.e., the hypotheses are much much shorter than the input utterances. This strikes me as a silence modeling issue maybe, but I can't seem to figure out how to fix it, or why it isn't generating problems during training.
One other thing I noticed was, when looking at the variances,
$ printp -gaufn variances
there are quite a few zero entries,
mgau 141
feat 0
density 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+0\
0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00\
which seem likely to be wrong.
I also tried including the force alignment steps in the training procedure but this did not have any effect.
Any advice would be greatly appreciated and I can provide further information if it will be helpful.
--- (Edited on 2/9/2010 5:52 am [GMT-0600] by joebob) ---
--- (Edited on 2/9/2010 6:03 am [GMT-0600] by joebob) ---
its easy to find in mdef file which phone in which context gau 1414 belongs to. then you need to find why this context is not widely represented in training data. fisher by default has too many fillers. other possible mistake could be in trimming the silence arounf each utterance. there must be about 0.2 s of silence.
--- (Edited on 2/9/2010 15:21 [GMT+0300] by nsh) ---
Hi,
Thanks for the quick reply. I've already removed the superfluous fillers that aren't represented in my subset, so that is probably OK.
It sounds like the issue is most likely the silence padding on either end of the utterance. I have just used the transcript timing information exactly as-is, so this is almost certainly truncating things in some inappropriate places. Also, some of the segments are extremely short and occasionally contain ust an 'um' or an 'oh yeah'. I had thought about merging these based on conversational turns rather than strict adherence to the transcripts, but strictly adhering made for simpler scripts. Looks like I may have to revise that approach.
Is is sufficient to use sox to pad each of the utterance segments with 0.2s silence? I've never tried that and wonder if it is acceptable?
Is it normal for this not to raise error issues during the training process?
--- (Edited on 2/9/2010 6:57 am [GMT-0600] by joebob) ---
> Is is sufficient to use sox to pad each of the utterance segments with 0.2s silence? I've never tried that and wonder if it is acceptable?
It's easier to try that then to guess
> Is it normal for this not to raise error issues during the training process?
It's a minor bug that could be fixed. But first it needs to be reported of course. With dropping some utterances that were not aligned properly training material for some senone could disappear and that could cause zero variance.
> I've already removed the superfluous fillers that aren't represented in my subset, so that is probably OK.
Then it's sill interesting which phone senone 141 belongs to.
--- (Edited on 2/10/2010 10:57 [GMT+0300] by nsh) ---
Hi,
Thanks again for the reply. With respect to the padding I was a bit worried that maybe inadvertently adding zeros to the ends of the file would screw things up. Applying a bit of dithering seems to make that issue moot.
I still cannot decode everything, and although I looked up the zero variance senones, I was unable to find anything particularly suspicious. I'm thinking now that the issue is more basic and has to do with the way I've configured the trainer. I'll provide a response if I suss out the issue.
--- (Edited on 2/10/2010 6:34 am [GMT-0600] by joebob) ---
Hi again,
I sorted out the problem by eliminating the sox commands from my segmentation approach and relying entirely on sph2pipe for the whole proces. So instead of running the conversion to wav and then using sox to perform the segmentation according to the transcript, I'm now just running one sph2pipe command for each transcription segment,
$ sph2pipe -p -f rif -t starttime:endtime largefile.sph shortseg.wav
This eliminated the problem so my guess is that the real issue was a mismatch in the timing acquired by sox versus the transcript. Anyway I've now segmented the massive fisher corpus and am on my way to models comprising a couple of hundred hours of the data.
Thanks again for the responses, they got me going in the right direction.
--- (Edited on 2/11/2010 5:13 pm [GMT-0600] by joebob) ---