Speech Recognition Engines

Flat
Problem with silence/fillers using Julius4 forced alignment
User: azeem
Date: 4/23/2013 6:08 am
Views: 5645
Rating: 3

Basically, I am not able to spot silences/fillers with forced alignment in julius.

 

 

1 Forced Alignment with Julius 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 

1.1 word / phoneme segmentation kit 

====================================

   I started from the word / phoneme segmentation kit that is present

   in the home page of the Julius project:

 

   + [http://sourceforge.jp/projects/julius/downloads/32570/julius4-segmentation-kit-v1.0.tar.gz]

 

   In this package there is a Perl script that automatically generates

   the speech grammar files 'tmp.dfa', 'tmp.phndict' and 'tmp.dict' from

   transcription.  Then recognition with julius is performed with

   -walign and/or -palign parameters.

 

 

1.2 What I have done successfully 

==================================

   So I used Julius for Forced Alignment and I prepared a script that

   builds a grammar (like in the Perl script mentioned above) for each

   input file from text transcription.  I experienced very good

   results both in word and phone alignment.

 

1.3 What I am not able to do 

=============================

   Unfortunately, I am not able to implement the following feature:

   I would like Julius to be able to spot silence (and furthermore

   even non verbal sounds) that may occur between words (or even

   between phones).  I would like to do this without explicitly

   designing a particular grammar to contain also optional states

   related to silence or filler words.

 

1.4 -iwsp parameter 

====================

   The "-iwsp" parameter seems to be related to the -iwsp option:

 

   $ julius -help

   ...

   "[-iwsp]             insert sp for all word end (multipath)(off)"

   ...

   

   I have tried it, but without the expected results.  I.e., with

   input audio files containing silences between words, there has not

   been detected any "sp" in the output.

 

2 My question 

~~~~~~~~~~~~~~

  Does anybody know how to spot silences or non verbal sounds in a

  forced alignment procedure with Julius4, without explicitly

  designing a grammar that include states associated with silence/non

  verbal sounds?

 

 

 

3 Test with voxforge acoustic model for English 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  If someone is interested, I can share a test example I'm struggling

  with.  It's about forced alignment for the English language.

 

3.1 Acoustic model 

===================

   I downloaded a pre-built AM, contained in this tarball:

 

   + [http://www.repository.voxforge1.org/downloads/software/julius-3.5.2-quickstart-linux.tgz]

 

   The "tiedlist" file, namely

julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist

   is lacking several triphones, and this can cause Julius to give an

   error (see below).

 

 

3.2 A sample audio and grammar 

===============================

   Here follows a link to a tarball with files to test this issue:

[https://dl.dropboxusercontent.com/u/10183668/julius\_FA/test\_julius\_fa.tar.gz] 

  

3.2.1 Here follows a description of each file contained in that tarball: 

-------------------------------------------------------------------------

    - sample_eng/c31c030s_plus_sil.wav

      This is a sample audio file we used to test julius forced

      alignment.  It is a sentence taken from WSJ corpus. A silence chunk

      has been manually added.

 

    - sample_eng/c31c030s_plus_sil.txt

      This is the corresponding text file:

      + "he updates his list of things to do today before going home each

        evening"

      The silence has been inserted between the words "today" and "before".

 

    - fa_files/file.dfa, fa_files/file.phndict and fa_files/file.dict

      Files that specify the grammar and dictionary for forced alignment

 

    - output_files/c31c030s_plus_sil.phn and output_files/c31c030s_plus_sil.wrd

      These two files are the output I gave from julius. There is no

      silence (nor "sp", short pause) inserted between "today" and

      "before". Actually, the phoneme "ey" in the alignment is lasting

      more than a second and a half, spanning over the long silence.

      These files have been obtained by converting the actual output

      of julius, from timings in frames to timings in seconds.

 

3.3 Missing triphones 

======================

    In the sample sentence I provided there are triphones that are

    missing in the voxforge acoustic model.  One can map some of them

    into other triphones and put them into the voxforge tiedlist file,

    like this:

 

    $ echo "ah-p+d  ah-p+ch

      ax-p+d ax-p+t

      b-iy+f b-iy+s

      hh-ix+z hh-ix+s

      iy-f+ao iy-f+ow

      ow-ix+n ow-ix+ng

      p-d+ey n-d+ey

      uw-d+ey ax-d+ey" \

    >> julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist

 

 

3.4 The julius command line I used is the following: 

=====================================================

   $ echo sample_eng/c31c030s_plus_sil.wav | julius \

    -h julius-3.5.2-quickstart-linux/acoustic_model_files_build726/hmmdefs \

    -hlist julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist \

    -dfa fa_files/file.dfa \

    -v   fa_files/file.phndict \

    -walign \

    -palign \

    -multipath \

    -spmodel "sp" \

    -iwsp \

    -b 200 \

    -b2 200 \

    -bs 200 \

    -sb 200.0 \

    -gprune safe \

    -iwcd1 max \

    -iwsppenalty -30.0 \

    -input file

 

 

 

 

--- (Edited on 4/23/2013 6:08 am [GMT-0500] by azeem) ---

Re: Problem with silence/fillers using Julius4 forced alignment
User: kmaclean
Date: 4/24/2013 10:28 am
Views: 356
Rating: 3

> I would like Julius to be able to spot silence (and furthermore even non

>verbal sounds) that may occur between words (or even    between phones).

First, for better results, use a current nightly build of the VoxForge acoustic models

Second, if you look at the output of the forced alignment, you should have word and phoneme timestamps.  You should be able to create a script to collect the time stamps from the end of one word to the beginning of the next to give you an idea of where there might be silence.

I used HTK's Hvite for forced alignment in this tutorial (on how to segment a speech file).  The word and phoneme timestamps from that tutorial look like this... the sp "short pause" entries correspond to the silence you are looking for.

--- (Edited on 4/24/2013 11:28 am [GMT-0400] by kmaclean) ---

Re: Problem with silence/fillers using Julius4 forced alignment
User: azeem
Date: 4/28/2013 10:18 am
Views: 2439
Rating: 2

@kmaclean: First of all, thank you very much for your advices.

 

> First, for better results, use a current nightly build of the

> VoxForge acoustic models

 

Thanks, I will surely try those!

 

 

> Second, if you look at the output of the forced alignment, you

> should have word and phoneme timestamps.  You should be able to

> create a script to collect the time stamps from the end of one

> word to the beginning of the next to give you an idea of where

> there might be silence.

 

Yes, I have already prepared such a script. Actually, in the

output that I linked in my post there is an example of output

with timestamps (I realize that my post may be too long,

difficult to read and hide this info).  I have two versions, .wrd

and .phn.  Here follows the word version, it can be noted that

the word "today" is very long (and that's the error: a silence

should have been spotted right after that word):

 

0 0.32 sil

0.32 0.45 he

0.45 0.88 updates

0.88 1.04 his

1.04 1.31 list

1.31 1.37 of

1.37 1.75 things

1.75 1.87 to

1.87 2.06 do

2.06 3.98 today

3.98 4.31 before

4.31 4.6 going

4.6 4.88 home

4.88 5.08 each

5.08 5.63 evening

5.63 5.91 sil

 

 

> I used HTK's Hvite for forced alignment in this tutorial (on how

> to segment a speech file).  The word and phoneme timestamps from

> that tutorial look like this... the sp "short pause" entries

> correspond to the silence you are looking for.

 

Sure, HVite is another option that I want to try. But in this

thread I would like to trouble shoot julius, i.e. to investigate

wheter it is capable of automatically spot silence/fillers events

in a Forced Alignment task, without designing a specific grammar

containing those nodes.

 

In particular, the -iwsp option seems to deliver what I am

looking for, and I would like to understand if I am using it

correctly.

 

 

--- (Edited on 4/28/2013 10:18 am [GMT-0500] by azeem) ---

PreviousNext