General Discussion

Nested
Vietnamese support
User: daFunkyUnit
Date: 11/18/2009 11:02 am
Views: 16479
Rating: 20

Hello,

I would like to attempt to create a model that supports the Vietnamese language.  I guess the part that I'm currently stuck at is that I don't know where to find a dictionary of all the phonemes used in the Vietnamese language and creating monophones file.

In the Greek language thread, I see that Ralf has created a variety of dictionaries but some are "untrainable."  Could Ralf or anyone else could elaborate on what that means?

Also, is there any particular MFCC setting that would improve performance for a tonal language?

Thanks for all the help!
-Bao

Re: Vietnamese support
User: nsh
Date: 11/18/2009 4:19 pm
Views: 205
Rating: 19

I'd recommend using


http://home.uchicago.edu/~jkirby/vphon.html

Also I'd recommend to read

http://cslu.cse.ogi.edu/people/hosom/pubs/vu-Eurospeech05-VNlvcsr_2005.pdf


As for tonal features (read, f0 in MFCC files) I'd suggest to start without it now, it will require both custom feature extraction (learn that not everything is adjustable by parameter) and changes in recognizer.

 

Convert Ralf's Vietnamese dictionary into Sphinx format
User: ralfherzog
Date: 11/19/2009 1:03 pm
Views: 654
Rating: 22

Hello daFunkyUnit! Here is what you can do:

1. Download/extract Ralf's Vietnamese dictionary (PLS format). Your goal is to transform the dictionary from PLS format into Sphinx format. Take a look at cmudict to find out how a dictionary looks like that is in Sphinx format.
2. Remove the XML tags with Notepad++ (Search/Replace).
3. Convert the eSpeak phonemes into Arpabet phonemes (Search/Replace with Notepad++).

Then you have a Vietnamese pronouncing dictionary (Sphinx format). You can import it into simon as SPHINX lexicon.

When you convert the dictionary into Sphinx format, you can try to use it for training (that means: you can record a word with simon; for the further steps read the simon handbook). Maybe it will work out for you, and simon would be able to recognize a few words in Vietnamese. You have to try.

So you need a Sphinx dictionary for training (not a PLS dictionary). Ralf's Vietnamese dictionary can be imported as PLS dictionary. But during import, lots of phonemes are being omitted.

Greetings,
Ralf

Re: Convert Ralf's Vietnamese dictionary into Sphinx format
User: daFunkyUnit
Date: 11/23/2009 3:01 am
Views: 218
Rating: 18

I'm in the process of converting the XML PLS dictionary into SPHINX/HTK format, and going from eSpeak to ARPABet.  However, does the ARPABet phoneme system support different tonal pronunciations?  Otherwise words that only differ in tone will get mapped to the same phonemes. 

For example following words all have different meanings:


ma

All three will end up being in the phonetic dictionary as "M AE"

(pseudo-)Arpabet should have 21 consonants for Hanoi pronunciation
User: ralfherzog
Date: 11/23/2009 11:33 am
Views: 378
Rating: 19

Hello daFunkyUnit!

Do the words ma, mà, má have different pronunciations?
If these words would have the same pronunciation, the result would be "M AE" for each word.
If they have different pronunciations (= different phonemes), the result should be e.g. "M AE", "M AE1", "M AE2". I just listened to the examples in the Wikipedia. Obviously, in your language the different a-vowels are different phonemes.
You have to be creative: it seems that you have to invent your own pseudo-Arpabet phonemes.

The Vietnamese eSpeak phonemes are just a guess. If you find major mistakes, you should hand-correct them.

Can you please post a link to your Vietnamese Sphinx dictionary? I would like to import it into simon (like I imported a Polish Sphinx dictionary).

If you want Hanoi pronunciation, your (pseudo-)Arpabet should have 21 consonants.
If you want Ho Chi Minh City (Saigon) pronunciation, your (pseudo-)Arpabet should have 22 consonants.

You have to catch the major characteristics of your language (= phonemes). It is not necessary that you catch the details (= phones). The Vietnamese language seems to be totally different from the English language. But that shouldn't be a problem if you invent your own pseudo-Arpabet phonemes.

Greetings,
Ralf

Re: (pseudo-)Arpabet should have 21 consonants for Hanoi pronunciation
User: daFunkyUnit
Date: 12/4/2009 3:54 am
Views: 1958
Rating: 420

Hi Ralf!


How were you able to sort your dictionary?  The order of the dictionary works with all the HTK utilities.  However, there are some Perl scripts (like the one that converts your prompts to a word list) that do not sort according to the HTK order.  I was wondering if you had some sort of script that sorts in the order that works with HTK.

I've created a rough initial draft of the dictionary, although it currently does not differentiate the tones, i.e. does not have unique phonemes for each tonal pronunciation.

http://www.raymondkwan.com/uploader2/files/1412/viet_lexicon.zip

Re: (pseudo-)Arpabet should have 21 consonants for Hanoi pronunciation
User: daFunkyUnit
Date: 12/8/2009 6:31 pm
Views: 208
Rating: 20

Updated lexicon w/ tonal information.  Based on Northern (Hanoi) dialect.

 

http://www.raymondkwan.com/uploader2/files/1414/viet_lexicon_hanoi.zip

$ sort viet_lexicon > viet_lexicon_sorted
User: ralfherzog
Date: 12/11/2009 3:11 am
Views: 707
Rating: 20

Hello daFunkyUnit!

Fine. I was able to import the Hanoi dictionary into simon.

You can sort the dictionary with the sort command (e.g. Ubuntu terminal): $ sort viet_lexicon > viet_lexicon_sorted

I don't know whether this solves your problem. It is great to see that you are on the right path. Your dictionary seems to be fine (at least for the use with simon).

Re: (pseudo-)Arpabet should have 21 consonants for Hanoi pronunciation
User: leomessi
Date: 3/9/2011 11:31 am
Views: 1163
Rating: 21

I am interested in your job. However, the link above was died. Can anyone upload the hanoi_lexicon?

Thanks!!

Re: (pseudo-)Arpabet should have 21 consonants for Hanoi pronunciation
User: bela374
Date: 6/22/2013 7:48 am
Views: 769
Rating: 11

The phonemes are just a guess. If you find major mistakes, you should hand-correct them. 

 

 

---------------------------------------

[url=www.facebook.com/usman374]| Usman | Malik | The | Great |[/url]

PreviousNext