Dutch

Flat
Dutch - Nederlands
User: Robin
Date: 8/12/2007 9:01 am
Views: 24621
Rating: 44

Hey everyone,

I want to start the Dutch section on VoxForge. as far as I know there isn't an open source phonetic dictionary available. There is an open source wordlist available on www.opentaal.nl that is also used to create spell checker files for openoffice.org and firefox.

My plan a is to use this list to create a phonetic dictionary using text replacements (Dutch is fairly consistent in its pronunciation).

Should anyone know where to get an open source Dutch phonetic dictionary, stop me before it's too late.

Robin 

Re: Dutch - Nederlands
User: nsh
Date: 8/12/2007 9:15 am
Views: 420
Rating: 38

There is Dutch festival voice

http://nextens.uvt.nl/

You can extract dictonary from it

Re: Dutch - Nederlands
User: Robin
Date: 8/12/2007 9:57 am
Views: 348
Rating: 38

Thanks for the link, I'll give it a try.

Robin 

Re: Dutch - Nederlands
User: Robin
Date: 8/13/2007 8:28 am
Views: 410
Rating: 40

This is indeed a good starting point. For testing it will be enough, but the licence is probably not suitable to release models under the GPL (at the NeXTeNS project they're not completely sure which licence applies, but it's save to assume that it's not 'open' enough, because many people/institutions put in work in it. Unlikely therefore that they would all agree to an open licensing scheme.

So for the longer term I'll work on a new dictionary.
Re: Dutch - Nederlands
User: nsh
Date: 8/15/2007 4:02 pm
Views: 369
Rating: 35

About rules for Dutch, they are available from espeak synthesizer (under GPL now). But remember that even for simple languages rules are _very_ bad. It's much better idea to bootstrap decision tree from existing dictionary or from hand-made data and then use it as an initial markup. (No idea about Dutch though, it can be a simple language, but the only simple language I know is Spanish Laughing)

 

Re: Dutch - Nederlands
User: Robin
Date: 8/15/2007 6:05 pm
Views: 355
Rating: 35

And here am I thinking that Russian was even better than Spanish considering pronunciation rules!

Thanks for the tip about espeak and the warning! I will check out the quality of espeaks Dutch and the rules for sure!

Re: Dutch - Nederlands
User: kmaclean
Date: 8/15/2007 10:15 pm
Views: 366
Rating: 35

Hi Robin,

I think this file might be what your are looking for (from the IFA Spoken Language Corpora; GPL license):

[   ] TwenteCorpusContextDist.txt.bz2                      29-May-2006 16:08   28M  
 

You need to scroll down a bit in the document, but it looks like a Dutch pronunciation dictionary (I can't say for certain because I can't read Dutch ...).  You'll need to clean it up a bit for use with a speech recognition engine (Perl is good for something like that - let me know if you need help).

They also have audio:

Ken 

 

Re: Dutch - Nederlands
User: Robin
Date: 8/16/2007 7:24 am
Views: 340
Rating: 27

It's amazing, because I informed at IFA, but they've apparently done good work than they themselves can keep track of.

I did know about the corpus (and told you about it too, but that was a while ago), but I wasn't focusing on it because the dictionary seemed more pressing. Seems like were all set for a good start!

I'll read a bit about perl and will get in touch when I need pointers. 

Re: Dutch - Nederlands
User: kmaclean
Date: 8/16/2007 7:43 am
Views: 352
Rating: 34

With respect to scripting languages, Perl seems like the main language used in the speech recognition domain.  Which is why I started using it, and explains my current bias towards Perl.

I am told that Python is easier to learn, and Ruby has been gaining quite a bit of traction in the open source development community.  Either Perl, Python or Ruby should be able to do what you are looking to do.

Ken 

Re: Dutch - Nederlands
User: nsh
Date: 11/5/2007 10:52 am
Views: 346
Rating: 38

Hi, I've just made a Dutch model for sphinx3 from IFA corpus. Sphinx2 or pocketsphinx model can be made too, not time yet. Helper files and model itself could be downloaded from: 
 
http://www.mediafire.com/download.php?b2juwvounye 
 
Few issues still exists: 
 
1. We need testing data, in particular language model. To create one I need a lot of Dutch texts. 
 
2. I stripped around 80% of the database due to 5000 OOV words, celex seems to miss a lot of important data. This has to be fixed 
 
3. There are still some bad transcriptions, sphinx report about them as ERRORS 
 
4. It would be nice to use hand-made segmentation as well, that will greatly improve WER.

Also it would be nice to commit this to Dutch part of voxforge.

Previous