Dutch

Flat
Re: Dutch - Nederlands
User: kmaclean
Date: 11/5/2007 2:06 pm
Views: 448
Rating: 58

Hi nsh,

>Hi, I've just made a Dutch model for sphinx3 from IFA corpus.

awesome, thanks!

>Also it would be nice to commit this to Dutch part of voxforge. 

I've added the acoustic model to the Dutch svn repository, and given you access to it too (same password as your Russian svn access).

Ken 

Re: Dutch - Nederlands
User: nsh
Date: 11/5/2007 3:24 pm
Views: 434
Rating: 42
Thanks. Actually it would be nice to get Robin's word on this. Big work still stands. Language model is the most important thing.
Re: Dutch - Nederlands
User: Robin
Date: 11/6/2007 9:06 am
Views: 395
Rating: 35

Hi nsh,

>I've just made a Dutch model for sphinx3 from IFA corpus.

That's great I definitely wasn't expecting such good news when I read this!


Few issues still exists: 
 
1. We need testing data, in particular language model. To create one I need a lot of Dutch texts. 
I have never come across a copyright free collection of texts, but under Dutch law, law texts and court decisions are free of copyright.  I am not sure if there are any restrictions for you regarding format etc. for the creation of the model.

In theory we could crawl www.rechtspraak.nl and use all rulings without any legal issues.  Let me know what you think of this possibility.


2. I stripped around 80% of the database due to 5000 OOV words, celex seems to miss a lot of important data. This has to be fixed 
What is celex?  And what dictionary did you use?  I am assuming that choice of dictionary has a large influence on OOV words, right?  There is a dictionary (GPL) in svn and it's quite large, so perhaps that one is better than the one you used?

Regarding your other two points I will send you my opinion by e-mail.

Robin

Re: Dutch - Nederlands
User: nsh
Date: 11/6/2007 2:44 pm
Views: 387
Rating: 35

>In theory we could crawl www.rechtspraak.nl and use all rulings without any legal issues.  Let me know what you think of this possibility.

Hm, usually old books are always free, some classical library can be a good source really. Although it's not a primary goal, you can test acoustic model with finite state grammar for example.

>What is celex?  And what dictionary did you use?  I am assuming that choice of dictionary has a large influence on OOV words, right?  There is a dictionary (GPL) in svn and it's quite large, so perhaps that one is better than the one you used?

 Celex is here

 http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/SLcorpus/scripts/CelexLexicon.txt.gz

Well, I'll look on another dictionary too, probably there is sence to use it instead. OOV words are usually senseless like uuhm, uuh, zoo. Unfortunately they are rather often.

Here is the full list of OOV words in transcriptions:

http://www.dev.voxforge.org/projects/Dutch/browser/Trunk/AcousticModels/etc/celex.missing.words 

Re: Dutch - Nederlands
User: Robin
Date: 11/7/2007 10:37 am
Views: 407
Rating: 34

> Hm, usually old books are always free, some classical library can be a good source really. Although it's not a primary goal, you can test acoustic model with finite state grammar for example.

In that sense we have similar copyright rules as other countries (70 years after the death of the author the copyright expires).  However, Dutch is a very dynamic language and it has changed a lot (both in spelling and in word order).  In my opinion you will get better prediction using current formal language (rulings) then with old prose.

I am not familiar with the process of making a language model for Sphinx.  Do you require lots of 'relatively small' text files ( I mean unformatted .txt files with hundreds or several thousands of words).  Does this matter at all?  If I know the constraints (if any) I can think about the best option.

> Well, I'll look on another dictionary too, probably there is sence to use it instead. OOV words are usually senseless like uuhm, uuh, zoo. Unfortunately they are rather often.

I saw a lot of normal words in there too (say 30-40%), if those are also used frequently then a bigger dictionary will help a lot I think.  Let me know if you get better results with our dictionary.

From the remaining part many seem to be typos, so I guess those could be corrected as well.  Though that will take a lot more time.

Robin

Re: Dutch - Nederlands
User: nsh
Date: 11/7/2007 11:21 am
Views: 404
Rating: 39

> Let me know if you get better results with our dictionary.

I've just moved to Twente corpus, it's really bigger, so the problem with dictionary seems to be solved. But now we have much more training data, so training will take a lot of time :(

> I am not familiar with the process of making a language model for Sphinx.  Do you require lots of 'relatively small' text files ( I mean unformatted .txt files with hundreds or several thousands of words).  Does this matter at all?  If I know the constraints (if any) I can think about the best option.

It doesn't matter at all. Actually since Dutch is the language with rather active morphology it won't help even. So probably one big text over 10Mb will be enough. Of course it can be concatenated from smaller chunks. In the future significant investigation will be required.

 

 

Re: Dutch - Nederlands
User: nsh
Date: 11/10/2007 3:03 pm
Views: 348
Rating: 44

Ken: I've updated Dutch model, now it's ready for testing. Can you please add a packed AcousticModel from Dutch trunk t
 http://www.voxforge.org/home/downloads?

Re: Dutch - Nederlands
User: kmaclean
Date: 11/11/2007 8:34 pm
Views: 422
Rating: 34

Hi nsh,

done, the link is now set up on VoxForge Downloads, with a link that points to here:

[   ] DutchAcousticModel.zip  11-Nov-2007 21:20  22.2M  

Note that it is not set up to update automatically from the Dutch svn repository if there are changes.  Let me know it this is what you were looking for.

Ken 

Re: Dutch - Nederlands
User: nsh
Date: 11/12/2007 3:25 am
Views: 2739
Rating: 44

Great, thanks a lot.

 

Re: Dutch - Nederlands
User: kkoenen
Date: 3/12/2016 3:38 pm
Views: 329
Rating: 0

Just finished setting up an Android app with Dutch speech recognition. Although not perfect, it works good enough for my purpose. I use pocketsphinx with these files (dictionary and language model) : http://osdn.jp/projects/sfnet_cmusphinx/downloads/Acoustic%20and%20Language%20Models/Dutch%20Voxforge/voxforge-nl-0.1.tar.gz/

Previous