VoxForge
Ken has created a Subversion repository instance for Spanish in the following directory.
http://www.dev.voxforge.org/svn/es/
And at the end I have got upload all the files that I have in my training directory.
The directory has training for HTK/Julian with all the voices that voxforge have for spanish.
I have described some things about speech recognition softwares and the training process in my blog (it's writen in spanish) (http://ubanov.wordpress.com/2008/11/28/reconocimiento-de-voz-en-castellano/).
Hello ubanov! Hello everyone!
It is great that you write about speech recognition software in your blog.
I just took a look into the spanish pronounciation dictionary. On my computer (Win XP, Firefox) there are some unreadable characters. We have similar problems with the german language when it comes to characters like "ä, ö, ü, ß".
The special characters of the Spanish language may be displayed correctly on your own personal computer. But please keep in mind that other people like me may experience problems with the special characters. Probably you know about this problem. If not, please read this article.
Greetings, Ralf
Hi Ralf,
>On my computer (Win XP, Firefox) there are some unreadable characters.
I think this might be a problem with Subversion's web front-end, because Trac displays the exact same file correctly: voxforge_lexicon_spanish
There must be a default setting in Subversion somewhere that I need to change...
Ken
Hi Ralf,
I guess I spoke too soon, as you indicated, there is a problem with displaying the characters in voxforge_lexicon_spanish (the prompts file also has the same problem - see ticket #441 for details). I think it is because it uses ISO-8859-1 rather than UTF-8 encoding.
Workaround: If you display it using ISO-8859-1 it displays correctly...
in FireFox go to:
View > Character Encoding
then select Western (ISO-8859-1) so it can view properly.
Ubanov: if you can figure it out, it is best to use UTF-8 encoding for the files you upload to VoxForge - if you can... it saves a lot of headache down the road.
thanks,
Ken
Hello Ken,
Thanks for your answer. I just inserted a direct hyperlink to the Spanish VoxForge dictionary in the how to install Simon under Ubuntu. It would be great if someone would try Simon with this Spanish pronunciation dictionary (and report about the results)! It would be interesting to know whether the Spanish VoxForge dictionary is Simon-compatible or not.
This is exactly what I have in mind: saving a lot of headache! It took me a very long time to find out that there is a problem with the character encoding. How is it possible to train a speech model when the character encoding is wrong? ISO-8859-1 is still a very common standard, but we should try to totally switch to UTF-8. When I change the display mode to ISO-8859-1, the Spanish special characters are being displayed correctly. Thanks for the tip, Ken.
UTF-8 is the way to go, e.g. Wikipedia, WordPress.com, eBay.com are using UTF-8. I encourage everyone to employ UTF-8 instead of ISO-8859-1.
Greetings, Ralf
Hi,
As you asked I have update de dictionary to UTF-8 format. In order to make the translation I have used a simple script that I have build (it's the filtroiso884911toutf8.c program, stored in svn programas directory).
I have update another files too, but may be some are missing. If anyone finds one file that it's not converted, send me a mail and I will change.
I used iso 8859-1, because it's the default option with a Debian 4.0 installation in spanish. When I connect to my Debian machine with putty, the default is 8859-1 too. If it's preferable the utf-8, it's allright for me :-)
Ralf: I will test simon one of this days in spanish and I will tell you anything.
Regards.
Hello!
Which standard should we use? ISO-8859-1 or UTF-8? Well, it is not easy to find an answer. The Spanish Debian distribution may use ISO-8859-1. And even the CMU Sphinx website is encoded in ISO-8859-1. Obviously, they don't care about UTF-8. But from my point of view, they should.
I just downloaded the German prompts (658k). And when I look at the unpacked prompts with my text editor Notepad++, I see a lot of garbage when it comes to the German special characters (ä,ö,ü,ß). The reason for this crap this the mixed use of different character encodings. In my opinion, it doesn't make sense to train the acoustic/language/speech model when the German special characters are garbage.
So we have to take care of that problem. EBay migrated from Latin 1 (I think that this is the same as ISO-8859-1) to UTF-8. In Spanish as well as in German they used Latin 1, and migrated to UTF-8.
A reason why eBay migrated was to solve cross-border trade impediments (PDF). I think that we have an analog problem with the speech recognition development. E.g. I created and uploaded lots of German prompts (unfortunately using mixed character encodings). And nsh compiled my prompts (and the corresponding audio files) to a speech model. And the result is not yet usable because of the character encoding issues (the special German characters are crap). It won't be easy to fix that. Maybe we should try and use your script filtroiso88591toutf8.c.
Sorry for writing so much.
I hope that you won't lose too much time because of encoding issues.
Thanks in advance for testing Simon with the Spanish dictionary.
Greetings, Ralf
Hi,
I would like to help you. The only problem is that I don't have a table converting with the conversion from ISO-8859-1 to UTF-8.
What I have done this morning is to search the characters for áéíóúñ in ISO and UTF. In order to help you I would need what are the characters that you use in ISO-8859-1 and the UTF-8 equivalent (if you give me two files will all the characters I make the rest). Do you use ISO-8859-1 or another ISO-8859-x?
May be write the characters you need to convert in reply to this message (here I will have the ISO chars), and then I will try to create the UTF file.
I'm thinking now that I have missed some characters (ï and ü) in my conversion of the lexicon dictionary...
I expect to help you, but I need a little bit help.
Regards.
Hi Ken,
In the train/wav subdirectorys I have uploaded the sounds again converting the stereo wav files to mono (using sox -c 1 fichero.wav ficherosal.wav), and changing the prompts files to utf-8 characteres. I have uploaded ubanov*, buhochileno4 and txita1 directorys.
Ken may be you upload the files to the spanish voice repository (in order to be possible to download the files from the Listen option of voxforge).
Another thing, I'm going to include a reference about the encoding in the spanish Read or Listen page (asking the people to use UTF-8 charset).
Regards.
Hi Ralf,
>How is it possible to train a speech model when the character encoding is
>wrong?
The use of UTF-8 is really more to get rid of headaches that occur when trying to display international character sets on a web site.
It does not really have much to do with acoustic model training, since Sphinx, Julius/HTK, ... use ASCII internally (which I assume is the reason why the SAMPA computer readable phonetic alphabet was created).
Ken