VoxForge
This might be useful for the creation of a German pronunication dictionary:
TXT2PHO - a TTS front end for the German inventories of the MBROLA project.
However, the software has some restrictive licensing provisions:
Permission is granted to use this software for non-commercial, non-military purposes, with and only with the lexicon and prosody files made available by the author from the HADIFIX for MBROLA project ...
Not sure if that would apply to pronunciations generated with the toolkit.
Ken
I don't think we can use it.
Using TXT2PHO in order to create a dictionary is close to reading the dictionary it uses (BOMP) directly. And both the dictionary and TXT2PHO itself clearly state they are non-military, which the GPL -- unfortunately -- is not.
Anyway, if we could use it, then we could just as well use BOMP directly.
I've had a first look at Sequitur G2P (which is a trainable g2p-tool) and it's likely that I will be allowed to use another trainable g2p-tool (without name, published in [1]). Thus, I will be able to compare the two and see which performs better.
So, we need some data to bootstrap these trainable systems. I just checked in some tools that extract pronunciations from the German Wiktionary.
The resulting data has to be post-processed, before we can use it for bootstrapping. In order to priorize that, we could use the word frequency information from Wortschatz-project, for which a Perl-module (EDIT: newer version with fixed frequency extraction) is available.
I hope to be able to setup a webtool that helps to post-process the wiktionary output. Would there be anyone volunteering to actually use that webtool and help in creating the dictionary? Ralf, would you be willing (and able) to help?
Cheers!
Timo
[1]: Phonological Constraints and Morphological
Preprocessing for Grapheme-to-phoneme Conversion
Vera Demberg, Helmut Schmid and Gregor Möhler, 2007
In Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics (ACL-07), Prague, Czech Republic, June 2007
Hi Ralf,
sorry for not getting back to you any earlier.
I've set up a dictionary tool on http://www.ling.uni-potsdam.de/~timo/projekte/voxforge.html . The main task is to paste the entries in the first row on the right (Aussprachen) to the corresponding field on the left.
Now, if it was just that, it would be too easy and too boring...
Often, there are far more variants of the word on the left than there are transcriptions. In these cases it would be nice, if you could add the missing transcriptions (often it is just a matter of appending -? or -? or whatever.
Sometimes the list on the left contains ridiculous word forms -- just leave the corresponding field empty (or press "Wort entfernen", but the result will be the same). It may also happen, that you are asked for the same word more than once (there are different entries for "bin", "ist" "sind" in the wiktionary and each entry will ask about all different sein-forms). If you are sure you've entered a transcription already, then just ignore it the second time.
Sometimes there are actually more transcribed word forms than words on the left. (Or they are different.) Then you can add a word form on the left with "Wort hinzufügen". Note: Often there are different transcriptions for the same word form (?v?ltn?, ?v?lt?n). Usually you would want to pick the form that would be used most in colloquial speech (here: v?ltn?).
Also, there may just be erroneous transcriptions (quite often), where people just guessed how IPA works. It's important, that we catch most of these errors. So you might actually want to start out with the Wiktionary Transcription Guideline which shows, how the transcription *should* be.
To enter IPA symbols into the textfields directly, just type the keys listed on the right (for ? type N) and they will automagically be transformed to IPA. (This works in Firefox, I don't have Windows, so I can't check Internet Explorer.)
Please input your e-mail address or another kind of ID into the first textfield. This way we can later compare who's the most hard working transcriber!
Cheers, Timo
clickable link: http://www.ling.uni-potsdam.de/~timo/projekte/voxforge.html
UPDATE: It's important that you transcribe, how something would be spoken in colloquial standard German. By the way, what region of Germany are you from? ;-)
Hi Ralph,
>3. Do you plan to release the results coming from the dictionary acquisition
>project under the Pronunciation Lexicon Specification?
You certainly are persistent with respect to PLS :)
Ken