VoxForge
...but that still will not adress the underlying problem completely - there are many other "words" which have more than one correct pronounciation (let alone regional dialects...). E.g. english words which have german counterparts that have the exact same spelling:
"Personal Computer" vs "Personal Sekritaerin"
abbreviations: some people will read "z.B." "zetbe", others would read "zum Beispiel"
regional dialekts: people from the south are more likely to read "richtig" as "richtik" which others will read "richtich"
not sure how big the impact of all that is - but still, if we want to have really high-quality training material, we should have correct phoneme transcriptions of what people read, not what they were supposed to read if they worked as a tv news anchor... ;)
Hello all,
sorry for my late response:
Since we do not have much infrastructure right now, maybe agreeing on a simple common format for those efforts would help already? e.g. we could decide we want to use simple CSV, something along the lines of
submission,prompt,rating,comment
and simply post those here?
I think a central "place" like Google Doc/Drive would be the best solution. This kind of data storage should avoids an otherwise necessary CSV synchronisation.
I'm working on a webcrawler right now but it has a long way to go.
As a Java/Web developer i already developed some more or less small web crawler. Maybe, i can help you?
I also have a "bored" virtual server instance with 3 cores, 2GB ram and a real unlimited traffic which is waiting for a challenging task. Maybe an additional infrastructure part, where the crawler can run 24x7? I could also install an additional SVN repository for data synchronisation (CSV files, ... ?)
I also experienced in digital audio processing, so slicing audio files should not be a problem.
Well, i would really like to help. Where can i start?
(1) Submission rating? If someone explains me, how a "good" submission can be distinguished, thats maybe a part for me :)
(2) Audio slicing? All i need is a short introduction. BTW. I also found last week a possible media source... Maybe a look worth..
(3) ...
I dont know, but i think we should use a good performing way for coordination? (eMail, Instant Messaging, ...?)
Regards,
Andreas
(1) Submission rating? If someone explains me, how a "good" submission can be distinguished, thats maybe a part for me :)
That's a really good question and the anwser depends on what you do. But I can tell you what "bad" submissions are.
www.messe2media.com/files/Aligning_Fehler.odt
Take a look a this file with files I checked.
We mostly look for transcription - audio mismatch. This for example:
anonymous-20100302-huu/mfc/de8-089
PROMPT: DIE PROVINZ HAT 75 MILLIONEN EINWOHNER
SAID: DIE PROVINZ HAT 7 KOMMA 5 MILLIONEN EINWOHNER
In annother instance the speaker stopped the recording to soon und the end of the last word were missing.
Heavy accent or a lot of static aren't necessary a bad thing if you want to make your model more robust but maybe should be marked as noisy or accent.
But we need some sort of checklist before we start controlling again. I just checked files that were rejected by Force Alignment to find out what happend and ended up checking a lot files twice.
P.S.: If you wonder about some files without PROMPT-SAID Pair or any other comment in my file. Those files were rejected without any apparent reason.
I like the idea of having a checklist for common errors instead of a numeric rating a lot!
Maybe we can collect properties we would like to check for?
- noisy (would also include background noise/music)
- accent
- transcription mismatch, if possible also note correct transcription
- truncated
- audio distored (e.g. clipping, much too silent, etc)
About aligning errors: at least for HTK my experience so far is that the order in which you process training data matters a lot. For now, I am using the most clean ones first, then slowly work my way towards more noisy files - but I am no expert in this area, maybe people more knowledgable in the field of speech recognition can comment here?
As a Java/Web developer i already developed some more or less small web crawler. Maybe, i can help you?
You sure can. Anyway I can contact you directly? Well my script so far is perl script based. It's hooks on to google for a certain searchword and download the links it get there.
Having a small construct that I only need to modify would certainly help. Any way I can contact you through email?
If yes write me at [email protected].
Binh
Addressing this pronunciation issue: There is a standard book for education of actors, presenters in radio etc. called "Der kleine Hey, Die Kunst des Sprechens". If anyone is interested in training his or her ears on distinguishing pronunciation and dialects he or she should read this.
>I have uploaded new files to the FTP server, could you move them to the german model, once again?
done,
for any new submissions, please update the date in your license file
thanks,
Ken
ah, good point, thanks, will do :)
Two more questions:
- I have created a new, updated german audio model for CMU Sphinx (http://goofy.zamia.org/voxforge/de/) - this is based on our submission rating/tagging effort. Currently I am still polishing the model but once it is ready for use I will open a new Thread about it in the german forum. Would you be willing to host the new model on the official voxforge servers somewhere?
- what is the current policy about using librivox audiobooks to generate submissions - if I create submissions based on librivox audiobooks, should I name and upload them using my personal account or should I create/use a different account?
> I have created a new, updated german audio model for CMU Sphinx (http://goofy.zamia.org/voxforge/de/) - this is based on our submission rating/tagging effort. Currently I am still polishing the model but once it is ready for use I will open a new Thread about it in the german forum. Would you be willing to host the new model on the official voxforge servers somewhere?