Re: Speech Submission Feedback

General Discussion

Flat

Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:41 pm

Views: 7312
Rating: 19

Here is a thread with respect to another user's (very valid) opinion as to the state of the VoxForge Submission System:

On 7/18/07, mahasamoot wrote:

Good god. Do you think you could make it any more of a hassle to contribute to your project? No wonder you don't have enough data yet.

On 7/19/07, Ken MacLean wrote:

thanks for the feedback!

On 7/19/07, mahasamoot wrote:

My feedback wasn't very helpful. So here's some more constructive criticism:

Your steps are listed in the wrong order. They should be:

Get an account on voxforge.org
Download and install Audacity and 7Zip.
Set up Audacity. With the desired audio setings
Make a top-level directory
Create your README and LICENSE in the top-level dir.
Create a directory for today's data --> USERNAME-[YYYY][MM][DD]
Pick out the prompts
record
copy README and LICENSE into today's directory
make the prompts file
zip
upload

We notice that there are way too many steps, and I didn't even list all of them explicitly.
Roll it all into a Python program, get things such as the sound card make and type from the OS. ---> Remember the average windows user is dumbfounded when you hit "Ctrl.-Z." They're not going to figure out how to edit a text file, especially given the fact that the line breaks are unix, and it comes up as one line in Notepad. Even if they know enough to turn on word wrap, it's still an ugly confusing blob.
Audacity seems like a great program, but it wasn't meant to be used on 1/4 of the screen, and I found it awkward in this context. If it can be scripted to show a prompt and record and one line at a time, it would be much easier.

Anyway, I've gone through all of this and I've got data to upload. But your account set up isn't working, so the data sits on my hard-drive, no use to anyone. Can you activate my account?

--- (Edited on 7/22/2007 10:41 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:42 pm

Views: 290
Rating: 16

Hi Thomas,

My apologies for the obvious UI problems.

I am currently working on adapting MoodleSpeex (Java Applet based audio recording) to work as the new speech submission front-end for VoxForge. Basically you will be able to record directly from your browser, and click an upload button to submit everything to the site. So I will be revamping the entire submission process in the next month or two.

Your account should be activated now, but if it is not, please let me know.

Once again, thanks for the feedback!

All the best,

Ken

--- (Edited on 7/22/2007 10:42 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:42 pm

Views: 320
Rating: 10

Hi Thomas,

I've also got an FTP upload site set up for the folks at LibriVox (they sometimes donate their audio books in wav format to us before they convert them to mp3) which you might want to use.

Go to the FTP Submissions link, and follow the instructions (you need to be signed in).

Ken

--- (Edited on 7/22/2007 10:42 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:43 pm

Views: 281
Rating: 18

Hi Ken,

How much transcribed speech is needed? How much English do you have thus far?

I live in Thailand. Are you interested in creating a Thai database? I live next to Burapha University. I could ask some of my friends there about finding some text that would be appropriate, and translating the necessary parts of the website. If I could find a professor who's interested, perhaps she'd even scare up some students to squawk into the box.

--- (Edited on 7/22/2007 10:43 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:44 pm

Views: 314
Rating: 24

Hi Thomas,

We need hundreds of hours of English speech ... this is a *long-term* project. The Sphinx group of recognizers use about 140 hours of speech (and this is a very good quality speech corpus) and they don't approach commercial quality speech recognition.

Please see the metrics page for the details on the current corpus size. We have about 22 hours of English speech so far. But this is misleading, because I'm basically counting all submissions, whether it would make sense to include them or not (i.e. some submission have sound quality issues, or the speech is not the right dialect for the first release of the corpus).

The main focus of VoxForge has been English, but some people have convinced me to open it up to Russian and Dutch. Note that each of these have had access to a small speech database to 'seed' a corpus in their language.

If you can get a few hours of transcribed audio, I can include it on the site.

Please note that there would be lots of work left to actually create acoustic models, and workable Thai speech recognition (this post provides some details). Basically you need all the phonemes for the Thai language and a pronunciation dictionary for all the words you use, and that this will let you create a monophone acoustic model (Steps 1 through 8 in the VoxForge Dev Tutorial guide you through this process ... for English). More work would need to be done to create a tied-state triphone acoustic model.

A good first step might be to look at (or contact) http://thaispeech.longdo.org/. They use the HTK toolkit to create acoustic models in Thai (Julius uses HTK acoustic models). I was never able to translate the site. They would probably already have a pronunciation dictionary created, and likely some transcribed speech - this might be a good 'seed' for a Thai section on VoxForge.

thanks for your interest!

Ken

--- (Edited on 7/22/2007 10:44 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:44 pm

Views: 277
Rating: 10

Hi Ken,

We need hundreds of hours of English speech ... this is a *long-term* project. The Sphinx group of recognizers use about 140 hours of speech ... [but] they don't approach commercial quality speech recognition.

So is the main difference between Julius/Sphinx & ViaVoice/DNS the size or quality of the corpus? Do you think one of these systems has better underlying technology? Or some combination of the above?

... We have about 22 hours of English speech so far. But this is misleading, because I'm basically counting all submissions, whether it would make sense to include them or not (i.e. ... or the speech is not the right dialect for the first release of the corpus).

What dialect(s) are you most interested in? I assume General American? Do you want only one dialect to start with? What's the right mix between speakers and sample size for say, 200 hrs; 2 pl x 100 hrs, 4 pl x 50 hrs, 8 pl x 25 hrs, 16 pl x 12.5 hrs, 32 pl x 6.25 hrs., 64 pl x 3.125 hrs, 128 pl x 1.5 hrs, 256 pl x 40 min, 512 pl x 20 min.,1024 pl x 10 min, 2024 pl x 5 min, 4098 pl x 2.5 min, or 8192 pl x 1.25min?

If you can get a few hours of transcribed audio, I can include it on the site.

I'll see if there's interest.

A good first step might be to look at (or contact) http://thaispeech.longdo.org/.

Thanks for the link. It looks like their project is stalled, though.

--- (Edited on 7/22/2007 10:44 pm [GMT-0400] by kmaclean) ---

Re: Speech Submission Feedback

User: kmaclean
Date: 7/22/2007 9:45 pm

Views: 2746
Rating: 16

Hi Thomas

So is the main difference between Julius/Sphinx & ViaVoice/DNS the size or quality of the corpus? Do you think one of these systems has better underlying technology? Or some combination of the above?

See Arthur Chan's article on "Why there is no Open Source Dictation" in this post. He basically says that the acoustic model is more important than the speech decoder (i.e. the actual speech recognition engine). And see this post on the basic differences between Sphinx and Julius. Julius is more targeted towards dictation, and Sphinx to telephony IVR and command and control applications. You need much more audio data for a dictation application.

What dialect(s) are you most interested in? I assume General American? Do you want only one dialect to start with? What's the right mix between speakers and sample size for say, 200 hrs; 2 pl x 100 hrs, 4 pl x 50 hrs, 8 pl x 25 hrs, 16 pl x 12.5 hrs, 32 pl x 6.25 hrs., 64 pl x 3.125 hrs, 128 pl x 1.5 hrs, 256 pl x 40 min, 512 pl x 20 min.,1024 pl x 10 min, 2024 pl x 5 min, 4098 pl x 2.5 min, or 8192 pl x 1.25min?

Right now we are collecting any English audio we can get. The first release will be General American, because that is where we are getting most of our speech. We are using Subversion to store the corpus, so once we have enough audio, we can generate sub-corpora relatively easily (which is why we are trying to collect information such as gender, age range, dialect, ...).

I don't know what the right mix is. We need a good coverage of triphones, which requires a wide variety of text, and we need samples from as many people as possible. The best, in my view (not substantiated in any way), would be 200 people reading 1 hour of completely different text. In our case, we take what we can get ... LibriVox audio books can provide us with good triphone variation, and an updated speech submission system will (hopefully) provide us with good variance in speakers and recording environments.

Thanks for the link. It looks like their project is stalled, though.

Can you tell if they have a pronunciation dictionary?

I'd like to post this thread on the site, please let me know if that would be OK,

thanks,

Ken

--- (Edited on 7/22/2007 10:45 pm [GMT-0400] by kmaclean) ---

Previous • Next •


Username	Password