VoxForge
The reason for my resent interest in speech recognition is to enhance the content produced as a side effect of PyCon (python conference.) I am the AV guy for http://us.pycon.org/2008 and 09 (and probably forever.) I have the original .dv (uncompressed) recordings of all the sessions from 08 – you can see some of them here: http://youtube.com/pycon08
I also record presentations at the local python user group, which means I have that .dv too. And for PyCon 09 we may encourage talk proposals to include a video.
There is value in having these presentations transcribed. So far no one seems interested in doing it manually. But, if some speech recognition could at least give a head start, then there would probably be interest in editing the resulting text, producing both transcribed talks and speech/text that could be submitted to Voxforge.
Any suggestions on what I should do next?
--- (Edited on 6/18/2008 9:06 am [GMT-0500] by Visitor) ---
--- (Edited on 6/18/2008 9:08 am [GMT-0500] by CarlFK) ---
Hi CarlFK,
>But, if some speech recognition could at least give a head start, then there
>would probably be interest in editing the resulting text, producing both
>transcribed talks and speech/text that could be submitted to Voxforge.
We've received a few requests for something similar in the past. Unfortunately, we are not yet at the point where Free or Open Source speech recognition can transcribe presentations - with less than optimal audio and background noise (echos, sounds from the crowd, ...). FOSS speech recognition is not even at the point where it can work in a dictation application for a single person using a headset microphone in a quiet room.
The main stumbling block is that we need better acoustic models (AMs), and to get that, we need much more (good quality) transcribed speech audio (1000+ hours).
VoxForge currently focusing on getting a decent command and control acoustic model, which requires much less speech audio because you only need to recognize a smaller set of words or phrases.
At some point, when we have reasonably good dictation AMs, then we might have something that could help you with presentation transcriptions, and then maybe use this same audio to help improve the VoxForge acoustic models even further. But as I said, we are not there yet.
thanks for your interest in VoxForge,
Ken
--- (Edited on 6/18/2008 12:03 pm [GMT-0400] by kmaclean) ---
Well, I though we already discussed this #cmusphinx, you just need a better language model. WSJ will be quite acceptable for you. The rest is about correction of the speech.
Of course already transcribed video will be invaluable for us in sense of video as well as just audio.
--- (Edited on 6/18/2008 3:09 pm [GMT-0500] by nsh) ---
>you just need a better language models
If that all you need, then you might just use one of the language models Keith Vertanen trained using the Gigaword corpus:
or get a copy of the Google Web 1T 5-gram Corpus, and create your own language model (it used to be going for $50 but now is at $150 on the LDC site). This corpus was generated "from approximately 1 trillion word tokens of text from publicly accessible Web pages".
or, using a similar approach to Google's, use Google's search engine to create your text corpus and create your own language model using one of these approaches:
Ken
--- (Edited on 6/19/2008 5:20 pm [GMT-0400] by kmaclean) ---
Sorry, by better language model I meant the specialized language model built on texts of your domain. Generic language model for gigaword is available here:
http://www.inference.phy.cam.ac.uk/kv227/lm_giga/
but it will not perform well, just because it's too generic. Unfortunately current free decoders require specialized model to give acceptable performance.
--- (Edited on 6/20/2008 2:29 am [GMT-0500] by nsh) ---
Hi nsh,
>Sorry, by better language model I meant the specialized language model built
>on texts of your domain.
I am a bit confused here... how does a specialized language model differ from a grammar file (i.e words.mlf file) used in a forced alignement context?
For example, in the AudioBook.pm script that I am currently working on, I can segment audio books from Librivox using forced alignment - which takes the actual text of the audio book, and is able to find time alignments of the words contained in the text (to find pauses where the script can segment the audio). Problems occur when the text does not exactly (98-99%) match the contents of the audio, and the text and audio must be reviewed manually to figure out where the problem might be.
Would a 'specialized language model' in this context essentially be a language model containing only the probabilities of sequences of words in the audiobook text? And might this approach be used to highlight where there might be a problem in the transcription?
And to generalize this to CarlFK's original question, how can you create a specialized language model without actually transcribing the text of the presentation beforehand?
thanks,
Ken
--- (Edited on 6/20/2008 11:55 am [GMT-0400] by kmaclean) ---
> And to generalize this to CarlFK's original question, how can you create a specialized language model without actually transcribing the text of the presentation beforehand?
To build a specialized model you can take transcription of the previous conferences, mailing list archives, related documentation, technical papers on Python and so on. This language model will be more suitable for decoding reports.
> Would a 'specialized language model' in this context essentially be a language model containing only the probabilities of sequences of words in the audiobook text?
In perfect case yes, but of course there will be some difference.
> And might this approach be used to highlight where there might be a problem in the transcription?
I don't think so.
>For example, in the AudioBook.pm script that I am currently working on, I can segment audio books from Librivox using forced alignment
Ken, have you seen this article:
http://www.cs.cmu.edu/~awb/papers/is2007/IS071088.PDF
There are some ideas that such task requires modification in force alignment procedure itself. I think it's reasonable.
--- (Edited on 6/20/2008 3:15 pm [GMT-0500] by nsh) ---
Hi nsh,
>Ken, have you seen this article: [...]
>There are some ideas that such task requires modification in force alignment
>procedure itself. I think it's reasonable.
I didn't think I did... until I found a link on the VoxForgeDev wiki to this article, so I guess I don't remember reading it... :)
I do remember a thread where you were talking to some FestVox people about the FestVox Interslice tool, the post said:
The basic idea of interslice is to automatically build synthetic voices from large speech databases typically available from public domain such as librivox.org and loudlit.org.I'm hoping that it gets released soon...
Interslice comes with a segmentation tool capable to handling infinitely large corpora and chunking them into utterances and *.lab files.
From the paper that you mentioned (Automatic Building of Synthetic Voices from Large Multi-Paragraph Speech Databases), this part really interests me:
... we believe that given proper tuning of these values it might produce a real-time segmentation of the large ?les.
My approach, using HTK's forced alignment, takes about 9 minutes to force align a 50 minute text (assuming no major errors in the transcription). Real-time segmentation would be awesome!
Other interesting information in the paper you mentioned is:
[...] it is known that the prosody and acoustics of a word spoken in a sentence signi?cantly differs from that of spoken in isolation. The work done in [citations omitted] suggests that a similar analogy of prosodic and acoustic difference exists for sentences spoken in isolation versus sentences spoken in paragraphs and similarly for paragraphs too. Some of the characteristics associated with a large multi-paragraph speech are pronunciation variations with word mentions, word prominence, speaking style, change of voice quality and emotion with the semantics of the text and with the roles in a story.
Even though this is targeted towards the text-to-speech domain, should we be looking at collecting longer passages from our users too (i.e. use paragraph prompts in the Speech Submission app, as opposed to sentence prompts)?
thanks,
Ken
--- (Edited on 6/22/2008 7:57 pm [GMT-0400] by kmaclean) ---
>... we believe that given proper tuning of these values it might >produce a real-time segmentation of the large ?les.
>My approach, using HTK's forced alignment, takes about 9 minutes to force
>align a 50 minute text (assuming no major errors in the transcription).
>Real-time segmentation would be awesome!
It has been pointed out to me that this last statement makes *no* sense... because:
"real-time" processing of a 50 minute speech file would mean only that the processing takes no longer than 50 minutes.
my mistake... :)
I guess what I was (incorrectly) thinking was that "real-time" processing would still take 9 minutes for the modified forced alignment process, but it would spit out segments incrementally using the process described in the article (rather than just generating time stamps that need to be further processed by a script to detect pauses).
The advantage would be that there would be no need for the pause detection script (which should speed things up) and it might allow for some parallel processing of the segments (like audio segment verification to ensure that it actually matches the prompt text - before the entire segmentation process is fully completed).
I guess we'll have to wait and see...
Ken
--- (Edited on 6/30/2008 1:39 pm [GMT-0400] by kmaclean) ---