VoxForge
Hi.
I've got +1000 video files with voice and I'd like to extract voice information from them.
Which is the most suitable engine to perform it?
Thank you very much.
--- (Edited on 6/13/2007 8:43 am [GMT-0500] by Visitor) ---
Hi guarriman,
This issue is not so much which open source speech recognition engine would be the best for such a task, but:
The task you are describing is like a speaker independent dictation task, although it does not have to be in real time. Current Free and Open Source speech recognition systems are *not* there yet. See this post for more information: comparison of the different recognition device and Arthur Chan's article on this post: Speech Recognition Engine comparison.
If the videos are from a single (or small number of) speaker(s), you might try to transcribe one hour's worth of audio, create an acoustic model from this. Then create a language model, and then try recognizing another hour of audio. Next use the transcriptions generated from the recognition results to re-train an acoustic model with the additional transcribed audio (i.e. now you are training an acoustic model with 2 hours of audio). Your acoustic models will get better as they are trained with more audio. Keep iterating this process until all the videos are completed.
Good luck!
Ken
--- (Edited on 6/14/2007 10:52 am [GMT-0400] by kmaclean) ---
There is a post on Slashdot that discusses 'Closed Captioning of Web Videos' that has some information that might be useful to you.
They talk about a couple of sites that provide collaborative subtitling of videos:
Might be an option to post some videos on this site and get help with transcribing them.
Ken
--- (Edited on 6/17/2007 11:36 pm [GMT-0400] by kmaclean) ---
Hi Ken, thank you very much for your answer.
> If the videos are from a single (or small number of) speaker(s), you might
> try to transcribe one hour's worth of audio, create an acoustic model from
> this.
Unfortunately, videos show hundreds of people talking in Spanish and English and, most of times, with background noise :(
A hard job, isn't it?
--- (Edited on 6/19/2007 2:58 am [GMT-0500] by Visitor) ---
--- (Edited on 6/19/2007 2:59 am [GMT-0500] by Visitor) ---
If you rank the order of difficulty of building a speech recognition system (#1 being the most difficult to build and #7 being the easiest):
(taken from The Heiroglyphs - a book about Sphinx from "The Grand Janitor" (Arthur Chan), with emphasis on Sphinx3.):
Recognize everything that could be said by anyone in the world;
Recognize everything that could be said by everyone who speaks the same language;
Recognize what two persons are saying in a conversation, same language;
Recognize what a person wants to write in a letter (i.e. dictation);
Build a 3000 word dialog system (e.g. CMU Communicator system is a good example) - doable with Sphinx;
Build a 1000-person phone directory system with English names (i.e. a speech directory system) - doable with Sphinx;
Build a 100-command command and control system in English - doable with Sphinx;
The first 4 items cannot be done with Sphinx (or any other free or open source speech recognition system) in its current configuration.
Your desire to automatically transcribe videos shows of hundreds of people talking in Spanish and English (with background noise) is near the top of level of difficulty. The only thing that makes it easier is that it does not have to be real time - but it is still a very difficult task with current open source or free speech recognition technology.
Ken
--- (Edited on 6/20/2007 2:36 pm [GMT-0400] by kmaclean) ---