VoxForge
i have a group of seminar videos, I am looking at setting up a website like metavid ( http://metavid.org/ ) witch is based on the wikipedia software
is there a way to improve the speech recognition by using the improved transcripts?
--- (Edited on 2/25/2010 6:21 am [GMT-0600] by Visitor) ---
There are ways, but they depend on the decoder you are using.
--- (Edited on 2/25/2010 22:02 [GMT+0300] by nsh) ---
I have read http://www.voxforge.org/home/dev/autoaudioseg
what decoder to you recommend?
--- (Edited on 2/25/2010 11:54 pm [GMT-0600] by tom_a_sparks) ---
> what decoder to you recommend?
Regardless of which decoder you decide to use, there is the issue of the need for a specialized language model. From this post (PyCon transcription):
To build a specialized model you can take transcription of the previous conferences, mailing list archives, related documentation, technical papers [...] and so on. This language model will be more suitable for decoding reports.
--- (Edited on 2/26/2010 1:20 pm [GMT-0500] by kmaclean) ---
http://cmusphinx.sourceforge.net/sphinx4/
--- (Edited on 2/26/2010 21:54 [GMT+0300] by nsh) ---
that sound like a catch-22,
so can i use Speaker Independent Acoustic Model and a adapt it to include what i need?
--- (Edited on 2/28/2010 5:21 am [GMT-0600] by tom_a_sparks) ---
>so can i use Speaker Independent Acoustic Model and a adapt it to
>include what i need?
Note: you need both a language model and an acoustic model.
You can adapt a speaker independent acoustic model using some of the (manually transcribed) speech from the video you want to automatically transcribe. You may get better recognition rates by converting the audio used to create your speaker independent acoustic model into the same compressed format used in the video, and then training a new acoustic model using this 'compressed' audio (see David Gelbart's post in this thread).
Then use all the video transcriptions you currently have (and as nsh stated: transcription of the previous conferences, mailing list archives, related documentation, technical papers [...] and so on) to create a language model that is specific to the type of speech being uttered in the video.
--- (Edited on 3/9/2010 12:30 am [GMT-0500] by kmaclean) ---
I was looking at something like this[1], but use the speech recognitor to do the transcript, and add the miss recognited words to the speech recognition database and repeat until all the words are recognited
[1]
you might try to transcribe one hour's worth of audio, create an acoustic model from this. Then create a language model, and then try recognizing another hour of audio. Next use the transcriptions generated from the recognition results to re-train an acoustic model with the additional transcribed audio (i.e. now you are training an acoustic model with 2 hours of audio). Your acoustic models will get better as they are trained with more audio. Keep iterating this process until all the videos are completed.- http://www.voxforge.org/home/forums/message-boards/speech-recognition-engines/looking-for-an-engine-to-extract-voice-from-video
--- (Edited on 3/9/2010 12:24 am [GMT-0600] by tom_a_sparks) ---