Audio and Prompts Discussions

Nested
European Parliament
User: TonyR
Date: 11/22/2015 10:12 am
Views: 6673
Rating: 0

Has anyone looked at the European Parliament as a source of transcribed data in very many langauges?

I see pages like http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20150519+ITEMS+DOC+XML+V0//EN which has transcripts but broken video and pages like http://www.europarl.europa.eu/ep-live/en/plenary/video?debate=1432019122890 which has downloadable video but it's all translated into one language.  The raw video must exist somewhere, and if so then this would be a fantastic ASR resource.

 

Tony

--- (Edited on 22-November-2015 4:12 pm [GMT+0000] by TonyR) ---

Re: European Parliament
User: colbec
Date: 12/22/2015 12:25 pm
Views: 68
Rating: 0

Well, Tony, you may well be right; there may be a resource there, but how to get to it? The European Parliament seems bound and determined to use Apple QuickTime; maybe someone could explain to them that they will reach a wider audience with an mp4. My Linux won't recognize QT in the format presented (even though VLC might well be able to handle it), and there's no point in trying to download QT for my VirtualBox Windows XP since Apple QT does not support that platform any more.

It raises an interesting issue about quality of recordings. Parliament can sometimes be a rowdy place with extraneous audio launched into the stream, including perhaps the sounds of chairs being broken over members' heads.

I wonder what you feel about controlled recordings of say poetry, available in various languages on Librivox. Poetry has the advantage of being naturally broken into stanzas and lines in both text and audio, although that depends on the phrasing used by the speaker. Personally I think sonnets (maybe Limericks as well) are neat and convenient packages.

The natural silences between lines are convenient break points. While it may be easy to manage as data input, perhaps it would be considered insufficiently realistic speech?

--- (Edited on 2015-12-22 1:25 pm [GMT-0500] by colbec) ---

Re: European Parliament
User: TonyR
Date: 12/23/2015 1:50 am
Views: 32
Rating: 0

MP4 - see my second link below - you can dowload MP4 from there.

quality - listen to a few - from those I've listened to it's one person speaking and very orderly.

poetry - it has different intoantion from normal speech - that doesn't make me comfortable.

librivox - this is interesting but it's read speech and there's the not insignificant problem of removing the headers and footers in both the audio and the text so that there is something to align.

alignmnet in general - if the audio matches the text then 30mins of audio is fine to train on as it can be split into smaller chunks.

--- (Edited on 23-December-2015 7:50 am [GMT+0000] by TonyR) ---

Re: European Parliament
User: colbec
Date: 12/23/2015 2:42 am
Views: 311
Rating: 0

You are right, I did find some links to mp4, and then discovered as you did that the audio was in one language, even though VLC thought there might be tracks for many other languages. In my sample the audio was all German.

With regard to poetry, for my purposes of speaker dependent models where the variety comes in devices, time of day, age and mood rather than the individual, my poetry samples don't seem to have other than a beneficial effect. One day I will try to make that comment a bit more scientifically concrete.

Comparing librivox to EU debate sources, I guess they both have a downside: accessing video to extract just the minor audio component and the tedium of trimming headers and footers. I guess that's where scripting comes in handy.

--- (Edited on 2015-12-23 3:42 am [GMT-0500] by colbec) ---

Re: European Parliament
User: nsh
Date: 1/31/2016 5:40 pm
Views: 64
Rating: 0

Hi Tony

I think from many languages point of view, this website has great potential for crawling:

http://www.amara.org/en/videos/UhmavM4GFMwA/ko/897233/

Most TED videos are subtitled, it really could be an amazing dataset if we get them in more organized form.

 

--- (Edited on 2/1/2016 02:40 [GMT+0300] by nsh) ---

Re: European Parliament
User: TonyR
Date: 2/1/2016 3:05 am
Views: 83
Rating: 0

Wow - that's cool.  I knew that Amara was the platform that TED used and that TED was English and TEDx wasn't limited to English.   And now I know that 박수 means applause in Korean - Google translate can help me understand any language so we can build ASR in it - great!

--- (Edited on 1-February-2016 9:05 am [GMT+0000] by TonyR) ---

Re: European Parliament
User: nsh
Date: 2/1/2016 8:08 am
Views: 34
Rating: 1

I just spoke with Sylvain Chevalier from Amara, another very good thing is that they have an API to access all data:

http://amara.readthedocs.org/en/latest/api.html

this is really cool.

--- (Edited on 2/1/2016 17:08 [GMT+0300] by nsh) ---

Re: European Parliament
User: colbec
Date: 2/1/2016 8:57 am
Views: 2735
Rating: 0

In the same vein, CBC in Canada is starting a project to make episodes of certain radio programmes available to the hearing impaired community with transcripts of current affairs interviews - see http://www.cbc.ca/radio/thecurrent/episode-transcripts-asl-videos-make-the-current-available-to-hearing-impaired-1.3425933.

The advantage of this type of material is that the audio is voice only (with rare exceptions), normal speech intonation, and few individual voices participating.

I happened to hear the announcement at the end of an interview with Nick Bostrom of Oxford U. about super AI a few minutes ago. The problem might be getting permission. CBC keeps its material tightly controlled; this is partly to protect its guests such as Nick B. - he might like the idea of his voice being absorbed into a grand Voxforge model, but others might not. It might be easy to get access to the text transcript both here and from Amara, but the audio would be another issue entirely.

--- (Edited on 2016-02-01 9:57 am [GMT-0500] by colbec) ---

--- (Edited on 2016-02-01 10:06 am [GMT-0500] by colbec) ---

PreviousNext