VoxForge
The LT and the Teleccoperation group have open sourced their German spoken language corpus, recorded over 2014 and 2015 using several speakers from their department.
The corpus has about 35 hours of speech. About 180 speakers have read
aloud sentences from German Wikipedia, protocols from European
Parliament and some individual commands.
The speakers have confirmed that the recorded speech can be distributed with CC-BY license.
For each sentence the speaker metadata (ageclass, region, corpus,
transcript sentence etc.) and for each microphone an individual
wave-file were generated. The recordings were collected with the
software KiSrecord (supports concurrent recordings via multithreading).
The distance between speaker and each microphone is 1 meter. More
details here (pdf). The target is Distant Speech Recognition and a speaker
independent acoustic model. In addition to the open speech data corpus, they have also
developed acoustic models for Sphinx and Kaldi.
Their motivation was how they could support open source speech recognition. When their research in speech recognition begain, they were faced with
the general issue of obtaining speechdata and decided to support
open source speech recognition projects (instead of buying commercial
software).
They are interested that their developed pattern be further
used by other research institutions or companies. The current size of
open speech data corpus is just meant as the First Start.
Voxforge mirror of the Open Speech Sata Corpus for German:
german-speechdata-TUDa-2015.tar.gz 27-Mar-2015 17:21 16G
The LT and the Teleccoperation group has now an ultimate update on the speech data corpus. Many words are corrected now in the second version (issue with text normalisation e.g. separating thousand signs like 1.000.000).
We got the feedback from the community, so we could enhance the speechdata corpus J.
You can find the new corpus here:
german-speechdata-package-v2.tar.gz 01-Jul-2015 14:07 16G
Problems with this corpus:
A lot of audio files are assigned to wrong transcriptions (~15%):
Example 1:
File: german-speechdata-package-v2/train/2014-08-04-13-15-38.xml (sentence id: 113)
text should read: "In den wenigen beobachteten Fällen wurden diese großen Beutetiere innerhalb von Sekunden getötet."
But the corresponding audio files (e.g. 2014-08-04-13-15-38_Yamaha.wav) contain only "Okay"
(according to "SentencesAndIDs.raw.txt" sentence id 851)
Example 2:
German-speech data-package-v2/train/2014-08-04-13-13-22.xml (sentence id: 158)
Text should read: "Das Land der offenen Fernen wie die Rhön auch genannt wird ..."
The associated audio files (e.g. 2014-08-04-13-13-22_Yamaha.wav) contain the sentence "Ich weiß nicht" (sentence-id 815)
---
It seems that wav files from the "command" corpus have overwritten those from the "wiki" corpus? Maybe it could be easy to fix for the creator of the corpus, but I cannot find a possibility to fix it with the given files.
>A lot of audio files are assigned to wrong transcriptions (~15%):
This is not a VoxForge corpus.
We are mirroring it for the LT and the Teleccoperation group.
Best to contact to them.
If any updates are made, I can update the mirrored copy here.
thanks,
Ken