German

Nested
Re: Ralph's TTS voice
User: Robin
Date: 2/22/2008 3:54 pm
Views: 292
Rating: 28
Just a thought: what happens if you use the speech of more than one person? Did anyone ever try that?
Re: Ralph's TTS voice
User: kmaclean
Date: 2/29/2008 12:42 pm
Views: 502
Rating: 81

Hi Robin,

My understanding is that it depends on the type of Text-To-Speech ("TTS") engine your are using. 

If you use a TTS engine that is based on the concatenation of diphones (like the MBROLA speech synthesizer - though this is not a true 'text'-to-speech engine since it doesn't accept raw text for input), then using speech from more than one person will not sound natural, since different parts of a word may contain speech from different people.

HTS uses Hidden Markov Modeling to generate speech. HMMs are also used for speech recognition (Sphinx/HTK/Julius all use HMMs).  In this case, rather than using statistical models (HMMs) to recognize speech, they are used to generate speech.  Thus the use of many speakers is possible (not sure how many would be the maximum), because the differences in speakers should be smoothed out by the statistical modelling process. 

I think that Cepstral's VoiceForge uses a hidden markov modelling to generate speech, which allows users to quickly create new voices by "adapting" a base hmm tts model to their voice.  

The Festival Speech Synthesis System can use different types of Text-to-speech engines (it can use MBROLA, HTS, ...).  It default engine is a Multisyn general purpose unit selection synthesis engine.  This excerpt(1) describes the Festival multisyn implementation:

The multisyn implementation in Festival uses a conventional unit selection algorithm. A target utterance structure is predicted for the input text, and suitable diphone candidates from the inventory are proposed for each target diphone. The best candidate sequence is found that minimises target and join costs.

So it seems like that Festival's multisyn implementation uses diphones.  Because of this it would likely have the same problems as MBROLA if more than one person was used for creating the audio based for the TTS model.

Ken 

(1) Multisyn Voices from ARCTIC Data for the Blizzard Challenge, Robert Clark, Korin Richmond and Simon King, CSTR, The University of Edinburgh, Edinburgh, UK.

Re: Zipf's law; German acoustic model
User: kmaclean
Date: 2/29/2008 12:49 pm
Views: 459
Rating: 86

Hi Ralf,

Thanks for bringing up Zipf's law.  For others (like me) who have never heard of it, here is an excerpt from Wikipedia:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and all by itself accounts for nearly 7% of all word occurrences (69971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36411 occurrences), followed by "and" (28852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

Amazing stuff,

thanks,

Ken 

P.S.  re: "I just hope that you have enough webspace to store those prompts."

No worries, disk space gets cheaper every year - not sure if it follows Moore's Law, but it must be pretty close :)

Re: covering all nodes of a language
User: kmaclean
Date: 2/29/2008 1:27 pm
Views: 369
Rating: 31

I just came across a tool on the Festvox site that might be helpful:

find_nice_prompts: tools for building balanced prompt lists

Scripts for finding a nice set of prompts.  Given a large set of text data find a balanced subset of "nice" prompts that can be recorded.

I haven't had a chance to look at it in detail.

Ken 

Re: Zipf's law; German acoustic model
User: nsh
Date: 3/30/2008 6:59 pm
Views: 1048
Rating: 33

Hi Ralf.

So far I can operate with numbers. Thanks to Ken who uploaded almost all the audio we have and now I was able to train a model from 21 hours of speech.  The results are very good for you:

  SENTENCE ERROR: 9.6% (23/249)   WORD ERROR RATE: 2.6% (50/1928)

 but very bad for other speakers

  ENTENCE ERROR: 86.4% (190/221)   WORD ERROR RATE: 52.8% (884/1675)

 the model is clearly overtrained. Basically other speakers are even dropped from training just because of alignment issues. The 21 hours we already have is enough even for TTS moreover they aren't designed to be good TTS database (they doesn't cover the intonation aspects while they should). So don't sumbit more, it's already fine.  I hardly believe we'll get a database with 10 hours from 10 speakers to be balanced ever.

But again, I uploaded the model and you can now have very precise your own recognizer.

 

 

Re: Ralph's TTS voice
User: dh
Date: 11/12/2009 8:18 pm
Views: 91
Rating: 6

What has become of this? Was a German tts voice for festival created and if so, is it available somewhere?

Re: Ralph's TTS voice
User: nsh
Date: 11/13/2009 5:22 pm
Views: 5697
Rating: 7

Nothing was built.

Well, it's not that required nowdays since openmary released their German voices under BSD license. But still would be interesting do do for someone who wants to become familar with Festival voice building. Also, it will be GPL, which is important for some people.

 

PreviousNext