VoxForge
I picked up one of these for a couple of reasons:
1. Noise cancelling - sometimes extraneous sounds interfere with my recordings
2. Quality of incoming sound - sometimes it is hard to hear what Festival voices are saying and better quality headphones might improve that situation
3. Wired, so instant-on - wireless headsets take a small time to wake up
Experience (same computer setup and prompt set as in my previous submissions in this subforum):
Noise cancelling works well
Expected improvement in decipherability of Festival voices did not materialize. The issue must be at the Festival source. Another approach is required.
Instant-on is perfect, making sample recording a more definite experience.
In addition:
There is a bit of DC offset with this mike, but it does not seem to interfere with quality of recording and deciphering by HTK
I was expecting that the higher quality of the headset would help me to get to a working accurate recognition model more quickly and with fewer audio samples than the wireless and bluetooth headsets but this was not so.
Sennheiser support was excellent (I thought I had a problem but it was pulseaudio, not the hardware).
Hi colbec,
I am using a Sennheiser PC131 headset (price was about 35 Euros). I used exactly this headset to recognize 200 German words under Ubuntu (100% perfect recognition).
I think that it is not necessary to invest more money into a headset for speech recognition.
The quality of the recording might be negatively influenced by electromagnetic interference inside your computer chassis if you are using on-board-sound on an older mainboard. As a work-around, you can try to use an external USB sound card.
Greetings,
Ralf
100% accurate is a wonderful result, Ralf, particulary after training each word only three times. In my own case, 99% of words need only 3 trains, others occasionally need a bit (or a lot!) more. True, I need a more modern computer, this one has seen a lot of adventures (IBM netvista 1.6 GHz).
However, having errors in a way is good. It forces me to think of ways around problems and extends my dialog manager to be more thoughtful in the way it handles errors, and in the way it requires me to test my models.
Each mike I use seems to pop different words as occasionally problematic. Probably as you say due to the weakest point in the chain, which, in the case of the Sennheiser, is not the headset itself. I would not hesitate to get another of these headsets.
"weakest point in the chain" - probably, the weakest part is not the headset.
My tip: While speaking (for training and for recognition), read directly from the SAMPA phonemes in the pronunciation dictionary. Try to speak every phoneme that is part of the specific word. Don't omit any phoneme while speaking. While speaking, concentrate on the phonemes. This way, you can get very good results.
A new computer? Focus on the weakest part of the computer. I am using
- power supply: Seasonic S12II-380 380W ATX 2.2
- CPU cooler: Arctic Cooling Freezer 64 Pro PWM
Both components are pretty quiet (I am sure that there are better solutions, but this solution is perfect for me). You could exchange the power supply from your existing computer, and replace it with a new one. Maybe you will get more silence.
The weak parts for speech recognition are:
- on-board-sound on older mainboards;
- cheap and loud power supplies. A good power supply costs at least 45 Euros;
- cheap and loud CPU cooler. A good CPU cooler costs at least 13 Euros;
- your own speech has to match the phonemes that are in the pronunciation dictionary. By the way, I am looking for native speakers who are willing to improve my PLS dictionaries. The PLS dictionaries are currently a very weak part for almost all languages. PLS is good because we can integrate the terminal information (verb, noun, adjective).
You can see that I know about the weak parts. The weak part is not the speed of your computer (1.6 GHz). Until a few days ago, my best computer had 1.8 GHz - the speed was sufficient for speech recognition. Maybe you should take a closer look at on-board-sound, noise of your power supply, noise of your CPU cooler, phonemes in your pronunciation dictionary.
Try to find and fix the weakest part of your system.
Ralf, some interesting points here.
On the topic of reading from phonemes, I can see the idea, however it requires quite a bit of concentration from the user. Particularly with respect to the NATO alphabets and other lists of words designed to be recognizable (I raised this earlier) if there is a need to resort to phonemes then clearly the word has been badly chosen (LYMA-LEEMA, KWEEBEC-KABEC).
Also, at least in English, I am suspecting that the occasional joining of words in two-word sentences is causing me a problem. Combinations such as PROGRAM STOP have a natural pause between the words, however LIMA ROMEO can sometimes have a discernible silence between the words and sometimes not. If I supply samples with both in about equal proportions then my recognition rate goes up.
With regard to power supplies, I have tried a number of different headsets and each one comes up with its own set of problem words. I was hoping that I would see some common elements which would indicate an issue at a common weak point, but I don't see it in practice. The Sennheiser gives me an issue with the word GOLF. Bizarre. If it was a phoneme issue then it should show up in the other headsets as well. None of the other headsets do. Sufficient practice smooths things out.
I'm sorry not to be able to help with the German, my vocabulary extends to about 2 words.
"the word has been badly chosen" - tip: when you choose long words, you have more phonemes. More phonemes means better recognition (because there is less room for misunderstanding).
When you watch my video with the 200 German words, then you can see that I am not using short words, only long words. Colbec, you used the words LIMA, ROMEO, GOLF, and QUEBEC. These are short words. I would recommend that you don't use such short words. Try to use long words (because each word has more phonemes).
Try long words like NEIGHBOURHOOD, BACKWARDS. Don't use STOP, use instead STOPPING (the -ing form is longer). Train the plural form (ending with an -s) instead of the singular form.
The goal is to get an accuracy that is as high as possible. With long words, it is much easier. Focus on long words!
"natural pause" - compare this with the glottal stop. I am not sure how this will be implemented
- if there is a pause inside the word (I think that there should be a special phoneme for the glottal stop);
- if there is a pause between the words (I don't know how this problem can be handled; my current focus is not the recognition of utterances, only the recognition of single spoken words).
"Sufficient practice smooths things out" - my plan is to find bad training wav files with sam. The idea is to use confidence scores (I think that this value is delivered by Julius, but I don't know for sure) for the evaluation of the wav files that are used for the training of the speech model. If a wav file has a bad confidence score, then I can delete the specific file, and generate the speech model again. I hope that I find a way to make it work.
"not to be able to help with the German" - that is no problem. What is your native language? If it is English, then you could first install simon, and then import VoxForgeDict. I would like to know: how do you feel when you use simon for training/recognition?
For example, I would like to have an English pronunciation dictionary that doesn't use only capitalized letters (like VoxForgeDict).
So you are using Sennheiser PC131. That should be totally sufficient for good recognition results. Why don't you try to test your wav files with sam that you have recorded with your Sennheiser headset?
Ralf, I understand there is no shortage of long words in the German language!
Using long words is one approach, but since with short words the errors are minimal, I have found that having my dialog manager isolate and remember words which give problems and then ask me to practice them with more prompts makes great strides towards reliable recognition.
One technique I have used is to identify duplicate lists of words that mean the same thing. For example ONE, TWO, ZERO as one set, with ONE-ONLY, TWO-ONES, ZERO-NONE as another set where each of the hyphenated words is declared as a special word in the lexicon. Since both sets translate into the same concept as a final product, {1,2,0}, if my dialog manager detects unusually high errors for one word then I can switch that word to the more time-consuming but more recognizable alternative with no change to code or grammar by just saying it. But if the short word offers no problems then I use it.
I wish you luck using the Julius scores. My own experience has been very confusing. I have frequently seen better scores associated with incorrectly identified words than correctly identified. I have yet to sort out a pattern I can fit into my DM reliably.
I will try to follow up on your other suggestions when I can.