VoxForge
Is there any trick to getting the Linux quickstart to work on a 64-bit laptop running Ubuntu 10.04? I confirmed my laptop's mic works by recording speech with Audacity, but when I the README's suggesting "./julian -input mic -C julian.jconf", it seems to initialize, and I get the prompt:
### read waveform input
<<< please speak >>>
but no matter what I say, nothing seems to happen.
How are you supposed to use this? Is it print out text to the console as you speak, or does it save text to a file somewhere?
I have a few simple Wav files that I've been trying to process with Julian, using:
./julian -input rawfile -filelist filelist.txt -C julian.jconf
Julian seems to run through some startup procedure, but doesn't seem to output any converted text, and then exits with the message:
------------- System Info end -------------
------
### read waveform input
adin_file: channel num != 1 (2)
Error: failed to read /tmp/audio/hello-2.wav as a wav file
*** glibc detected *** ./julian: corrupted double-linked list: 0x0954cff0 ***
Why couldn't it read my wav file?
--- (Edited on 9/29/2011 9:20 pm [GMT-0500] by Cerin) ---
I've also tried running the "controlapp" example included in the the Ubuntu package. I ran through the README's instructions, and it seems to run without error, but all it does is show the text "Taking control of Rhythmbox media player" and nothing happens when I speak into the mic.
Running julius without piping it into command.py shows the following. Am I doing anything wrong?
STAT: include config: julian.jconf
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: rdhmmdef: ascii format HMM definition
Stat: rdhmmdef: limit check passed
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: rdhmmdef: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 8002
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names: 9406 in HMMList
Stat: init_phmm: base phones: 44 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 1085 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [mediaplayer.dfa] and [mediaplayer.dict]...
Stat: init_voca: read 9 words
STAT: done
STAT: Gram #0 mediaplayer registered
STAT: Gram #0 mediaplayer: new grammar loaded, now mash it up for recognition
STAT: Gram #0 mediaplayer: extracting category-pair constraint for the 1st pass
STAT: Gram #0 mediaplayer: installed
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 126 nodes
STAT: coordination check passed
STAT: multi-gram: beam width set to 126 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done
STAT: ###### initialize input device
Stat: adin_oss: device name = /dev/dsp
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
----------------------- System Information begin ---------------------
JuliusLib rev.4.1.2 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension :
- Compiled by : cc -g -O2 -g -Wall -O2
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/usr/share/julius-voxforge/acoustic/hmmdefs
hmmmapfilename=/usr/share/julius-voxforge/acoustic/tiedlist
Language Model:
- LM00 "_default"
grammar #1:
dfa = mediaplayer.dfa
dict = mediaplayer.dict
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_0_D_N_Z (25 dim. from 12 cepstrum + c0, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cepstral normalization = real-time MAP-CMN
base setup from = Julius defaults
MAP-CMN:
initial cep. data = none
beginning data weight = 100.00
beginning data update = yes, from last inputs at each input
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8002 models, 5950 states, 5950 mpdfs, 5950 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_N_D_Z_0
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 1 Gaussians
max length of model = 5 states
logical base phones = 44
model skip trans. = exist, require multi-path handling
skippable models = sp (1 model(s))
AM Parameters:
Gaussian pruning = safe (-gprune)
top N mixtures to calc = 2 / 0 (-tmix)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use max. prob. of same LC)
sp transition penalty = -70.0
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=grammar
DFA grammar info:
5 nodes, 4 arcs, 4 terminal(category) symbols
category-pair matrix: 20 bytes (544 bytes allocated)
Vocabulary Info:
vocabulary size = 9 words, 33 models
average word len = 3.7 models, 11.0 states
maximum state num = 24 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
found sp category IDs =
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 126
root node num = 9
leaf node num = 9
(-penalty1) IW penalty1 = +5.0
(-penalty2) IW penalty2 = +20.0
(-cmalpha)CM alpha coef = 0.050000
inter-word short pause = on (append "sp" for each word tail)
sp transition penalty = -70.0
Search parameters:
multi-path handling = yes, multi-path mode enabled
(-b) trellis beam width = 126 (-1 or not specified - guessed)
(-n)search candidate num= 1
(-s) search stack size = 500
(-m) search overflow = after 2000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 200
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 200.0 (in logscore)
(-n) search till = 1 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use max. prob. of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
all possible words will be expanded in 2nd pass
build_wchmm2() used
lcdset limited by word-pair constraint
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = on
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
long-term DC removal = off
reject short input = off
----------------------- System Information end -----------------------
*************************************************************
* NOTICE: The first input may not be recognized, since *
* no initial CMN parameter is available on startup. *
* for MFCC01*
*************************************************************
STAT: AD-in thread created
<<< please speak >>>^C
--- (Edited on 9/29/2011 9:45 pm [GMT-0500] by Cerin) ---
While your sound device may be working, Julius evidently cannot hear because it is not listening on the right sound system or the right device.
I can't speak for Ubuntu but - in the above output note the
STAT: ###### initialize input device
Stat: adin_oss: device name = /dev/dsp
Your Julius is trying to use OSS. This might be right, it might not. Check the packages you have installed for ALSA, which is perhaps a bit more prevalent. If you are using ALSA Julius should be compiled with alsa libraries present. If these are found Julius would output something like 'adin_alsa: device name ...'. At that point you can use the ALSADEV environment variable according to Julius info.
The second issue is whether /dev/dsp is right or not. What physical sound hardware are you using? Are you using a sound server like pulseadio?
--- (Edited on 9/30/2011 12:13 pm [GMT-0500] by colbec) ---
How would I determine what audio system I'm using? It's a fairly vanilla Ubuntu 10.04 install, so I'm using whatever the default is. Looking at my packages, I have pulseaudio installed, but I also have gnome-alsamixer, alsa-base and alsa-oss.
The Hardware tab in Gnome's Sound Preferences dialog says I'm using "Digital Stereo (IE958) Output + Analog Stereo Input".
Also, aplay gives me:
~$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: NVidia [HDA NVidia], device 0: Cirrus Analog [Cirrus Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: NVidia [HDA NVidia], device 1: Cirrus Digital [Cirrus Digital]
Subdevices: 1/1
Subdevice #0: subdevice #0
And yes, I believe I'm using pulseaudio, because that's the default setup in Ubuntu.
Regards,
Chri
--- (Edited on 9/30/2011 1:40 pm [GMT-0500] by Cerin) ---
OK we're making good progress here.
Take the device first. It looks like you only have the one device, card 0 with 2 subdevices, let's take it that is your speakers and the mike, which is most probably a regular wired headset plugged into your main card on the motherboard. This perhaps eliminates the possibility of a /dev/dsp1 or 2 or 3 as might be the case with a USB or bluetooth device. /dev/dsp is likely not the issue.
This allows us to focus on the sound library. The ideal solution is to use an installation of Julius that is aware of your ALSA. Your Julius version is 4.1.2, I think that there is a later precompiled version 4.2 which has alsa as a default and handles pulseadio a lot better.
http://julius.sourceforge.jp/en_index.php
One thing to try before you go this route is to prefix your call to julius with padsp which is a pulseaudio utility.
$ padsp julius ...
If that fails, take a look at Julius 4.2, if you compile your own you will be able to check that julius can see your alsa libraries.
--- (Edited on 9/30/2011 3:04 pm [GMT-0500] by colbec) ---
Thanks, using padsp managed to get the example controlapp working for me.
The remaining problem I have with is transcribing files. I think the main issue is that Julius is unable to process arbitary wav files, and requires that a wav file have a specific sampling rate. For example, I ran:
julius -input file -filelist filelist.txt -C julian.jconf
and got the error:
Error: adin_file: sampling rate != 16000 (32000)
Error: adin_file: error in parsing wav header at hello.wav
Error: adin_file: failed to read speech data: "hello.wav"
So I used sox to resample my hello.wav to 16000.
sox hello.wav -r 16000 hello.16.wav resample
Then, I updated my filelist.txt to only contain "hello.16.wav".
However, now when I run julius, I get:
Error: adin_file: bytes per second != 32000 (16000)
Error: adin_file: error in parsing wav header at hello.16.wav
Error: adin_file: failed to read speech data: "hello.16.wav"
I'm really confused by this. Aren't "sample rate" and "bytes per second" the same thing? Googling doesn't really clarify, as some sites use them interchangeably, while others use slightly different terms (e.g. bits per second, or bit depth).
How can I set sample rate to 32000 but also set bytes per second to 16000?
Regards,
Chri
--- (Edited on 9/30/2011 9:32 pm [GMT-0500] by Cerin) ---
The julian.jconf file sets up a number of parameters which are important for files, have you checked these for compatibility with your sound files?
--- (Edited on 10/1/2011 7:19 am [GMT-0500] by colbec) ---
Yes, I checked my jconf file, but there's no mention of a "bytes-per-second" parameter. I see the smpFreq parameter, and it's 16000, and I've resampled to 16000. I don't know where it's getting 32000.
These are the only uncommented options.
-dfa mediaplayer.dfa
-v mediaplayer.dict
-h /usr/share/julius-voxforge/acoustic/hmmdefs
-hlist /usr/share/julius-voxforge/acoustic/tiedlist
-penalty1 5.0 # first pass
-penalty2 20.0 # second pass
-iwcd1 max # assign maximum likelihood of the same context
-gprune safe # safe pruning, accurate but slow
-b2 200 # beam width on 2nd pass (#words)
-sb 200.0 # score beam envelope threshold
-spmodel "sp" # HMM model name
-iwsp # append a skippable sp model at all word ends
-iwsppenalty -70.0 # transition penalty for the appenede sp models
-smpFreq 16000 # sampling rate (Hz)
--- (Edited on 10/1/2011 9:18 am [GMT-0500] by Cerin) ---
I think I found the sox call to do the bit rate and format conversion.
sox hello.wav -r 16000 -b 32 -c 1 hello.s32
Julius reports <search failed>, but I'm guessing that's just because it can't recognize this specific pronounciation of "hello".
--- (Edited on 10/1/2011 9:40 am [GMT-0500] by Cerin) ---
I don't have much experience feeding wav files to julius, but I did try with one of my prompt recordings and it processed fine. The only changes were to comment out my input mic in julian.jconf, specify -input rawfile -filelist filelist (where filelist is a text file containing the path and name of a file to be processed) in the julius command and it worked correctly. My smpFreq in this file and in jconf is 48000 but other frequencies will work.
The kind of error messages you are getting seem to indicate perhaps that your list of files is pointing at different files than you have been editing?
--- (Edited on 10/1/2011 9:54 am [GMT-0500] by colbec) ---