VoxForge
I made a.arpa with CMUSLM2.0,and context cues with <s></s>
Why <s></s> are predicted in my "a.arpa" file ,If there are not <s></s> in my a.text then the result will be "b.arpa",and if b.vocab without <s></s> then will be the "c.arpa" and what i should do ?
This is my file:
a.text
<s> OPEN BROWSER </s>
<s> NEW E-MAIL </s>
<s> FORWARD </s>
<s> BACKWARD </s>
<s> NEXT WINDOW </s>
<s> LAST WINDOW </s>
<s> OPEN MUSIC PLAYER </s>
a.vocab
## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 13 words ##
</s>
<s>
BACKWARD
BROWSER
E-MAIL
FORWARD
LAST
MUSIC
NEW
NEXT
OPEN
PLAYER
WINDOW
a. arpa
#############################################################################
## Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
## Ronald Rosenfeld and Philip Clarkson
## Version 3, Copyright (c) 2006, Carnegie Mellon University
## Contributors includes Wen Xu, Ananlada Chotimongkol,
## David Huggins-Daines, Arthur Chan and Alan Black
#############################################################################
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 13 words,
which begins "</s>", "<s>", "BACKWARD"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Witten Bell discounting was applied.
This file is in the ARPA-standard format introduced by Doug Paul.
p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)
p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
All probs and back-off weights (bo_wt) are given in log10 form.
Data formats:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
end of data mark: \end\
\data\
ngram 1=14
ngram 2=18
ngram 3=24
\1-grams:
-1.1461 <UNK> 0.0000
-98.8037 </s> -0.8451
-98.8037 <s> -0.0348
-1.1461 BACKWARD -0.3010
-1.1461 BROWSER -0.3010
-1.1461 E-MAIL -0.3010
-1.1461 FORWARD -0.3010
-1.1461 LAST -0.2341
-1.1461 MUSIC -0.2688
-1.1461 NEW -0.2688
-1.1461 NEXT -0.2341
-0.8451 OPEN -0.2341
-1.1461 PLAYER 0.0000
-0.8451 WINDOW -0.4771
\2-grams:
-0.0669 </s> <s> 0.0348
-1.1139 <s> BACKWARD 0.0000
-1.1139 <s> FORWARD 0.0000
-1.1139 <s> LAST 0.0000
-1.1139 <s> NEW 0.0000
-1.1139 <s> NEXT 0.0000
-0.8129 <s> OPEN 0.0000
-0.3010 BACKWARD </s> 0.5441
-0.3010 BROWSER </s> 0.5441
-0.3010 E-MAIL </s> 0.5441
-0.3010 FORWARD </s> 0.5441
-0.3010 LAST WINDOW 0.1761
-0.3010 MUSIC PLAYER -0.3010
-0.3010 NEW E-MAIL 0.0000
-0.3010 NEXT WINDOW 0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC 0.0000
-0.1761 WINDOW </s> 0.3680
\3-grams:
-1.0792 </s> <s> BACKWARD
-1.0792 </s> <s> FORWARD
-1.0792 </s> <s> LAST
-1.0792 </s> <s> NEW
-1.0792 </s> <s> NEXT
-1.0792 </s> <s> OPEN
-0.3010 <s> BACKWARD </s>
-0.3010 <s> FORWARD </s>
-0.3010 <s> LAST WINDOW
-0.3010 <s> NEW E-MAIL
-0.3010 <s> NEXT WINDOW
-0.6021 <s> OPEN BROWSER
-0.6021 <s> OPEN MUSIC
-0.3010 BACKWARD </s> <s>
-0.3010 BROWSER </s> <s>
-0.3010 E-MAIL </s> <s>
-0.3010 FORWARD </s> <s>
-0.3010 LAST WINDOW </s>
-0.3010 MUSIC PLAYER </s>
-0.3010 NEW E-MAIL </s>
-0.3010 NEXT WINDOW </s>
-0.3010 OPEN BROWSER </s>
-0.3010 OPEN MUSIC PLAYER
-0.1761 WINDOW </s> <s>
\end\
b.arpa
....
data\
ngram 1=14
ngram 2=11
ngram 3=11
\1-grams:
-1.1461 <UNK> 0.0000
-98.8451 </s> 0.0000
-98.8451 <s> 0.0000
-1.1461 BACKWARD -0.2688
-1.1461 BROWSER -0.2688
-1.1461 E-MAIL -0.2688
-1.1461 FORWARD -0.2688
-1.1461 LAST -0.2341
-1.1461 MUSIC 0.0000
-1.1461 NEW -0.2688
-1.1461 NEXT -0.2341
-0.8451 OPEN -0.2341
-1.1461 PLAYER 0.0000
-0.8451 WINDOW -0.1963
\2-grams:
-0.3010 BACKWARD NEXT 0.0000
-0.3010 BROWSER NEW 0.0000
-0.3010 E-MAIL FORWARD 0.0000
-0.3010 FORWARD BACKWARD 0.0000
-0.3010 LAST WINDOW -0.1761
-0.3010 NEW E-MAIL 0.0000
-0.3010 NEXT WINDOW -0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC -0.2688
-0.6021 WINDOW LAST 0.0000
-0.6021 WINDOW OPEN -0.1761
\3-grams:
-0.3010 BACKWARD NEXT WINDOW
-0.3010 BROWSER NEW E-MAIL
-0.3010 E-MAIL FORWARD BACKWARD
-0.3010 FORWARD BACKWARD NEXT
-0.3010 LAST WINDOW OPEN
-0.3010 NEW E-MAIL FORWARD
-0.3010 NEXT WINDOW LAST
-0.3010 OPEN BROWSER NEW
-0.3010 OPEN MUSIC PLAYER
-0.3010 WINDOW LAST WINDOW
-0.3010 WINDOW OPEN MUSIC
\end\
c.arpa
...
\1-grams:
-0.2840 <UNK> 0.2382
-1.4491 BACKWARD 0.3188
-1.4491 BROWSER 0.3188
-1.4491 E-MAIL 0.3188
-1.4491 FORWARD 0.3188
-1.4491 LAST 0.0362
-1.4491 MUSIC 0.0157
-1.4491 NEW 0.0157
-1.4491 NEXT 0.0362
-1.0969 OPEN 0.0320
-1.4491 PLAYER 0.0000
-1.0969 WINDOW -0.2833
\2-grams:
-0.3358 <UNK> <UNK> 0.0726
-99.9990 <UNK> BACKWARD 0.0000
-99.9990 <UNK> FORWARD 0.0000
-99.9990 <UNK> LAST 0.0000
-99.9990 <UNK> NEW 0.0000
-99.9990 <UNK> NEXT 0.0000
-0.8129 <UNK> OPEN 0.0000
-99.9990 BACKWARD <UNK> 0.2688
-99.9990 BROWSER <UNK> 0.2688
-99.9990 E-MAIL <UNK> 0.2688
-99.9990 FORWARD <UNK> 0.2688
-99.9990 LAST WINDOW 0.6021
-99.9990 MUSIC PLAYER 0.3188
-99.9990 NEW E-MAIL 0.0000
-99.9990 NEXT WINDOW 0.6021
-99.9990 OPEN BROWSER 0.0000
-99.9990 OPEN MUSIC 0.0000
-0.1249 WINDOW <UNK> -0.2083
\3-grams:
-99.9990 <UNK> <UNK> BACKWARD
-99.9990 <UNK> <UNK> FORWARD
-99.9990 <UNK> <UNK> LAST
-99.9990 <UNK> <UNK> NEW
-99.9990 <UNK> <UNK> NEXT
-99.9990 <UNK> <UNK> OPEN
-99.9990 <UNK> BACKWARD <UNK>
-99.9990 <UNK> FORWARD <UNK>
-99.9990 <UNK> LAST WINDOW
-99.9990 <UNK> NEW E-MAIL
-99.9990 <UNK> NEXT WINDOW
-99.9990 <UNK> OPEN BROWSER
-99.9990 <UNK> OPEN MUSIC
-99.9990 BACKWARD <UNK> <UNK>
-99.9990 BROWSER <UNK> <UNK>
-99.9990 E-MAIL <UNK> <UNK>
-99.9990 FORWARD <UNK> <UNK>
-99.9990 LAST WINDOW <UNK>
-99.9990 MUSIC PLAYER <UNK>
-99.9990 NEW E-MAIL <UNK>
-99.9990 NEXT WINDOW <UNK>
-99.9990 OPEN BROWSER <UNK>
-99.9990 OPEN MUSIC PLAYER
-0.1761 WINDOW <UNK> <UNK>
\end\
--- (Edited on 8/19/2008 7:21 am [GMT-0500] by chn) ---
--- (Edited on 8/19/2008 7:31 am [GMT-0500] by chn) ---
--- (Edited on 8/19/2008 7:50 am [GMT-0500] by chn) ---
Both your models are not correct, just because you are using too old and broken software. I suggest you to try lmtool:
http://www.speech.cs.cmu.edu/tools/lmtool.html
You can just upload your text there and get proper model and a dictionary.
--- (Edited on 8/19/2008 4:09 pm [GMT-0500] by nsh) ---
I used cmucmslmtk V3,which is a new version,and I need use this tool to get my LM.
Do you think this tool is broken ?
--- (Edited on 8/19/2008 9:02 pm [GMT-0500] by chn) ---
> I used cmucmslmtk V3,which is a new version,and I need use this tool to get my LM.
It's not the latest version, the latest one is in trunk or in form of nightly build on cmusphinx sourceforge site
> Do you think this tool is broken ?
The tool itself is not broken, you are using it incorrectly just because ngrams in your result have unknowns and silences. Compare them with the proper result from online lmtool.
It's possible to get the same with offline cmuclmtk but since you don't provide details on what you are doing, it's hard to help you.
--- (Edited on 8/20/2008 1:22 am [GMT-0500] by nsh) ---
Sorry ,I just want to use this tool to get LM and I find what I got is not the same as online lmtool!
This is what I have done:
I prepared a.text:
OPEN BROWSER
NEW EMAIL
FORWARD
BACKWARD
NEXT WINDOW
LAST WINDOW
OPEN MUSIC PLAYER
and a.ccs:<s>
2.and I got a.sent.text with cmd `sed`.a.sent.text:
<s> OPEN BROWSER </s>
<s> NEW EMAIL </s>
<s> FORWARD </s>
<s> BACKWARD </s>
<s> NEXT WINDOW </s>
<s> LAST WINDOW </s>
<s> OPEN MUSIC PLAYER </s>
3.Got a.wfreq and a.vocab with the cmd text2wfreq and wfreq2vocab.
a.vocab:
## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 13 words ##
</s>
<s>
BACKWARD
BROWSER
EMAIL
FORWARD
LAST
MUSIC
NEW
NEXT
OPEN
PLAYER
WINDOW
4.Got a.idngram with text2idngram
5.Got a.arpa with cmd idngram2lm
#############################################################################
## Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
## Ronald Rosenfeld and Philip Clarkson
## Version 3, Copyright (c) 2006, Carnegie Mellon University
## Contributors includes Wen Xu, Ananlada Chotimongkol,
## David Huggins-Daines, Arthur Chan and Alan Black
#############################################################################
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 13 words,
which begins "</s>", "<s>", "BACKWARD"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Witten Bell discounting was applied.
This file is in the ARPA-standard format introduced by Doug Paul.
p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)
p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
All probs and back-off weights (bo_wt) are given in log10 form.
Data formats:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
end of data mark: \end\
\data\
ngram 1=14
ngram 2=18
ngram 3=24
\1-grams:
-1.3010 <UNK> 0.0000
-0.5229 </s> -0.8451
-98.8386 <s> -0.1487
-1.3010 BACKWARD -0.1461
-1.3010 BROWSER -0.1461
-1.3010 EMAIL -0.1461
-1.3010 FORWARD -0.1461
-1.3010 LAST -0.2553
-1.3010 MUSIC -0.2788
-1.3010 NEW -0.2788
-1.3010 NEXT -0.2553
-1.0000 OPEN -0.2553
-1.3010 PLAYER 0.0000
-1.0000 WINDOW -0.3222
\2-grams:
-0.0669 </s> <s> 0.0348
-1.1139 <s> BACKWARD 0.0000
-1.1139 <s> FORWARD 0.0000
-1.1139 <s> LAST 0.0000
-1.1139 <s> NEW 0.0000
-1.1139 <s> NEXT 0.0000
-0.8129 <s> OPEN 0.0000
-0.3010 BACKWARD </s> 0.5441
-0.3010 BROWSER </s> 0.5441
-0.3010 EMAIL </s> 0.5441
-0.3010 FORWARD </s> 0.5441
-0.3010 LAST WINDOW 0.1761
-0.3010 MUSIC PLAYER -0.1461
-0.3010 NEW EMAIL 0.0000
-0.3010 NEXT WINDOW 0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC 0.0000
-0.1761 WINDOW </s> 0.3680
\3-grams:
-1.0792 </s> <s> BACKWARD
-1.0792 </s> <s> FORWARD
-1.0792 </s> <s> LAST
-1.0792 </s> <s> NEW
-1.0792 </s> <s> NEXT
-1.0792 </s> <s> OPEN
-0.3010 <s> BACKWARD </s>
-0.3010 <s> FORWARD </s>
-0.3010 <s> LAST WINDOW
-0.3010 <s> NEW EMAIL
-0.3010 <s> NEXT WINDOW
-0.6021 <s> OPEN BROWSER
-0.6021 <s> OPEN MUSIC
-0.3010 BACKWARD </s> <s>
-0.3010 BROWSER </s> <s>
-0.3010 EMAIL </s> <s>
-0.3010 FORWARD </s> <s>
-0.3010 LAST WINDOW </s>
-0.3010 MUSIC PLAYER </s>
-0.3010 NEW EMAIL </s>
-0.3010 NEXT WINDOW </s>
-0.3010 OPEN BROWSER </s>
-0.3010 OPEN MUSIC PLAYER
-0.1761 WINDOW </s> <s>
\end\
--- (Edited on 8/21/2008 3:40 am [GMT-0500] by chn) ---
Excuse me for being too irritated. Mostly it's ok. A few things to correct the lm:
1. Online Lmtool uses QuickLM from here, it counts 3-grams differently:
http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl
you can also use it. Alternatively, you can just filter wngrams (don't get idngrams directly, get wngrams first and then strip them with grep:
./text2wngram < input.txt | grep -v "</s> <s>" > input.wngram
the rest will work as usual:
./wngram2idngram -vocab input.vocab < input.wngram > input.idngram
./idngram2lm -vocab_type 0 -idngram input.idngram -vocab input.vocab -arpa input.arpa
Please note that you need latest trunk to make it work more or less properly.
2. Don't generate open vocabulary model, sphinx ignores unknowns, so pass -vocab_type 0 to idngram2lm
--- (Edited on 8/21/2008 5:05 pm [GMT-0500] by nsh) ---
Thank you!
And I got a.wngram as you said:
<s> NEW EMAIL 1
<s> NEXT WINDOW 1
<s> OPEN BROWSER 1
<s> OPEN MUSIC 1
LAST WINDOW </s> 1
MUSIC PLAYER </s> 1
NEW EMAIL </s> 1
NEXT WINDOW </s> 1
OPEN BROWSER </s> 1
OPEN MUSIC PLAYER 1
When I got a.idngram(-write_ascii), I think what I got was something wrong!
40000000 40000000 40000000 1141473616
Why?
--- (Edited on 8/22/2008 7:59 pm [GMT-0500] by chn) ---
--- (Edited on 8/22/2008 8:02 pm [GMT-0500] by chn) ---
Yes, there was a bug that I fixed just yesterday, please try to checkout cmuclmtk from trunk and rebuild it.
--- (Edited on 8/22/2008 10:09 pm [GMT-0500] by nsh) ---