cmusphinx / cmudict Goto Github PK
View Code? Open in Web Editor NEWCMU US English Dictionary
License: Other
CMU US English Dictionary
License: Other
Here is line 33 of cmudict.vp:
...ellipsis IH0 L IH1 P S IH0 S
All other lines begin with a single punctuation character. Having 3 periods here presents a parsing problem.
As a non-native English speaker, I am not sure about the first pronunciation of the word "natural" in CMU dictionary is valid because the /r/ sound is completely missing.
natural N AE1 CH ER0 AH0 L
natural(2) N AE1 CH R AH0 L
Could anyone tell me whether it's a bug or it's valid?
Thanks.
As of July 2022, is this repository still being actively (or at least somewhat regularly) maintained? If so, what is the preferred avenue for contributing corrections/additions? Should we follow the usual git workflow of fork->push->pull request? Or should changes be emailed directly to someone as in the past? Thank you.
I can't be sure of anything when it comes to phonetics as I'm no expert, but we've found it a big peculiar that some words starting with engineer
have an inconsistent first value:
engineer EH1 N JH AH0 N IH1 R
engineer's EH2 N JH AH0 N IY1 R Z
engineered EH2 N JH AH0 N IY1 R D
engineering EH1 N JH AH0 N IH1 R IH0 NG
engineers EH1 N JH AH0 N IH1 R Z
engineers' EH1 N JH AH0 N IH1 R Z
Could it be that EH2
is the more appropriate solution for all these first phonemes? Even with the secondary stress on the first phoneme, having multiple 1
-phonemes does mean we have to treat words has having more than one primary stress sometimes?
Thanks for the job. English is not my mother tongue, but I am learning to do English text-to-speech training. I want the cmudict.syllable as a symbol table to encode English phonemes. as I know the digits at the end of cmudict symbols are stress labels. AA0 means no stress. So do I need the symbol AA and AA0 at the same time in the process of English phoneme coding?
I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything.... Appreciate any help one this matter.
Following the discovery by @benkasminbullock of an inconsistency with borrowers'
(#17) and my cleanup of duplicates (#18), here are some resulting non-duplicates with inconsistencies. Each of those many entries should be investigated individually to analyze why there is a difference of phonemes between the two graphies. Each pair from this list may need a stress correction.
administrators AE0 D M IH1 N IH0 S T R EY2 T ER0 Z
administrators' AE0 D M IH1 N AH0 S T R EY2 T ER0 Z
advertisers AE1 D V ER0 T AY2 Z ER0 Z
advertisers' AE1 D V ER2 T AY2 Z ER0 Z
advisers AE0 D V AY1 Z ER0 Z
advisers' AE2 D V AY1 Z ER0 Z
affiliates AH0 F IH1 L IY0 AH0 T S
affiliates(2) AH0 F IH1 L IY0 EY2 T S
affiliates' AH0 F IH1 L IY0 IH0 T S
americas AH0 M EH1 R AH0 K AH0 Z
americas(2) AH0 M EH1 R IH0 K AH0 Z
americas' AH0 M EH1 R IH0 K AH2 Z
applicants AE1 P L IH0 K AH0 N T S
applicants' AE1 P L AH0 K AH0 N T S
arbitrators AA1 R B IH0 T R EY2 T ER0 Z
arbitrators' AA1 R B AH0 T R EY2 T ER0 Z
assets AE1 S EH2 T S
assets' AE1 S EH0 T S
auditors AA1 D AH0 T ER0 Z
auditors' AO1 D IH0 T ER0 Z
authorities AH0 TH AO1 R AH0 T IY0 Z
authorities' AH0 TH AO1 R IH0 T IY0 Z
bancshares B AE1 NG K SH EH0 R Z
bancshares' B AE0 NG K SH EH1 R Z
bein B IY1 N
bein' B IY1 IH0 N
belzbergs B EH1 L T S B ER0 G Z
belzbergs' B EH1 L Z B ER0 G Z
bishops B IH1 SH AH0 P S
bishops' B IH1 SH AA0 P S
businesses B IH1 Z N AH0 S AH0 Z
businesses(2) B IH1 Z N IH0 S IH0 Z
businesses' B IH1 Z N EH2 S IH0 Z
cardinals K AA1 R D AH0 N AH0 L Z
cardinals' K AA1 R D IH0 N AH0 L Z
characters K AE1 R AH0 K T ER0 Z
characters(2) K EH1 R AH0 K T ER0 Z
characters' CH EH1 R AH0 K T ER0 Z
chemicals K EH1 M IH0 K AH0 L Z
chemicals' CH EH1 M AH0 K AH0 L Z
communists K AA1 M Y AH0 N AH0 S T S
communists' K AA1 M Y UW0 N IH0 S T S
consultants K AH0 N S AH1 L T AH0 N T S
consultants' K AH0 N S AH1 L T AH2 N T S
contractors K AA1 N T R AE2 K T ER0 Z
contractors' K AH0 N T R AE1 K T ER0 Z
controllers K AH0 N T R OW1 L ER0 Z
controllers' K AH0 N T R AA1 L ER0 Z
controls K AH0 N T R OW1 L Z
controls' K AA1 N T R AA0 L Z
corestates K AO1 R S T EY2 T S
corestates' K AO1 R AH0 S T EY2 T S
currencies K ER1 AH0 N S IY0 Z
currencies' K ER0 EH1 N S IY0 Z
dataproducts D EY1 T AH0 P R AA2 D AH0 K T S
dataproducts' D EY1 T AH0 P R AO2 D AH0 K T S
dataproducts'(2) D AE1 T AH0 P R AO2 D AH0 K T S
debentures D AH0 B EH1 N CH ER0 Z
debentures' D IH0 B EH1 N CH ER0 Z
delegates D EH1 L AH0 G EY2 T S
delegates(2) D EH1 L AH0 G AH0 T S
delegates' D EH2 L AH0 G EY1 T S
depositors D AH0 P AA1 Z IH0 T ER0 Z
depositors' D IH0 P AA1 Z IH0 T ER0 Z
endotronics EH2 N D OW0 T R AA1 N IH0 K S
endotronics' EH2 N D AH0 T R AA1 N IH0 K S
engines EH1 N JH AH0 N Z
engines' EH1 NG G IY2 N Z
environmentalists EH0 N V AY1 R AH0 N M EH2 N T AH0 L IH0 S T S
environmentalists(2) EH0 N V AY1 R AH0 N M EH2 N AH0 L IH0 S T S
environmentalists(3) EH0 N V AY1 R AH0 N M EH2 N T AH0 L IH0 S
environmentalists(4) EH0 N V AY1 R AH0 N M EH2 N AH0 L IH0 S
environmentalists' IH0 N V AY2 R AH0 N M EH1 N T AH0 L IH0 S T S
environmentalists'(2) EH0 N V AY2 R AH0 N M EH1 N AH0 L IH0 S T S
exchanges IH0 K S CH EY1 N JH AH0 Z
exchanges(2) IH0 K S CH EY1 N JH IH0 Z
exchanges' EH0 K S CH EY1 N JH IH0 Z
executives IH0 G Z EH1 K Y AH0 T IH0 V Z
executives' EH0 G Z EH1 K Y AH0 T IH0 V Z
exporters IH0 K S P AO1 R T ER0 Z
exporters' EH2 K S P AO1 R T ER0 Z
fathers F AA1 DH ER0 Z
fathers' F AE1 TH ER0 Z
framers F R EY1 M ER0 Z
framers' F R AE1 M ER0 Z
goin G OY1 N
goin' G OW1 AH0 N
hospitals HH AA1 S P IH2 T AH0 L Z
hospitals' HH AO1 S P IH0 T AH0 L Z
hostages HH AA1 S T AH0 JH AH0 Z
hostages' HH AO1 S T IH0 JH IH0 Z
husbands HH AH1 Z B AH0 N D Z
husbands' HH AH1 S B AH0 N D Z
ickes IH1 K IY0 Z
ickes(2) AY1 K IY0 Z
ickes(3) AY1 K S
ickes' IH1 K AH0 S
immigrants IH1 M AH0 G R AH0 N T S
immigrants' IH1 M IH0 G R AH0 N T S
imports IH2 M P AO1 R T S
imports(2) IH1 M P AO2 R T S
imports' IH1 M P AO0 R T S
individuals IH2 N D AH0 V IH1 JH AH0 W AH0 L Z
individuals' IH2 N D IH0 V IH1 JH AH0 W AH0 L Z
institutes IH1 N S T AH0 T UW2 T S
institutes' IH1 N S T IH0 T UW2 T S
islands AY1 L AH0 N D Z
islands' AY1 S L AH0 N D Z
issuers IH1 SH UW0 ER0 Z
issuers' IH1 S UW0 R Z
issues IH1 SH UW0 Z
issues' IH1 S UW0 Z
jacobs JH EY1 K AH0 B Z
jacobs' JH EY1 K AH2 B Z
jefferies JH EH1 F R IY0 Z
jefferies' JH EH1 F ER0 IY0 Z
ladies L EY1 D IY0 Z
ladies' L EY1 D IY2 Z
lawmakers L AO1 M EY2 K ER0 Z
lawmakers' L AO1 M EY1 K ER0 Z
legislators L EH1 JH AH0 S L EY2 T ER0 Z
legislators' L EH1 JH IH0 S L EY2 T ER0 Z
losers L UW1 Z ER0 Z
losers' L OW1 Z ER0 Z
machines M AH0 SH IY1 N Z
machines' M AH0 CH IY1 N Z
makin M AE1 K IH0 N
makin' M EY1 K IH0 N
manufacturers M AE2 N Y AH0 F AE1 K CH ER0 ER0 Z
manufacturers' M AE2 N AH0 F AE1 K CH ER0 ER0 Z
marketers M AA2 R K AH0 T ER0 Z
marketers' M AA1 R K AH0 T ER0 Z
memories M EH1 M ER0 IY0 Z
memories' M EH1 M ER0 IY2 Z
microsystems M AY1 K R OW2 S IH1 S T AH0 M Z
microsystems' M AY1 K R OW0 S IH2 S T AH0 M Z
months M AH1 N TH S
months' M AA1 N TH S
mothers M AH1 DH ER0 Z
mothers' M AH1 TH ER0 Z
multifoods M AH1 L T IY0 F UW1 D Z
multifoods' M AH1 L T IY0 F UW2 D Z
negotiators N IH0 G OW1 SH IY0 EY2 T ER0 Z
negotiators' N AH0 G OW1 SH IY0 EY2 T ER0 Z
netherlands N EH1 DH ER0 L AH0 N D Z
netherlands' N EH1 TH ER0 L AE0 N D Z
non-smokers N AA0 N S M OW1 K ER0 Z
non-smokers' N AA1 N S M OW1 K ER0 Z
nonsmokers N AA0 N S M OW1 K ER0 Z
nonsmokers' N AA1 N S M OW1 K ER0 Z
nothin N AA1 TH IH0 N
nothin' N AH1 TH IH0 N
palestinians P AE2 L IH0 S T IH1 N IY0 AH0 N Z
palestinians' P AE2 L AH0 S T IH1 N IY0 AH0 N Z
partnerships P AA1 R T N ER0 SH IH2 P S
partnerships' P AA1 R T N ER0 SH IH0 P S
physics F IH1 Z IH0 K S
physics' F IH1 S IH0 K S
predecessors P R EH1 D AH0 S EH2 S ER0 Z
predecessors' P R EH2 D AH0 S EH1 S ER0 Z
products P R AA1 D AH0 K T S
products(2) P R AA1 D AH0 K S
products' P R AO1 D AH0 K T S
products'(2) P R AO1 D AH0 K S
projects P R AA1 JH EH0 K T S
projects(2) P R AH0 JH EH1 K T S
projects(3) P R AA1 JH EH0 K S
projects(4) P R AH0 JH EH1 K S
projects' P R AO1 JH EH0 K T S
projects'(2) P R AO1 JH EH0 K S
properties P R AA1 P ER0 T IY0 Z
properties' P R OW1 P ER0 T IY0 Z
prosecutors P R AA1 S IH0 K Y UW2 T ER0 Z
prosecutors' P R AA1 S AH0 K Y UW0 T ER0 Z
representatives R EH2 P R AH0 Z EH1 N T AH0 T IH0 V Z
representatives(2) R EH2 P R IH0 Z EH1 N T AH0 T IH0 V Z
representatives(3) R EH2 P R AH0 Z EH1 N AH0 T IH0 V Z
representatives(4) R EH2 P R IH0 Z EH1 N AH0 T IH0 V Z
representatives' R EH2 P R AH0 S EH1 N T AH0 T IH0 V Z
representatives'(2) R EH2 P R AH0 S EH1 N AH0 T IH0 V Z
retirees R IY0 T AY1 R IY1 Z
retirees' R IH0 T AY2 R IY1 Z
returns R IH0 T ER1 N Z
returns(2) R IY0 T ER1 N Z
returns' R AH0 T ER1 N Z
rollin R AA1 L IH0 N
rollin' R OW1 L IH0 N
secretaries S EH1 K R AH0 T EH2 R IY0 Z
secretaries' S EH1 K R IH0 T EH2 R IY0 Z
sons S AH1 N Z
sons' S AA1 N Z
speculators S P EH1 K Y AH0 L EY2 T ER0 Z
speculators' S P EH1 K Y AH0 L ER0 T EY2 Z
starin S T AE1 R IH0 N
starin' S T EH1 R IH0 N
steelmakers S T IY1 L M EY2 K ER0 Z
steelmakers' S T IY1 L M AH0 K ER0 Z
steelworkers S T IY1 L W ER2 K ER0 Z
steelworkers' S T IY1 L W ER0 K ER0 Z
subjects S AH1 B JH IH0 K T S
subjects(2) S AH0 B JH EH1 K T S
subjects(3) S AH0 B JH EH1 K S
subjects' S AH1 B JH EH0 K T S
subjects'(2) S AH1 B JH EH0 K S
superpowers S UW2 P ER0 P AW1 ER0 Z
superpowers' S UW1 P ER0 P AW2 R Z
superregionals S UW2 P ER0 R IY1 JH AH0 N AH0 L Z
superregionals' S UW0 P ER0 R IY1 JH AH0 N AH0 L Z
supervisors S UW2 P ER0 V AY1 Z ER0 Z
supervisors' S UW1 P ER0 V AY2 Z ER0 Z
surgeons S ER1 JH AH0 N Z
surgeons' S ER1 JH IH0 N Z
talkin T AA1 K AH0 N
talkin' T AO1 K IH0 N
technologies T EH0 K N AA1 L AH0 JH IY0 Z
technologies' T EH2 K N AA1 L AH0 JH IY0 Z
telesis T EH1 L AH0 S IH0 S
telesis' T EH1 L AH0 S IH2 S
universities Y UW2 N AH0 V ER1 S AH0 T IY0 Z
universities' Y UW2 N IH0 V ER1 S IH0 T IY0 Z
vehicles V IY1 HH IH0 K AH0 L Z
vehicles(2) V IY1 IH0 K AH0 L Z
vehicles' V EH1 HH IH0 K AH0 L Z
Thanks so much for giving CMUdict to the world! I've developed software for exploring it (along with AmEPD). It also supports searching by pronunciation and very simple wildcard matching. I apologize for posting this as an issue, but I hope it can be useful for some people. Here's the link:
https://sourceforge.net/projects/pronundict/
For your reference: in my file of errata for cmudict, I have SUTTER, BENTEN, HARRIS, HARRIS' HARRIS'S, and NABIL marked as erroneous. The final vowel of SUTTER seems to be wrong, the second vowel of BENTEN is marked as IH0 but a word with that pronunciation (written benten and pronounced ben-teen) doesn't seem to exist, and all of the HARRIS entries have EH1 as the first vowel, as if "hay-ris". NABIL is missing the final L sound, but this name seems to have an L sound:
https://translate.google.com/#ar/en/%D9%86%D8%A8%D9%8A%D9%84%E2%80%8E
Hi Nickolay, I tried to find a description of some criteria as to how the dictionary is transcribed (from other sources). One feature that strikes me as particularly odd is the entries marked with the 1
for primary stress on multiple vowels. Many of these are compounds--and, without understanding even the basic principles, I am not trying to get into the topic of compound treatment. But there are non-initialisms, non-compounds which show multiple "primary stress," (if I should take the digit markers at their face value). For example, the stressed final -ee more or less consistently yields IY1
(see entries for inductee, markee, pawnee), many of which have another primary stress elsewhere in addition to the final IY1
; according to AmHer, inductee, has a secondary stress on /in/-, and the other two, bisyllabic examples do not carry a secondary stress at all. So the example of inductee is off the mark, with its double primary stress:
CMUDict IH2 N D AH1 K T IY1
AmHer IH2 N D AH0 K T IY1 (assuming ษ -> AH0)
On the other hand, the phonemic transcription of a sample of a few words in -ee carrying the stress 1 elsewhere (e.g. manatee M AE1 N AH0 T IY2
) does match that of AmHer. Looks like the common theme here is the final stressed IY1
.
Is this just an error, or is there is more to it? If that should be fixed, I have a list, not split into categories of initialisms, compounds and simple words, but I can manually select the latter category, it's not that large. Most of the rejects are compounds even in the weakest sense of the word (e.g. remake, where re is a morpheme that would not stand on its own).
The cmusphinx-devel list on SF has had almost no traffic for the last 2 years, so I do not think it makes more sense to bring the question there than here--or is it? Or has the list moved elsewhere?
I am using the dictionary in a research, and simply discarding data with multiple primary stress (this is how science is supposed to work, is not it? If the data does not fit the theory, too bad for the data :D ). But this is still confusing.
I'm assuming this repo is the best source for obtaining the cmudict files, but can you confirm? Or is the source found at The CMU Pronouncing Dictionary site kept up-to-date?
And, if this repo is the best source, are there any plans to start tagging releases here?
Thanks!
(I'm re-posting this since someone closed it without addressing any of the issues here.)
For your reference: in my file of errata for cmudict, I have SUTTER, BENTEN, HARRIS, HARRIS' HARRIS'S, and NABIL marked as erroneous. The final vowel of SUTTER seems to be wrong, the second vowel of BENTEN is marked as IH0 but a word with that pronunciation (written benten and pronounced ben-teen) doesn't seem to exist, and all of the HARRIS entries have EH1 as the first vowel, as if "hay-ris". NABIL is missing the final L sound, but this name seems to have an L sound:
https://translate.google.com/#ar/en/%D9%86%D8%A8%D9%8A%D9%84%E2%80%8E
Hi. Your website article mentions a Russian dict, but I could not find it. Would you please point me to the right direction?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.