Giter Site home page Giter Site logo

cmudict's People

Contributors

coeur avatar dhgutteridge avatar dylanhand avatar jimregan avatar lenzo-ka avatar nshmyrev avatar razzius avatar scripttiger avatar tjmahr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmudict's Issues

Not sure if it's a valid pronunciation

As a non-native English speaker, I am not sure about the first pronunciation of the word "natural" in CMU dictionary is valid because the /r/ sound is completely missing.

natural N AE1 CH ER0 AH0 L
natural(2) N AE1 CH R AH0 L

Could anyone tell me whether it's a bug or it's valid?

Thanks.

Current state of updates, contribution procedure?

As of July 2022, is this repository still being actively (or at least somewhat regularly) maintained? If so, what is the preferred avenue for contributing corrections/additions? Should we follow the usual git workflow of fork->push->pull request? Or should changes be emailed directly to someone as in the past? Thank you.

Values for engineer-* are inconsistent

I can't be sure of anything when it comes to phonetics as I'm no expert, but we've found it a big peculiar that some words starting with engineer have an inconsistent first value:

engineer EH1 N JH AH0 N IH1 R
engineer's EH2 N JH AH0 N IY1 R Z
engineered EH2 N JH AH0 N IY1 R D
engineering EH1 N JH AH0 N IH1 R IH0 NG
engineers EH1 N JH AH0 N IH1 R Z
engineers' EH1 N JH AH0 N IH1 R Z

Could it be that EH2 is the more appropriate solution for all these first phonemes? Even with the secondary stress on the first phoneme, having multiple 1-phonemes does mean we have to treat words has having more than one primary stress sometimes?

Is it a duplication that AA0 and AA both in the cmudict.symbols file?

Thanks for the job. English is not my mother tongue, but I am learning to do English text-to-speech training. I want the cmudict.syllable as a symbol table to encode English phonemes. as I know the digits at the end of cmudict symbols are stress labels. AA0 means no stress. So do I need the symbol AA and AA0 at the same time in the process of English phoneme coding?

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.

I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.

phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst

I tried fst with phonetisaurus-g2p as follows:

phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words

But it didn't return anything.... Appreciate any help one this matter.

List of words with apostrophe and inconsistent pronunciations

Following the discovery by @benkasminbullock of an inconsistency with borrowers' (#17) and my cleanup of duplicates (#18), here are some resulting non-duplicates with inconsistencies. Each of those many entries should be investigated individually to analyze why there is a difference of phonemes between the two graphies. Each pair from this list may need a stress correction.

administrators AE0 D M IH1 N IH0 S T R EY2 T ER0 Z
administrators' AE0 D M IH1 N AH0 S T R EY2 T ER0 Z

advertisers AE1 D V ER0 T AY2 Z ER0 Z
advertisers' AE1 D V ER2 T AY2 Z ER0 Z

advisers AE0 D V AY1 Z ER0 Z
advisers' AE2 D V AY1 Z ER0 Z

affiliates AH0 F IH1 L IY0 AH0 T S
affiliates(2) AH0 F IH1 L IY0 EY2 T S
affiliates' AH0 F IH1 L IY0 IH0 T S

americas AH0 M EH1 R AH0 K AH0 Z
americas(2) AH0 M EH1 R IH0 K AH0 Z
americas' AH0 M EH1 R IH0 K AH2 Z

applicants AE1 P L IH0 K AH0 N T S
applicants' AE1 P L AH0 K AH0 N T S

arbitrators AA1 R B IH0 T R EY2 T ER0 Z
arbitrators' AA1 R B AH0 T R EY2 T ER0 Z

assets AE1 S EH2 T S
assets' AE1 S EH0 T S

auditors AA1 D AH0 T ER0 Z
auditors' AO1 D IH0 T ER0 Z

authorities AH0 TH AO1 R AH0 T IY0 Z
authorities' AH0 TH AO1 R IH0 T IY0 Z

bancshares B AE1 NG K SH EH0 R Z
bancshares' B AE0 NG K SH EH1 R Z

bein B IY1 N
bein' B IY1 IH0 N

belzbergs B EH1 L T S B ER0 G Z
belzbergs' B EH1 L Z B ER0 G Z

bishops B IH1 SH AH0 P S
bishops' B IH1 SH AA0 P S

businesses B IH1 Z N AH0 S AH0 Z
businesses(2) B IH1 Z N IH0 S IH0 Z
businesses' B IH1 Z N EH2 S IH0 Z

cardinals K AA1 R D AH0 N AH0 L Z
cardinals' K AA1 R D IH0 N AH0 L Z

characters K AE1 R AH0 K T ER0 Z
characters(2) K EH1 R AH0 K T ER0 Z
characters' CH EH1 R AH0 K T ER0 Z

chemicals K EH1 M IH0 K AH0 L Z
chemicals' CH EH1 M AH0 K AH0 L Z

communists K AA1 M Y AH0 N AH0 S T S
communists' K AA1 M Y UW0 N IH0 S T S

consultants K AH0 N S AH1 L T AH0 N T S
consultants' K AH0 N S AH1 L T AH2 N T S

contractors K AA1 N T R AE2 K T ER0 Z
contractors' K AH0 N T R AE1 K T ER0 Z

controllers K AH0 N T R OW1 L ER0 Z
controllers' K AH0 N T R AA1 L ER0 Z

controls K AH0 N T R OW1 L Z
controls' K AA1 N T R AA0 L Z

corestates K AO1 R S T EY2 T S
corestates' K AO1 R AH0 S T EY2 T S

currencies K ER1 AH0 N S IY0 Z
currencies' K ER0 EH1 N S IY0 Z

dataproducts D EY1 T AH0 P R AA2 D AH0 K T S
dataproducts' D EY1 T AH0 P R AO2 D AH0 K T S
dataproducts'(2) D AE1 T AH0 P R AO2 D AH0 K T S

debentures D AH0 B EH1 N CH ER0 Z
debentures' D IH0 B EH1 N CH ER0 Z

delegates D EH1 L AH0 G EY2 T S
delegates(2) D EH1 L AH0 G AH0 T S
delegates' D EH2 L AH0 G EY1 T S

depositors D AH0 P AA1 Z IH0 T ER0 Z
depositors' D IH0 P AA1 Z IH0 T ER0 Z

endotronics EH2 N D OW0 T R AA1 N IH0 K S
endotronics' EH2 N D AH0 T R AA1 N IH0 K S

engines EH1 N JH AH0 N Z
engines' EH1 NG G IY2 N Z

environmentalists EH0 N V AY1 R AH0 N M EH2 N T AH0 L IH0 S T S
environmentalists(2) EH0 N V AY1 R AH0 N M EH2 N AH0 L IH0 S T S
environmentalists(3) EH0 N V AY1 R AH0 N M EH2 N T AH0 L IH0 S
environmentalists(4) EH0 N V AY1 R AH0 N M EH2 N AH0 L IH0 S
environmentalists' IH0 N V AY2 R AH0 N M EH1 N T AH0 L IH0 S T S
environmentalists'(2) EH0 N V AY2 R AH0 N M EH1 N AH0 L IH0 S T S

exchanges IH0 K S CH EY1 N JH AH0 Z
exchanges(2) IH0 K S CH EY1 N JH IH0 Z
exchanges' EH0 K S CH EY1 N JH IH0 Z

executives IH0 G Z EH1 K Y AH0 T IH0 V Z
executives' EH0 G Z EH1 K Y AH0 T IH0 V Z

exporters IH0 K S P AO1 R T ER0 Z
exporters' EH2 K S P AO1 R T ER0 Z

fathers F AA1 DH ER0 Z
fathers' F AE1 TH ER0 Z

framers F R EY1 M ER0 Z
framers' F R AE1 M ER0 Z

goin G OY1 N
goin' G OW1 AH0 N

hospitals HH AA1 S P IH2 T AH0 L Z
hospitals' HH AO1 S P IH0 T AH0 L Z

hostages HH AA1 S T AH0 JH AH0 Z
hostages' HH AO1 S T IH0 JH IH0 Z

husbands HH AH1 Z B AH0 N D Z
husbands' HH AH1 S B AH0 N D Z

ickes IH1 K IY0 Z
ickes(2) AY1 K IY0 Z
ickes(3) AY1 K S
ickes' IH1 K AH0 S

immigrants IH1 M AH0 G R AH0 N T S
immigrants' IH1 M IH0 G R AH0 N T S

imports IH2 M P AO1 R T S
imports(2) IH1 M P AO2 R T S
imports' IH1 M P AO0 R T S

individuals IH2 N D AH0 V IH1 JH AH0 W AH0 L Z
individuals' IH2 N D IH0 V IH1 JH AH0 W AH0 L Z

institutes IH1 N S T AH0 T UW2 T S
institutes' IH1 N S T IH0 T UW2 T S

islands AY1 L AH0 N D Z
islands' AY1 S L AH0 N D Z

issuers IH1 SH UW0 ER0 Z
issuers' IH1 S UW0 R Z

issues IH1 SH UW0 Z
issues' IH1 S UW0 Z

jacobs JH EY1 K AH0 B Z
jacobs' JH EY1 K AH2 B Z

jefferies JH EH1 F R IY0 Z
jefferies' JH EH1 F ER0 IY0 Z

ladies L EY1 D IY0 Z
ladies' L EY1 D IY2 Z

lawmakers L AO1 M EY2 K ER0 Z
lawmakers' L AO1 M EY1 K ER0 Z

legislators L EH1 JH AH0 S L EY2 T ER0 Z
legislators' L EH1 JH IH0 S L EY2 T ER0 Z

losers L UW1 Z ER0 Z
losers' L OW1 Z ER0 Z

machines M AH0 SH IY1 N Z
machines' M AH0 CH IY1 N Z

makin M AE1 K IH0 N
makin' M EY1 K IH0 N

manufacturers M AE2 N Y AH0 F AE1 K CH ER0 ER0 Z
manufacturers' M AE2 N AH0 F AE1 K CH ER0 ER0 Z

marketers M AA2 R K AH0 T ER0 Z
marketers' M AA1 R K AH0 T ER0 Z

memories M EH1 M ER0 IY0 Z
memories' M EH1 M ER0 IY2 Z

microsystems M AY1 K R OW2 S IH1 S T AH0 M Z
microsystems' M AY1 K R OW0 S IH2 S T AH0 M Z

months M AH1 N TH S
months' M AA1 N TH S

mothers M AH1 DH ER0 Z
mothers' M AH1 TH ER0 Z

multifoods M AH1 L T IY0 F UW1 D Z
multifoods' M AH1 L T IY0 F UW2 D Z

negotiators N IH0 G OW1 SH IY0 EY2 T ER0 Z
negotiators' N AH0 G OW1 SH IY0 EY2 T ER0 Z

netherlands N EH1 DH ER0 L AH0 N D Z
netherlands' N EH1 TH ER0 L AE0 N D Z

non-smokers N AA0 N S M OW1 K ER0 Z
non-smokers' N AA1 N S M OW1 K ER0 Z

nonsmokers N AA0 N S M OW1 K ER0 Z
nonsmokers' N AA1 N S M OW1 K ER0 Z

nothin N AA1 TH IH0 N
nothin' N AH1 TH IH0 N

palestinians P AE2 L IH0 S T IH1 N IY0 AH0 N Z
palestinians' P AE2 L AH0 S T IH1 N IY0 AH0 N Z

partnerships P AA1 R T N ER0 SH IH2 P S
partnerships' P AA1 R T N ER0 SH IH0 P S

physics F IH1 Z IH0 K S
physics' F IH1 S IH0 K S

predecessors P R EH1 D AH0 S EH2 S ER0 Z
predecessors' P R EH2 D AH0 S EH1 S ER0 Z

products P R AA1 D AH0 K T S
products(2) P R AA1 D AH0 K S
products' P R AO1 D AH0 K T S
products'(2) P R AO1 D AH0 K S

projects P R AA1 JH EH0 K T S
projects(2) P R AH0 JH EH1 K T S
projects(3) P R AA1 JH EH0 K S
projects(4) P R AH0 JH EH1 K S
projects' P R AO1 JH EH0 K T S
projects'(2) P R AO1 JH EH0 K S

properties P R AA1 P ER0 T IY0 Z
properties' P R OW1 P ER0 T IY0 Z

prosecutors P R AA1 S IH0 K Y UW2 T ER0 Z
prosecutors' P R AA1 S AH0 K Y UW0 T ER0 Z

representatives R EH2 P R AH0 Z EH1 N T AH0 T IH0 V Z
representatives(2) R EH2 P R IH0 Z EH1 N T AH0 T IH0 V Z
representatives(3) R EH2 P R AH0 Z EH1 N AH0 T IH0 V Z
representatives(4) R EH2 P R IH0 Z EH1 N AH0 T IH0 V Z
representatives' R EH2 P R AH0 S EH1 N T AH0 T IH0 V Z
representatives'(2) R EH2 P R AH0 S EH1 N AH0 T IH0 V Z

retirees R IY0 T AY1 R IY1 Z
retirees' R IH0 T AY2 R IY1 Z

returns R IH0 T ER1 N Z
returns(2) R IY0 T ER1 N Z
returns' R AH0 T ER1 N Z

rollin R AA1 L IH0 N
rollin' R OW1 L IH0 N

secretaries S EH1 K R AH0 T EH2 R IY0 Z
secretaries' S EH1 K R IH0 T EH2 R IY0 Z

sons S AH1 N Z
sons' S AA1 N Z

speculators S P EH1 K Y AH0 L EY2 T ER0 Z
speculators' S P EH1 K Y AH0 L ER0 T EY2 Z

starin S T AE1 R IH0 N
starin' S T EH1 R IH0 N

steelmakers S T IY1 L M EY2 K ER0 Z
steelmakers' S T IY1 L M AH0 K ER0 Z

steelworkers S T IY1 L W ER2 K ER0 Z
steelworkers' S T IY1 L W ER0 K ER0 Z

subjects S AH1 B JH IH0 K T S
subjects(2) S AH0 B JH EH1 K T S
subjects(3) S AH0 B JH EH1 K S
subjects' S AH1 B JH EH0 K T S
subjects'(2) S AH1 B JH EH0 K S

superpowers S UW2 P ER0 P AW1 ER0 Z
superpowers' S UW1 P ER0 P AW2 R Z

superregionals S UW2 P ER0 R IY1 JH AH0 N AH0 L Z
superregionals' S UW0 P ER0 R IY1 JH AH0 N AH0 L Z

supervisors S UW2 P ER0 V AY1 Z ER0 Z
supervisors' S UW1 P ER0 V AY2 Z ER0 Z

surgeons S ER1 JH AH0 N Z
surgeons' S ER1 JH IH0 N Z

talkin T AA1 K AH0 N
talkin' T AO1 K IH0 N

technologies T EH0 K N AA1 L AH0 JH IY0 Z
technologies' T EH2 K N AA1 L AH0 JH IY0 Z

telesis T EH1 L AH0 S IH0 S
telesis' T EH1 L AH0 S IH2 S

universities Y UW2 N AH0 V ER1 S AH0 T IY0 Z
universities' Y UW2 N IH0 V ER1 S IH0 T IY0 Z

vehicles V IY1 HH IH0 K AH0 L Z
vehicles(2) V IY1 IH0 K AH0 L Z
vehicles' V EH1 HH IH0 K AH0 L Z

My corrections

For your reference: in my file of errata for cmudict, I have SUTTER, BENTEN, HARRIS, HARRIS' HARRIS'S, and NABIL marked as erroneous. The final vowel of SUTTER seems to be wrong, the second vowel of BENTEN is marked as IH0 but a word with that pronunciation (written benten and pronounced ben-teen) doesn't seem to exist, and all of the HARRIS entries have EH1 as the first vowel, as if "hay-ris". NABIL is missing the final L sound, but this name seems to have an L sound:

https://translate.google.com/#ar/en/%D9%86%D8%A8%D9%8A%D9%84%E2%80%8E

Documentation/principles of the dictionary maintenance

Hi Nickolay, I tried to find a description of some criteria as to how the dictionary is transcribed (from other sources). One feature that strikes me as particularly odd is the entries marked with the 1 for primary stress on multiple vowels. Many of these are compounds--and, without understanding even the basic principles, I am not trying to get into the topic of compound treatment. But there are non-initialisms, non-compounds which show multiple "primary stress," (if I should take the digit markers at their face value). For example, the stressed final -ee more or less consistently yields IY1 (see entries for inductee, markee, pawnee), many of which have another primary stress elsewhere in addition to the final IY1; according to AmHer, inductee, has a secondary stress on /in/-, and the other two, bisyllabic examples do not carry a secondary stress at all. So the example of inductee is off the mark, with its double primary stress:

CMUDict IH2 N D AH1 K T IY1
AmHer   IH2 N D AH0 K T IY1 (assuming ษ™ -> AH0)

On the other hand, the phonemic transcription of a sample of a few words in -ee carrying the stress 1 elsewhere (e.g. manatee M AE1 N AH0 T IY2) does match that of AmHer. Looks like the common theme here is the final stressed IY1.

Is this just an error, or is there is more to it? If that should be fixed, I have a list, not split into categories of initialisms, compounds and simple words, but I can manually select the latter category, it's not that large. Most of the rejects are compounds even in the weakest sense of the word (e.g. remake, where re is a morpheme that would not stand on its own).

The cmusphinx-devel list on SF has had almost no traffic for the last 2 years, so I do not think it makes more sense to bring the question there than here--or is it? Or has the list moved elsewhere?

I am using the dictionary in a research, and simply discarding data with multiple primary stress (this is how science is supposed to work, is not it? If the data does not fit the theory, too bad for the data :D ). But this is still confusing.

Authoritative Source?

I'm assuming this repo is the best source for obtaining the cmudict files, but can you confirm? Or is the source found at The CMU Pronouncing Dictionary site kept up-to-date?

And, if this repo is the best source, are there any plans to start tagging releases here?

Thanks!

Corrections (re-post)

(I'm re-posting this since someone closed it without addressing any of the issues here.)

For your reference: in my file of errata for cmudict, I have SUTTER, BENTEN, HARRIS, HARRIS' HARRIS'S, and NABIL marked as erroneous. The final vowel of SUTTER seems to be wrong, the second vowel of BENTEN is marked as IH0 but a word with that pronunciation (written benten and pronounced ben-teen) doesn't seem to exist, and all of the HARRIS entries have EH1 as the first vowel, as if "hay-ris". NABIL is missing the final L sound, but this name seems to have an L sound:

https://translate.google.com/#ar/en/%D9%86%D8%A8%D9%8A%D9%84%E2%80%8E

Russian Dictionary?

Hi. Your website article mentions a Russian dict, but I could not find it. Would you please point me to the right direction?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.