dmort27 / epitran Goto Github PK
View Code? Open in Web Editor NEWA tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
License: MIT License
A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
License: MIT License
It seems that Yoruba, which uses ⟨y⟩ for the approximant /j/, is being incorrectly transcribed such that ⟨y⟩ becomes the vowel /y/.
>>> import epitran
>>> translator = epitran.Epitran('yor-Latn')
>>> translator.transliterate('Yorùbá') # expected output: 'jōrùbá'
'yorùbá'
Hi!
I used this library for some work that I am writing a paper on. Is there something that I can cite? I should note that I also used Panphon and cited appropriately from the paper linked in that README.
Just testing this out for fun... one thing that I notice is that the backoff feature seems to hang if it gets some input that isn't in its alphabets, e.g.
epi = epitran.Epitran('tur-Latn')
epi.transliterate('merhaba: nasilsin?!')
'meɾhaba: nasilsin?!'
works pretty quick (and would be better achieved if I had used an actual Turkish keyboard)
Wheras
backoff = Backoff(['tur-Latn', 'hin-Deva'])
backoff.transliterate('merhaba: nasilsin?!')
seems to hangs indefinitely...
Maybe backoff needs an extra something for pass-throughs? :)
Traceback (most recent call last):
File "d:/tcritp/tcript.py", line 3, in <module>
tr.transliterate(u'spark')
File "D:\Python3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
return self.epi.transliterate(word, normpunc, ligatures)
File "D:\Python3\lib\site-packages\epitran\flite.py", line 92, in transliterate
acc.append(self.english_g2p(chunk))
File "D:\Python3\lib\site-packages\epitran\flite.py", line 211, in english_g2p
return self.arpa_to_ipa(arpa_text)
File "D:\Python3\lib\site-packages\epitran\flite.py", line 76, in arpa_to_ipa
text = ''.join(ipa_list)
File "D:\Python3\lib\site-packages\epitran\flite.py", line 75, in <lambda>
ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
KeyError: ''
Source:
from epitran import Epitran
tr = Epitran('eng-Latn', cedict_file='cedict_1_0_ts_utf-8_mdbg.txt')
tr.transliterate('test')
Note: Changing 'test'
to u'test'
does not help.
I'm trying to use epitran to obtain the correct phonetic pronunciations of French words. I did get it working eventually through the use of the fra-Latn
preprocessor, however its performance is lackluster. It seems to give me very literal translations, and ones that never use the uvular "ʁ" sound or the sound separating ".":
So after having mixed performance with that, I looked at the documentation and noticed there was a more phonetic translator "fra-Latn-np". Upon attempting to use this to translate any given word, I get the following error:
Traceback (most recent call last):
File "main.py", line 6, in <module>
epi = epitran.Epitran('fra-Latn-np')
File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/_epitran.py", line 46, in __init__
self.epi = SimpleEpitran(code, preproc, postproc, ligatures, rev, rev_preproc, rev_postproc, tones=tones)
File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 43, in __init__
self.g2p = self._load_g2p_map(code, False)
File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 100, in _load_g2p_map
raise DatafileError('Header is ["{}", "{}"] instead of ["Orth", "Phon"].'.format(orth, phon))
epitran.exceptions.DatafileError: Header is ["Prth", "Phon"] instead of ["Orth", "Phon"].
I'm not sure what causes it, but looking in that directory there is also an undocumented "fra-Lang-p" preprocessor, which does better at other times and worse than others. Could you please explain what is going on here?
Here is my code:
import sys
from google_trans_new import google_translator
import epitran
translator = google_translator()
epi = epitran.Epitran('fra-Latn-np')
# Translate the first system argument
#translated_text = translator.translate(sys.argv[1], lang_src='en', lang_tgt='fr')
# Get the IPA pronunciation
#ipa_symbols = epi.transliterate(translated_text)
#print(translated_text)
#print(ipa_symbols)
print(epi.transliterate(sys.argv[1]))
Hello, I just discoverd this awesome module, and I found two issues with the French language (both with fra-Latn
and fra-Latn-np
).
When a word ends with a 's', the 's' is silent. So "il" ("he / she") is pronounced in the same way as "ils" ("they"). However, when I try epi.transliterate("il")
and epi.transliterate("ils")
, it returns il
and ils
.
The final 'es' is pronounced when it comes after a consonant. For example, "faites" ("do") is pronounced "fɛt" and "fait" ("done") is pronounced "fɛ". But transliterate()
returns "fe" and "fe".
In the same way, it returns "ɡaraʒ" for "garage" (which is correct, Wikitionary gives "ɡa.ʁaʒ") but "ɡara" for "garages".
Are these languages that should also be approached with caution? Not sure what this section in the README means.
In https://github.com/dmort27/epitran/blob/master/epitran/epihan.py#L103, there is no tones
argument, but the construction of the object in https://github.com/dmort27/epitran/blob/master/epitran/_epitran.py#L44 gives the argument, causing an error.
Hello
I am trying to install lex_lookup, since I wish to convert an English text to API.
I ran Cygwin using Windows 10. I followed the instructions, including changing the "cp -pd" to "cp -pR" in the relevant flite-2.0.5-current\main\Makefile file. However, I cannot manage to run this command - "make lex_lookup".
Thank you very much for your help.
So I use this code:
from epitran.backoff import Backoff
backoff = Backoff(['fas-Arab', 'rus-Cyrl'])
backoff.transliterate('Привет дорогой друг пидор')
and it gives
'prʲivʲet doroɡoй druɡ pʲidor'
as u see, there is the russian й in result, which should be (maybe) j. Or am I wrong?
Thanks for your wonderfull work. Do you have any plan for the support of Norwegian, Danish and Finnish?
I got a strange transliteration for Italian:
abiud d͡ʒenerɔ eliat͡ʃim eliat͡ʃim ɡenerɔ asor
ɡenerɔ
should be like d͡ʒenerɔ
. This happens if the string is part of a much larger string, but not when it is transliterated in isolation (i.e., only that string).
It happens here because the function panphon.FeatureTable.segs
isn't defined:
https://github.com/dmort27/epitran/blob/master/epitran/flite.py#L155
Using Python 3.7.4. This is the output of pip freeze
:
editdistance==0.5.3
epitran==1.1
marisa-trie==0.7.5
munkres==1.1.2
numpy==1.17.0
panphon==0.15
PyYAML==5.1.2
regex==2019.8.19
unicodecsv==0.14.1
First of all, thanks for making this great software. Works perfect for me.
Also adding rules is explained very clearly and I could implement it with ease.
I am parsing, then converting a dutch wordlist to ipa and xsampa, trying to generate a dict for building voices. I saw there's a arpabet mapping too, which would be handy training sphinx. Should I create a class, and ipa2arpa.csv like you did for the xsampa conversion?
I am now using xsampa like this:
`from epitran.xsampa import XSampa
#set to dutch
epi = epitran.Epitran('nld-Latn')
#x-sampa class
xs = XSampa()
s = epi.transliterate( word ).encode("utf-8")
s_a = xs.ipa2xs( unicode(s, "utf-8") )
`
So I could also make a class like xsampa for ipa2arpa, or there is a simpler way?
When will this problem be fixed? Thank you very much!
Hi David,
Is there a way to directly get the list of all ipa symbols that you use for English and Polish?
Thanks,
Adnane
Thanks for your great job firstly ! It's a really awesome project ! And I have a question about how to convert the ipa token to the language token ? Or is it possible ?
I am facing an issue running the model for English. I have installed Flite and am able to run c = os.system(command) from my python script as well.
I get the following warning:
WARNING:root:lex_lookup (from flite) is not installed. Did anyone else face this issue? Could you let me know how you have solved it? Thanks!
On Debian stretch:
$ pip3 install epitran
Collecting epitran
Using cached epitran-0.23-py2.py3-none-any.whl
Collecting marisa-trie (from epitran)
Using cached marisa_trie-0.7.4-cp35-cp35m-manylinux1_x86_64.whl
Collecting panphon>=0.12 (from epitran)
Using cached panphon-0.12-py2.py3-none-any.whl
Collecting unicodecsv (from epitran)
Collecting subprocess32 (from epitran)
Using cached subprocess32-3.2.7.tar.gz
Complete output from command python setup.py egg_info:
This backport is for Python 2.x only.
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-7hr0xfki/subprocess32/
Here is the traceback for using the Epitran("amh-Ethi"), for other languages it works fine.
import epitran
epi = epitran.Epitran("amh-Ethi")
Traceback (most recent call last):
File "", line 1, in
File "epitran/_epitran.py", line 42, in init
self.epi = SimpleEpitran(code, preproc, postproc, ligatures)
File "epitran/simple.py", line 52, in init
self.postprocessor = PrePostProcessor(code, 'post')
File "epitran/ppprocessor.py", line 28, in init
self.rules = self._read_rules(code, fix)
File "epitran/ppprocessor.py", line 38, in _read_rules
return Rules([abs_fn])
File "epitran/rules.py", line 28, in init
rules = self._read_rule_file(rule_file)
File "epitran/rules.py", line 36, in _read_rule_file
rules.append(self._read_rule(line))
File "epitran/rules.py", line 65, in _read_rule
return self._fields_to_function(a, b, X, Y)
File "epitran/rules.py", line 81, in _fields_to_function
regexp = re.compile(left)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 345, in compile
return _compile(pattern, flags, kwargs)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 490, in _compile
caught_exception.pos)
_regex_core.error: missing ) at position 53
When running Backoff
class with "cmn-Hant" (which uses EpihanTraditional
), it complains with the following error:
File "/usr2/home/amuis/anaconda3/envs/py36/lib/python3.6/site-packages/epitran/backoff.py", line 46, in transliterate
m = lang.epi.regexp.match(dia.process(token))
AttributeError: 'EpihanTraditional' object has no attribute 'regexp'
This can be easily fixed by adding the following line at the end of https://github.com/dmort27/epitran/blob/master/epitran/epihan.py
self.regexp = re.compile(r'\p{Han}')
I assume the character class "[{Han}" captures both traditional and simplified Chinese.
I've been trying to use the english transliteration, without success.
I did follow installation instructions for flite (and also copied the relevant binaries in the /usr/local/bin
), and the process seems to have worked since I do not get anymore the "lex_lookup not installed" kind of error.
However, I'm still stuck at a rather cryptic KeyError. When I do (in python3):
import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate('Berkeley')
this is what I get:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-8-e30894fd177f> in <module>
----> 1 epi.transliterate('Berkeley')
~/.local/lib/python3.7/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
60 unicode: IPA string
61 """
---> 62 return self.epi.transliterate(word, normpunc, ligatures)
63
64 def reverse_transliterate(self, ipa):
~/.local/lib/python3.7/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
89 for chunk in self.chunk_re.findall(text):
90 if self.letter_re.match(chunk):
---> 91 acc.append(self.english_g2p(chunk))
92 else:
93 acc.append(chunk)
~/.local/lib/python3.7/site-packages/epitran/flite.py in english_g2p(self, text)
205 logging.warning('Non-zero exit status from lex_lookup.')
206 arpa_text = ''
--> 207 return self.arpa_to_ipa(arpa_text)
~/.local/lib/python3.7/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
73 arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
74 ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75 text = ''.join(ipa_list)
76 return text
77
~/.local/lib/python3.7/site-packages/epitran/flite.py in <lambda>(d)
72 arpa_list = self.arpa_text_to_list(arpa_text)
73 arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74 ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
75 text = ''.join(ipa_list)
76 return text
KeyError: 'iy)\n(b'
No matter my query, it seems self.arpa_map
does not have it. What am I doing wrong?
In fra-Latn.txt preprocessors there are some matches that use []
and others that use ()
::vowel:: = a|á|â|æ|e|é|è|ê|ë|i|î|ï|o|ô|œ|u|ù|û|ü|A|Á|Â|Æ|E|É|È|Ê|Ë|I|Î|Ï|O|Ô|Œ|U|Ù|Û|Ü|ɛ
::front_vowel:: = e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ
::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ
% Treatment of <c> and <s>
sc -> s / _ [::front_vowel::]
c -> s / _ [::front_vowel::]
% High vowels become glides before vowels
ou -> w / _ (::vowel::)
u -> ɥ / _ (::vowel::)
Is a difference in behaviour between the two?
Am I right in thinking that:
[::front_vowel::]
is [e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ]
in regex.(::front_vowel::)
is (e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ)
in regex.From what I understand of regex they look as though they'd do the same thing, except [::front_vowel::]
would also match the |
char.
I also don't think []
would work if there are two or more chars in a group, for example ch
in:
::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ
I'd guess that ()
also create capturing groups but I'm not sure if that's being utilised.
Any guidance would be greatly appreciated.
You can install flite
by sudo apt install flite
. t2p
is included.
https://packages.ubuntu.com/search?keywords=flite&searchon=names&exact=1&suite=all§ion=all
$ dpkg -S /usr/bin/t2p
flite: /usr/bin/t2p
$ flite --version
Carnegie Mellon University, Copyright (c) 1999-2016, all rights reserved
version: flite-2.1-release Dec 2017 (http://cmuflite.org)
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
lex_lookup
must be still built from source.
Hello,
I came across what I believe to be a bug in German transliteration of the grapheme 's'. This occurs when using the 'deu-Latn' and the 'deu-Latn-nar' dictionaries. Take for example the word 'sehr':
In [14]: epi1.transliterate('sehr')
Out[14]: 't͡seːə'
In [16]: epi3.transliterate('sehr')
Out[16]: 't͡seːɐ'
Here epi1
was initialized with the 'deu-Latn' dictionary and epi3
with the 'deu-Latn-nar' dictionary.
In both cases I would expect the 's' in 'sehr' to be transliterated with [z]. I know that [s] is also possible in this case when dealing with southern German dialects, and I see this transliteration when using the 'deu-Latn-np' dictionary. However, after consulting all my sources, I don't see a case where this can be transliterated as [t͡s].
Another example would be the word 'Stock':
In [20]: epi1.transliterate('Stock')
Out[20]: 'stok'
In [21]: epi3.transliterate('Stock')
Out[21]: 'stok'
In the case of the 'deu-Latn' example, I can understand why this may be transliterated as [s], but at least with the narrow transliteration I would expect [ʃ]. As far as I know, [s] only occurs in this environment in northern German dialects.
Would you mind investigating this with me? What I've done so far is look at 10s of examples(I'm transliterating a large corpus) and it seems that it happens across the board, no exceptions. I also made sure that I pip installed the latest version of Epitran.
I am curious to get a sense of what other researchers feel about the use of BCP47 tags for speech recognition models. And at what level (data ID, on training data, or the model itself, or out put from the model? Read more about BCP47 here: https://www.w3.org/International/articles/language-tags/
https://tools.ietf.org/html/bcp47
Some months ago I was on the IETF mailing list for sub-tags, and suggest that Speech to text and text to speech models should have tags identifying them. But there didn't seem to be any great "ah Ha's" from that crowd.
Hello can someone help me with this error? I've already updated my Microsoft Visual Studio cause that was the first error now I am getting this. Why is this happening? Thank you!
C:\Users\LENOVO>pip install epitran
Collecting epitran
Using cached epitran-1.8-py2.py3-none-any.whl (132 kB)
Requirement already satisfied: regex in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (2020.7.14)
Requirement already satisfied: unicodecsv in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (0.14.1)
Requirement already satisfied: setuptools in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (50.3.1)
Collecting marisa-trie
Using cached marisa-trie-0.7.5.tar.gz (270 kB)
Collecting panphon>=0.16
Using cached panphon-0.17-py2.py3-none-any.whl (71 kB)
Requirement already satisfied: PyYAML in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (5.3)
Collecting editdistance
Using cached editdistance-0.5.3.tar.gz (27 kB)
Requirement already satisfied: numpy in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.18.0)
Requirement already satisfied: munkres in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.1.4)
Building wheels for collected packages: marisa-trie, editdistance
Building wheel for marisa-trie (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-51uiu7_u'
cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
Complete output (23 lines):
running bdist_wheel
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build\temp.win-amd64-3.8
creating build\temp.win-amd64-3.8\marisa-trie
creating build\temp.win-amd64-3.8\marisa-trie\lib
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
agent.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
keyset.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
trie.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
mapper.cc
marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Failed building wheel for marisa-trie
Running setup.py clean for marisa-trie
Building wheel for editdistance (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-cibnnpm3'
cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\editdistance\
Complete output (30 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.8
creating build\lib.win-amd64-3.8\editdistance
copying editdistance\__init__.py -> build\lib.win-amd64-3.8\editdistance
copying editdistance\_editdistance.h -> build\lib.win-amd64-3.8\editdistance
copying editdistance\def.h -> build\lib.win-amd64-3.8\editdistance
running build_ext
building 'editdistance.bycython' extension
creating build\temp.win-amd64-3.8
creating build\temp.win-amd64-3.8\Release
creating build\temp.win-amd64-3.8\Release\editdistance
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/_editdistance.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/_editdistance.obj
_editdistance.cpp
editdistance/_editdistance.cpp(91): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
editdistance/_editdistance.cpp(92): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
editdistance/_editdistance.cpp(44): warning C4018: '<=': signed/unsigned mismatch
editdistance/_editdistance.cpp(97): note: see reference to function template instantiation 'unsigned int edit_distance_bpv<cmap_v,varr<1>>(T &,const int64_t *,const size_t &,const unsigned int &,const unsigned int &)' being compiled
with
[
T=cmap_v
]
editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/bycython.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/bycython.obj
bycython.cpp
c:\users\lenovo\appdata\local\programs\python\python38\include\pyconfig.h(206): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Failed building wheel for editdistance
Running setup.py clean for editdistance
Failed to build marisa-trie editdistance
Installing collected packages: marisa-trie, editdistance, panphon, epitran
Running setup.py install for marisa-trie ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie'
cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
Complete output (23 lines):
running install
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build\temp.win-amd64-3.8
creating build\temp.win-amd64-3.8\marisa-trie
creating build\temp.win-amd64-3.8\marisa-trie\lib
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
agent.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
keyset.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
trie.cc
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
mapper.cc
marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie' Check the logs for full command output.
C:\Users\LENOVO>
Hello,
If I understand correctly, if you use Flite as the backend for English G2P, you get transcriptions in US English. How would one go about getting transcriptions for other varieties of English, e.g. Received Pronunciation or Australian English? I know that Festvox supports British and Scottish English, so could it in theory be used as the backend for English G2P?
For my use case, it's not super important that the vowels are precise, but the rhoticity distinction would be extremely useful.
Thanks!
Should I feed the individual words to the epitran or a whole sentence to the epitran?
Are there any rules using contexts to transliterate the words?
For example I have epi.transliterate('янъ') -> jan
With word_to_tuples I can find skipped 'ъ', but how can I know what 'я' occupy the first two indexes?
IPA transliterations of Bengali characters with Chandrabindus in them leave the Chandrabindu there, when it should be replaced with a combining tilde, the corresponding IPA character. With epitran 0.56 installed:
>>> import epitran
>>> translator = epitran.Epitran('ben-Beng')
>>> translator.transliterate('হাঁ')
ɦaঁ
I haven't checked extensively, but it is possible this also occurs with other languages and diacritics.
I dowloaded ipa-xsampa.csv and find some errors in the data, e.g.
R\
in X.SAMPA
maps to vd uvular fricative
and vl uvular trill
glottal plosive
has two identical rowsI modified them based on wikipedia. I think you may like to check the modified file: ipa-xsampa-modified.csv.txt. Note that I modified the file according to my requirements, so it might not suit your needs.
Thanks for the data!
Is it correct to understand from the paper, that no machine learning is involved or directly integrated in the framework that epitran is? or could you point me in the right direction?
Hi David,
Related to the issue I rose on the 13th May 2019 (list of all ipa characters used for English and Polish), What do the words 'pau' and 'null' at the start of the file arpabet.csv mean?
Cheers,
Adnane
I wanted to give this a whirl but hit some speedbump from the getgo :
In [2]: epi = epitran.Epitran('eng-Latn')
In [3]: epi.transliterate('iceland')
WARNING:root:lex_lookup (from flite) is not installed.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-3-54aaf7e8072d> in <module>()
----> 1 epi.transliterate('iceland')
/home/jeremy/.local/lib/python3.6/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
60 unicode: IPA string
61 """
---> 62 return self.epi.transliterate(word, normpunc, ligatures)
63
64 def reverse_transliterate(self, ipa):
/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
89 for chunk in self.chunk_re.findall(text):
90 if self.letter_re.match(chunk):
---> 91 acc.append(self.english_g2p(chunk))
92 else:
93 acc.append(chunk)
/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in english_g2p(self, text)
205 logging.warning('Non-zero exit status from lex_lookup.')
206 arpa_text = ''
--> 207 return self.arpa_to_ipa(arpa_text)
/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
73 arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
74 ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75 text = ''.join(ipa_list)
76 return text
77
/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in <lambda>(d)
72 arpa_list = self.arpa_text_to_list(arpa_text)
73 arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74 ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
75 text = ''.join(ipa_list)
76 return text
KeyError: ''
In [4]: epi2 = epitran.Epitran('rus-Cyrl')
In [7]: epi.transliterate('')
Out[7]: ''
In [9]: epi2.transliterate('Приве́т')
Out[9]: 'prʲivʲét'
I am trying to use Epitran to create IPA conversions for English sentences, and it doesn't produce results I expect for some common words.
import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate("was does buzz")
'wɑz dowz bʌz'
Note that IPA for does
contains a w
. Looking through dictionaries, I find ˈdəz
, dɪz
. When all three are put into a simple IPA reader, the dictionary versions sound correct and epitran's translation sounds wrong.
What is wrong: pronunciation library https://itinerarium.github.io/phoneme-synthesis/ or this library encoding results?
When I am running this code:
import epitran
epi = epitran.Epitran('eng-Latn')
print (epi.transliterate(u'Berkeley'))
I am using python+3. Would you please help me to fix this error?
File "/home/hamada/.local/lib/python3.6/site-packages/epitran/flite.py", line 212, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range
For example За́мок
or замо́к
@dmort27
For instance in cebuano
felix --> [e, l, i]
x --> []
In swedish
och --> []
I fixed this (I think), by simply replacing the commented line below with the uncommented one. Maybe this is horribly wrong, but it seems to work now.
#ipa_segs = self.ft.ipa_segs(self.epi.strict_trans(word, normpunc,
# ligaturize))
ipa_segs = self.ft.segs_safe(self.epi.transliterate(word, normpunc, ligaturize))
How does Epitran transliterate contractions? It seems that the package has difficulties with it. for example:
Simply concatenating the contractions seem to give better results in some cases. Why is that the case?
When i use epitran--English - Latn, it tells that root:lex_lookup (from flite) is not installed. But I have installed lex_lookup.
(base) [root@host-10-29-0-161 testsuite]# make lex_lookup
Makefile:83: warning: overriding recipe for target multi_thread' Makefile:80: warning: ignoring old recipe for target
multi_thread'
make: `lex_lookup' is up to date.
Error:
import epitran
epi = epitran.Epitran('eng-Latn')
print(epi.transliterate(u'Berkeley'))
WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):
File "", line 1, in
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/_epitran.py", line 62, in transliterate
return self.epi.transliterate(word, normpunc, ligatures)
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 96, in transliterate
acc.append(self.english_g2p(chunk))
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 214, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range
epi = epitran.Epitran("ita-Latn")
epi.transliterate("motorizzazione")
return 'motorit͡sasione', but it should be, at least, 'motorit͡sat͡sione' or, better, 'motorit͡st͡sat͡st͡sione' (semivowel "i" should be "j", but I do not know how fine-grained the transliteration is supposed to be)
Hello all,
First off I wanted to say well done on Epitran! It is a tool that has proven useful for many projects of mine.
I stumbled across something today and I wanted to know if Epitran was designed to do this, or if it's a bug. I noticed that when I transliterate words in German, they have a different IPA transliterations when adding punctuation. As far as I can tell, this doesn't happen in any other language (I tried to reproduce the error in Polish, Russian and English.).
Examples:
For the last two examples, I could accept the transliterations that are produced when punctuation is added to the string, when phonetic environment and dialect are taken into consideration. However, to my knowledge, I don't know of any case where 'heute' should have an 'h' after the diphthong in its transliteration.
When using the transliterate function, I normally use normpunc=True and ligatures=True, but even disabling those flags produces the same results. I also used pip to check that I was using the latest version of Epitran.
I would really appreciate some info on this matter, as it will guide my future projects. Thanks a lot for your time!
How to solve this problem?
import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate('Hello')
WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):
File "<ipython-input-3-9e6f98d7c4c9>", line 1, in <module>
epi.transliterate('Hello')
File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
return self.epi.transliterate(word, normpunc, ligatures)
File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 94, in transliterate
acc.append(self.english_g2p(chunk))
File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 212, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range
It's working for other languages but not english
Thank you for this great tool!
I was hoping to use Epitran to extract frequencies of grapheme-phoneme alignment in different languages. But I am running into issues when using the word_to_tuples
and word_to_segs
features.
Here is the output of epi.word_to_tuples
for the word tough
in English
('L', 0, 't', 't', [('t', <map object at 0x113817c50>)])
('L', 0, 'o', 'ʌ', [('ʌ', <map object at 0x113817250>)])
('L', 0, 'u', 'f', [('f', <map object at 0x1120a06d0>)])
('L', 0, 'g', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'h', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
Here is the output for choice
('L', 0, 'c', 't͡ʃ', [('t͡ʃ', <map object at 0x11380cad0>)])
('L', 0, 'h', 'o', [('o', <map object at 0x11380c5d0>)])
('L', 0, 'o', 'j', [('j', <map object at 0x11380cb10>)])
('L', 0, 'i', 's', [('s', <map object at 0x1120a0fd0>)])
('L', 0, 'c', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'e', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
I'd expect the phonetic form /f/
in tough
to correspond to either g
or h
. And the phonetic form /s/
in choice
to correspond to c
. However, that's not the case. I am wondering if this is expected behavior or a bug?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.