Comments (14)
After doing some profiling with cProfile, the issue is indeed caused by a regression on Python >= 3.7, more precisely by sre_parse._uniq
function, which did not exist on Python <= 3.6.
I've created a PR on Python repo which fixes the issue we have here. See: python/cpython#15030
I came up with a very dirty quick fix (to be run before importing sacremoses, only on Python >= 3.7)
import sre_parse
sre_parse._uniq = lambda x: list(dict.fromkeys(x))
from sacremoses.
I'm on MacOS 10.14.1, python 3.7.1 (via anaconda). I'll run some more tests on my end with English later today on some different OSs and python installs too to see if I can isolate the problem.
from sacremoses.
With English, it actually seems to hang forever (I Ctrl-c
'd the process after waiting for 15 minutes).
I think it's getting hung compiling regular expressions somewhere.
In [2]: mt = MosesTokenizer(lang='en')
In [3]: mt.tokenize("Hello, Mr. Smith!")
from sacremoses.
The first behavior of using the tokenizer first time seems reasonable. The regexes would be compiled and cached and in the case of the new expanded perluniprop files, they're huge, so it makes sense.
Second behavior of English hanging, shouldn't be the case:
[in]:
%time
mt.tokenize("Hello, Mr. Smith!")
[out]:
CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs
Might be some cache in the perluniprop files or some system problems.
Which OS are you using? Which Python verion?
from sacremoses.
But it do looks like the new version of the full perluniprop is does feel slower =(
I'll run some benchmark.
from sacremoses.
I'm having the same isssue, Ubuntu 18.04 and python 3.7.3
from sacremoses.
It works with python 3.6.8
from sacremoses.
Same issue here. Seems it's something wrong with re on Python 3.7
from sacremoses.
@myleott @johnfarina could you try and upgrade the Sacremoses? The current version should be 0.0.24
.
The reason behind the slowness don't seem to be the Python distribution. If any changes to the speed, upgrading Python should speed up regexes, https://docs.python.org/3/whatsnew/3.7.html (esp. with the flags regex compilation).
It's probably because of the unichars -au
inclusion of unamed characters from Perluniprops to resolve the CJK tokenization issues from https://github.com/alvations/sacremoses/issues/42. That caused the list of characters in IsAlpha to grow from 21674 to 476052 bytes and IsAlnum grew from 22414 to 478372.
It was too much of a performance cost for perfect accuracy on all possible characters, so the new version falls back to the only unichars
without -au
and statically adds the CJK characters as per needed instead of adding the universe of alphanumeric characters all the time.
P/S: Weird that the PR auto-closes the issue....
from sacremoses.
Substantial improvement for Korean with version 0.0.24
!
In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="ko")
In [3]: %time mt.tokenize("세계 에서 가장 강력한")
CPU times: user 5.84 s, sys: 41.7 ms, total: 5.88 s
Wall time: 6.04 s
Out[3]: ['세계', '에서', '가장', '강력한']
English is slower, weirdly:
In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 11.6 s, sys: 89.9 ms, total: 11.7 s
Wall time: 11.9 s
Out[3]: ['Hello', ',', 'World', '!']
and Chinese takes almost 2 minutes on my machine, which is still a bit painful:
In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="zh")
In [3]: %time mt.tokenize("记者 应谦 美国")
CPU times: user 1min 54s, sys: 878 ms, total: 1min 55s
Wall time: 1min 56s
Out[3]: ['记者', '应谦', '美国']
from sacremoses.
@johnfarina Which Python version are you using for the above benchmark?
from sacremoses.
I have the same issue. It looks like it is indeed related to Python 3.7 🤔:
With Python 3.6.1 (Amazon Linux), sacremoses 0.0.24:
In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 220 ms, sys: 0 ns, total: 220 ms
Wall time: 220 ms
Out[3]: ['Hello', ',', 'World', '!']
With Python 3.7.3 (Amazon Linux), sacremoses 0.0.24:
In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 21.1 s, sys: 10 ms, total: 21.1 s
Wall time: 21.1 s
Out[3]: ['Hello', ',', 'World', '!']
from sacremoses.
@johnfarina Which Python version are you using for the above benchmark?
This was python 3.7.3 (via anaconda) on Mac OS 10.14.1. I tried the same with 3.7.1 on Mac and Ubuntu 16.04 too with similar results.
from sacremoses.
Thanks @yannvgn!! Great to see this resolved!
from sacremoses.
Related Issues (20)
- error: loading model for detruecaser
- Bug: normalize prints extra newline HOT 1
- Is this normal to tokenize "His number is No.123." to ['His', 'number', 'is', 'No.123.']. Should it be ['His', 'number', 'is', 'No.123', '.']?
- bug report
- Truecaser seems do not process sentences beginning with quotation marks
- update release on github
- No detokenize_penn?
- can't tokenise the period properly
- deep detokenizer
- Chinese full stop “。” can't be split. HOT 1
- Trying to get in touch regarding a security issue HOT 2
- Is there a way to use sacremoses in java? HOT 1
- Cache downloaded tokenizer files HOT 1
- Which of the 100 languages used in mBERT are not supported by this tokenizer? HOT 1
- distutils is deprecated in Python 3.10 HOT 1
- New release? HOT 4
- Error with CLI `tokenize` using `click==8.1.3` HOT 10
- Is this package multi-threaded? HOT 1
- [Question] Why is <unk> token tokenized into 3 items? HOT 1
- Quiet flag has no effect on the detokenizer HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sacremoses.