Comments (13)
The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:
RUN apt-get update --fix-missing && apt-get install locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
from text.
yes, the sys.getdefaultencoding()
looks unexpected. Python 3 changed the system encoding to default to utf-8, but only when LC_CTYPE is unicode-aware.
I'm betting that echo $LANG
and echo $LC_CTYPE
will print C
or something on your machine -- try setting these environment variables beforehand and let me know how that goes:
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
from text.
I'm confused. You're on Python3.5, but it looks like the line.decode('utf-8')
branch (in line 106) ran, which is behind a six.PY2
condition that should be False
. Any idea what's going on there? Maybe insert some print statements?
from text.
Yeah, I'd expect the fix you mentioned (encoding argument to open) to be a python 2 fix. What's the value of the $LC_ALL environment variable on your system / sys.getdefaultencoding()
?
from text.
If I had to guess, I'd say maybe you're still running an older version of torchtext (e.g. in a Python session you've had open for a while) but the code in the dist-packages
folder has been updated (and the traceback pulls from there rather than what's actually running).
from text.
Version of torchtext: (Most recent)
$git log
commit df7b391d3c02471a2095170ee83c9de4586930e7
Author: Nelson Liu <[email protected]>
Date: Fri Jul 14 15:48:45 2017 -0700
Fix lint
commit f411d83ecf63936d7f4062b9bbc1a667a07f2caf
Author: Nelson Liu <[email protected]>
Date: Fri Jul 14 15:48:17 2017 -0700
Add non-regression test
@jekbradbury
Reinstall of torchtext:
Installed /usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg
Processing dependencies for torchtext==0.1.1
Finished processing dependencies for torchtext==0.1.1
@nelson-liu
System:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
Got the same error:
Traceback (most recent call last):
File "src/jobs/seq2seq/train.py", line 234, in <module>
fields=[('input', input_field), ('output', output_field)])
File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 56, in splits
File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 107, in __init__
File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 106, in <listcomp>
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)
Replicated the error in terminal:
>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line for line in f]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
from text.
Running this in Docker on a GPU machine.
Tried echo $LANG:
# echo $LANG
en_US.UTF-8
# echo $LC_CTYPE
Tried exporting:
# export LANG=en_US.UTF-8
# export LC_ALL=en_US.UTF-8
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Python3 CLI:
>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line.decode('utf-8') for line in f]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)
Torchtext:
Traceback (most recent call last):
File "example/tune.py", line 268, in <module>
main()
File "example/tune.py", line 208, in main
dev_examples, train_examples = load_examples(options)
File "/root/pytorch-seq2seq/example/lib/utils.py", line 222, in load_examples
fields=[('input', input_field), ('output', output_field)])
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
train_data = None if train is None else cls(path + train, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
for line in f]
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
make_example(line.decode('utf-8') if six.PY2 else line, fields)
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
from text.
bump; did this end up working / should we close this?
On another note, i've never been sure how to write code that works with unicode for py2/3...much of it hinges on the fact that py3 is assumed to use unicode by default, but this isn't necessarily always true. Should we be refactoring things to check default encoding instead, or (probably more sane) have something in the README about properly setting locales for py3 to use unicode by default?
from text.
This ended up working!
from text.
Sorry to bump this, but I've run into the same problem even though on my machine (Red Hat 6.9) I have the LANG
and LC_ALL
variables set to en_US.UTF-8
. I think part of it might be that I'm trying to use Python 3 to load models that were saved with Python 2.
from text.
The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:
RUN apt-get update --fix-missing && apt-get locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
@nelson-liu
Minor but can be helpful
Can you edit apt-get locales
to apt-get install locales
?
from text.
thanks @rishabh1212
from text.
apt-get update --fix-missing && apt-get install locales
@nelson-liu Thanks for your solution. I encountered a similar issue and solved it with your suggestion.
from text.
Related Issues (20)
- FLAN_T5_XXL_GENERATION model is inaccessible
- libtorch <torch.h> is not found
- torchtext.datasets - requests.exceptions.ConnectionError HOT 2
- Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?
- Torchtext.data.Field import error HOT 2
- m
- Link to the original CLIP Tokenizer file needs to be updated in [torchtext.transforms.CLIPTokenizer]
- CharBPETokenizer docs not rendering correctly
- Declaring _MapStyleDataset inside function makes it unpicklable
- torchtext 0.16.0 wheels are missing for aarch64 linux platform HOT 2
- tor HOT 2
- Insta Doxxxx HOT 1
- One of the three datasets returned by Multi30k seems to be bugged.
- Confusing docs for build_vocab_from_iterator
- how to run this code
- UTF-8 error with testing set of `torchtext.datasets.Multi30k(language_pair=("de", "en"))`. HOT 4
- Torch Text Transform Documentation Mismatch
- The Future of torchtext HOT 1
- BLEU_SCORE weird behaviour
- Fail to import torchtext KeyError: 'SP_DIR' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text.