Giter Site home page Giter Site logo

Comments (13)

nelson-liu avatar nelson-liu commented on May 20, 2024 16

The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:

RUN apt-get update --fix-missing && apt-get install locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

from text.

nelson-liu avatar nelson-liu commented on May 20, 2024 11

yes, the sys.getdefaultencoding() looks unexpected. Python 3 changed the system encoding to default to utf-8, but only when LC_CTYPE is unicode-aware.

I'm betting that echo $LANG and echo $LC_CTYPE will print C or something on your machine -- try setting these environment variables beforehand and let me know how that goes:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

from text.

jekbradbury avatar jekbradbury commented on May 20, 2024

I'm confused. You're on Python3.5, but it looks like the line.decode('utf-8') branch (in line 106) ran, which is behind a six.PY2 condition that should be False. Any idea what's going on there? Maybe insert some print statements?

from text.

nelson-liu avatar nelson-liu commented on May 20, 2024

Yeah, I'd expect the fix you mentioned (encoding argument to open) to be a python 2 fix. What's the value of the $LC_ALL environment variable on your system / sys.getdefaultencoding()?

from text.

jekbradbury avatar jekbradbury commented on May 20, 2024

If I had to guess, I'd say maybe you're still running an older version of torchtext (e.g. in a Python session you've had open for a while) but the code in the dist-packages folder has been updated (and the traceback pulls from there rather than what's actually running).

from text.

PetrochukM avatar PetrochukM commented on May 20, 2024

Version of torchtext: (Most recent)

$git log
commit df7b391d3c02471a2095170ee83c9de4586930e7
Author: Nelson Liu <[email protected]>
Date:   Fri Jul 14 15:48:45 2017 -0700

    Fix lint

commit f411d83ecf63936d7f4062b9bbc1a667a07f2caf
Author: Nelson Liu <[email protected]>
Date:   Fri Jul 14 15:48:17 2017 -0700

    Add non-regression test

@jekbradbury
Reinstall of torchtext:

Installed /usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg
Processing dependencies for torchtext==0.1.1
Finished processing dependencies for torchtext==0.1.1

@nelson-liu
System:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Got the same error:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 56, in splits
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 107, in __init__
  File "/usr/local/lib/python3.5/dist-packages/torchtext-0.1.1-py3.5.egg/torchtext/data/dataset.py", line 106, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)

Replicated the error in terminal:

>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line for line in f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

from text.

PetrochukM avatar PetrochukM commented on May 20, 2024

Running this in Docker on a GPU machine.

Tried echo $LANG:

# echo $LANG
en_US.UTF-8
# echo $LC_CTYPE

Tried exporting:

# export LANG=en_US.UTF-8
# export LC_ALL=en_US.UTF-8
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

Python3 CLI:

>>> f = open('/root/qa/data/simple_questions_wikidata/train.tsv', 'r')
>>> [line.decode('utf-8') for line in f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 573: ordinal not in range(128)

Torchtext:

Traceback (most recent call last):
  File "example/tune.py", line 268, in <module>
    main()
  File "example/tune.py", line 208, in main
    dev_examples, train_examples = load_examples(options)
  File "/root/pytorch-seq2seq/example/lib/utils.py", line 222, in load_examples
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]

from text.

nelson-liu avatar nelson-liu commented on May 20, 2024

bump; did this end up working / should we close this?

On another note, i've never been sure how to write code that works with unicode for py2/3...much of it hinges on the fact that py3 is assumed to use unicode by default, but this isn't necessarily always true. Should we be refactoring things to check default encoding instead, or (probably more sane) have something in the README about properly setting locales for py3 to use unicode by default?

from text.

PetrochukM avatar PetrochukM commented on May 20, 2024

This ended up working!

from text.

ianbstewart avatar ianbstewart commented on May 20, 2024

Sorry to bump this, but I've run into the same problem even though on my machine (Red Hat 6.9) I have the LANG and LC_ALL variables set to en_US.UTF-8. I think part of it might be that I'm trying to use Python 3 to load models that were saved with Python 2.

from text.

rishabh1212 avatar rishabh1212 commented on May 20, 2024

The default ubuntu docker image doesn't have en-US.UTF-8. That's the warning you're getting when exporting. Try:

RUN apt-get update --fix-missing && apt-get locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

@nelson-liu
Minor but can be helpful
Can you edit apt-get locales to apt-get install locales?

from text.

nelson-liu avatar nelson-liu commented on May 20, 2024

thanks @rishabh1212

from text.

rongduo avatar rongduo commented on May 20, 2024

apt-get update --fix-missing && apt-get install locales

@nelson-liu Thanks for your solution. I encountered a similar issue and solved it with your suggestion.

from text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.