Giter Site home page Giter Site logo

UnicodeEncodeError about pyquery HOT 21 CLOSED

gawel avatar gawel commented on August 20, 2024
UnicodeEncodeError

from pyquery.

Comments (21)

flisky avatar flisky commented on August 20, 2024

more interesting:

doc = pq(url="http://cn.bing.com/")
doc('div:contains("必应")')
# raise UnicodeDecodeError
doc(u'div:contains("必应")') 
# raise UnicodeEncodeError

from pyquery.

flisky avatar flisky commented on August 20, 2024

I didn't dig deeply, but I see the XPathExpr#__str__ in cssselector under py2 returns unicode, which I don't think it's proper behavior.

from pyquery.

gawel avatar gawel commented on August 20, 2024

Looks like a cssselect problem (or not). Ping @SimonSapin

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

@flisky , are you really using pyquery master? You traceback (path = str(path) and return str(self).decode('utf-8') in cssselectpatch.py) does not match what I see in pyquery/cssselectpatch.py@master or in the PyPI release.

I couldn’t reproduce a similar traceback either, but I had either this:

  File "/home/simon/.virtualenvs/cssselect2/lib/python2.7/site-packages/pyquery/cssselectpatch.py", line 195, in xpath_contains_function
    value = str(function.arguments[0].value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6211' in position 0: ordinal not in range(128)

(pyquery should use unicode() on Python 2 rather than str().) and this:

  File "/home/simon/.virtualenvs/cssselect2/lib/python2.7/site-packages/pyquery/cssselectpatch.py", line 196, in xpath_contains_function
    xpath.add_post_condition(
AttributeError: 'XPathExpr' object has no attribute 'add_post_condition'

… which is a bit more mysterious to me, as nothing seems wrong with the xpathexpr_cls mechanism.

Returning Unicode from __str__ on Python 2 is not "proper", but it kinda works if you use unicode() and not str(). I’d rather just use a .to_xpath() method that returns Unicode and not bother with the semantics of __str__ on Python 2 vs. 3, but I’m waiting on your reply re. pyquery version to do that in a way that won’t break pyquery.

from pyquery.

flisky avatar flisky commented on August 20, 2024

The traceback was wrong, because I manually edited the source file to debug, and sorry for this.

And yes, I'm in the master branch of pyquery.

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

Could you copy a trackeback you get with unedited master?

from pyquery.

flisky avatar flisky commented on August 20, 2024
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    doc._css_to_xpath(u'a:contains("我")') 
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/pyquery-1.2.5-py2.7.egg/pyquery/pyquery.py", line 227, in _css_to_xpath
    return self._translator.css_to_xpath(selector, prefix)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/cssselect-0.8-py2.7.egg/cssselect/xpath.py", line 188, in css_to_xpath
    for selector in selectors)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/cssselect-0.8-py2.7.egg/cssselect/xpath.py", line 188, in <genexpr>
    for selector in selectors)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/cssselect-0.8-py2.7.egg/cssselect/xpath.py", line 208, in selector_to_xpath
    xpath = self.xpath(tree)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/cssselect-0.8-py2.7.egg/cssselect/xpath.py", line 230, in xpath
    return method(parsed_selector)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/cssselect-0.8-py2.7.egg/cssselect/xpath.py", line 260, in xpath_function
    return method(self.xpath(function.selector), function)
  File "/home/jeff/.virtualenvs/scrapy/local/lib/python2.7/site-packages/pyquery-1.2.5-py2.7.egg/pyquery/cssselectpatch.py", line 226, in xpath_contains_function
    value = str(function.arguments[0].value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6211' in position 0: ordinal not in range(128)

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

Try changing the last line to value = unicode(function.arguments[0].value)?

(Of course, a real patch will need something like this for Python 3 support.)

try:
    unicode
except NameError:
    unicode = str

from pyquery.

flisky avatar flisky commented on August 20, 2024

It works for unicode string, but fails for byte string which contains non-latin characters under py2 (not tried in py3). I think we need support both str & unicode, right?

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

No. Selectors are text, and text is represented as Unicode in Python. If you want to transmit text as bytes the receiver needs additionally to know the character encoding that was used. Python 2 will try to be "helpful" and automatically convert between ASCII-encoded byte strings and Unicode strings, but that is often more problematic (hiding underlying bugs) than it helps. One of the major changes with Python 3 is a much stronger distinction between Unicode and bytes.

So, as an API designer: if you expect text (as opposed to binary data), expect Unicode. On Python 2 only, ASCII-encoded byte strings should be acceptable too, but it’s best to raise an exception on non-ASCII data (don’t try to guess the encoding.) This usually just happens with Python 2’s implicit conversion.

As a user: if it’s text, use Unicode. If there is any chance that it contains non-ASCII characters, use Unicode. For a literal string that only contains ASCII, it is acceptable to omit the u prefix: it will usually just work because of the above. (And it’s more convenient on Python 2 because bytes are the default string type.)

More on Unicode and characters encoding in Python: http://nedbatchelder.com/text/unipain.html

from pyquery.

flisky avatar flisky commented on August 20, 2024

Ok, I see, and thank you for your anwser. Many libraries & frameworks (like Django ORM) support both unicode & string, so I used to think it's a best practice to follow.

Maybe we could put this in the docs to clarify? It helps (at least to me).

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

Do you have an example of API that accepts non-ASCII bytes for text? Do you know if they guess the encoding or just assume UTF-8 or do something else?

I usually just write "Unicode string" in docs and omit the "ASCII bytes are acceptable on Python 2". See eg. http://pythonhosted.org/cssselect/#cssselect.parse

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

I’m gonna consider this resolved with no change in cssselect. (The fix is using unicode() instead of str() in pyquery’s xpath_contains_function.) Does that sound fine to everyone?

from pyquery.

flisky avatar flisky commented on August 20, 2024

def force_text(s, encoding='utf-8', strings_only=False, errors='strict') from django.utils.encoding.

LGTM.

from pyquery.

SimonSapin avatar SimonSapin commented on August 20, 2024

Well ok. But force_text is about nothing but handling bytes vs. text. It’s no an unrelated API that will implicitly do the conversion. Or is this function called a lot by other Django APIs?

from pyquery.

flisky avatar flisky commented on August 20, 2024

Yes. It's called a lot.

ack -l "django.utils.encoding" |wc -l # -> 140 (source files)

from pyquery.

flisky avatar flisky commented on August 20, 2024

And total source files: ack -g *.py|wc -l # -> 195 (source files) [Django master without tests]

from pyquery.

gawel avatar gawel commented on August 20, 2024

Looks like I've fixed the problem just by using future's unicode_literal in cssselectpatch

from pyquery.

drzraf avatar drzraf commented on August 20, 2024

I'm experiencing this UnicodeError, but clearly on the pyquery side:
$ python2.7 -c 'from pyquery import PyQuery as p; print(p("é").html());'

é

$ python2.7 -c 'from pyquery import PyQuery as p; print(p("é").html());'|cat

Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

(the only difference being piping to cat)

This incorrect encoding in the first output may be avoided (passing a lxml.html.fromstring to PyQuery).
But the fact that there is a (fatal) UnicodeEncodeError when stdout is piped is really weird.
[the setup is full utf-8]

from pyquery.

flisky avatar flisky commented on August 20, 2024

@drzraf, it's clearly not pyquery's bug.

It happens on the python side, because python cannot detect the encoding for the tty.
PYTHONIOENCODING may help.

[YES, It's really weird before you understand this...]

from pyquery.

drzraf avatar drzraf commented on August 20, 2024

On Wed, Apr 15, 2015 at 11:30:03PM -0700, 尹吉峰 wrote:

@drzraf, it's clearly not pyquery's bug.

It happens on the python side, because python cannot detect the encoding for the tty.
PYTHONIOENCODING may help.

excellent! it did the trick

Then using

reload(sys)
sys.setdefaultencoding("utf-8")
in the script did it too.

[the fact that without any option the utf-8 character is double-encoded
in the pyquery output is another issue [python2, lxml or pyquery], but
there are workarounds]

many thanks!

from pyquery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.