Giter Site home page Giter Site logo

Text encoding issues about pyzotero HOT 10 CLOSED

urschrei avatar urschrei commented on June 7, 2024
Text encoding issues

from pyzotero.

Comments (10)

urschrei avatar urschrei commented on June 7, 2024

I'm not actually de– or encoding the returned content anywhere, so I wonder whether requests is double-encoding the returned utf-8. Anyway, I'm digging into it now.

from pyzotero.

urschrei avatar urschrei commented on June 7, 2024

OK, I'm now explicitly encoding bibliography strings as UTF-8. Can you pull the dev branch and see if this works for you?

from pyzotero.

fractaledmind avatar fractaledmind commented on June 7, 2024

Just to confirm that new build fixes issues. Thank you.

from pyzotero.

smathot avatar smathot commented on June 7, 2024

This issue does not appear to be fixed yet. It doesn't always happen, which makes me think it's not a simple text-encoding issue. But sometimes it does.

For example, here you see two items with special characters, only of which is correctly encoded:

# zot = zotero instance
s = zot.item('FRHUMMC6', content='citation', style='chicago-author-date')
print s
print s[0]
s = zot.item('SBRWRSCQ', content='citation', style='chicago-author-date')
print s
print s[0]

output:

[u'<span>(Tka\u010dik et al. 2011)</span>']
<span>(Tkačik et al. 2011)</span>
[u'<span>(Aln\u0102\u015as et al. 2014)</span>']
<span>(AlnĂŚs et al. 2014)</span>

Here, 'Tkačik' is correct, but 'AlnĂŚs' should have been 'Alnæs'. I've tested this with the latest snapshot (5dc5bdd).

And thanks for your great work on pyzotero!

from pyzotero.

urschrei avatar urschrei commented on June 7, 2024

I can't reproduce this, and have added a test (d56cbba) to check for double-encoding issues. Could you give me the items in Zotero RDF export format, and I'll try to see what's going on?

from pyzotero.

smathot avatar smathot commented on June 7, 2024

Sure, here they are:

Both items appear correct in the Zotero desktop app, and the Zotero web interface. It's a weird issue, because it affects only a small number of items with special characters.

from pyzotero.

fractaledmind avatar fractaledmind commented on June 7, 2024

The fact that Alnæs becomes AlnĂŚs suggests to me an issue with Unicode normalization. There are 4 options for Unicode normalization can be one of NFC, NFKC, NFD, and NFKD. Here's what the C and K options do:

  • C: Combine characters and diacritics that are written using separate code points, such as converting “e” plus an acute accent modifier into “é”.
  • K: Replace characters that are functionally equivalent with the most common form. For example, full-width Roman characters will be replaced with ASCII characters, ellipsis characters will be replaced with three periods, and the ligature ‘fl’ will be replaced with ‘fl’.

Perhaps you could try normalizing the text as soon as you receive it from Zotero. I personally use the following function to decode and normalize all incoming text in my Python:

def decode(text, encoding='utf-8', normalization='NFC'):
    """Convert `text` to unicode."""
    if isinstance(text, basestring):
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
    return unicodedata.normalize(normalization, text)

Now, this just easily not be the issue, but I've been bitten enough by the vagaries of Unicode in Python to know that it easily could be as well.

from pyzotero.

urschrei avatar urschrei commented on June 7, 2024

Thanks, I'll try that in an experimental branch. I still can't recreate the problem in the latest snapshot, even with @smathot's imported examples:

In [4]: zc = zot.top(limit=2, content='citation', style='chicago-author-date')
In [5]: print(zc[0].encode('utf8'))
<span>(Tkačik et al. 2011)</span>
In [6]: print(zc[1].encode('utf8'))
<span>(Alnæs et al. 2014)</span>

I wonder if it's a locale issue. Mine is:

LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

from pyzotero.

smathot avatar smathot commented on June 7, 2024

I made a curious observation: If I add another special character to the name, then the æ is also decoded correctly. For example, Alnæsé. This only works if the added character has an accent or something; it doesn't work with plain ASCII characters. And when I remove the special character, the encoding error comes back: AlnĂŚs

Is there some kind of encoding auto-detection somewhere?

I use Kubuntu 14.10 with en_US.UTF-8 locale, and Python 2.7.8.

from pyzotero.

sraimund avatar sraimund commented on June 7, 2024

Yes, the encoding in the requests library is guessed by chardet in case it is not included in the response (https://github.com/kennethreitz/requests/blob/master/requests/models.py#L748-749). As it is stated, that the Zotero API always returns UTF-8 (https://groups.google.com/forum/#!topic/zotero-dev/q8ZxilZobo4), you can maybe add the line self.request.encoding = "utf-8" in the _retrieve_data() method of zotero.py after the request is made (e.g. before line https://github.com/urschrei/pyzotero/blob/master/pyzotero/zotero.py#L269). I had a problem with umlauts in my citations and it seems to be solved by this additional line.

from pyzotero.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.