I have a number of items in my Zotero library that have non-ASCII characters. I do not

I can't reproduce this, and have added a test (<a class="commit-link" data-hovercard-t

Sure, here they are: <a href="http://files.cogsci.nl/tmp/aln%c

The fact that Alnæs becomes <code class="notranslate"

Text encoding issues about pyzotero HOT 10 CLOSED

urschrei commented on June 7, 2024

Text encoding issues

from pyzotero.

Comments (10)

urschrei commented on June 7, 2024

I'm not actually de– or encoding the returned content anywhere, so I wonder whether requests is double-encoding the returned utf-8. Anyway, I'm digging into it now.

from pyzotero.

urschrei commented on June 7, 2024

OK, I'm now explicitly encoding bibliography strings as UTF-8. Can you pull the dev branch and see if this works for you?

from pyzotero.

$fractaledmind avatar$ fractaledmind commented on June 7, 2024

Just to confirm that new build fixes issues. Thank you.

from pyzotero.

smathot commented on June 7, 2024

This issue does not appear to be fixed yet. It doesn't always happen, which makes me think it's not a simple text-encoding issue. But sometimes it does.

For example, here you see two items with special characters, only of which is correctly encoded:

# zot = zotero instance
s = zot.item('FRHUMMC6', content='citation', style='chicago-author-date')
print s
print s[0]
s = zot.item('SBRWRSCQ', content='citation', style='chicago-author-date')
print s
print s[0]

output:

[u'<span>(Tka\u010dik et al. 2011)</span>']
<span>(Tkačik et al. 2011)</span>
[u'<span>(Aln\u0102\u015as et al. 2014)</span>']
<span>(AlnĂŚs et al. 2014)</span>

Here, 'Tkačik' is correct, but 'AlnĂŚs' should have been 'Alnæs'. I've tested this with the latest snapshot (5dc5bdd).

And thanks for your great work on pyzotero!

from pyzotero.

urschrei commented on June 7, 2024

I can't reproduce this, and have added a test (d56cbba) to check for double-encoding issues. Could you give me the items in Zotero RDF export format, and I'll try to see what's going on?

from pyzotero.

smathot commented on June 7, 2024

Sure, here they are:

http://files.cogsci.nl/tmp/aln%c3%a6s.rdf (incorrect)
http://files.cogsci.nl/tmp/tka%c4%8dik.rdf (correct)

Both items appear correct in the Zotero desktop app, and the Zotero web interface. It's a weird issue, because it affects only a small number of items with special characters.

from pyzotero.

$fractaledmind avatar$ fractaledmind commented on June 7, 2024

The fact that Alnæs becomes AlnĂŚs suggests to me an issue with Unicode normalization. There are 4 options for Unicode normalization can be one of NFC, NFKC, NFD, and NFKD. Here's what the C and K options do:

C: Combine characters and diacritics that are written using separate code points, such as converting “e” plus an acute accent modifier into “é”.
K: Replace characters that are functionally equivalent with the most common form. For example, full-width Roman characters will be replaced with ASCII characters, ellipsis characters will be replaced with three periods, and the ligature ‘ﬂ’ will be replaced with ‘fl’.

Perhaps you could try normalizing the text as soon as you receive it from Zotero. I personally use the following function to decode and normalize all incoming text in my Python:

def decode(text, encoding='utf-8', normalization='NFC'):
    """Convert `text` to unicode."""
    if isinstance(text, basestring):
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
    return unicodedata.normalize(normalization, text)

Now, this just easily not be the issue, but I've been bitten enough by the vagaries of Unicode in Python to know that it easily could be as well.

from pyzotero.

urschrei commented on June 7, 2024

Thanks, I'll try that in an experimental branch. I still can't recreate the problem in the latest snapshot, even with @smathot's imported examples:

In [4]: zc = zot.top(limit=2, content='citation', style='chicago-author-date')
In [5]: print(zc[0].encode('utf8'))
<span>(Tkačik et al. 2011)</span>
In [6]: print(zc[1].encode('utf8'))
<span>(Alnæs et al. 2014)</span>

I wonder if it's a locale issue. Mine is:

LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

from pyzotero.

smathot commented on June 7, 2024

I made a curious observation: If I add another special character to the name, then the æ is also decoded correctly. For example, Alnæsé. This only works if the added character has an accent or something; it doesn't work with plain ASCII characters. And when I remove the special character, the encoding error comes back: AlnĂŚs

Is there some kind of encoding auto-detection somewhere?

I use Kubuntu 14.10 with en_US.UTF-8 locale, and Python 2.7.8.

from pyzotero.

sraimund commented on June 7, 2024

Yes, the encoding in the requests library is guessed by chardet in case it is not included in the response (https://github.com/kennethreitz/requests/blob/master/requests/models.py#L748-749). As it is stated, that the Zotero API always returns UTF-8 (https://groups.google.com/forum/#!topic/zotero-dev/q8ZxilZobo4), you can maybe add the line self.request.encoding = "utf-8" in the _retrieve_data() method of zotero.py after the request is made (e.g. before line https://github.com/urschrei/pyzotero/blob/master/pyzotero/zotero.py#L269). I had a problem with umlauts in my citations and it seems to be solved by this additional line.

from pyzotero.

Text encoding issues about pyzotero HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent