Comments (10)
I'm not actually de– or encoding the returned content anywhere, so I wonder whether requests
is double-encoding the returned utf-8. Anyway, I'm digging into it now.
from pyzotero.
OK, I'm now explicitly encoding bibliography strings as UTF-8. Can you pull the dev branch and see if this works for you?
from pyzotero.
Just to confirm that new build fixes issues. Thank you.
from pyzotero.
This issue does not appear to be fixed yet. It doesn't always happen, which makes me think it's not a simple text-encoding issue. But sometimes it does.
For example, here you see two items with special characters, only of which is correctly encoded:
# zot = zotero instance
s = zot.item('FRHUMMC6', content='citation', style='chicago-author-date')
print s
print s[0]
s = zot.item('SBRWRSCQ', content='citation', style='chicago-author-date')
print s
print s[0]
output:
[u'<span>(Tka\u010dik et al. 2011)</span>']
<span>(Tkačik et al. 2011)</span>
[u'<span>(Aln\u0102\u015as et al. 2014)</span>']
<span>(AlnĂŚs et al. 2014)</span>
Here, 'Tkačik' is correct, but 'AlnĂŚs' should have been 'Alnæs'. I've tested this with the latest snapshot (5dc5bdd).
And thanks for your great work on pyzotero!
from pyzotero.
I can't reproduce this, and have added a test (d56cbba) to check for double-encoding issues. Could you give me the items in Zotero RDF export format, and I'll try to see what's going on?
from pyzotero.
Sure, here they are:
- http://files.cogsci.nl/tmp/aln%c3%a6s.rdf (incorrect)
- http://files.cogsci.nl/tmp/tka%c4%8dik.rdf (correct)
Both items appear correct in the Zotero desktop app, and the Zotero web interface. It's a weird issue, because it affects only a small number of items with special characters.
from pyzotero.
The fact that Alnæs
becomes AlnĂŚs
suggests to me an issue with Unicode normalization. There are 4 options for Unicode normalization can be one of NFC
, NFKC
, NFD
, and NFKD
. Here's what the C
and K
options do:
C
: Combine characters and diacritics that are written using separate code points, such as converting “e” plus an acute accent modifier into “é”.K
: Replace characters that are functionally equivalent with the most common form. For example, full-width Roman characters will be replaced with ASCII characters, ellipsis characters will be replaced with three periods, and the ligature ‘fl’ will be replaced with ‘fl’.
Perhaps you could try normalizing the text as soon as you receive it from Zotero. I personally use the following function to decode and normalize all incoming text in my Python:
def decode(text, encoding='utf-8', normalization='NFC'):
"""Convert `text` to unicode."""
if isinstance(text, basestring):
if not isinstance(text, unicode):
text = unicode(text, encoding)
return unicodedata.normalize(normalization, text)
Now, this just easily not be the issue, but I've been bitten enough by the vagaries of Unicode in Python to know that it easily could be as well.
from pyzotero.
Thanks, I'll try that in an experimental branch. I still can't recreate the problem in the latest snapshot, even with @smathot's imported examples:
In [4]: zc = zot.top(limit=2, content='citation', style='chicago-author-date')
In [5]: print(zc[0].encode('utf8'))
<span>(Tkačik et al. 2011)</span>
In [6]: print(zc[1].encode('utf8'))
<span>(Alnæs et al. 2014)</span>
I wonder if it's a locale issue. Mine is:
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
from pyzotero.
I made a curious observation: If I add another special character to the name, then the æ
is also decoded correctly. For example, Alnæsé
. This only works if the added character has an accent or something; it doesn't work with plain ASCII characters. And when I remove the special character, the encoding error comes back: AlnĂŚs
Is there some kind of encoding auto-detection somewhere?
I use Kubuntu 14.10 with en_US.UTF-8
locale, and Python 2.7.8.
from pyzotero.
Yes, the encoding in the requests library is guessed by chardet in case it is not included in the response (https://github.com/kennethreitz/requests/blob/master/requests/models.py#L748-749). As it is stated, that the Zotero API always returns UTF-8 (https://groups.google.com/forum/#!topic/zotero-dev/q8ZxilZobo4), you can maybe add the line self.request.encoding = "utf-8"
in the _retrieve_data() method of zotero.py after the request is made (e.g. before line https://github.com/urschrei/pyzotero/blob/master/pyzotero/zotero.py#L269). I had a problem with umlauts in my citations and it seems to be solved by this additional line.
from pyzotero.
Related Issues (20)
- retrieving items from subcollections HOT 2
- Automatically upload or update pdfs in zotero with pyzotero HOT 7
- get_subset does not return actual data
- Can't attach note to item HOT 3
- Deleting a note HOT 3
- Are new Zotero beta notes and annotations (not saved in the PDF) retrievable in Pyzotero? HOT 6
- HTTP response is stored in Zotero.request property
- report requests.exceptions.HTTPError: 403, when use attachment_simple() function HOT 1
- Can't retrieve groups metadata.
- Feature Request - Add tags and modify comments of annotations HOT 8
- Missing git tag for 1.5.3 HOT 2
- How can I access the link of an attachment file in Zotero and update it? HOT 4
- Using search/request parameters HOT 2
- Software itemType HOT 2
- Wrong documentation for combining tags via logical "OR" operator?
- Using collection_items function to retrieve more than 100 items HOT 2
- Moving Collections from my library to group library HOT 2
- How to access my notes from zotero?
- Efficiently fetch attachments by path HOT 8
- Bad Gateway errors HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyzotero.