Comments (4)
Seems to be working with lxml_parser = XMLParser(huge_tree = True, recover = True)
from smsxml2html.
Thanks! It sounds like your input XML contains https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%98%84 (the 😄 character), and I'm not handling that properly. Can you please view the XML in a Unicode-aware text editor, check line 49, column 101, and confirm that that's what it contains?
from smsxml2html.
Had a pile of poo here, literally: 💩 (which, apparently like anything including surrogate pairs, causes trouble). It was encoded by ��
. Other than that, some more smileys.
The following piece of code made it work for me:
from lxml.etree import XMLParser, parse
from io import BytesIO
lxml_parser = XMLParser(huge_tree = True)
payload = open (input, 'rb').read().decode ('utf-8')
chrifnotspecial = lambda dec: '&#%d;' % dec if dec in [ 10, 13, 35, 38, 59, 60, 62 ] else chr (dec) # don't convert `\n\r#&;<>`
payload = re.sub (r'&#(\d+);', lambda x: chrifnotspecial (int (x.group(1))), payload)
# No idea why 'INFORMATION SEPARATOR's ended up in some messages,
# but I decide that I don't need them, and they make the parser barf out...
for dec in [ 28, 29, 30, 31 ]:
payload = payload.replace (chr (dec), '')
# combine surrogate pairs
payload = payload.encode ('utf-16', 'surrogatepass').decode ('utf-16')
payload = payload.encode ('utf-8')
tree = etree.parse (BytesIO (payload), parser = lxml_parser)
root = tree.getroot()
Obviously, this should replace the existing code in main()
for reading the input file.
This assumes that you are using Python 3 (required for chr()
with inputs > 255; unichr()
of Python 2 should do the job as well, but I didn't test). smsxml2html is almost Python3-compatible except for two minor parts: You have to replace the two occurences of iteritems
with items
, and msg.text.encode('utf8')
by msg.text
(or msg.text.strip()
, possibly, if you preserve whitespace, but want to drop superfluous spaces at beginning and end of a message). If encoding='utf-8'
is given as an additional argument for open(output_path, 'w')
, I guess that this should even be fully backwards compatible.
I'm not an expert on Unicode representation by any means. IIUC, the encoding used in those XML files is more UTF-16-ish than UTF-8-ish, and "normal" (more clever) means to convert (like passing the XML to BeautifulSoup) fail because an entity-wise conversion yields invalid results in some intermediate stage (since �
does not translate to a character, but a "surrogate code point", which is more like a modifier for the next character). I might well be totally wrong here; almost my entire wisdom is based on a StackOverflow post concerning �
and some pile of poo reference.
Anyway, my understanding of what I do is: convert each entity independendly and blindly, then convert to UTF-16 while keeping those surrogate pairs alone, then read and interpret them, and then encode again to the more well-received (at least to me) UTF-8.
By the way, no clue why apparently I need to go via the BytesIO
, but this works. Using etree.fromstring()
instead of etree.parse()
did not (although, AFAICS, this should do the same after removing the encoding tag in line 1 of the XML?)...
Caveat: this is a pretty brute-force-ish hands-on approach. Worked for me; YMMV.
I found some evidence that, apparently, funny characters in XML attributes are not really covered by the XML standard, although it seems that XML 1.1 relaxed it somewhat. In any case, the file produced by SMS Backup & Restore seem to not strictly obey the standard in all cases.
This is pretty much "best-effort recovery", with lowest effort for me.
Note that I preserve, among others, the
encoding for linebreaks, which would be silently converted to normal spaces by the LXML parser. I like to have white-space: pre-wrap;
in the CSS for .month_convos td
; I appreciate if my conversation partners spend the effort to type line breaks, so who am I to drop them in the archives?
from smsxml2html.
By the way, @T2Fr: recover = True
"Seems to be working" - but, unfortunately, at the expense of ignoring the character. These days, that can mean "discarding the message", which would be 💩☹... 😄
from smsxml2html.
Related Issues (2)
- Error HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smsxml2html.