Giter Site home page Giter Site logo

Comments (4)

T2Fr avatar T2Fr commented on June 25, 2024 1

Seems to be working with lxml_parser = XMLParser(huge_tree = True, recover = True)

from smsxml2html.

KermMartian avatar KermMartian commented on June 25, 2024

Thanks! It sounds like your input XML contains https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%98%84 (the 😄 character), and I'm not handling that properly. Can you please view the XML in a Unicode-aware text editor, check line 49, column 101, and confirm that that's what it contains?

from smsxml2html.

akobel avatar akobel commented on June 25, 2024

Had a pile of poo here, literally: 💩 (which, apparently like anything including surrogate pairs, causes trouble). It was encoded by ��. Other than that, some more smileys.

The following piece of code made it work for me:

from lxml.etree import XMLParser, parse
from io import BytesIO
lxml_parser = XMLParser(huge_tree = True)

payload = open (input, 'rb').read().decode ('utf-8')
chrifnotspecial = lambda dec: '&#%d;' % dec if dec in [ 10, 13, 35, 38, 59, 60, 62 ] else chr (dec) # don't convert `\n\r#&;<>`
payload = re.sub (r'&#(\d+);', lambda x: chrifnotspecial (int (x.group(1))), payload)
# No idea why 'INFORMATION SEPARATOR's ended up in some messages,
# but I decide that I don't need them, and they make the parser barf out...
for dec in [ 28, 29, 30, 31 ]:
    payload = payload.replace (chr (dec), '')

# combine surrogate pairs
payload = payload.encode ('utf-16', 'surrogatepass').decode ('utf-16')
payload = payload.encode ('utf-8')

tree = etree.parse (BytesIO (payload), parser = lxml_parser)
root = tree.getroot()

Obviously, this should replace the existing code in main() for reading the input file.

This assumes that you are using Python 3 (required for chr() with inputs > 255; unichr() of Python 2 should do the job as well, but I didn't test). smsxml2html is almost Python3-compatible except for two minor parts: You have to replace the two occurences of iteritems with items, and msg.text.encode('utf8') by msg.text (or msg.text.strip(), possibly, if you preserve whitespace, but want to drop superfluous spaces at beginning and end of a message). If encoding='utf-8' is given as an additional argument for open(output_path, 'w'), I guess that this should even be fully backwards compatible.

I'm not an expert on Unicode representation by any means. IIUC, the encoding used in those XML files is more UTF-16-ish than UTF-8-ish, and "normal" (more clever) means to convert (like passing the XML to BeautifulSoup) fail because an entity-wise conversion yields invalid results in some intermediate stage (since &#55357; does not translate to a character, but a "surrogate code point", which is more like a modifier for the next character). I might well be totally wrong here; almost my entire wisdom is based on a StackOverflow post concerning &#55357; and some pile of poo reference.
Anyway, my understanding of what I do is: convert each entity independendly and blindly, then convert to UTF-16 while keeping those surrogate pairs alone, then read and interpret them, and then encode again to the more well-received (at least to me) UTF-8.
By the way, no clue why apparently I need to go via the BytesIO, but this works. Using etree.fromstring() instead of etree.parse() did not (although, AFAICS, this should do the same after removing the encoding tag in line 1 of the XML?)...

Caveat: this is a pretty brute-force-ish hands-on approach. Worked for me; YMMV.
I found some evidence that, apparently, funny characters in XML attributes are not really covered by the XML standard, although it seems that XML 1.1 relaxed it somewhat. In any case, the file produced by SMS Backup & Restore seem to not strictly obey the standard in all cases.
This is pretty much "best-effort recovery", with lowest effort for me.

Note that I preserve, among others, the &#10; encoding for linebreaks, which would be silently converted to normal spaces by the LXML parser. I like to have white-space: pre-wrap; in the CSS for .month_convos td; I appreciate if my conversation partners spend the effort to type line breaks, so who am I to drop them in the archives?

from smsxml2html.

akobel avatar akobel commented on June 25, 2024

By the way, @T2Fr: recover = True "Seems to be working" - but, unfortunately, at the expense of ignoring the character. These days, that can mean "discarding the message", which would be 💩☹... 😄

from smsxml2html.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.