Hi there, Ran your and received the following error text as o

Thanks! It sounds like your input XML contains <a href="https://apps.timwhitlock.info/

By the way, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Invalid xmlChar value 55357, line 49, column 101 about smsxml2html HOT 4 OPEN

kermmartian commented on June 25, 2024

Invalid xmlChar value 55357, line 49, column 101

from smsxml2html.

Comments (4)

T2Fr commented on June 25, 2024 1

Seems to be working with lxml_parser = XMLParser(huge_tree = True, recover = True)

from smsxml2html.

KermMartian commented on June 25, 2024

Thanks! It sounds like your input XML contains https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%98%84 (the 😄 character), and I'm not handling that properly. Can you please view the XML in a Unicode-aware text editor, check line 49, column 101, and confirm that that's what it contains?

from smsxml2html.

akobel commented on June 25, 2024

Had a pile of poo here, literally: 💩 (which, apparently like anything including surrogate pairs, causes trouble). It was encoded by &#55357;&#56489;. Other than that, some more smileys.

The following piece of code made it work for me:

from lxml.etree import XMLParser, parse
from io import BytesIO
lxml_parser = XMLParser(huge_tree = True)

payload = open (input, 'rb').read().decode ('utf-8')
chrifnotspecial = lambda dec: '&#%d;' % dec if dec in [ 10, 13, 35, 38, 59, 60, 62 ] else chr (dec) # don't convert `\n\r#&;<>`
payload = re.sub (r'&#(\d+);', lambda x: chrifnotspecial (int (x.group(1))), payload)
# No idea why 'INFORMATION SEPARATOR's ended up in some messages,
# but I decide that I don't need them, and they make the parser barf out...
for dec in [ 28, 29, 30, 31 ]:
    payload = payload.replace (chr (dec), '')

# combine surrogate pairs
payload = payload.encode ('utf-16', 'surrogatepass').decode ('utf-16')
payload = payload.encode ('utf-8')

tree = etree.parse (BytesIO (payload), parser = lxml_parser)
root = tree.getroot()

Obviously, this should replace the existing code in main() for reading the input file.

This assumes that you are using Python 3 (required for chr() with inputs > 255; unichr() of Python 2 should do the job as well, but I didn't test). smsxml2html is almost Python3-compatible except for two minor parts: You have to replace the two occurences of iteritems with items, and msg.text.encode('utf8') by msg.text (or msg.text.strip(), possibly, if you preserve whitespace, but want to drop superfluous spaces at beginning and end of a message). If encoding='utf-8' is given as an additional argument for open(output_path, 'w'), I guess that this should even be fully backwards compatible.

I'm not an expert on Unicode representation by any means. IIUC, the encoding used in those XML files is more UTF-16-ish than UTF-8-ish, and "normal" (more clever) means to convert (like passing the XML to BeautifulSoup) fail because an entity-wise conversion yields invalid results in some intermediate stage (since &#55357; does not translate to a character, but a "surrogate code point", which is more like a modifier for the next character). I might well be totally wrong here; almost my entire wisdom is based on a StackOverflow post concerning &#55357; and some pile of poo reference.
Anyway, my understanding of what I do is: convert each entity independendly and blindly, then convert to UTF-16 while keeping those surrogate pairs alone, then read and interpret them, and then encode again to the more well-received (at least to me) UTF-8.
By the way, no clue why apparently I need to go via the BytesIO, but this works. Using etree.fromstring() instead of etree.parse() did not (although, AFAICS, this should do the same after removing the encoding tag in line 1 of the XML?)...

Caveat: this is a pretty brute-force-ish hands-on approach. Worked for me; YMMV.
I found some evidence that, apparently, funny characters in XML attributes are not really covered by the XML standard, although it seems that XML 1.1 relaxed it somewhat. In any case, the file produced by SMS Backup & Restore seem to not strictly obey the standard in all cases.
This is pretty much "best-effort recovery", with lowest effort for me.

Note that I preserve, among others, the 
 encoding for linebreaks, which would be silently converted to normal spaces by the LXML parser. I like to have white-space: pre-wrap; in the CSS for .month_convos td; I appreciate if my conversation partners spend the effort to type line breaks, so who am I to drop them in the archives?

from smsxml2html.

akobel commented on June 25, 2024

By the way, @T2Fr: recover = True "Seems to be working" - but, unfortunately, at the expense of ignoring the character. These days, that can mean "discarding the message", which would be 💩☹... 😄

from smsxml2html.

Invalid xmlChar value 55357, line 49, column 101 about smsxml2html HOT 4 OPEN

Comments (4)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent