Giter Site home page Giter Site logo

Comments (11)

JoshCheek avatar JoshCheek commented on July 20, 2024 5

Synopsis

Rough time figuring this out. It boils down to this: A UTF8 string whose byte representation includes bytes outside the ASCII range, being incorrectly encoded as ASCII-8BIT (a byte string), and then trying to transcode to UTF8. Because some bytes are not valid ASCII, it blows up.

# encoding: utf-8
"ç".force_encoding(Encoding::ASCII_8BIT)  # => "\xC3\xA7"
   .encode(Encoding::UTF_8)               # ~> Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8

How it got encoded incorrectly

How does that force_encoding happen? Well, the JSON stdlib's error message includes the string it was attempting to parse, but the error message itself is in ASCII-8BIT (I don't know why, I'll probably open a bug report). In other words, it thinks the error message is a byte string (notice how it inspects).

require 'json'  # => true

str = "√"                                           # => "√"
msg = (JSON.parse JSON.dump str rescue $!.message)  # => "757: unexpected token at '\"\xE2\x88\x9A\"'"

str.encoding  # => #<Encoding:UTF-8>
msg.encoding  # => #<Encoding:ASCII-8BIT>

How strings try to transcode to fix the issue

Okay, so how does it try to transcode itself? Apparently, when two strings need to be concatenated (as we are doing when we append the errors into comments in the original text), it will try transcoding one to the other. First trying to make the RHS into the LHS, if that fails, trying to make the LHS into the RHS. If that fails, it blows up as we see in this issue.

# encoding: UTF-8

def a8b(string)
  string.force_encoding(Encoding::ASCII_8BIT)
end

# ENCODINGS MATCH: no transcoding
  ("a" + "å")           # => "aå"
  ("a" + "å").encoding  # => #<Encoding:UTF-8>


# ENCODINGS ARE COMPATIBLE: the RHS is transcoded to the LHS
  ("a" + a8b("a"))           # => "aa"
  ("a" + a8b("a")).encoding  # => #<Encoding:UTF-8>

  (a8b("a") + "a")           # => "aa"
  (a8b("a") + "a").encoding  # => #<Encoding:ASCII-8BIT>

# RHS IS NOT COMPATIBLE WITH LHS: the LHS is transcoded to the RHS
  # example1: not compatible b/c å is multibyte
  (a8b("") + "å")            # => "å"
  (a8b("a") + "å").encoding  # => #<Encoding:UTF-8>

  # example2: not compatible b/c "\xC3\xA5" is a string of two bytes with values 195 and 165
  # since ASCII only has values 0-127, these are not valid ASCII values.
  # So the RHS has no idea what these are supposed to be and can't change encodings.
  # So the LHS changes its encoding to ASCII-8BIT, because it has a valid ascii representation (97)
  ("a" + a8b("å"))           # => "a\xC3\xA5"
  ("a" + a8b("å")).encoding  # => #<Encoding:ASCII-8BIT>

# NEITHER RHS NOR LHS ARE COMPATIBLE: explosions
  # this is where we find ourselves
  ("å" + a8b("å"))  # ~> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

Okay, but how did we get the error message?

The JSON lib will emit an invalid object while dumping (apparently, toplevel JSON value can only be an object or an array ...for some reason). However, rather than blowing up when asked to dump invalid JSON, it blows up when asked to parse it. Well... unless you tell it that the lib which generated this JSON is ...quirky.

require 'json'  # => false

json = JSON.dump("√")                # => "\"√\""
JSON.parse(json) rescue $!.message   # => "757: unexpected token at '\"\xE2\x88\x9A\"'"
JSON.parse(json, quirks_mode: true)  # => "√"

Summary

How to fix it

Any string that exists at this level is SiB data and should be encoded accordingly. I think that in addition to calling to_s on it, we should try first transcoding it, and then force encoding it to whatever the other side's encoding is (presumably UTF8, but I'm not sure that's legit)

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024 1

Bug report opened here

from seeing_is_believing.

stephan-nordnes-eriksen avatar stephan-nordnes-eriksen commented on July 20, 2024 1

I could not get it to successfully parse JSON.parse(JSON.dump("√")). Maybe I don't fully understand the issue, but I ended up converting to the Oj gem.

It does seem to handle anything I have been able to throw at it thus far, but it behaves a bit differently, so you need to enable the mode to Oj.default_options = {:mode => :compat } to make it work similarly to the normal JSON. A plus is that it is apparently faster as well.

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024

This is an encoding issue. The inspected string is sometimes coming back as #<Encoding:UTF-8>, and sometimes as #<Encoding:US-ASCII>. When I then go to add the inspections as annotations, it blows up inside parser. Looks like I can call force_encoding on the inspected value, here, to get it to not blow up. For some reason, though, this causes the stack overflow test to blow up. Not sure why, but it would be good for these failures to show up at lib level tests. Also, I might be able to force the encoding on the consumer side instead of the producer side.

from seeing_is_believing.

J-Swift avatar J-Swift commented on July 20, 2024

Thanks for posting this detailed report Josh. You saved me a bunch of time trying to figure this out myself for a project I'm working on.

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024

lol, np. When it takes me a lot of effort to figure something out, I try to document it in that moment, so I don't have to re-experience it later ^_^ Glad I was able to save someone else this confusion!

from seeing_is_believing.

stephan-nordnes-eriksen avatar stephan-nordnes-eriksen commented on July 20, 2024

I just came across this issue myself. Is there any solution that will allow me to parse UTF-8 strings using the ruby json parser?

I think the solution is here: https://github.com/ohler55/oj

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024

It does parse UTF-8 strings. The problem here was that the string wasn't valid UTF-8, so it tried to fix the encoding and couldn't.

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024

That's a different issue: it's blowing up because "√" is not valid as a toplevel JSON value.

JSON only considers arrays and objects (ruby hashes) to be valid toplevel objects. The spec does a bad job of conveying this, but at the top of http://json.org/ it says:

JSON is built on two structures:

  • A collection of name/value pairs. In various languages, this is realized as an object,
    record, struct, dictionary, hash table, keyed list, or associative array.
  • An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

You can see it in Ruby's parsing code here.

The default JSON parser can deal with this, it just doesn't by default, because that's apparently nonstandard. To get it to parse this, pass the key quirks_mode when you parse it.

$ ruby -rjson -e 'p JSON.parse(JSON.dump("√"), quirks_mode: true)'
""

from seeing_is_believing.

stephan-nordnes-eriksen avatar stephan-nordnes-eriksen commented on July 20, 2024

Interesting. Strange that basic values are invalid JSON. Does make sense though "JavaScript Object Notation". If I was designing this though I would have added those as valid syntax "some string" feels so wrong to be invalid. The same goes for 123 and true. Why are arrays valid then? Shouldn't those be values inside an object similar to strings? Weird.

Anyways, thanks a lot for clarifying!

from seeing_is_believing.

JoshCheek avatar JoshCheek commented on July 20, 2024

Just found this wonderful resource about encodings, adding it here b/c this is where I go when I get confused about them.

from seeing_is_believing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.