Giter Site home page Giter Site logo

Comments (6)

stuartpb avatar stuartpb commented on June 13, 2024

The presence of a BOM anywhere in UTF-8 is a bug.

from big-list-of-naughty-strings.

ssokolow avatar ssokolow commented on June 13, 2024

@stuartpb To quote the link you just cited (emphasis mine):

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[5][6]

from big-list-of-naughty-strings.

stuartpb avatar stuartpb commented on June 13, 2024

@ssokolow Sure, I'm not saying it shouldn't be included in the corpus - just that it shouldn't be treated as a bug that occurs "by naively concatenating files together" (or that it shouldn't be stripped by anything that isn't expected to round-trip its exact input between encodings). Its root cause is software that adds a BOM to UTF-8, at any point. See the text immediately following your excerpt:

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[7]

from big-list-of-naughty-strings.

ssokolow avatar ssokolow commented on June 13, 2024

@stuartpb However, that does not apply to many web applications because they accept "plaintext" generated by Microsoft applications or Google Docs.

Even so, Microsoft compilers[9] and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

from big-list-of-naughty-strings.

stuartpb avatar stuartpb commented on June 13, 2024

@ssokolow Sure - we're talking about two different parts of the RFC, which apply to two different scenarios. If your data round-trips encodings without metadata, stripping the BOM is a bug; if it doesn't, keeping it is a bug.

from big-list-of-naughty-strings.

ssokolow avatar ssokolow commented on June 13, 2024

@stuartpb My point was that, "The presence of a BOM anywhere in UTF-8 is a bug." sounded like you might have been misinterpreting the way this case should be tested and, when you clarified, it still felt like you might have been misinterpreting... just in how the standard applied to the applications in question rather than what the standard said... something I no longer think.

from big-list-of-naughty-strings.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.