Comments (6)
The presence of a BOM anywhere in UTF-8 is a bug.
from big-list-of-naughty-strings.
@stuartpb To quote the link you just cited (emphasis mine):
The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[5][6]
from big-list-of-naughty-strings.
@ssokolow Sure, I'm not saying it shouldn't be included in the corpus - just that it shouldn't be treated as a bug that occurs "by naively concatenating files together" (or that it shouldn't be stripped by anything that isn't expected to round-trip its exact input between encodings). Its root cause is software that adds a BOM to UTF-8, at any point. See the text immediately following your excerpt:
The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[7]
from big-list-of-naughty-strings.
@stuartpb However, that does not apply to many web applications because they accept "plaintext" generated by Microsoft applications or Google Docs.
Even so, Microsoft compilers[9] and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.
from big-list-of-naughty-strings.
@ssokolow Sure - we're talking about two different parts of the RFC, which apply to two different scenarios. If your data round-trips encodings without metadata, stripping the BOM is a bug; if it doesn't, keeping it is a bug.
from big-list-of-naughty-strings.
@stuartpb My point was that, "The presence of a BOM anywhere in UTF-8 is a bug." sounded like you might have been misinterpreting the way this case should be tested and, when you clarified, it still felt like you might have been misinterpreting... just in how the standard applied to the applications in question rather than what the standard said... something I no longer think.
from big-list-of-naughty-strings.
Related Issues (20)
- Make the file(s) more structured
- Add <!--<script> to the list of strings
- Add dangerous WiFi SSIDs
- Add 睷�睷睷� to the list
- O'[email protected] HOT 2
- Add markdown injection
- Niger, the country. HOT 1
- Add rm -rf / HOT 5
- Comment misidentifies Œ as lowercase
- Question - Naughty Http Endpoints HOT 1
- IDN characters HOT 7
- Underscore-separated digits HOT 2
- XML bomb
- Line 507 HOT 1
- Accident HOT 2
- Is this repo still "alive"? HOT 9
- Add BWTC32Key-generated BOM+CJK test string
- Add CSV excel macro injections
- Add Abugidas and CJK/Emoji variation selectors.
- Test
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from big-list-of-naughty-strings.