Giter Site home page Giter Site logo

samlists's Introduction

Samlists

You'll never find what you're not looking for, so these wordlists are:

  1. Comprehensive. These wordlists are constructed by analyzing terabytes of data from the biggest data sources around.

  2. Based on RECENT data. Tech evolves fast, why shouldn't wordlists? The wordlists are based exclusively on data from up to a year ago, and will keep changing as tech changes.

  3. Created with SCIENCE. By using some data science to remove outliers, generally crappy results and much else we remove much of the human element and biases to give you much more relevant and language-agnostic results.

  4. Magical. The construction of these wordlists is automagic, meaning in a year from now this github repo will still have up-to-date and high quality wordlists.

  5. Sorted. By rows being sorted from most likely to occur to least likely, your chances of finding juicy stuff as fast as possible is much better, making the wordlists uniquely suitable when speed AND comprehensiveness are required.

  6. Explainable. Many items in popular wordlists have no basis in real life except for what the author thinks will work. Every row in these wordlists is derived from real data.

The likelihood of the top parameters in the wordlists make a beautiful exponential curve, demonstrating that they follow a distribution cleanly

As you can see, the top 3000 rows in the parameters wordlist map quite beautifully to their likelihood of being found in websites.

Wordlists

Use the mixed case wordlist unless you are sure your target is case insensitive. Or don't. I'm a README, not a cop.

Wordlist name Size(s) Description
sam-cc-parameters-(mixedcase|lowercase)-all.txt ~50,000 HTTP parameter names. Use this to find hidden functionality! Basically what would go in {here} for the URL http://example.com?{here}=value.
sam-gh-directories-(mixedcase|lowercase)-top(size).txt 1,000
10,000
100,000
Directory names as found in all open-source GitHub repos. Useful for brute-forcing host directories.
sam-gh-files-(mixedcase|lowercase)-top(size).txt 1,000
10,000
100,000
1,000,000
File names as found in all open-source GitHub repos. Useful for brute-forcing files, especially blind.

Methodology

The wordlists are created by trawling through huge public datasets. The methods employed are a bit different based on the noisiness of the data source, but in general:

  1. Deleting duplicate items from the same source (e.g. repo or domain) to allow the final frequency to represent their global frequency as opposed to letting small but repetitive sources dominate.
  2. Pruning items that are too rare to be of general interest based on their rate of occurrence (generally at least 10-100 occurrences)
  3. Using shannon entropy to remove random values, tokens and UUIDs.
  4. Removing items that are broken due to incorrect encoding and/or decoding.

Data sources

The data source is given in the name of the file, to make them easy to tell apart.

cc = CommonCrawl

gh = GitHub BigQuery Public Dataset

samlists's People

Contributors

the-xentropy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.