Giter Site home page Giter Site logo

allen-oneill / double-agent Goto Github PK

View Code? Open in Web Editor NEW

This project forked from unblocked-web/double-agent

0.0 1.0 1.0 77.75 MB

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

License: MIT License

Shell 0.46% TypeScript 98.18% CSS 0.06% JavaScript 1.30%

double-agent's Introduction

Double Agent is a suite of tools written to allow a scraper engine to test if it is detectable when trying to blend into the most common web traffic.

Structure:

DoubleAgent has been organized into two main layers:

  • /collect: scripts/plugins for collecting browser profiles. Each plugin generates a series of pages to test how a browser behaves.
  • /analyze: scripts/plugins for analyzing browser profiles against verified profiles. Scraper results from collect are compared to legit "profiles" to find discrepancies. These checks are given a Looks Human"โ„ข score, which indicates the likelihood that a scraper would be flagged as bot or human.

The easiest way to use collect is with the collect-controller:

  • /collect-controller: a server that can generate step-by-step assignments for a scraper to run all tests

Plugins

The bulk of the collect and analyze logic has been organized into what we call plugins.

Collect Plugins

Name Description
browser-codecs Collects the audio, video and WebRTC codecs of the browser
browser-dom-environment Collects the browser's DOM environment such as object structure, class inheritance amd key order
browser-fingerprints Collects various browser attributes that can be used to fingerprint a given session
browser-fonts Collects the fonts of the current browser/os.
http-assets Collects the headers used when loading assets such as css, js, and images in a browser
http-basic-headers Collects the headers sent by browser when requesting documents in various contexts
http-websockets Collects the headers used when initializing and facilitating web sockets
http-xhr Collects the headers used by browsers when facilitating XHR requests
http2-session Collects the settings, pings and frames sent across by a browser http2 client
tcp Collects tcp packet values such as window-size and time-to-live
tls-clienthello Collects the TLS clienthello handshake when initiating a secure connection
http-basic-cookies Collects a wide range of cookies configuration options and whether they're settable/gettable

Analyze Plugins

Name Description
browser-codecs Analyzes that the audio, video and WebRTC codecs match the given user agent
browser-dom-environment Analyzes the DOM environment, such as functionality and object structure, match the given user-agent
browser-fingerprints Analyzes whether the browser's fingerprints leak across sessions
http-assets Analyzes http header order, capitalization and default values for common document assets (images, fonts, media, scripts, stylesheet, etc)
http-basic-cookies Analyzes whether cookies are enabled correctly, including same-site and secure
http-basic-headers Analyzes header order, capitalization and default values
http-websockets Analyzes websocket upgrade request header order, capitalization and default values
http-xhr Analyzes header order, capitalization and default values of Xhr requests
http2-session Analyzes http2 session settings and frames
tcp Analyzes tcp packet values, including window-size and time-to-live
tls-clienthello Analyzes clienthello handshake signatures, including ciphers, extensions and version

Scraper Results:

For a dynamic approach to exploring results, visit ScraperReport.

Testing your Scraper:

This project leverages yarn workspaces. To get started, run yarn from the root directory.

If you'd like to test out your scraper stack:

  1. Navigate to the /collect-controller directory and run yarn start. Follow setup directions print onto the console from this command.

  2. The API will return assignments one at a time until all tests have been run. Include a scraper engine you're testing with a query string or header called "scraper". Assignment format can be found at /collect-controller/interfaces/IAssignment.ts.

  3. Once all tests are run, results will be output to the same directory as your scraper engine.

Popular scraper examples can be found in the scraper-report repo.

double-agent's People

Contributors

calebjclark avatar blakebyrnes avatar dependabot[bot] avatar

Watchers

James Cloos avatar

Forkers

allenrnd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.