Giter Site home page Giter Site logo

harbor_on_apache_hadoop's Introduction

HARBOR_on_Apache_Hadoop

CS61C Spring 2016 Project 5

As part of the larger CHAOS project, the mysterious No Such Agency has effectively wiretapped a huge fraction of the internet, recording every page request and, for web traffic that identifies the user, also recording the username in the reply. Each record includes both a timestamp and the network 4-tuple (source IP, source port, destination IP, destination port). For requests, it includes any tracking cookie seen in the HTTP request, while web replies include any extracted username. Requests also include the full HTTP headers (which include other features), but for this purpose they are simply treated as an opaque blob of data and excluded from the analysis.

Of course, simply recording this data is of little use; it needs to be both analyzed and searched. Your job is to prototype the HARBOR internal database structure using Hadoop. HARBOR performs three primary tasks:

1: It matches requests and replies: When the data includes both a request and reply with the same network 4-tuple with a timestamp that is within 10 seconds, it creates a stream of matched records which include both the request and reply.

2: It writes the output stream of matched request/reply pairs into a set of independent "Query Focused Datasets" (QFDs). In a QFD, a particular key (such as the source IP, destination IP, cookie, or username) is hashed using a cryptographic hash function. Then the lowest 2 bytes are used to select which file to write into, with the resulting file containing both the hash and the data.

This enables efficient searching for All records associated with a username All records associated with a tracking cookie All records associated with a source IP or a destination IP The particular QFD keys we will be using are: cookie: Cookies included in user requests srcIP: Source IP addresses destIP: Destination IP addresses 3: It provides a second tool to create the TOTALFAIL database of Tor users (the torusers QFD). This tool first accesses the QFD which indexes matched users by source IP to get the users who are seen using a set of Tor exit nodes. Using any tracking cookies present in the resulting query, it accesses the tracking cookie QFD to get a list of all users, and then does a query for the usernames associated with those tracking cookies to create the TOTALFAIL QFD.

This is a structure optimized for the particular problem faced by the No Such Agency. They are obtaining a massive amount of data, recording the activity of a huge fraction of people on the planet. Most of this data is completely irrelevant, but it is not known in advance which is relevant or not.

Instead, the No Such Agency desires the capability to easily query "At this IP, who was connected to the network" and "For this person, which IPs and when did he connect to the network."

Query Focused Datasets solve this problem. In a parallel structure like Hadoop, access time is dominated by disk latency and network latency. By splitting the data into buckets, this can be efficiently parallelized (with each bucket potentially written by a separate process) and easily communicated in parallel. Then when any particular entry is needed, only that particular bucket needs to be searched for the resulting value.

Within a bucket we don't bother sorting: since most buckets will never be read, and since accessing a bucket is dominated by the time it takes to load a bucket from disk (rather than to search for an entry within a bucket) it is better from a total cost perspective to not bother sorting the data.

Then the TOTALFAIL analysis is simply taking advantage of this in a parallel structure. With the initial QFDs in place, it's easy to do a query for "all Tor users" based on the IPs used by Tor, and then do a query for "each of these users, what is their activity." Now when a No Such Agency analyst wants to find out more about a particular Tor user, they can easily discover all their activity.

harbor_on_apache_hadoop's People

Contributors

marcuzge avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.