Giter Site home page Giter Site logo

stream-processor's Introduction

####Concept This started as a project for a course in distributed systems during my masters program. The basic idea is based in part off Apache Storm and Heron.

The design has some core classes that need to be discussed before we get into how it is implemented and works in practice.

######Coordinator The Coordinator class is much like a Topology Master in Apache Storm, just a bit watered down. The Coordinator assigns available workers to utilities (think, bolts), and sends an initialization RPC letting each host know who the Coordinator is, and which utilities have been assigned to them.

######AppBase / UtilBase These two base classes make up our "topology". Every application is derived from AppBase. Applications are straightforward in that they simply define a set of utilities to be applied to the dataset. Utilities are derived from UtilBase, and are executed in the order they are added. Utilities are Callable objects, which override a thread-safe call() method.

######WokerController Every worker in the system has a WorkerController which acts as a local coordination class for the worker host. This class helps coordinate which application is active, which utility is assigned to this worker, who the coordinator is, who the next hop is, and where our GateKeeper is.

######GateKeeper The GateKeeper at each worker is the entry and exit point of all streams of data. Stream units of work are enqueued into the GateKeeper, which applies the assigned utilities to the stream. Once the utility has been applied to the stream, the GateKeeper forwards the result to the next worker for further processing.

diagram

######General Flow / Idea Suppose you have a massive set of data, perhaps a 100GB store of tweets from around the world. Your goal is to find all tweets that satisfy all of the following conditions: (1) The tweet contains #PresidentialDebate, (2) The tweet is from someone with >5,000 followers, and (3) The tweet has been re-tweeted +500 times.

Once proper utilities are defined (see: UtilBase and it's derivations), here's a high-level view of what this system does:

  • The Coordinator initializes the topology, and assigns utilities to each available Worker
  • The Coordinator starts streaming in the dataset, and connects the stream to the endpoint of the next Worker in the chain
  • Supposing the utilities were assigned in order (optional), the first Worker will analyze the stream as it flows through it, and remove all tweets that do not contain #PresidentialDebate. It will then connect the stream output to the next Worker in the chain.
  • The next worker will analyze the stream as it flows through it, and remove all tweets who were not from users with more than 5,000 followers. It will then connect the stream output to the next Worker in the chain.
  • The final worker will analyze the stream as it flows through it, and remove all tweets that haven't been re-tweeted more than 500 times.
  • Finally, the last Worker in the chain will return the results to the Coordinator for summarization of the results.

Much of the above is fully customizable, based on configuration settings, utility definitions and assignments, and how you want your summarization of data to appear. Additionally, Workers do not explicitly have to work in serial, but can be adjusted fairly easily (through code changes) to have a single stream source as the input to both Workers, simultaneously, with a consolidation node that merges the results. This isn't fully implemented, but the framework doesn't need much modification.

Note also that these aren't distinct opperations, but are in fact happening in parallel. To put it another way, the Coordinator could be accumulating results from the last Worker in the chain, while it's still streaming in data from the source. Indeed, the stream flows through the topology.

#####File needed to run the project:

  • UX.jar -- main jar containing project compiled code
  • config.properties -- properties file containing the configuration parameters of the system
  • kryonet-2.21-all.jar -- network lib for managing connections and serializations

#####Configuration properties:
[see the config.properties for details]

#####Running the application:

  • From the command prompt, run the following: java -jar UX.jar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.