Giter Site home page Giter Site logo

sessionization-demo's Introduction

This project gives 2 solutions to solve sessionization problem with Apache Spark.

Sessionization

Definition

A sessionization problem tries to group events related to one specific entity (user, device, ...). It can be considered as a window accumulating all activity for given entity, followed either by a period of inactivity or an explicit event closing the window. For this sample application to keep the overall code simple enough, we only deal with the inactivity period. Feel free to add the logic closing the session with an explicit action.

For example a user made the following actions: ("click", "2019-01-31 20:00"), ("link hover", "2019-01-31 20:03"), ("click", "2019-01-31 20:06"), ("click", "2019-01-31 20:10"). We generate the session after "browser_close" action or 10 minutes of inactivity. In that case, we compute the session at 20:20 and the session itself has the following format: {"user_id": ..., "actions": ["click", "link hover", "click", "click"], "duration_sec": 600, "end_reason": "inactivity"}.

Example

The example is composed of the data generated by my data generator which simulates users visiting a website. An example of the generated event looks like:

{
 "user": {"latitude": -27.0782, "longitude": -65.9743, "ip": "71.106.102.164"}, 
 "source": {"site": "mysite.com", "api_version": "v3"}, 
 "user_id": 342290, 
 "page": {"previous": "article category 2-10", "current": "category 18"}, 
 "visit_id": "2f8340bc3f3247298162305bd1979ff5", 
 "event_time": "2019-10-10T15:06:45+00:00", 
 "technical": {"network": "adsl", "lang": "en", 
  "device": {"type": "tablet", "version": "Samsung Galaxy Tab S3"}, 
  "os": "Android 8.1", "browser": "Mozilla Firefox 53",
  }
}

Solutions

Streaming

To solve a sessionization problem with streaming, we can use different patterns but the most common one is stateful processing. Depending on our final session, we'll accumulate the events in some state, detect the termination action and generate the output. The state can have different format. It can be stored in an external data store with a fast access (in-memory cache, K/V store with milliseconds latency). The state can also be persisted in the main memory of the application and be backed up in a checkpoint.

Batch

If higher latency is allowed, batch approach can also be used to solve sessionization problems. The idea is to store all user actions on a distributed and scalable file system (S3, GCS, ...) and generate sessions at regular interval, ideally from an orchestrator. The key point here is to guarantee sequentiality of the execution. It means that the generation for H can only start if the generation for H-1 terminated correctly. To increase the performance, files should be partitioned by event time. Thanks to that, we won't need to filter out the events happened during last 24 hours for the last generation of the day.

Tests

To test the application you have to generate some test data. Both streaming and batch versions use my data generator simulating user visits on a website.

Streaming

  1. Go to your working directory
  2. git clone [email protected]:bartosz25/data-generator.git
  3. Follow the README from https://github.com/bartosz25/data-generator/tree/master/examples/kafka
  4. Start streaming application com.waitingforcode.streaming.Application "/tmp/sessions-demo" "/home/bartosz/workspace/sessionization-demo/configuration/kafka_configuration.json", an IntelliJ's example: IntelliJ configuration for streaming

Batch

  1. Go to your working directory
  2. git clone [email protected]:bartosz25/data-generator.git
  3. Follow the README from https://github.com/bartosz25/data-generator/tree/master/examples/local_filesystem
  4. Start batch application com.waitingforcode.batch.Application /home/bartosz/tmp/test_generator/2019/08/11/09 "" "2019-08-11 09" "/tmp/output/2019/08/11/09" where "/home/bartosz/tmp/test_generator/2019/08/11/09" is the input directory, "" previous session's directory and "2019-08-11 09" processed time IntelliJ configuration for batch

Why Maven?

I experienced that sbt doesn't work perfect in every configuration, like the one behind a corporate proxy. To simplify your tests, I decided to use more classical Maven builder tool - even though I agree that XML is a little bit heavy for that.

Further reading

sessionization-demo's People

Contributors

bartosz25 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.