Giter Site home page Giter Site logo

guavusproject's Introduction

guavusproject

=============

Problem description

We have files (TSV) using this structure

service_type (String) subscriber id (Long) service_name (String) timestamp (yyyyMMddhhmm)

Using Hadoop, we first want to calculate the number of unique timestamp done by every subscriber per service_type, service_name, subscriber_id, day (yyyyMMdd).

Using this result, the next step(s) are to sort by count and then rank them. For the ranking, you need to add a column that include a fraction with the rank as the nominator and the total subscriber in the segment as the denominator. See the example below.

Example

service_type (String) subscriber id (Long) service_name (String) timestamp (yyyyMMddhhmm)
gds1 1236 browsing 201307051051
gds1 1236 browsing 201307051051
gds1 1235 browsing 201307051050
gds1 1236 browsing 201307051055
gds1 1235 browsing 201307061051
gds1 1233 browsing 201307061050
gds1 1233 browsing 201307061051
gds1 1233 browsing 201307061052
gds1 1234 browsing 201307061051
gds1 1234 browsing 201307061052
gds1 1237 browsing 201307061052
gds1 1237 browsing 201307061053

Implementation description

  • First of all I have removed all duplicate log lines using DistinctInfoJob. (the example contained duplicate lines, but if you are sure that input files do not contain duplicates then you can skip this step.) DistinctInfoJob has very simple mapper which just emits every line as a key and 1 as a value. Afterwards, during grouping all duplicate lines will be group together and passed to reducer. Here reducer just emits a key for each group and the size of the group.

The result for the above example would be:

service_type (String) subscriber id (Long) service_name (String) timestamp (yyyyMMddhhmm) duplicate count
gds1 1233 browsing 201307061050 1
gds1 1233 browsing 201307061051 1
gds1 1233 browsing 201307061052 1
gds1 1234 browsing 201307061051 1
gds1 1234 browsing 201307061052 1
gds1 1235 browsing 201307051050 1
gds1 1235 browsing 201307061051 1
gds1 1236 browsing 201307051051 2
gds1 1236 browsing 201307051055 1
gds1 1237 browsing 201307061052 1
gds1 1237 browsing 201307061053 1
  • Second step is counting number of unique timestamp done by every subscriber per service_type, service_name, subscriber_id, day (yyyyMMdd). This is done by CountJob. Map function here emits service_type, service_name, subscriber_id and the first 8 charaters of the timestamp as a key and 1 as a value. Reduce function recieves groups of lines with identical service_type, service_name, subscriber_id, day (yyyyMMdd) and emits it with corresponding group size.

The result for the above example would be:

service_type (String) subscriber id (Long) service_name (String) day (yyyyMMdd) count
gds1 1233 browsing 20130706 3
gds1 1234 browsing 20130706 2
gds1 1235 browsing 20130705 1
gds1 1235 browsing 20130706 1
gds1 1236 browsing 20130705 2
gds1 1237 browsing 20130706 2

Note that this two steps could have been combined, if we try to eliminate duplicates in the reduce function of the CountJob. That might be faster, however that would assume storing groups in memory. As amount of data might be very big, this might not be possible.

  • The next part of the problem is to sort the data within the groups by count. For that we need to do a secondary sort. I have created a new combined key type Pair, where the key would be the group_key and the value would be the count. Additionaly PairGrouping comparator was implemented so that hadoop still groups the data according to group_key. The partitioner is implemented to partition data according to group_key. Finally two comparators PairIncComparator and PairDescComparator are implemented to sort the data in groups by count in increasing and descreasing order.

  • Besides sorting, a ranking of the data within the groups should be done. In addition to ranking the total subscriber in the segment should be calculated for forming the fraction for the last column. Here as well we have a choice of using additional MapReduce job or storing groups in memory for calculating total. Once again I chose to create an additional MapReduce job IndexJob. Here mapper emites Pair combined keys of <service_type, service_name, day> and count, and original line as value. Then PairIncComparator is used to sort by count after grouping by <service_type, service_name, day> using PairGrouping. Reduce function here gets the sorted groups and emits the original lines with a increasing index. Note that the last element of the group will contain the highest index and, in fact, the size of the group.

The result for the above example would be:

service_type (String) subscriber id (Long) service_name (String) day (yyyyMMdd) count index
gds1 1235 browsing 20130705 1 1
gds1 1236 browsing 20130705 2 2
gds1 1235 browsing 20130706 1 1
gds1 1234 browsing 20130706 2 2
gds1 1237 browsing 20130706 2 3
gds1 1233 browsing 20130706 3 4
  • The last step is ranking the data and forming the fraction column. This is done by RankJob. In this case mapper emits Pair combined keys of <service_type, service_name, day> and previously assigned indexes, and original line as value. Then PairDescComparator is used to reverse sort by index. As indexes were assigned after sorting the data in increasing order, the data now will also be reverse sorted by count. Note that now the index of the first element of the group will indicate the size of the group. Then reducer takes the first index as a denominator and calculates rank of the each element as a nominator for the fraction. In the end reducer emits the original line and the formed fraction.

The result for the above example would be:

service_type (String) subscriber id (Long) service_name (String) day (yyyyMMdd) count rank
gds1 1236 browsing 20130705 2 1/2
gds1 1235 browsing 20130705 1 2/2
gds1 1233 browsing 20130706 3 1/4
gds1 1237 browsing 20130706 2 2/4
gds1 1234 browsing 20130706 2 2/4
gds1 1235 browsing 20130706 1 4/4

Running the jobs

The commands for running the jobs are in script file

guavusproject's People

Contributors

pgayane avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.