guavusproject

=============

Problem description

We have files (TSV) using this structure

service_type (String) subscriber id (Long) service_name (String) timestamp (yyyyMMddhhmm)

Using Hadoop, we first want to calculate the number of unique timestamp done by every subscriber per service_type, service_name, subscriber_id, day (yyyyMMdd).

Using this result, the next step(s) are to sort by count and then rank them. For the ranking, you need to add a column that include a fraction with the rank as the nominator and the total subscriber in the segment as the denominator. See the example below.

Example

service_type (String)	subscriber id (Long)	service_name (String)	timestamp (yyyyMMddhhmm)
gds1	1236	browsing	201307051051
gds1	1236	browsing	201307051051
gds1	1235	browsing	201307051050
gds1	1236	browsing	201307051055
gds1	1235	browsing	201307061051
gds1	1233	browsing	201307061050
gds1	1233	browsing	201307061051
gds1	1233	browsing	201307061052
gds1	1234	browsing	201307061051
gds1	1234	browsing	201307061052
gds1	1237	browsing	201307061052
gds1	1237	browsing	201307061053

Implementation description

First of all I have removed all duplicate log lines using DistinctInfoJob. (the example contained duplicate lines, but if you are sure that input files do not contain duplicates then you can skip this step.) DistinctInfoJob has very simple mapper which just emits every line as a key and 1 as a value. Afterwards, during grouping all duplicate lines will be group together and passed to reducer. Here reducer just emits a key for each group and the size of the group.

The result for the above example would be:

service_type (String)	subscriber id (Long)	service_name (String)	timestamp (yyyyMMddhhmm)	duplicate count
gds1	1233	browsing	201307061050	1
gds1	1233	browsing	201307061051	1
gds1	1233	browsing	201307061052	1
gds1	1234	browsing	201307061051	1
gds1	1234	browsing	201307061052	1
gds1	1235	browsing	201307051050	1
gds1	1235	browsing	201307061051	1
gds1	1236	browsing	201307051051	2
gds1	1236	browsing	201307051055	1
gds1	1237	browsing	201307061052	1
gds1	1237	browsing	201307061053	1

Second step is counting number of unique timestamp done by every subscriber per service_type, service_name, subscriber_id, day (yyyyMMdd). This is done by CountJob. Map function here emits service_type, service_name, subscriber_id and the first 8 charaters of the timestamp as a key and 1 as a value. Reduce function recieves groups of lines with identical service_type, service_name, subscriber_id, day (yyyyMMdd) and emits it with corresponding group size.

The result for the above example would be:

service_type (String)	subscriber id (Long)	service_name (String)	day (yyyyMMdd)	count
gds1	1233	browsing	20130706	3
gds1	1234	browsing	20130706	2
gds1	1235	browsing	20130705	1
gds1	1235	browsing	20130706	1
gds1	1236	browsing	20130705	2
gds1	1237	browsing	20130706	2

Note that this two steps could have been combined, if we try to eliminate duplicates in the reduce function of the CountJob. That might be faster, however that would assume storing groups in memory. As amount of data might be very big, this might not be possible.

The next part of the problem is to sort the data within the groups by count. For that we need to do a secondary sort. I have created a new combined key type Pair, where the key would be the group_key and the value would be the count. Additionaly PairGrouping comparator was implemented so that hadoop still groups the data according to group_key. The partitioner is implemented to partition data according to group_key. Finally two comparators PairIncComparator and PairDescComparator are implemented to sort the data in groups by count in increasing and descreasing order.
Besides sorting, a ranking of the data within the groups should be done. In addition to ranking the total subscriber in the segment should be calculated for forming the fraction for the last column. Here as well we have a choice of using additional MapReduce job or storing groups in memory for calculating total. Once again I chose to create an additional MapReduce job IndexJob. Here mapper emites Pair combined keys of <service_type, service_name, day> and count, and original line as value. Then PairIncComparator is used to sort by count after grouping by <service_type, service_name, day> using PairGrouping. Reduce function here gets the sorted groups and emits the original lines with a increasing index. Note that the last element of the group will contain the highest index and, in fact, the size of the group.

The result for the above example would be:

service_type (String)	subscriber id (Long)	service_name (String)	day (yyyyMMdd)	count	index
gds1	1235	browsing	20130705	1	1
gds1	1236	browsing	20130705	2	2
gds1	1235	browsing	20130706	1	1
gds1	1234	browsing	20130706	2	2
gds1	1237	browsing	20130706	2	3
gds1	1233	browsing	20130706	3	4

The last step is ranking the data and forming the fraction column. This is done by RankJob. In this case mapper emits Pair combined keys of <service_type, service_name, day> and previously assigned indexes, and original line as value. Then PairDescComparator is used to reverse sort by index. As indexes were assigned after sorting the data in increasing order, the data now will also be reverse sorted by count. Note that now the index of the first element of the group will indicate the size of the group. Then reducer takes the first index as a denominator and calculates rank of the each element as a nominator for the fraction. In the end reducer emits the original line and the formed fraction.

The result for the above example would be:

service_type (String)	subscriber id (Long)	service_name (String)	day (yyyyMMdd)	count	rank
gds1	1236	browsing	20130705	2	1/2
gds1	1235	browsing	20130705	1	2/2
gds1	1233	browsing	20130706	3	1/4
gds1	1237	browsing	20130706	2	2/4
gds1	1234	browsing	20130706	2	2/4
gds1	1235	browsing	20130706	1	4/4

Running the jobs

The commands for running the jobs are in script file

pgayane / guavusproject Goto Github PK

guavusproject's Introduction

guavusproject

Problem description

Example

Implementation description

Running the jobs

guavusproject's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent