Data Engineering challenge
Input/Output files
Input:
data/clicks/*.json
data/impressions/*.json
Output:
data/output/metrics.json
data/output/recommendations.json
data/output/recommendations.json/_SUCCESS
data/output/recommendations.json/part-00000-b70f97a5-94b4-4dde-8460-8b55ab6ab8a9-c000.json
- Scala has a deprecated JSON library so I experimented with a few of the most popular out there. I just looked for two things in this library, for it to be lightweight and for it to let me parse a JSON file to a Scala object, and viceversa easily. After some experimentation I opted to use
jackson
- Created
JSONETL
to handle the reading and parsing of the JSON files.readJSON(inputDirectory: String)
- returns
(List[String], List[String])
, a list of the click and impression json files respectively. inputDirectory
should be the folder containingclicks/
andimpressions/
. Using the above folder structure as example,inputDirectory
should be"data/"
. Note: This could be improved by receivinginputDirectory: File
instead and avoiding errors in the String path.
- returns
parseJSONintoList(stringJSON: String)
- Parses a JSON file in string format into a scala object, in this case
List[Map[String, Object]]
- Parses a JSON file in string format into a scala object, in this case
convertToJSON(result: Any)
- Converts Any into Jackson's ArrayNode. Note: It would have been better to specify the type (
Iterable[Map[String, Object]]
) to ensure Type Safety.
- Converts Any into Jackson's ArrayNode. Note: It would have been better to specify the type (
write(result: ArrayNode, path: String = "data/output/metrics.json")
- Takes a Jackson's ArrayNode and writes it in json format.
- Created
ClicksImpressionsTransformer
to handle the metric calculationscalculateMetrics
Receives the list of clicks and impressions and calculates the corresponding metrics.convertClicksAndImpToScala
converts the lists of clicks and impressions to a List per element.reduceAndRenameClicksAndImp
reduces Lists into a single listgroupAndJoinClicksAndImpressions
joins clicks and impressions and claculates metrics
- Trait
Mapper
contains anObjectMapper
object used to map scala objects into JSON and viceversa SparkInstance
contains the SparkSession object used for obejctive #3.generateRecommendations
generates the recommendadions based on the metrics from the previous objective.write
writes DataFrame in spark's json format.