This repository contains the skeleton for solving basic Spark exercises.
- Load the data found in data/web_events.log into an RDD. Take into account that each line contains an entry, and the ; has been used as separator. Each line contains:
sourceHost, timestamp, method, URL, HTTPCode
- Obtain the number of events per host.
- Obtain the number of events per HTTPCode.
- Determine the number of different hosts.
Notes
- Use a case class WebEvent to represent each line.
- Load the data found in data/auth_events.log into an RDD. Each line contains the following elements.
timestamp, sourceHost, Process, Message
- Obtain the number of events per host.
- Obtain the number of events per process.
- Filter those hosts that have at least one failed authentication, and one failed requests.
- Obtain the percentage of successful web requests per host.
- Obtain the percentage of successful authentication per host.
Notes
- Use a case class AuthEvent to represent each line.
- Step 4 requires a join operation.
- Consider computing step 5 and 6 using common transformations.
- Load the file data/web_events.csv using the Spark-CSV library provided by databricks.
- Solve the scenarios presented in Exercise using SparkSQL when possible.
Notes
- The file contains a header.
- Each column is separated by ;