This project involves analyzing crime data in Los Angeles using Apache Spark. It includes queries implemented in DataFrame, RDD and SQL APIs.
You can check out this setup guide provided by our professors:
Before running the scripts, ensure that you have the following prerequisites:
-
Cluster Setup:
- Install and configure at least 2 Ubuntu 22.04 clusters.
- Set up Apache Spark, Java, Hadoop Distributed File System (HDFS), and YARN on each cluster.
-
Access to UIs:
- Ensure proper configuration for accessing Spark, HDFS, and YARN UIs.
-
Download Datasets:
- Download the basic crime datasets:
- Download income 2015 dataset and reverse geocoding dataset:
-
Store Datasets in HDFS:
- Store downloaded datasets in the HDFS of your machine.
Follow these steps to run the scripts (for instance Query 1 with DataFrame API):
- Clone this repository:
git clone https://github.com/ntua-el19613/CrimeInLA cd CrimeInLA
- Navigate to the specific query folder:
cd query1
- Submit the Spark job for Query 1 with DataFrame API:
spark-submit Q1DF.py
This project was a team effort by the following contributors:
- Giannouchou Olga (03119613)
- Bellos Ioannis (03119067)