The jar has two main entrypoint the AppMain and the BenchmarkMain. You can run the application on a Spark cluster exploiting the spark-submit.sh script in the master node. For more info refer to https://spark.apache.org/docs/latest/submitting-applications.html . The AppMain executes the queries and store the results in a file. Use it as follows:
./bin/spark-submit --class AppMain --master <master-url> \
--deploy-mode <deploy-mode> hdfs://master:54310/app.jar \
<csv file> <parquet file> <avro file> <deploymode> <cacheOrNot>
The argument 'deploymode' can be "local" or "cluster". 'cacheOrNot' can be "cache" or "no_cache" The BenchmarkMain runs the queries and compute the execution time saving it in a file.
./bin/spark-submit --class BenchmarkMain --master <master-url> \
--deploy-mode <deploy-mode> hdfs://master:54310/app.jar \
<output> <<csv file> <parquet file> <avro file> <deploymode> <cacheOrNot> <#run>
The 'output' parameter take the path to the file where write the execution times for the query performed with the different input file format. '#run' specify the number of runs for each query on which to compute the average of the execution time.
To deploy locally the layers of the application you can exploit Docker.
Several scripts are available in the directory /docker/run
The framework Apache NiFi is used to transform the dataset from csv to Avro and Parquet formats.
The dataset is filtered removing tuple with value field equals to zero and then it is loaded into HDFS.
In order to create the NiFi container using Docker, you can run the nifi-start.sh
script.
It defines the environment variable HDFS_DEFAULT_FS to which is assigned the address for HDFS.
The default is hdfs://master:54310 but you can change it before run the script.
The image already contains the template that is instantiated automatically on NiFi. To start it, you can access the NiFi UI at http://localhost:9999/nifi/. After the container startup the UI can take several minute to be available. To start the dataFlow you must first enable all the controller services as describet at https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Enabling_Disabling_Controller_Services For more info please refer to https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#starting-a-component
To build and run Alluxio docker image please refer to: Build and run alluxio docker image
- kubectl - Kubernetes CLI
- gcloud - Google Cloud CLI
- sbt - SBT CLI
First of all choose a region (e.g. europe-west-4) and a zone (e.g. europe-west-4a) on GC on which all components will be deployed. Create a new Virtual Private Cloud (VPC) network or use the default one. If a new network is created, be sure that the Firewall rules are set accordingly. For development simplicity all ports can be opened for Ingress traffic.
To run the application, you need a cluster with Spark and Hadoop frameworks. To create one, use the DataProc service. In the cluster section, choose to create a new one. Set the region,zone and VPC network according to *Region, Zone and VPC * section, cluster mode to Standard (1 master, N worker) and set the machine type and storage configuration (the default ones should do fine). Once the cluster deployment has finished, find the master's external IP (smIP). The cluster contains by default HDFS, Hadoop MapReduce,Spark, Hive and other Big Data frameworks. Some available default web interfaces:
- http://smIP:9870 - HDFS namenode
- http://smIP:8088 - YARN resource manager
From the HDFS Web UI, there is the master hostname and ports. By default it is cluster-name-m:8020 .
Perform SSH into master (VM Instances subsection in cluster info). Run the following:
hdfs dfs -mkdir /alluxio
to create a directory that Alluxio will use as its root directory.
Note: the master IP is ephemeral by default. If needed, a static IP can be assigned.
Using the Kubernetes Engine, create a new cluster with the same region and network configuration as the DataProc one (else the component's won't find themselves). Connect with a local terminal to the cluster (using the gcloud cli):
gcloud container clusters get-credentials <cluster-name> \
--zone <selected-zone>
This way the cluster's credentials will be available to kubectl. To verify that everything went well, run
kubectl get nodes
where all the cluster's nodes should be displayed.
In the project file kubernetes/alluxio/conf/alluxio.properties , set the ALLUXIO_UNDERFS_ADDRESS to hdfs://smIP:8020/alluxio so that Alluxio will use HDFS as its underlying storage. You can use other underlying storage such as Google Cloud Storage specifying gs://bucket-name/directory. After all settings are configured, run the script
kubernetes/alluxio/load_config.sh
to load a ConfigMap in Kubernetes where Alluxio's configuration is stored and later retrieved.
After, deploy Alluxio master and worker with
kubernetes/alluxio/deploy_alluxio.sh
Mongo is used for storing query results. To deploy it, run
kubernetes/mongodb/deploy_mongo.sh
Go to kubernetes/nifi directory and run
kubectl apply -f nifi_service.yaml
Spark needs an application jar to distribute the code to all the nodes in the cluster. To build one, go to root of the project (where the build.sbt file is located)
sbt assembly
Load the jar to a Google Cloud Storage bucket where it will be retrieved by Dataproc to run the Spark job.
and it will output a "fat" jar in the target/scala-2.11 directory. Be sure not to include spark libraries since it will conflict with existing ones on the cluster.
First create a directory on HDFS for Alluxio, /alluxio (they need to be put here for Alluxio to see them). Load the datasets on HDFS by any means you want:
- SSH into namenode, download files and manually and put them on file system
- use NiFi with the template in docker/build/NiFi/templace2.0.xml (note that NiFi also needs HDFS's configuration files)
From the Jobs section in DataProc, create a new Spark job on the cluster by specifying the jar's url, the main class (e.g. BenchmarkMain) on GCS and its arguments, e.g.
//args
alluxio://<all-ip>:<all-port>/results // output path
alluxio://<all-ip>:<all-port>/data.csv // csv dataset
alluxio://<all-ip>:<all-port>/data.parquet // parquet dataset
alluxio://<all-ip>:<all-port>/data.avro // avro dataset
cluster // deploy type
no_cache // cache or not initial rdds
5 // # runs for each query to compute mean time
If you want to use HDFS directly, use
hdfs://mIP:8020/<file>
in the previous example.