The spark_examples from undarmaa

#Pyspark - KMeans

The following example uses KMeans clustering to produce geographic Lat/Long centers based on purchases. Steps are outlined below to setup the local environment to run the example. Check out the Spark docs for the lastest functionality.

Note: using Spark 1.2.0 package with CDH 5.3

###Steps

The following directions assume you have a Spark environment up and running. If not, go here Spark VM.

###Setup Spark Environment

I grabbed sample from the watson_bisum_deals to import into my Local Hive table. I logged onto i75 and pulled the sample data into my home directory. From there use rsync or scp to pull to your local machine.

A. Getting sample Data

hive -e 'select * from watson_bisum_deals limit 10000' > deals_sample_data.tsv

B. Download the file

Note: Copying the file to the /home/hadmindirectory inside the VM.

scp -3 {user name}@{server name}:{path to sample file} hadmin@hadoop-manager:/home/hadmin

I set up a local hive table to mimic a production table 'watson_bisum_deals'. Run from inside the VM.

A. Use the create_watson_bisum_deals.sql file to create a local hive table(run as hadmin).

hive -f /home/hadmin/kmeans_example/create_watson_bisum_deals.sql

B. Import the sample data file into the table. Use the load_data.sql file.

hive -f /home/hadmin/kmeans_example/load_data.sql

C. Check Hive to sure everything looks good.

Is the Table there?

hive -e 'SHOW TABLES'

Check data:

hive -e 'select * from watson_bisum_deals limit 5'

###Running Spark

Log into the VM

ssh hadmin@hadoop-manager

Use the spark-submit utility to run the app(always run as hadmin user).

spark-submit {path to file}/kmeans.py

####Notes about the Running of the Spark App

Currently the app uses two centers in order the cluster the data points. The purchases are filtered by the 'Beauty/Health' category so results find the center lat/long based on 'Beauty/Health' purchases
Data for the input are pulled from your local hive table 'watson_bisum_deals'
The results are the Lat/Long of two centers where the data points are clustered
The results are written to HDFS temporarily(tab delimited file) in the /tmp/{output_dir}/{unix_timestamp}. The {output_dir} is set by the user while the {unix_timestamp} is a dynamic directory in order to avoid errors. **Note: the {output_dir} must exist prior to running to avoid errors.
The results are finally loaded into Hive. The table name is set by the user {hive_table}

###Using Pyspark Repl

The code contained with the 'kmeans.py' file can also be executed use the pyspark repl. This utility acts just like a python repl, and is good just to test and experiment

Start the repl:

pyspark

undarmaa / spark_examples Goto Github PK

spark_examples's Introduction

spark_examples's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent