Giter Site home page Giter Site logo

spark_examples's Introduction

#Pyspark - KMeans

The following example uses KMeans clustering to produce geographic Lat/Long centers based on purchases. Steps are outlined below to setup the local environment to run the example. Check out the Spark docs for the lastest functionality.

Note: using Spark 1.2.0 package with CDH 5.3

###Steps

The following directions assume you have a Spark environment up and running. If not, go here Spark VM.

###Setup Spark Environment

  1. I grabbed sample from the watson_bisum_deals to import into my Local Hive table. I logged onto i75 and pulled the sample data into my home directory. From there use rsync or scp to pull to your local machine.

A. Getting sample Data

hive -e 'select * from watson_bisum_deals limit 10000' > deals_sample_data.tsv

B. Download the file

Note: Copying the file to the /home/hadmindirectory inside the VM.

scp -3 {user name}@{server name}:{path to sample file} hadmin@hadoop-manager:/home/hadmin
  1. I set up a local hive table to mimic a production table 'watson_bisum_deals'. Run from inside the VM.

A. Use the create_watson_bisum_deals.sql file to create a local hive table(run as hadmin).

hive -f /home/hadmin/kmeans_example/create_watson_bisum_deals.sql

B. Import the sample data file into the table. Use the load_data.sql file.

hive -f /home/hadmin/kmeans_example/load_data.sql

C. Check Hive to sure everything looks good.

Is the Table there?

hive -e 'SHOW TABLES'

Check data:

hive -e 'select * from watson_bisum_deals limit 5' 

###Running Spark

  1. Log into the VM
ssh hadmin@hadoop-manager
  1. Use the spark-submit utility to run the app(always run as hadmin user).
spark-submit {path to file}/kmeans.py

####Notes about the Running of the Spark App

  • Currently the app uses two centers in order the cluster the data points. The purchases are filtered by the 'Beauty/Health' category so results find the center lat/long based on 'Beauty/Health' purchases
  • Data for the input are pulled from your local hive table 'watson_bisum_deals'
  • The results are the Lat/Long of two centers where the data points are clustered
  • The results are written to HDFS temporarily(tab delimited file) in the /tmp/{output_dir}/{unix_timestamp}. The {output_dir} is set by the user while the {unix_timestamp} is a dynamic directory in order to avoid errors. **Note: the {output_dir} must exist prior to running to avoid errors.
  • The results are finally loaded into Hive. The table name is set by the user {hive_table}

###Using Pyspark Repl

The code contained with the 'kmeans.py' file can also be executed use the pyspark repl. This utility acts just like a python repl, and is good just to test and experiment

Start the repl:

pyspark

spark_examples's People

Contributors

dusts66 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.