Giter Site home page Giter Site logo

gitter-badger / justenoughscalaforspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from deanwampler/justenoughscalaforspark

0.0 0.0 0.0 297 KB

A tutorial on the most important features and idioms of Scala they you need to use Spark's Scala APIs.

License: Apache License 2.0

justenoughscalaforspark's Introduction

Just Enough Scala for Spark

Strata NYC, September 27, 2016
Dean Wampler, Ph.D.
Lightbend, Inc.

This tutorial covers the most important features and idioms of Scala you need to use Apache Spark's Scala APIs. Because Spark is written in Scala, Spark is driving interest in Scala, especially for data engineers. Data scientists sometimes use Scala, but most use Python or R.

Prerequisites

I'll assume you have prior programming experience, in any language. Some familiarity with Java is assumed, but if you don't know Java, you should be able to search for explanations for anything unfamiliar.

This isn't an introduction to Spark itself. Some prior exposure to Spark is helpful, but I'll briefly explain most Spark concepts we'll encounter, too.

Throughout, you'll find links to more information on important topics.

Download the Tutorial

Begin by cloning or downloading the tutorial GitHub project github.com/deanwampler/JustEnoughScalaForSpark.

Using Spark Notebook

This tutorial uses a notebook format, which is popular with data scientists, but also useful for data engineers. While most of the popular notebooks, like iPython/Jupyter, Zeppelin, and Databricks support Scala, we'll use a Scala-centric notebook environment called Spark Notebook (http://spark-notebook.io).

You will need to install and run the Spark Notebook runtime. You can do this either by downloading it and running it "natively" on your computer, or by running it in Docker. (Using Docker may work better on Windows.)

Java 7 or 8

You'll need the Java 7 or 8 (preferred) JRE (Java Runtime Environment) installed. Go here for instructions, if necessary.

A separate Scala installation is not required.

Downloading Spark Notebook

If you want to run it "natively" (i.e., not use Docker), visit one of the following download pages and click the Download here link:

  • Zip file (for all platforms).
  • Tgz file (best for Mac OSX or Linux).

We're using notebook version 0.6.3 built for Spark 1.6.2, Hadoop 2.7.2, and Scala 2.11, with Hive and Parquet extensions. (See spark-notebook.io for other configurations.) We aren't using Spark 2.0.0, because support for it is still experimental, but the actual Spark version is less important for our purposes, since we're here to learn Scala.

Expand the archive somewhere convenient.

Start Spark Notebook as follows. Open a command window and change the working directory to the root directory where you expanded the Spark Notebook archive. Run the following command:

bin/spark-notebook

You'll see some log messages that then it will wait...

If you get an error that it fails to start, make sure Java is installed and on your path. (Run java -version in the same command window.) If that's not an issue, try moving the Spark Notebook folder to a directory where the full path has no whitespace (i.e., C:\Foo Bar\Baz has whitespace between Foo and Bar).

Now jump to Running the Tutorial.

Docker

If you wish to use Docker instead, first go to this docker.com page and follow the instructions to install Docker on your machine.

Once Docker is installed and running, open a command window and run these two commands to download and run the same Spark Notebook build as a Docker image.

docker pull andypetrella/spark-notebook:0.6.3-scala-2.11.7-spark-1.6.2-hadoop-2.7.2-with-hive-with-parquet
docker run -p 9000:9000 andypetrella/spark-notebook:0.6.3-scala-2.11.7-spark-1.6.2-hadoop-2.7.2-with-hive-with-parquet

Running the Tutorial

However you started Spark Notebook, open your browser to localhost:9000. The UI has a "SPARK NOTEBOOK" banner and shows several directories and notebooks for sample applications that come with Spark Notebook.

Now we need to load the tutorial in Spark Notebook.

Under the banner and under the tabs ("Files", "Running", ...), the first line of text says "To import a notebook, drag the file onto the listing below or click here."

The click here is a link. Click it, then navigate to where you downloaded the tutorial GitHub repository. Find and select notebooks/JustEnoughScalaForSpark.snb.

A new line in the UI is added with "JustEnoughScalaForSpark.snb" and an "Upload" button on the right-hand side. Click that button.

Now the line is moved towards the bottom of the page and the buttons to the right are different. Click the JustEnoughScalaForSpark link and the tutorial notebook will open in another browser tab. (It might take a minute to load completely.)

What's Next?

Congratulations! You are now ready to go through the tutorial.

Please post any feedback, bugs, or even pull requests to the project's GitHub page. Thanks.

Dean Wampler, September 2016

justenoughscalaforspark's People

Contributors

deanwampler avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.