Giter Site home page Giter Site logo

beamexample's Introduction

Apache Beam Example Code

An example Apache Beam project.

Description

This example can be used with conference talks and self-study. The base of the examples are taken from Beam's example directory. They are modified to use Beam as a dependency in the pom.xml instead of being compiled together. The example code is changed to output to local directories.

How to clone and run

  1. Open a terminal window.
  2. Run git clone [email protected]:eljefe6a/beamexample.git
  3. Run cd beamexample/BeamTutorial
  4. Run mvn compile
  5. Create local output directory: mkdir output
  6. Run mvn exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1"
  7. Run cat output/user_score to verify the program ran correctly and the output file was created.

Using a Java IDE

  1. Follow the IDE Setup instructions on the Apache Beam Contribution Guide.

Other Runners

Apache Flink

  1. Follow the first steps from Flink's Quickstart to download Flink.
  2. Create the output directory.
  3. To run on a JVM-local cluster: mvn exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=[local]'
  4. To run on an out-of-process local cluster (note that the steps below should also work on a real cluster if you have one running):
    1. Start a local Flink cluster.
    2. Navigate to the WebUI (typically http://localhost:8081), click JobManager, and note the value of jobmanager.rpc.port. The default is probably 6123.
    3. Run mvn package to generate a JAR file. Note the location of the generated JAR (probably target/Tutorial-0.0.1-SNAPSHOT.jar)
    4. Run mvn -X -e exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=localhost:6123 --filesToStage=target/Tutorial-0.0.1-SNAPSHOT.jar', replacing the defaults for port and JAR file if they differ.
    5. Check in the WebUI to see the job listed.
  5. Run cat output/user_score to verify the pipeline ran correctly and the output file was created.

Apache Spark

  1. Create the output directory.
  2. Allow all users (Spark may run as a different user) to write to the output directory. chmod 1777 output.
  3. Change the output file to a fully-qualified path. For example, this("output/user_score"); to this("/home/vmuser/output/user_score");
  4. Run mvn package
  5. Run spark-submit --jars ~/.m2/repository/org/apache/beam/beam-runners-spark/0.3.0-incubating-SNAPSHOT/beam-runners-spark-0.3.0-incubating-SNAPSHOT.jar --class org.apache.beam.examples.tutorial.game.solution.Exercise2 --master yarn-client target/Tutorial-0.0.1-SNAPSHOT.jar --runner=SparkRunner

Google Cloud Dataflow

  1. Follow the steps in either of the Java quickstarts for Cloud Dataflow to initialize your Google Cloud setup.
  2. Create a bucket on Google Cloud Storage for staging and output.
  3. Run mvn -X exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Dexec.args='--runner=DataflowRunner --project=<YOUR-GOOGLE-CLOUD-PROJECT> --gcpTempLocation=gs://<YOUR-BUCKET-NAME> --outputPrefix=gs://<YOUR-BUCKET-NAME>/output/', after replacing <YOUR-GCP-PROJECT> and <YOUR-BUCKET-NAME> with the appropriate values.
  4. Check the Cloud Dataflow Console to see the job running.
  5. Check the output bucket to see the generated output: https://console.cloud.google.com/storage/browser/<YOUR-BUCKET-NAME>/

Further Reading

beamexample's People

Contributors

eljefe6a avatar kennknowles avatar takidau avatar dhalperi avatar

Watchers

Alexander Gallego avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.