Giter Site home page Giter Site logo

mkhan037 / remoteshuffleservice Goto Github PK

View Code? Open in Web Editor NEW

This project forked from uber/remoteshuffleservice

0.0 1.0 0.0 693 KB

Remote shuffle service for Apache Spark to store shuffle data on remote servers.

License: Other

Java 85.72% Scala 14.26% Shell 0.02%

remoteshuffleservice's Introduction

Uber Remote Shuffle Service (RSS)

Uber Remote Shuffle Service provides the capability for Apache Spark applications to store shuffle data on remote servers. See more details on Spark community document: [SPARK-25299][DISCUSSION] Improving Spark Shuffle Reliability.

Please contact us ([email protected]) for any question or feedback.

Supported Spark Version

  • The master branch supports Spark 2.4.x. The spark30 branch supports Spark 3.0.x.

How to Build

Make sure JDK 8+ and maven is installed on your machine.

Build RSS Server

  • Run:
mvn clean package -Pserver -DskipTests

This command creates remote-shuffle-service-xxx-server.jar file for RSS server, e.g. target/remote-shuffle-service-0.0.9-server.jar.

Build RSS Client

  • Run:
mvn clean package -Pclient -DskipTests

This command creates remote-shuffle-service-xxx-client.jar file for RSS client, e.g. target/remote-shuffle-service-0.0.9-client.jar.

How to Run

Step 1: Run RSS Server

  • Pick up a server in your environment, e.g. server1. Run RSS server jar file (remote-shuffle-service-xxx-server.jar) as a Java application, for example,
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry standalone -dataCenter dc1

Step 2: Run Spark application with RSS Client

  • Upload client jar file (remote-shuffle-service-xxx-client.jar) to your HDFS, e.g. hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar

  • Add configure to your Spark application like following (you need to adjust the values based on your environment):

spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=standalone
spark.shuffle.rss.serviceRegistry.server=server1:12222
spark.shuffle.rss.dataCenter=dc1
  • Run your Spark application

Run with High Availability

Remote Shuffle Service could use a Apache ZooKeeper cluster and register live service instances in ZooKeeper. Spark applications will look up ZooKeeper to find and use active Remote Shuffle Service instances.

In this configuration, ZooKeeper serves as a Service Registry for Remote Shuffle Service, and we need to add those parameters when starting RSS server and Spark application.

Step 1: Run RSS Server with ZooKeeper as service registry

  • Assume there is a ZooKeeper server zkServer1. Pick up a server in your environment, e.g. server1. Run RSS server jar file (remote-shuffle-service-xxx-server.jar) as a Java application on server1, for example,
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry zookeeper -zooKeeperServers zkServer1:2181 -dataCenter dc1

Step 2: Run Spark application with RSS Client and ZooKeeper service registry

  • Upload client jar file (remote-shuffle-service-xxx-client.jar) to your HDFS, e.g. hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar

  • Add configure to your Spark application like following (you need to adjust the values based on your environment):

spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=zookeeper
spark.shuffle.rss.serviceRegistry.zookeeper.servers=zkServer1:2181
spark.shuffle.rss.dataCenter=dc1
  • Run your Spark application

remoteshuffleservice's People

Contributors

mabansal avatar vectorijk avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.