Giter Site home page Giter Site logo

rapids-shell's Introduction

rapids-shell

This repo was started as a wrapper around Spark REPLs for easier use with the Spark RAPIDS plugin. Lately I have been putting more effort in maintaining standalone Jupyter notebooks that can be easily started without the wrapper script, and particularly easy to simply open them in VSCode with the Jupyter extension.

Original Utility

A utility to start RAPIDS-enabled Spark Shell with access to unit tests resources from https://github.com/NVIDIA/spark-rapids Before running the examples make sure to at least execute mvn package in your local spark-rapids repo if you are not using binaries.

Comand line options

See rapids.sh --help for up to date information

Usage: rapids.sh [OPTION]
Options:
  --debug
    enable bash tracing
  -h, --help
    prints this message
  -l4j=LOG4J_CONF_FILE, --log4j-file=LOG4J_CONF_FILE
    LOG4J_CONF_FILE location of a custom log4j config for local mode
  -nsys, --nsys-profile
    run with Nsights profile
  -m=MASTER, --master=MASTER
    specify MASTER for spark command, default is local[-cluster], see --num-local-execs
  -n, --dry-run
    generates and prints the spark submit command without executing
  -nle=N, --num-local-execs=N
    specify the number of local executors to use, default is 2. If > 1 use pseudo-distributed
    local-cluster, otherwise local[*]
  -uecp, --use-extra-classpath
    use extraClassPath instead of --jars to add RAPIDS jars to spark-submit (default)
  -uj, --use-jars
    use --jars instead of extraClassPath to add RAPIDS jars to spark-submit
  --ucx-shim=spark<3xy>
    Spark buildver to populate shim-dependent package name of RapidsShuffleManager.
    Will be replaced by a Boolean option
  -cmd=CMD, --spark-command=CMD
    specify one of spark-submit (default), spark-shell, pyspark, jupyter, jupyter-lab
  -dopts=EOPTS, --driver-opts=EOPTS
    pass EOPTS as --driver-java-options
  -eopts=EOPTS, --executor-opts=EOPTS
    pass EOPTS as spark.executor.extraJavaOptions
  --gpu-fraction=GPU_FRACTION
    GPU share per executor JVM unless local or local-cluster mode, see spark.rapids.memory.gpu.allocFraction

Environment variables

  • SPARK_RAPIDS_HOME - the path either to the local repo or to the location used for downloading the binaries

  • SPARK_HOME - the path either to the local Spark repo or to the root fo binary distro

  • SPARK_CMD - one of spark-shell, spark-submit (default), pyspark, jupyter, jupyter-lab

Examples

Use Spark RAPIDS in Jupyter notebook

SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 SPARK_CMD=jupyter[-lab] rapids.sh

Run in pseudo-distirbuted local-cluster mode

NUM_LOCAL_EXECS=2 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh

Allow attaching a java debugger to the driver JVM

JDBSTR=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh

Running Spark RAPIDS ScalaTests in spark-shell once started

Single test suite

scala> run(new com.nvidia.spark.rapids.InsertPartition311Suite)
InsertPartition311Suite:
...

Single test case

scala> run(new com.nvidia.spark.rapids.HashAggregatesSuite, "sum(floats) group by more_floats 2 partitions")
HashAggregatesSuite:
...

Using integration test datagens

In pyspark based drivers one can use data generators from spark-rapids/integration-tests or run whole pytests.

Add rapids.py as an ipython startup file, e.g. on *NIX

cp src/python/rapids.py ~/.ipython/profile_default/startup/

Datagen

key_data_gen = StructGen([
        ('a', IntegerGen(min_val=0, max_val=4)),
        ('b', IntegerGen(min_val=5, max_val=9)),
    ], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)

...

Pytest

runpytest('test_struct_count_distinct')

rapids-shell's People

Contributors

gerashegalov avatar mythrocks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.