Giter Site home page Giter Site logo

cryptopalxyz / spark-tpcds-datagen Goto Github PK

View Code? Open in Web Editor NEW

This project forked from maropu/spark-tpcds-datagen

0.0 0.0 0.0 39.02 MB

All the things about TPC-DS in Apache Spark

License: Apache License 2.0

Shell 28.09% Scala 56.09% Python 15.82%

spark-tpcds-datagen's Introduction

Build Status

This is a TPCDS data generator for Apache Spark, which is split off from spark-sql-perf and includes pre-built tpcds-kit for Mac/Linux x86_64 platforms. To check TPCDS performance regression, the benchmark results (sf=20) for the current Spark master is daily tracked in the Google Spreadsheet (performance charts).

Note that the current master branch intends to support 3.1.1 on Scala 2.12.x. If you want to generate TPCDS data in Spark 3.0.x, please use branch-3.0.

How to generate TPCDS data

You can generate TPCDS data in /tmp/spark-tpcds-data:

# You need to set `SPARK_HOME` to your Spark v3.0.1 path before running a command below
$ ./bin/dsdgen --output-location /tmp/spark-tpcds-data

How to run TPCDS queries in Spark

If you run TPCDS quries on the master branch of Spark, you say a sequence of commands below:

$ git clone https://github.com/apache/spark.git

$ cd spark && ./build/mvn clean package -DskipTests

$ ./bin/spark-submit \
    --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark \
    --jars ${SPARK_HOME}/core/target/spark-core_<scala.version>-<spark.version>-tests.jar,${SPARK_HOME}/sql/catalyst/target/spark-catalyst_<scala.version>-<spark.version>-tests.jar \
    ${SPARK_HOME}/sql/core/target/spark-sql_<scala.version>-<spark.version>-tests.jar \
    --data-location /tmp/spark-tpcds-data

Options for the generator

$ ./bin/dsdgen --help
Usage: spark-submit --class <this class> --conf key=value <spark tpcds datagen jar> [Options]
Options:
  --output-location [STR]                Path to an output location
  --scale-factor [NUM]                   Scale factor (default: 1)
  --format [STR]                         Output format (default: parquet)
  --overwrite                            Whether it overwrites existing data (default: false)
  --partition-tables                     Whether it partitions output data (default: false)
  --use-double-for-decimal               Whether it prefers double types instead of decimal types (default: false)
  --use-string-for-char                  Whether it prefers string types instead of char/varchar types (default: false)
  --cluster-by-partition-columns         Whether it cluster output data by partition columns (default: false)
  --filter-out-null-partition-values     Whether it filters out NULL partitions (default: false)
  --table-filter [STR]                   Queries to filter, e.g., catalog_sales,store_sales
  --num-partitions [NUM]                 # of partitions (default: 100)

Run specific TPCDS quries only

To run a part of TPCDS queries, you type:

$ ./bin/run-tpcds-benchmark --data-location [TPCDS data] --query-filter "q2,q5"

Other helper scripts for benchmarks

To quickly generate the TPCDS data and run the queries, you just type:

$ ./bin/report-tpcds-benchmark [TPCDS data] [output file]

This script finally formats performance results and appends them into ./reports/tpcds-avg-results.csv. Notice that, if SPARK_HOME defined, the script uses the Spark. Otherwise, it automatically clones the latest master in the repository and uses it. To check performance differences with pull requests, you could set a pull request ID in the repository as an option and run the quries against it.

$ ./bin/report-tpcds-benchmark [TPCDS data] [output file] [pull request ID (e.g., 12942)]

Bug reports

If you hit some bugs and requests, please leave some comments on Issues or Twitter(@maropu).

spark-tpcds-datagen's People

Contributors

maropu avatar dongjoon-hyun avatar xerial avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.