Giter Site home page Giter Site logo

pyspark-easy-start's Introduction

Pyspark easy start

This repo shows how to easily get started with local spark cluster (one master, one worker) and run pyspark jobs on it, provided you have docker.

How to use

docker-compose up -d
docker-compose exec work-env python sql.py
 or
docker-compose exec work-env spark-submit sql.py

Sample output:

Creating network "mg-spark_default" with the default driver
Creating mg-spark_spark_1          ... done
Creating mg-spark_spark-worker-1_1 ... done
Creating mg-spark_work-env_1       ... done
21/03/05 15:56:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/03/05 15:56:25 WARN SparkContext: Please ensure that the number of slots available on your executors is limited by the number of cores to task cpus and not another custom resource. If cores is not the limiting resource then dynamic allocation will not work properly!
[Row(col0=0, col1=1, col2=2), Row(col0=3, col1=1, col2=5), Row(col0=6, col1=2, col2=8)]
+----+---------+---------+---------+
|col1|sum(col0)|sum(col1)|sum(col2)|
+----+---------+---------+---------+
|   1|        3|        2|        7|
|   2|        6|        2|        8|
+----+---------+---------+---------+

There are also run_sql.sh, run_file.sh and run_s3.sh scripts working for mac and linux.

Feel free to edit either of the .py files provided or create new ones. However, make sure any new files are inside same directory.

Requirements:

No need to have:

  • python. It is provided in the bitnami/spark:3-debian-10 image
  • pyspark. It is already installed inside bitnami/spark:3-debian-10 image

Note about local file access

read_file.py reads file from local filesystem. However, it's the worker that actually reads the file. Because of that there is volumes section defined for each docker service:

    volumes:
      - .:/app

so that the "local" file path resolves same way for every node.

Note about s3 / google cloud storage

Please update run_s3.py with s3 credentials (and endpoint if running own s3 service) file to run the s3 example. Credentials are provided inside python code which is not optimal - please do not do that for files going into any code repository.

To properly provide credentials and other info like s3 endpoint, some ideas is to use env vars or envfile. Please remember to add .env and related files to .gitignore

Credits:

  • Bitnami for their easy to use image
  • @dani8art for excellent explaination how to connect to cluster from pyspark

pyspark-easy-start's People

Contributors

leriel avatar whouey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.