Giter Site home page Giter Site logo

kapitan-spark's Introduction

Kapitan Spark - All-in-One Spark Installer for Kubernetes!

Overview

Welcome to our Helm Chart Installer for Spark. This would enable a user to easily deploy the Spark Ecosystem Components on a Kubernetes Cluster.

The below components enable the following features:

  1. Running Spark Notebooks using Spark and Spark SQL
  2. Creating Spark Jobs using python
  3. Tracking Spark Jobs using a UI

Components:

We invite you to try this out and let us know any issues/feedback you have via Github Issues. Do let us know what adaptions you have done for your setup via Github Discussions.

Usage

Quick Start

Suitable for users with basic knowledge on Kubernetes and Helm. Can also install on Microk8s.

Requirements:

  • Ingress
  • Storage that support ReadWriteMany

Installation of Helm

  1. Run the following install command, where spark-bundle is the name you prefer:

    helm install spark-bundle installer --namespace kapitanspark --create-namespace --atomic --timeout=15m
  2. Run the command kubectl get ingress --namespace kapitanspark to get IP address of KUBERNETES_NODE_IP. For default password, please refer to component section in this document. After that you can access

    • Jupyter lab at http://KUBERNETES_NODE_IP/jupyterlab
    • Spark History Server at http://KUBERNETES_NODE_IP/spark-history-server
    • Lighter UI http://KUBERNETES_NODE_IP/lighter
    • Spark Dashboard http://KUBERNETES_NODE_IP/grafana

Compatibility

Syntax Description
Kubernetes 1.23.0 >= 1.29.0
Helm 3

Resource Requirements

Resource Description Remarks
CPU 8 Cores
Memory 12 GB
Disk 80 GB Adjust this based on the size of your Spark docker images

Component Details and Defaults

Remarks
  • Hive metastore

    • You may rebuild the image using the Dockerfile hive-metastore/Dockerfile
    • After rebuilding, modify the following keys in values.yaml: image.repository, image.tag in values.yaml.
  • Spark Thrift Server

    • You may rebuild the image using the Dockerfile spark_docker_image/Dockerfile
    • After rebuilding, modify the following keys in values.yaml: image.repository, image.tag in values.yaml.
    • Spark UI has been intentionally disabled at spark-thrift-server/templates/service.yaml.
    • Dependency: hive-metastore component.
  • Jupyter Lab

    • Modify jupyterlab/requirements.txt according to your project before installation.
    • Default password: spark ecosystem
  • Lighter

    • You may rebuild the image using the Dockerfile spark_docker_image/Dockerfile
    • After rebuilding, modify the following keys in values.yaml: image.spark.repository, image.spark.tag in values.yaml.
    • If Spark History Server uses Persistent Volumes to save event logs instead of Blob storage S3a, ensure to install it with spark-history-server component on the same Kubernetes namespace.
    • Dependencies: hive-metastore, spark-dashboard and spark-history-server components. The latter can be turned off in values.yaml.
    • Default user: dataOps password: 5Wmi95w4
  • Spark History Server

    • By default, Persistent Volumes is used to read event logs, you may modify this by updating the dir key in spark-history-server/values.yaml and in the lighter component, update key spark.history.eventLog.dir in lighter/values.yaml
    • If using Persistence volume instead of Blob storage S3a, ensure it is installed on the same namespace as other components.
    • Default user: dataOps password: 5Wmi95w4
  • Spark Dashboard

    • Default user: grafana password: 1K7rYwg655Zl

Advanced Installation and Customisation

This method is ideal for advanced users who have some expertise in Kubernetes and Helm. This approach enables you to extend existing configurations efficiently for your needs, without modifying the existing source code.

Customisation of the Helm Chart

This helm chart supports various methods of customization

  1. Modifying values.yaml
  2. Providing a new values.yaml file
  3. Using Kustomize
Show Details of Customization
Customising values.yaml

You may customise your installation of the above components by editing the file at installer/values.yaml.

Alternative Values File

Alternatively, you can create a copy of the values file and run the following modified command

 helm install spark-bundle installer --values new_values.yaml --namespace kapitanspark --create-namespace --atomic --timeout=15m
Configuration Using Kustomize :

This approach prevents you from modifying the original source code and enables you to customize as per your needs.

You may refer to this section Using Kustomize

Installing Components Separately

If you want to install each component separately, you can also navigate to the individual chart folder and run helm install as needed.

Creating Multiple Instances

You may create multiple instances of this Helm Chart by specifying a different Helm Chart name, for example : production, staging and testing environments.

You may need to adjust the Spark Thrift Server Port Number if you are installing 2 instances on the same cluster.

Show Sample Commands to Create Multiple Instances
helm install spark-production installer --namespace kapitanspark-prod --create-namespace --atomic --timeout=15m
helm install spark-testing installer --namespace kapitanspark-test --create-namespace --atomic --timeout=15m
Using Kustomize to modify configuration
Show Customised Install Instructions

Requirements:

  • Ingress (Nginx)
  • Storage that support ReadWriteMany , eg: NFS or Longhorn NFS
  1. Customize your components by enabling or disabling them in installer/values.yaml

  2. Navigate to the directory kcustomize/example/prod/, and modify google-secret.yaml and values.yaml files.

  3. Modify jupyterlab/requirements.txt according to your project before installation

  4. Execute the install command stated below in the folder kcustomize/example/prod/, replacing spark-bundle with your preferred name. You can add --dry-run=server to test any error in helm files before installation:

    cd kcustomize/example/prod/
    helm install spark-bundle ../../../installer --namespace kapitanspark  --post-renderer ./kustomize.sh --values ./values.yaml --create-namespace --atomic --timeout=15m
  5. After successful installation, you should be able to access the Jupyter Lab, Spark History Server, Lighter UI and Dashboard based on your configuration of the Ingress section in values.yaml.

(Optional) Setup of Local Kubernetes Cluster

You may skip the local setup if you already an existing kubernetes cluster you would like to use

See details of setup for microk8s

At the moment, we have only tested this locally using microk8s. Refer to the installation steps on microk8s docs

If you are using Microk8s, below are the steps to install Nginx and PV with RWX support:

# the requirements stated below are the minimum, feel free to adjust upwards as needed
microk8s install --cpu 8 --mem 12 --disk 40
microk8s enable hostpath-storage
microk8s enable ingress

#output your kubeconfig using this command
microk8s config

# update ~/.kube/config to add the config above to access this kubernetes cluster via kubectl

kapitan-spark's People

Contributors

cometta avatar flowy0 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.