testdrivenio / spark-kubernetes Goto Github PK

View Code? Open in Web Editor NEW

105.0 7.0 75.0 236 KB

spark on kubernetes

Shell 52.76% Dockerfile 47.24%

spark kubernetes docker

spark-kubernetes's Introduction

Deploying Spark on Kubernetes

Want to learn how to build this?

Check out the post.

Want to use this project?

Minikube Setup

Install and run Minikube:

Install a Hypervisor (like VirtualBox or HyperKit) to manage virtual machines
Install and Set Up kubectl to deploy and manage apps on Kubernetes
Install Minikube

Start the cluster:

$ minikube start --memory 8192 --cpus 4
$ minikube dashboard

Build the Docker image:

$ eval $(minikube docker-env)
$ docker build -t spark-hadoop:2.2.1 -f ./docker/Dockerfile ./docker

Create the deployments and services:

$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
$ kubectl create -f ./kubernetes/spark-master-service.yaml
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
$ minikube addons enable ingress
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml

Add an entry to /etc/hosts:

$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts

Test it out in the browser at http://spark-kubernetes/.

spark-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

pydatabh vijayjatam u20024804 thesky0108 varunraj05 rootleveltech josefigueredo anarkia7115 gluckzhang enyachoke askrht goutham470 anirtek 1095661190 inylschek diegopacheco robsonyeg ziyunxiao airbots1980 houssem-tebai leonardas103 ahmed-cheibani pallavimitra neosun100 ozlevka peihongch kadern0 pradhyu jawad8106 hurner iratemonkey nadiromega alvin1026 fccastellvi saifalimulani morind geogubd vicchugu ajbd2106 timziebart ricanol tianzhengg geosmart dramolbhalla gszecsenyi hermawanln gergobig mfdbosch asbik-abdellah tolustar truongdk datacouch-io jear paulohgontijoo chancelier147 anmol-recker henryeleonu yadavanubhav05 navjagat djnavarro geekyouth bloodlee anilsb06 haron touma-i duc-dn akraubayev yabinmeng shishirme rpings iga1989 tncyyyll-datalonga masalinas msuhas13

spark-kubernetes's Issues

create a directory on HDFS

Hi,

Do you know how can I create a directory on HDFS?

Kind regards,
Anaid

1 node(s) had taints that the pod didn't tolerate

Fails to create pods in minikube version: v1.1.0 on Ubuntu 16.04

$ kubectl describe po spark-master-7fcc5d88cc-24t57
...
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  84s (x20 over 28m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

$ kubectl get po
NAME                            READY   STATUS    RESTARTS   AGE
spark-master-7fcc5d88cc-24t57   0/1     Pending   0          30m
spark-worker-77fc45f84f-g9lkl   0/1     Pending   0          30m
spark-worker-77fc45f84f-l8h8l   0/1     Pending   0          30m

I'm trying to create my own image using your Dockerfile, and its is build without problems and I create de master deployment and worker deployment, and the pods are running, but when I execute the example with pyspark I get this error:

$ kubectl exec spark-master-2-7dd86dc9d7-tftnr -it pyspark

Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31)
SparkSession available as 'spark'.
>>> words = 'the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog'
>>> seq = words.split()
>>> data = sc.parallelize(seq)
>>> counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
**[Stage 0:>                                                          (0 + 0) / 2]2019-04-25 15:53:10 WARN  TaskSchedulerImpl:66 - **Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

Any idea what is the problem?

[Question] Connecting to spark in Client Mode on k8s

Hello,
Thank you for the excellent work.
Is there a way to use this architecture for executing pyspark scripts in Client Mode. Such that one can import pyspark in a Jupyter Notebook and connect to the spark cluster running on kubernetes.

pyspark-notebook
λ docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$(PWD):/home/jovyan/work" jupyter/pyspark-notebook

Something like this?

# my-notebook.ipynb
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Build Spark session
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://localhost:6445")
sparkConf.set("spark.kubernetes.container.image", "spark-hadoop:2.2.1")
sparkConf.set("spark.kubernetes.namespace", "default")
sparkConf.set('spark.submit.deployMode', 'client') # Only client mode is possible 
sparkConf.set('spark.executor.instances', '2') # Set the number of executer pods
sparkConf.setAppName('pyspark-shell')
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

# Test
filePath = os.path.join('../Test1.csv')
df = spark.read.format('csv').options(
    header='true', inferSchema=True).load(filePath)
df.show()

I keep getting ImagePullBackOff

When I create the deployments and the services, the master is Running but I get "ImagePullBackOff" error for the workers

The Docker image used is: spark-hadoop:3.0.0

When I inspect the worker logs, I get this: "Error from server (BadRequest): container "spark-worker" in pod "spark-worker-655b68b77c-8cvrz" is waiting to start: trying and failing to pull image"

http://spark-kubernetes/ is not working

Hi,

I am following the steps that you have here: https://testdriven.io/blog/deploying-spark-on-kubernetes/

Thank you so much for sharing this with us.

I am having some problems. The test part is working, but this http://spark-kubernetes/ doesn't work for me.

when I execute minikube ip I get this: 192.168.99.103

Do you know another way to reach the same page (spark master)?

Kind regards,
Anaid

How to build docker images and deploy spark deployments and service yaml files

Hi @mjhea0 ,

I was able to create a Kubernetes cluster by following the documentation links. However, I am not able to build the docker custom images and not sure where to run the other commands for deployments and services. Can you guide me over here?

try to access on browser using http://spark-kubernetes/ but not accessile

hi, i conffigured according to repo testdrivenio/spark-kubernetes also set etc/hosts

but still got an error on browser on terminal it works fine and perfectly.

could you help me out this : /

Thanks in advance.....

Solution For Connectivity Problem When Submitting Work From Master Node To Worker(s)

I was inspired by this repository, and continue to build on it.

However, I also got the issue faced here: #1

I was getting:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I bashed my head around this for 2 nights, not being an expert in K8S I first thought something was wrong with how I started it up.

Either way, this is how I reproduced the problem:

I checked my resources, and I made the following config:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

And I ran minikube with:

minikube start --memory 8192 --cpus 4 --vm=true

These were my spark-master and spark-worker scripts:

spark-worker.sh

#!/bin/bash

. /common.sh

getent hosts sparkmaster

if ! getent hosts sparkmaster; then
  sleep 5
  exit 0
fi

/usr/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://sparkmaster:7077 --webui-port 8081 --memory 2g

### Note I put 2g here just to be 100% confident I was not using to much resources.

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

I then ran:

kubectl exec <master-pod-name> -it -- pyspark
>>>
sc.parallelize([1,2,3,4]).collect()
>>>
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

And error occurred!

I made sure to get access to 8081 and 4040 to investigate logs further:

kubectl port-forward <spark-worker-pod> 8081:8081

kubectl port-forward <spark-master-pod> 4040:4040

I then went in and:

http://localhost:8081/ --> Find my executor --> stderr (`http://localhost:8081/logPage/?appId=<APP-ID>&executorId=<EXECUTOR-ID>&logType=stderr`) ->

I scratched my head, and I knew! I have enough resources, why does this not work!

And I could see:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: sparkmaster/10.101.97.213:41607
Caused by: java.net.ConnectException: Connection timed out

I then thought well, I done this right:

spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster

The docs mention that it can be either HOST or IP, I am good I thought. I saw the possible solution of:

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Well this was not a problem for me, actually I had no iptables to resolve at all.

So I then verified the master IP with:

kubectl get pods -o wide

I then took the MASTER-IP and added it directly:

pyspark --conf spark.driver.bindAddress=<MASTER-POD-IP> --conf spark.driver.host=<MASTER-POD-IP>
>>> ....

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 3.7.9 (default, Sep 10 2020 17:42:58)
SparkSession available as 'spark'.
>>> sc.parallelize([1,2,3,4,5,6]).collect()
[1, 2, 3, 4, 5, 6] <---- BOOOOOOM!!!!!!!!!!!!!!!

SOLUTION:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

And add the IPs correctly:

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

# We must set the IP address to the executors of the master pod, othewerwise we will get the error
# inside the worker trying to connect to master:
#
# 20/09/12 15:56:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
# cluster UI to ensure that workers are registered and have sufficient resources
#
# When investigating the worker we can see:
# Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: s
#   parkmaster/10.101.97.213:34881
# Caused by: java.net.ConnectException: Connection timed out
#
# This means that when the spark-class ran, it was able to create the connection at init stage, but
# when pushing the spark-submit, it failed.
echo "spark.driver.host $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf
echo "spark.driver.bindAddress $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

In this case my SPARK_HOME is /usr/local/spark

My Dockerfile

FROM python:3.7-slim-stretch

# PATH
ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Spark
ENV SPARK_VERSION 3.0.0
ENV SPARK_HOME /usr/local/spark
ENV SPARK_LOG_DIR /var/log/spark
ENV SPARK_PID_DIR /var/run/spark
ENV PYSPARK_PYTHON /usr/local/bin/python
ENV PYSPARK_DRIVER_PYTHON /usr/local/bin/python
ENV PYTHONUNBUFFERED 1
ENV HADOOP_COMMON org.apache.hadoop:hadoop-common:2.7.7
ENV HADOOP_AWS org.apache.hadoop:hadoop-aws:2.7.7
ENV SPARK_MASTER_HOST sparkmaster

# Java
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/

# Install curl
RUN apt-get update && apt-get install -y curl

# Install procps
RUN apt-get install -y procps

# Install coreutils
RUN apt-get install -y coreutils

# https://github.com/geerlingguy/ansible-role-java/issues/64
RUN apt-get update && mkdir -p /usr/share/man/man1 && apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && apt-get clean && rm -rf /var/lib/apt/lists/ && \
    rm -rf /var/cache/oracle-jdk8-installer;

# Download Spark, enables full functionality for spark-submit against docker container
RUN curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
        tar -zx -C /usr/local/ && \
        ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 ${SPARK_HOME}

# add scripts and update spark default config
ADD tools/docker/spark/common.sh tools/docker/spark/spark-master.sh tools/docker/spark/spark-worker.sh /
ADD tools/docker/spark/example_spark.py /

RUN chmod +x /common.sh /spark-master.sh /spark-worker.sh

ADD tools/docker/spark/spark-defaults.conf ${SPARK_HOME}/conf/spark-defaults.conf
ENV PATH $PATH:${SPARK_HOME}/bin

Currently bulding a streaming platform in this repo:

https://github.com/Thelin90/deiteo

testdrivenio / spark-kubernetes Goto Github PK

spark-kubernetes's Introduction

Deploying Spark on Kubernetes

Want to learn how to build this?

Want to use this project?

Minikube Setup

spark-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

spark-kubernetes's Issues

Recommend Projects

Recommend Topics

Recommend Org