I was inspired by this repository, and continue to build on it.
However, I also got the issue faced here: #1
I was getting:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I bashed my head around this for 2 nights, not being an expert in K8S
I first thought something was wrong with how I started it up.
Either way, this is how I reproduced the problem:
1)
I checked my resources, and I made the following config:
spark-defaults.conf
spark.master spark://sparkmaster:7077
spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark
2)
And I ran minikube
with:
minikube start --memory 8192 --cpus 4 --vm=true
3)
These were my spark-master
and spark-worker
scripts:
spark-worker.sh
#!/bin/bash
. /common.sh
getent hosts sparkmaster
if ! getent hosts sparkmaster; then
sleep 5
exit 0
fi
/usr/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://sparkmaster:7077 --webui-port 8081 --memory 2g
### Note I put 2g
here just to be 100% confident I was not using to much resources.
spark-worker.sh
#!/bin/bash
. /common.sh
echo "$(hostname -i) sparkmaster" >> /etc/hosts
/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080
4)
I then ran:
kubectl exec <master-pod-name> -it -- pyspark
>>>
sc.parallelize([1,2,3,4]).collect()
>>>
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
And error occurred!
I made sure to get access to 8081
and 4040
to investigate logs further:
kubectl port-forward <spark-worker-pod> 8081:8081
kubectl port-forward <spark-master-pod> 4040:4040
I then went in and:
http://localhost:8081/ --> Find my executor --> stderr (`http://localhost:8081/logPage/?appId=<APP-ID>&executorId=<EXECUTOR-ID>&logType=stderr`) ->
5)
I scratched my head, and I knew! I have enough resources, why does this not work!
And I could see:
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: sparkmaster/10.101.97.213:41607
Caused by: java.net.ConnectException: Connection timed out
I then thought well, I done this right:
spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster
The docs mention that it can be either HOST
or IP
, I am good I thought. I saw the possible solution of:
sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
Well this was not a problem for me, actually I had no iptables
to resolve at all.
So I then verified the master
IP
with:
I then took the MASTER-IP
and added it directly:
pyspark --conf spark.driver.bindAddress=<MASTER-POD-IP> --conf spark.driver.host=<MASTER-POD-IP>
>>> ....
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Python version 3.7.9 (default, Sep 10 2020 17:42:58)
SparkSession available as 'spark'.
>>> sc.parallelize([1,2,3,4,5,6]).collect()
[1, 2, 3, 4, 5, 6] <---- BOOOOOOM!!!!!!!!!!!!!!!
6)
SOLUTION:
spark-defaults.conf
spark.master spark://sparkmaster:7077
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark
And add the IPs
correctly:
spark-worker.sh
#!/bin/bash
. /common.sh
echo "$(hostname -i) sparkmaster" >> /etc/hosts
# We must set the IP address to the executors of the master pod, othewerwise we will get the error
# inside the worker trying to connect to master:
#
# 20/09/12 15:56:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
# cluster UI to ensure that workers are registered and have sufficient resources
#
# When investigating the worker we can see:
# Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: s
# parkmaster/10.101.97.213:34881
# Caused by: java.net.ConnectException: Connection timed out
#
# This means that when the spark-class ran, it was able to create the connection at init stage, but
# when pushing the spark-submit, it failed.
echo "spark.driver.host $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf
echo "spark.driver.bindAddress $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf
/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080
In this case my SPARK_HOME
is /usr/local/spark
My Dockerfile
FROM python:3.7-slim-stretch
# PATH
ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Spark
ENV SPARK_VERSION 3.0.0
ENV SPARK_HOME /usr/local/spark
ENV SPARK_LOG_DIR /var/log/spark
ENV SPARK_PID_DIR /var/run/spark
ENV PYSPARK_PYTHON /usr/local/bin/python
ENV PYSPARK_DRIVER_PYTHON /usr/local/bin/python
ENV PYTHONUNBUFFERED 1
ENV HADOOP_COMMON org.apache.hadoop:hadoop-common:2.7.7
ENV HADOOP_AWS org.apache.hadoop:hadoop-aws:2.7.7
ENV SPARK_MASTER_HOST sparkmaster
# Java
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
# Install curl
RUN apt-get update && apt-get install -y curl
# Install procps
RUN apt-get install -y procps
# Install coreutils
RUN apt-get install -y coreutils
# https://github.com/geerlingguy/ansible-role-java/issues/64
RUN apt-get update && mkdir -p /usr/share/man/man1 && apt-get install -y openjdk-8-jdk && \
apt-get install -y ant && apt-get clean && rm -rf /var/lib/apt/lists/ && \
rm -rf /var/cache/oracle-jdk8-installer;
# Download Spark, enables full functionality for spark-submit against docker container
RUN curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
tar -zx -C /usr/local/ && \
ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 ${SPARK_HOME}
# add scripts and update spark default config
ADD tools/docker/spark/common.sh tools/docker/spark/spark-master.sh tools/docker/spark/spark-worker.sh /
ADD tools/docker/spark/example_spark.py /
RUN chmod +x /common.sh /spark-master.sh /spark-worker.sh
ADD tools/docker/spark/spark-defaults.conf ${SPARK_HOME}/conf/spark-defaults.conf
ENV PATH $PATH:${SPARK_HOME}/bin
Currently bulding a streaming platform in this repo:
https://github.com/Thelin90/deiteo