Demo : Notebook as a Service

Integration of Jupyterhub (Jhub) & Jupyter Enterprise Gateway (JEG)

LAB Objective:

Demonstarte steps to realize Jupyter notebook as a service using Jupyter Enterprise Gateway(JEG) & Jupyterhub (JHub) to realize separation of Jupyter server (backend of JupyterLab) and (computation) Kernels.

LAB Overview:

Deploy JEG & JHub on single node Microk8s cluster.
Simulate users access Jupyterhub via their browsers . Jupyterhub launches server (JupyterLab backend) pod for each individual user. When a user tries to connect to a kernel, the server will acts as a proxy to spawn a kernel Pod in a separate namespace which is separate from server Pod. All these Pods are managed by a K8s cluster. When a user shutdowns kernel, the kernel pod will be destroyed . When a user shutdowns server, server Pod will be destroyed. But, a PVC (bound to a PV backed by a NFS share storing all users' home directory data) persists even after a server pod is deleted, thus keeping user's home directory data beyond server pod lifecycle. Next time, when the user logs in to start another server, the newly created server pod will grab existing PVC so that the user can continue to work with his/her data.
In this demo, we showcase how Jupyter Enterprise Gateway works with Python kernel and Spark Python kernel (Spark on Kubernetes).

LAB Steps:

Setup a NFS server Assume that a NFS server (IP: 172.17.0.1) exports a share at the path: /home/nfs_share .
Setup single node Microk8s cluster on Ubuntu machine
- Install Metallb (load balancer)
- Install dynamic NFS provisioner

Create namespaces

kubectl create namespace enterprise-gateway
kubectl create namespace jupyterhub

Create PV & PVCs

Use yaml file jhub_pvc.yaml to create :
- pv nfs-pv : mount of a nfs share at 172.17.0.1:/home/nfs_share/claim
- pvc jhub-claim in namespace jupyterhub : bound to pv nfs-pv
Use yaml file kernelspecs_pvc.yaml to create pvc kernelspecs-pvc in namespace enterprise-gateway : 20MB nfs share allocated from nfs-client storage class to store kernelspecs.

Deploy JEG to namespace enterprise-gateway

git clone https://github.com/jupyter-server/enterprise_gateway
mkdir eg
helm template --output-dir ./eg enterprise-gateway enterprise-gateway/etc/kubernetes/helm/enterprise-gateway -n enterprise-gateway -f jeg_customized_values.yaml
kubectl apply -f ./eg/enterprise-gateway/templates/

We use jeg_customized_values.yaml to customize JEG chart values.

Copy kernelspecs and kernel-launcher sciprts and j2 templates to the NFS share that corresponds to pvc kernelspecs-pvc . After copy, the NFS share file /directory structure looks like:

.
├── python_kubernetes
│   ├── kernel.json
│   └── scripts
│       ├── kernel-pod.yaml.j2
│       └── launch_kubernetes.py
└── spark_python_kubernetes
    ├── bin
    │   └── run.sh
    ├── kernel.json
    └── scripts
        ├── kernel-pod.yaml.j2
        └── launch_kubernetes.py

Helm deploy JHub to namespace jupyterhub
```
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update
helm install jhub jupyterhub/jupyterhub -f jhub_customized_values.yaml --version=2.0.0 -n jupyterhub 
```
We use jhub_customized_values.yaml to customize Jhub chart values:
- Set singleuser.storage.type to static to use static storage allocation
- Set singleuser.storage.static.pvcName (jhub-claim) to allocate static storage for jupyter server (ex: each user's home dir mapped to a subdir of the username)
- Set singleuser.extraEnv.JUPYTER_GATEWAY_URL to point to JEG's endpoint (http://enterprise-gateway.enterprise-gateway:8888)
- Set singleuser.cmd to have a shell expand 'KERNEL_PATH=' expression first and pass KERNEL_PATH as an environment variable to 'jupyterhub-singleuser' by 'env' command. JUPYTERHUB_USER contains username and KERNEL_PATH stores the nfs share path to be mapped to the user's home dir in kernel pod.
Build custom Spark python kernel image that supports S3A access to object storage

As the default Spark python kernel images available in repo are built from Spark with Hadoop 2.7, we cannot get S3A access work with these images, we therefore build a custom container image using Dockerfile from the repo as a template and specifically install Spark 3.2.3 with Hadoop 3.2. We add hadoop-aws-3.2.3.jar and aws-java-sdk-bundle-1.11.901.jar required for S3A access to this image. These two jar files can get along with Spark 3.2.3 on Hadoop 3.2.
```
cd build
docker build -t yangxh/kernel-spark.py:latest .
```
We use the Dockerfile to build the custom kernel image.
Customize kernel.json file in kernelspecs' nfs share

Customize python_kubernetes/kernel.json and spark_python_kubernetes/kernel.json files in kernelspecs' nfs share to include environment variables KERNEL_VOLUME_MOUNTS and KERNEL_VOLUMES. These variables will be read by python-kubernetes/scripts/launch_kubernetes.py script to render kernel pod yaml file which includes mount of a nfs share at the path as specified by environment variable KERNEL_PATH. (When JHub launches a server for a user that connects to JEG ,KERNEL_PATH is one of enviornment variables that get passed to JEG. As KERNEL_VOLUMES in python-kubernetes/kernel.json makes reference to variable KERNEL_PATH, the kernel pod yaml file prepared by JEG will include a mount entry of NFS share at the path specified by KERNEL_PATH.) This will make a user's Jupyter server pod and kernel pod have a common nfs share mapped to their home directories ( '/home/jovyan').

In addition to the above customization, we will add configuration pertaining to S3 object storage endpoint URL, authentication provider and credentials to spark_python_kubernetes/kernel.json so that S3 object storage could be accessed in Spark python kernel. For illustration, here we specify AWS_SECRET_ACCESS_KEY=foo,AWS_ACCESS_KEY_ID=bar as environment variables and append --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider --conf spark.hadoop.fs.s3a.endpoint=object.storage.com to SPARK_OPTS variable. We also configure Spark driver and executor containers to use custom container image docker.io/yangxh/kernel-spark-py:latest we built in previous step.