canonical / data-science-stack Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 3.0 298 KB

Stack with machine learning tools needed for local development.

License: Apache License 2.0

Python 91.34% Jinja 2.69% Shell 5.82% Dockerfile 0.14%

charm charmed-kubeflow single-charm

data-science-stack's Introduction

Data Science Stack ✨

Making it seamless to run GPU enabled containerized ML Environments

Overview

The Data Science Stack (DSS) makes it seamless for everyone to jump into an ML Environment within minutes and be able to utilise their GPUs.

DSS is a ready-made environment which allows everyone to run ML workloads on the laptop. It gives easy access to a solution for developing and optimising ML models, that leverages the GPUs of the laptop by enabling users to utilise different ML environment images based on their needs.

The DSS is a stack that includes

a container orchestration system (microK8s snap)
out-of-the box containerized ML Environments
an intuitive CLI which streamlines the management of those containers (data-science-stack snap)

The container orchestration system also handles the integration with the host's GPU and drivers, so the containerized environments can only focus on user-space libraries.

Features

Containerized environment management
Seamless GPU utilization
Out-of-the box ML Environments with JupyterLab
Easy data passing between local machine and containerized ML Environments
MLflow for lineage tracking

Requirements

Ubuntu 22.04
Snapcraft (included in Ubuntu)

Quick Start

🚧🚧

Resources

🚧🚧

Feedback

🚧🚧

data-science-stack's People

Contributors

Stargazers

Watchers

Forkers

kenvandine ca-scribner mckees

data-science-stack's Issues

Specification for data science stack

Why it needs to get done

Before starting the implementation we need a specification first.

What needs to get done

Write a spec answering this questions.

How we will implement the DSS?
How do we install DSS (do we use snap?)
How do we upgrade DSS?
How do we remove DSS?
How do we publish DSS?
What thigs can be achieved with DSS (e.g. notebook creation)?

When is the task considered done

Specification is approved

Improve the user experience around debugging deployment issues

Why it needs to get done

When dss initialize or other commands fail, we should provide the user with help to debug. For example:

if dss initialize times out it provides no explanation about what went wrong and deletes all deployed resources which prevents further debugging
if hostpath-storage is not enabled, the deployment hangs without any reason why

We should improve this.

Some possible ideas:

timeout could leave the resources in whatever state they're in and give the users instructions on 1) how to debug further if they want to, and 2) how to remove the resources (ex: run dss remove-components or whatever dss command is appropriate)
- Con to this: leaves the user responsible for cleaning up any partially deployed resources
timeout could take a snapshot of all relevant debugging information
- Con to this: we need to predict where debugging information might be and capture it, without knowing ahead of time what will go wrong

What needs to get done

investigate and decide how to improve the UX around debugging

When is the task considered done

see above

Explore: How to remove snap with charms

Why it needs to get done

We need a way to remove snap which deploys charms to microk8s.

What needs to get done

Talk with other teams (start with microk8s) about how to remove snap which installs charms.

When is the task considered done

We know how to remove snap which installs charms.

Create a CI for testing the DSS snap

Why it needs to get done

We need a way of testing the DSS snap on pull request and push.

What needs to get done

Have a Github workflow that runs testing (unit and/or integration) on the snap. The testing that has to be done is ensuring the snap builds and the CLI commands work and don't return a non-zero exit code.

Please refer to #23 for more details on what can be leveraged (actions, reusable code, etc.)

When is the task considered done

When there is a workflow that automatically runs the sanity tests described above.

Create rockcraft projects for pytorch full CPU image

Why it needs to get done

The DSS project allows users to select the image of the Notebook server to be used, some of which are built and distributed by this team.

What needs to get done

Create a rockcraft project for the pytorch full Notebook Server image. Use this Dockerfile for reference.
Add sanity tests to the rock.

When is the task considered done

When the pytroch full CPU image is published in the CKF team Dockerhub repository.

Implement initialise CLI command

Why it needs to get done

DSS needs a command to connect Juju (from within the snap) to Microk8s running on the local machine. For this task we may assume that both executables are available at local machine.

What needs to get done

Implement a CLI command dss initialise which connects juju to microk8s based on config speciffied in env variable KUBECONFIG or based on flag --kubeconfig. Initialisation will consist of:

Add microk8s cloud based on provided kubeconfig

cat $SNAP_DATA/microk8s/client.config | $your_juju add-k8s my-microk8s

Bootstrap juju controller
Create juju model
Deploy DSS bundle

bundle: kubernetes
name: dss
applications:
  admission-webhook:
    charm: admission-webhook
    channel: 1.8/stable
    trust: true
    scale: 1
    _github_repo_name: admission-webhook-operator
    _github_repo_branch: track/1.8
  mlflow-minio:
    charm: minio
    channel: ckf-1.7/stable
    scale: 1
    trust: true
    _github_repo_name: minio-operator
  mlflow-mysql:
    charm: mysql-k8s
    channel: 8.0/stable
    scale: 1
    trust: true
    _github_repo_name: mysql-k8s-operator
  mlflow-server:
    charm: mlflow-server
    channel: 2.1/stable
    scale: 1
    trust: true
    _github_repo_name: mlflow-operator
  jupyter-controller:
    charm: jupyter-controller
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: notebook-operators
    options:
      use-istio: false
relations:
  - [mlflow-server, mlflow-minio]
  - [mlflow-server, mlflow-mysql]

wait for bundle to be deployed

Example bash script from demo:

juju bootstrap my-k8s uk8s-controller
juju add-model kubeflow

# Deploy charms
juju deploy dss --trust

juju wait-for application mlflow-server --query='name=="mlflow-server" && (status=="active" || status=="idle")' --timeout=15m0s
juju wait-for application mlflow-minio --query='name=="mlflow-minio" && (status=="active" || status=="idle")' --timeout=15m0s
juju wait-for application jupyter-controller --query='name=="jupyter-controller" && (status=="active" || status=="idle")' --timeout=15m0s

When is the task considered done

DSS CLI has command option dss initialise setups juju cloud controller and model and deploys DSS bundle.
After execution of the command we can run juju status and we can see all the components of DSS bundle in active state.

Exploration: Use Singularity Containers for DSS

Why it needs to get done

For the DSS we want to execute containers. As an alternative to Docker Singularity containers come into play.

What needs to get done

Find out how to run GPU workloads with Singularity containers
Find out how to run Docker containers in Singularity
Research how run GPU workloads with IntelGPU devices

When is the task considered done

We have clear understanding if Singularity is our way to deploy GPU workloads.

Implement prepare-host-env CLI command

Why it needs to get done

DSS should be able to output bash script to deploy microk8s on the host machine. We cannot setup microk8s inside the snap because of snap isolation.

What needs to get done

Implement a CLI command dss prepare-host-env which outputs bash script to deploy microk8s on the host machine. This commands also sets the desired addons. Example script from Demo:

#!/bin/bash
sudo snap install microk8s --classic --channel=1.28/stable

sudo usermod -a -G microk8s ubuntu
sudo mkdir /home/ubuntu/.kube
sudo chown -f -R ubuntu /home/ubuntu/.kube

sudo microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
sleep 30
sudo microk8s.kubectl wait --for=condition=available -nkube-system deployment/coredns deployment/hostpath-provisioner
sudo microk8s.kubectl -n kube-system rollout status ds/calico-node
snap connect microk8s $dss

You can also try to use MIcrok8s content feature to specify the microk8s setup with addons in the yaml file.

When is the task considered done

DSS CLI has command option dss prepare-host-env which outputs bash script to install microk8s version 1.28 (we will always provide same version).
The script will also setup addons dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
After execution of the script on the host machine (e.g. piping to bash) the microk8s is available

Rewrite the DSS spec with UX team comments

Why it needs to get done

UX team has checked the DSS spec and gave ideas on how we should rewrite the dss cli to be more usrr friendly. Before we continue with implementation we should redesign the CLI in the spec.

What needs to get done

Rewrite the DSS spec with UX team comments.

When is the task considered done

The spec is approved.

Implement DSS `status` command

Why it needs to get done

The status command checks the status of key components within the DSS environment. It verifies if the MLflow deployment is ready and checks if GPU acceleration is enabled on the Kubernetes cluster by examining the labels of Kubernetes nodes for NVIDIA or Intel GPU devices.

dss status

Please use the UX spec documentation .

What needs to get done

Implement start command
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Implement DSS `remove` command

Why it needs to get done

The remove-notebook command removes a specified notebook using the lightkube library and prints "notebook removed" upon completion.

dss remove-notebook --name user-notebook

What needs to get done

Implement remove command which deploys aforementioned components.
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Explore: Options to install + reinstall Nvidia drivers in Kubernetes cluster

Why it needs to get done

As a DSS user I want to be able to pick any version of Nvidia driver I want for my machine. I also want to be able to change this driver anytime I need. The assumptions is that for different ML library, version combination I may need different nvidia driver.

What needs to get done

Known research resources:

gpu addon for microk8s https://microk8s.io/docs/addon-gpu
nvidia operator https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html

When is the task considered done

We have more insights into options on what tools to use to install + reinstall to specific version of nvidia driver in the kubernetes cluster.

Implement DSS `start` command

Why it needs to get done

The start command starts a specified notebook within the DSS environment. This command is expect to start a stopped notebook

dss start my-notebook

Please use the UX spec documentation .

What needs to get done

Implement start command
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Explore: estimate the effort needed for DSS cli commands

Why it needs to get done

As we progress with the DSS installation we need to estimate the effort needed for different DSS CLI command

What needs to get done

After finishing the spec for DSS we want to estimate the effort needed for each DSS CLI commands and all the other DSS efforts.

When is the task considered done

We have task for each effort needed to finalize DSS based on the technical spec.

Explore: Research existing alternatives of DSS

Why it needs to get done

We need more insights on how others are solving problems which we want to face with DSS.

What needs to get done

Research existing tools:

When is the task considered done

We know what are aforementioned tools doing and how they are doing that.

Implement DSS `create-notebook` command

Why it needs to get done

The create-notebook command allows users to create a Jupyter notebook within the DSS environment. Users specify the notebook's name and an image for the notebook. Behind the scenes, the DSS Python library creates a notebook object from a Kubernetes manifest template, using the provided name and image flags. The command waits until the notebook pod is ready and then outputs a link to access the notebook UI. While waiting for the pod, DSS will output the status of the Pod (Creating, Waiting …). This command will fail with an error message if the Pod won’t get created or if it ends with an error.

The command looks like:

dss create-notebook --name user-notebook --image kubeflownotebookswg/jupyter-scipy:v1.8.0

Whenever a notebook is created dss deploys these Kubernetes objects:

Notebook deployment. This notebook will mount the user-data PVC which is shared across all notebooks. Additionally, the notebook requires an environment variable MLFLOW_TRACKING_URI set with a URL pointing to the MLflow server to be able to connect.
Notebook ClusterIP service, providing access to the notebook.

Output:
Access the notebook at http://10.152.183.223/notebook/user-namespace/user-notebook/

This command has also a -h or –help flag which prints out the help description of the command with a list of recommended images for notebooks. For this list we will use the default list from jupyter-ui. This list can be subject to a change based on customer needs.
Valid image: is the image which is accessible by DSS and can create a valid jupyter notebook (you can read more about creating valid images in the Documentation section). If the image should support GPU enhanced workloads the image must contain CUDA and cuDNN drivers (or tools needed for Intel GPU based workloads). The image may not contain the MLflow client python library.

What needs to get done

Implement create-notebook command which deploys aforementioned components.
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Create rockcraft projects for tensorflow full CPU image

Why it needs to get done

The DSS project allows users to select the image of the Notebook server to be used, some of which are built and distributed by this team.

What needs to get done

Update the rock file in kubeflow-rocks. Use this Dockerfile for reference.
Make sure the CI is properly working in the repo and the sanity tests are testing the rocks functionality.

When is the task considered done

When the tensorflow full CPU image is published in the CKF team Dockerhub repository.

Create the `user-data` PVC during the `dss initialize` command

Why it needs to get done

The user-data PVC is required for all user notebooks, and should be created by dss initialize as described in the spec. This was missed during #31

Because we're using microk8s hostpath storage, the volume size is arbitrary.

What needs to get done

add the notebook pvc creation to dss initialize

When is the task considered done

pvc for notebooks is created by dss initialize

Update the suggested images in `dss create-notebook --help` to use rocks

Why it needs to get done

The initial dss create-notebook --help text uses Kubeflow's upstream notebook servers as suggested images. Once we have our own rocks, we should update this help text to link our rocks.

What needs to get done

Update main.py with rock image links once available

When is the task considered done

when the help text uses canonical-published rocks

Exploration: the publish and approval process of a snap

Why it needs to get done

We need to explore the publish and approval (if needed) process of a snap to the Snap store.

What needs to get done

Explore the following:

What is the process we must follow for releasing a snap?
How much time does it take?
Do we depend on other teams to publish/approve a snap?
What tools could be leveraged? Example https://snapcraft.io/build
Do we need to build a CI that handles this for us?
Are there any re-usable workflows for publishing/building?

When is the task considered done

When we have enough information to answer the questions presented above to get started with writing our first snap.

Explore: If there is a way to locally browse PVCs in filesystem

Why it needs to get done

We want give the user the ability to browse the files stored in PVCs.

What needs to get done

Research the options for browsing files in PVCs in local file system.

When is the task considered done

Find a tool for browsing files in PVCs.

Explore: How to implement DSS with classical snap

Why it needs to get done

We plan to use classi snap for DSS because it needs to install other snaps like Microk8s, juju and juju-wait.

What needs to get done

Research:

how are classic snaps written
how to install, upgrade, delete other snaps with classic snap
examples of other classic snaps
how to run python cli in classic snap

When is the task considered done

We will understand how to write proper classic snap.

Feedback/improvements for dss

Bug Description

There are a couple of user experience limitations that are preventing us from using dss correctly. There are also small errors in the code. Here are a couple I have found while reviewing some PRs:

Timeout - I am working on a small multipass machine with network limitations and the dss initialize --kubeconfig $KUBECONFIG command just times out with a message that is not very helpful message:

2024-03-08 09:17:36 [ERROR] [initialize] [initialize]: Timeout waiting for deployment 'mlflow-deployment' in namespace 'dss' to be ready. Deleting resources...

For enhancing the user experience it would be ideal to have:

Options to change the timeout
A better logging to understand what's wrong

manifest_templates are not installed in /usr/local/lib/python3.10/dist-packages/dss/ which causes some of the dss commands to fail while looking for those package data files. I have identified some missing items in the setup.cfg file:

[options.package_data] should point to the files in manifest_templates/ instead of just manifest.yaml
Potentially we'll need to include MANIFESTS.in

There seems to be a typo here. It should be 60(?).
The logs in the initialise command is confusing:

2024-03-08 09:33:49 [INFO] [utils] [wait_for_deployment_ready]: Waiting for deployment mlflow in namespace dss to be ready...
2024-03-08 09:33:49 [ERROR] [initialize] [initialize]: Timeout waiting for deployment 'mlflow-deployment' in namespace 'dss' to be ready. Deleting resources...

We are waiting for the mlflow deployment to be ready, but then we refer to the mlflow-deployment. I see two problems with this:

We are using two names to refer to the same thing
The initialise command is talking about mlflow, which seems out of context considering that we are running dss initialize .... I'd expect better messaging for this, like: initializing dss -> deploying mlflow -> waiting for mlflow -> etc.

Repository doesn't have a README.md, CONTRIBUTING.md.
We could include a message in dss that tells users when any of the dependencies are missing or prerequisites are not met

pip packages
microk8s addons(?) <--- this could be tricky because the k8s node could be anything

To reproduce

Install dss from source and run commands in the description

KUBECONFIG=~/.kube/config
pip install .

Follow each step in the bug description

Create a CI for publishing DSS snap

Why it needs to get done

To have an automated way of publishing DSS to the Snap store.

What needs to get done

An automated workflow that can be used for releasing the DSS to the Snap Store on an agreed event (on demand, on merge, etc.) Please NOTE this task depends on #24, if we are able to leverage other solutions (like https://snapcraft.io/build) we may not need to build a workflow for this at all.

When is the task considered done

Please NOTE this task depends on #23, the description of this task will change accordingly.

Rewrite the POC DSS with python scripts

Why it needs to get done

For the final deployment we want to use Python rather than bash.

What needs to get done

Rewrite deploy.sh in https://github.com/misohu/data-science-stack-poc as python script.

When is the task considered done

We have a python script which can deploy DSS stack and creates default user namespace with notebook.

Create and release a `snap` for the DSS v0.1

Why it needs to get done

We need to create a snap project for the DSS.

What needs to get done

NOTE: this task may not be required at all if we can leverage automated tools like https://snapcraft.io/build

Release DSS snap v0.1 to the Snap store. Estimation 1D + approval process (if required)

When is the task considered done

The DSS snap is available in the store and can be installed using sudo snap install dss <options>

Create versioning startegy for DSS snap

Why it needs to get done

By registering the snap on snapstore we have automatic snap builds from default branch into edge track for free. The question is how are we going to face different versions of snaps?

What needs to get done

Design and implement versioning strategy and CI for DSS snap.

When is the task considered done

Versioning strategy is specified and implemented.

Add option to `create-notebook` to use GPUs

Why it needs to get done

Based on the dss spec, we need notebooks that support GPUs. #31 implements the create-notebook command, but it does not include an option to enable a GPU.

What needs to get done

add an interface to create-notebook to enable gpus (--gpu flag? or something more detailed?)

When is the task considered done

create-notebook allows creation of notebooks with GPUs

Create UATs that test `dss` actually uses GPUs (NVIDIA)

Why it needs to get done

In order to have e2e testing, an automated and repeatable testing framework is required to ensure dss can actually spin up Jupyter servers where GPU workloads can run on. Testing also covers the ML frameworks (pytorch and tensorflow) that users have access to, so this should also be considered.

What needs to get done

Create notebooks that exercise Pytorch and Tensorflow in a CPU environment
Create notebooks that exercise Pytorch and Tensorflow in a GPU environment

When is the task considered done

When the notebooks are placed in the UATs repository with instructions on how to run them.

Implement DSS `logs` command

Why it needs to get done

The print-logs command gathers the logs from the key components of the DSS and outputs them to the terminal. Namely

Notebooks
MLflow

Users can provide part parameters to get specific logs (e.g. notebooks). Additionally the parameter --name is used to get logs from a specific notebook.

dss logs --parts=notebooks --name=user-notebook

What needs to get done

Implement logs command which deploys aforementioned components.
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Exploration: How we can run GPU workloads in LXC

Why it needs to get done

For the DSS we want to execute containers. As an alternative to Docker LXC comes into play.

What needs to get done

Find out how to run GPU workloads with LXC
Find out how to run Docker containers in LXC
Research how run GPU workloads with IntelGPU devices

When is the task considered done

We have clear understanding if LXC is our way to deploy GPU workloads.

Implement DSS `list` command

Why it needs to get done

The list command retrieves and displays a list of Jupyter notebooks within the DSS environment. This command provides users with an overview of existing notebooks.

dss list

Example output

Name           State    Image                                               URL
user-notebook  Running  kubeflownotebookswg/jupyter-scipy:v1.8.0            http://10.152.183.223/
experiment-1   Running  charmedkubeflow/jupyter-pytorch-full:1.8.0-3058193  [http://10.152.183.22](http://10.152.183.223/)5
experiment-2   Stopped  kubeflownotebookswg/jupyter-scipy:v1.8.0            (stopped)

Please use the UX spec documentation.

What needs to get done

Implement list command
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Create a snapcraft.yaml file for the DSS

Why it needs to get done

A snapcraft.yaml file must be created defining all what is needed for the DSS to be installed, started, removed, etc.

What needs to get done

To write a snapcraft.yaml. The actual content of the snap depends on the exploration task #23, this field will be updated once we have worked on it.

Reference:

When is the task considered done

TODO: depends on #23 to clearly define what are the contents of the snapcraft.yaml file and deliverables of this task.

Create a workflow for running integration GPU tests

Why it needs to get done

The DSS needs to be run in a GPU enabled environment to test the correct integration with GPUs.

What needs to get done

Create a Github workflow that

Provisions a VM with GPUs, it can be either in AWS or self hosted runners
Installs all the requirements for testing dss
Runs integration tests that require GPUs

When is the task considered done

When there is an automated workflow that is triggered on pull and push.

Implement DSS `stop` command

Why it needs to get done

The stop command stops a specified notebook within the DSS environment. This command is expect to start a stopped notebook

dss stop my-notebook

Please use the UX spec documentation .

What needs to get done

Implement stop command
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Exploration: how to write and test snaps?

Why it needs to get done

We need to explore how snaps are written and tested before we can start creating one on our own. We need to gather enough information to start writing a snap for DSS.

Estimate 1D

What needs to get done

Explore the following:

How is a snapcraft.yaml constructed?
How is a fully built snap tested?
What tools can be leveraged. For instance https://snapcraft.io/build.
What is the repository structure of a snap?

When is the task considered done

When we have enough information to answer the questions presented above to get started with writing our first snap.

Explore: How to upgrade snap with charms

Why it needs to get done

We need a way to upgrade a snap which deploys charms

What needs to get done

Find a way how to properly do the upgrade. Talk first with microk8s team.

When is the task considered done

No response

Explore: How to publish a snap

Why it needs to get done

We need to to publish our snap with DSS to snapstore.

What needs to get done

Talk with other teams on how to properly publish snaps to snapstore.

When is the task considered done

We know how to publish snap to snapstore.

Add convenience commands to display URLs for mlflow and notebooks

Why it needs to get done

To connect to mlflow or a dss notebook, we can follow the ClusterIP:Port of the service. You can get these from kubectl get svc -n dss, but we should make it easier.

The initial spec for this tool has dss create-notebook (no args) listing the available notebooks, but this feels awkward. An alternative would be a dedicated command (dss list-notebooks, and something similar for mlflow? Or maybe a general dss list-resources that shows both mlflow and notebooks?). Whatever is implemented, it should have clickable links for all the resources

What needs to get done

implement a convenience function to see all notebook/mlflow endpoints

When is the task considered done

see above

Create rockcraft projects for tensorflow full GPU image

Why it needs to get done

The DSS project allows users to select the image of the Notebook server to be used, some of which are built and distributed by this team.

What needs to get done

Update the rock file in kubeflow-rocks. Use this Dockerfile for reference.
Make sure the CI is properly working in the repo and the sanity tests are testing the rocks functionality.

When is the task considered done

When the tensorflow full GPU image is published in the CKF team Dockerhub repository.

Implement create-notebook CLI command

Why it needs to get done

DSS should be able to create a jupyter server.

What needs to get done

With DSS I can run dss create-notebook command which will create a jupyter notebook and will output its URL for access. The command has two arguments:

--notebook-name: A name related with the server.
--notebook-image: An image which will be main image of the server pod.

When is the task considered done

User can run e.g.:

dss create-notebook --name user-notebook --notebook-image kubeflownotebookswg/jupyter-scipy:v1.8.0

this command will create jupyter server with name user-notebook and will output URL for accessing the notebook e.g.:

Access the notebook at http://10.152.183.223/notebook/user-namespace/user-notebook/

Explore: How to handle snap dependencies

Why it needs to get done

To properly deploy dss we need some other snap dependencies (juju, microk8s, yq etc.).

What needs to get done

Talk with other teams how to properly handle snap dependencies.

When is the task considered done

We know how to handle snap dependencies.

Improve how dss manages its kubeconfig file

Why it needs to get done

#29 initializes dss on a given microk8s cluster, specifying the cluster via a kubeconfig file passed by its filepath. This works fine for an initial implementation, but extending this to other commands will be frustrating because users will have to provide kubeconfig for every call to the dss CLI. We need to implement a better way of remembering the kubeconfig file between command executions to maintain a positive user experience.

What needs to get done

Implement something to remember kubeconfig files between calls, for example maybe keeping a local copy of the most recent file provided. This method must work both for our testing in this repo as well as when implemented as a snap.

When is the task considered done

all dss commands have a way of remembering the kubeconfig between calls.

Explore: How to run shell script or python on snap install

Why it needs to get done

For shipping the the DSS we want to run shell/python script to setup DSS.

What needs to get done

shell script/python file needs to be executed on

sudo snap intall dss --classic

When is the task considered done

We know how to execute shell script/python script on snap install (if it is possible).

Implement DSS `purge` command

Why it needs to get done

The remove-components command removes all Kubernetes components connected to DSS. Internally, this involves the removal of the entire dss Kubernetes namespace. Importantly, this command does not uninstall the DSS snap or the Python library. It proves useful for reinitialization without the necessity to remove the snap. It's important to note that PVC for notebooks will be removed but data will stay in microk8’s hostpath folder.

dss remove-components

What needs to get done

Implement remove components command which deploys aforementioned components.
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Define the manifests path as a variable in `config.py`

Why it needs to get done

to define the manifests path in one place and reuse it for all commands
it was first discussed in this comment

What needs to get done

Move the definition of the manifests path to a variable in config.py and reuse this variable in all the commands .py files that interact with the manifests path.

in config.py do something like:

MANIFEST_TEMPLATES_LOCATION = Path("./manifest_templates")
ABSOLUTE_MANIFEST_TEMPLATES_LOCATION = Path(__file__).parent / MANIFEST_TEMPLATES_LOCATION

When is the task considered done

manifests path definition is refactored as described and PR is merged

Implement UX changes for `initialize`

Why it needs to get done

Implement the proposed changes by UX team for the initialize command. More information is in the spec.

What needs to get done

Implement the changes in error messages
Implement the changes for options
Implement integration and unit tests
Test on Microk8s.

When is the task considered done

Changes are implemented
Changes are tested

Create rockcraft projects for pytorch full GPU image

Why it needs to get done

The DSS project allows users to select the image of the Notebook server to be used, some of which are built and distributed by this team.

What needs to get done

Update the rock file in kubeflow-rocks. Use this Dockerfile for reference.
Make sure the CI is properly working in the repo and the sanity tests are testing the rocks functionality.

When is the task considered done

When the pytroch full GPU image is published in the CKF team Dockerhub repository.

Implement DSS initialize command

Why it needs to get done

Running dss initialize will create a Kubernetes dss namespace for all user notebooks and it will deploy MLflow . All Kubernetes components connected with DSS will be deployed under the dss Kubernetes namespace. The initialize command accepts a --kubeconfig parameter which will be used for accessing the underlying Kubernetes cluster. If the kubeconfig is not provided, the initialize command will search for it in the KUBECONFIG environment variable.

MLflow

To deploy MLflow properly in local mode using the initialize command, DSS will generate the following Kubernetes objects:

MLflow-data Persistent Volume Claim (PVC) backed by the hostPath storage class.
MLflow deployment responsible for the MLflow server. This deployment also mounts the aforementioned PVC on /mldata to ensure data persistence in the event of a Microk8s restart. For more information on how to run the MLflow server in local mode refer to this guide.
MLflow service of type ClusterIP, exposing the MLflow server.

What needs to get done

Implement initialize command which deploys aforementioned components.
Implement integration test with Microk8s

When is the task considered done

Command is implemented
Tests are passing

Feedback on implementing the dss commands with timeout

Bug Description

Implementing the dss initialize and dss create-notebook commands with timeout affects the dss user experience, where the timeout failure logs commands are sometimes not helpful. For example as mentioned here for dss initialize and here for dss create-notebook.

It's necessary to consider, with the help of the UX team, whether the commands should have timeout and how to make the status of dss visible to users.

To Reproduce

Install dss from source and run the commands

KUBECONFIG=~/.kube/config
pip install .
dss initialize --kubeconfig $KUBECONFIG