kubeflow / katib Goto Github PK

Automated Machine Learning on Kubernetes

Home Page: https://www.kubeflow.org/docs/components/katib

License: Apache License 2.0

Makefile 0.40% Shell 3.17% Go 41.20% Python 37.21% JavaScript 0.08% HTML 5.86% Dockerfile 0.62% TypeScript 11.21% SCSS 0.22% CSS 0.01%

ai automl huggingface hyperparameter-tuning jax kubeflow kubernetes llm machine-learning mlops

katib's Introduction

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.

Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many ML frameworks, such as TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.

Katib can perform training jobs using any Kubernetes Custom Resources with out of the box support for Kubeflow Training Operator, Argo Workflows, Tekton Pipelines and many more.

Katib stands for secretary in Arabic.

Search Algorithms

Katib supports several search algorithms. Follow the Kubeflow documentation to know more about each algorithm and check the Suggestion service guide to implement your custom algorithm.

Hyperparameter Tuning	Neural Architecture Search	Early Stopping
Random Search	ENAS	Median Stop
Grid Search	DARTS
Bayesian Optimization
TPE
Multivariate TPE
CMA-ES
Sobol's Quasirandom Sequence
HyperBand
Population Based Training

To perform above algorithms Katib supports the following frameworks:

Installation

For the various Katib installs check the Kubeflow guide. Follow the next steps to install Katib standalone.

Prerequisites

This is the minimal requirements to install Katib:

Kubernetes >= 1.27
kubectl >= 1.27

Latest Version

For the latest Katib version run this command:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"

Release Version

For the specific Katib release (for example v0.14.0) run this command:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.14.0"

Make sure that all Katib components are running:

$ kubectl get pods -n kubeflow

NAME                                READY   STATUS      RESTARTS   AGE
katib-controller-566595bdd8-hbxgf   1/1     Running     0          36s
katib-db-manager-57cd769cdb-4g99m   1/1     Running     0          36s
katib-mysql-7894994f88-5d4s5        1/1     Running     0          36s
katib-ui-5767cfccdc-pwg2x           1/1     Running     0          36s

For the Katib Experiments check the complete examples list.

Quickstart

You can run your first HyperParameter Tuning Experiment using Katib Python SDK.

In the following example we are going to maximize a simple objective function: $F(a,b) = 4a - b^2$. The bigger $a$ and the lesser $b$ value, the bigger the function value $F$.

import kubeflow.katib as katib

# Step 1. Create an objective function.
def objective(parameters):
    # Import required packages.
    import time
    time.sleep(5)
    # Calculate objective function.
    result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"result={result}")

# Step 2. Create HyperParameter search space.
parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2)
}

# Step 3. Create Katib Experiment.
katib_client = katib.KatibClient()
name = "tune-experiment"
katib_client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    max_trial_count=12
)

# Step 4. Get the best HyperParameters.
print(katib_client.get_optimal_hyperparameters(name))

Documentation

Check the Katib getting started guide.
Learn about Katib Concepts in this guide.
Learn about Katib Interfaces in this guide.
Learn about Katib Components in this guide.
Know more about Katib in the presentations and demos list.

Community

We are always growing our community and invite new users and AutoML enthusiasts to contribute to the Katib project. The following links provide information about getting involved in the community:

Subscribe to the AutoML calendar to attend Working Group bi-weekly community meetings.
Check the AutoML and Training Working Group meeting notes.
If you use Katib, please update the adopters list.

Contributing

Please feel free to test the system! Developer guide is a good starting point for our developers.

Blog posts

Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML (by Andrey Velichkevich)

Events

AutoML and Training WG Summit. 16th of July 2021

Citation

If you use Katib in a scientific publication, we would appreciate citations to the following paper:

A Scalable and Cloud-Native Hyperparameter Tuning System, George et al., arXiv:2006.02085, 2020.

Bibtex entry:

@misc{george2020katib,
    title={A Scalable and Cloud-Native Hyperparameter Tuning System},
    author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha},
    year={2020},
    eprint={2006.02085},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

katib's People

Contributors

Stargazers

Watchers

Forkers

yujioshima treadstone90 ddutta libbyandhelen xinxingliu90 ddysher zmoon111 miguelperalvo lluunn gnanam336 nikolayvoronchikhin vinaykakade cimomo mrkm4ntr lyft inc0 csaroff cosmix cfregly swiftdiaries mitake jlewi mayankjuneja told chen0031 hnakagawa codeaudit falconzyx xychu johnugeorge shibuiwilliam toshiiw cheyang mylinyuzhi ytetra yylin1 kkohtaka jimexist stockholmfox sgcclh richardsliu texasmichelle lkpdn andreyvelich ciscoai mengqiy syed mickeyh devops8012 denkensk justcoworker codeshine01 munisystem chenzhiwei hsiachubby gyliu513 rcschen deepermind ozwm xyhuang jayunit100 yiyisan jdplatt zhenghuiwang akado2009 quelle1 achalshant opus111 vmolina garganubhav dreamryx jq kevinbache drewnf toquoccuong gaocegege q2w bryandai wukong1992 strikingraghu caicloud wangxuelei terrytangyuan fendaq loomlike taogeanton2 bordias adamjm queenwu praneetdutta chenhuims jinchihe nidhidamodaran dung-n-tran gerring schoenemeyer goodluckbot epa095 saschaheyer nealgavin

katib's Issues

draft suggestions for better user guide

I've walked through the getting started guide, here's my suggestions on a few improvements that we can do to improve the overall user experience (fairly low hanging fruit IMO):

Use minikube (or local cluster). The getting started guide uses a custom cluster with gpu, which is, in most cases, impractical for new comers to setup and try out katib
Do not use ingress. Similar to the above issue, we should lower the barrier to try out katib; setting up ingress, along with hostname resolution, can cost time. Using node port should suffice.
Use mnist as example to reduce waiting time (existing guide creates a cifar10 training); faster turnaround makes it easy to observe outcome and experiment with katib features
Versioned image and source code. Since the project is in its early stage, breaking changes are almost always possible. Errors in getting started guide can scare off potential users and contributors, let's stick to a particular version (and update thereafter in case of new versions available).
Finish the getting started guide by providing instructions on modeldb visualization. After creating the Study using random-cpu.yml, katib creates two jobs. However, after training, they both disappear and I'm not able to see the result neither in Kubernetes nor in modeldb. This can be that I'm not familiar with the system, or the guide is not complete. In either case, I think we should provide guidance on what to do after creating Study, to avoid confusion for people new to the system like me :)

/cc @gaocegege @YujiOshima
/area documentation

[go] Rename the package from mlkube/katib to this repo

We used to place the code in https://github.com/mlkube/katib and we contributed to code to Kubeflow community, then I think we should rename the packages in the code.

[feature] Add test code for each algorithms

There is no test code for each suggestion and early stop algorithms.
We need to test the algorithms will work as expected.

Release process for CLI

It looks like Katib includes a CLI. We'll need a release process for this and a way to distribute releases.

What are the initial platforms for which we want to build and release the CLI?

CreateStudy RPC error: Objective_Value_Name is required

When I try to run Createstudy as in the example I'm getting:

katib-cli -s 192.168.99.100:30678 -f examples/random.yml Createstudy
2018/04/24 06:43:11 connecting 192.168.99.100:30678
2018/04/24 06:43:11 study conf{  UNKNOWN_OPTIMIZATION 0 <nil> []    [] []  []  [] 0  <nil> }
2018/04/24 06:43:11 req Createstudy
2018/04/24 06:43:11 CreateStudy failed: rpc error: code = Unknown desc = Objective_Value_Name is required.

Is there a mismatch between the latest CLI release and the core image in source control? How does one build the CLI from source?

cli failed to connect

I was trying the getting start guide.
I can see the pod using kubectl.

But

katib-cli -s gke-test-katib-default-pool-b88188d2-jnvp:30678 Getstudies
2018/05/09 11:38:13 connecting gke-test-katib-default-pool-b88188d2-jnvp:30678
2018/05/09 11:38:14 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup gke-test-katib-default-pool-b88188d2-jnvp on 127.0.0.1:53: no such host"; Reconnecting to {gke-test-katib-default-pool-b88188d2-jnvp:30678 <nil>}
2018/05/09 11:38:14 GetStudy failed: rpc error: code = 14 desc = grpc: the connection is unavailable

11:38:14lunkai@None:katib$ katib-cli
2018/05/09 11:38:34 connecting 127.0.0.1:6789
2018/05/09 11:38:34 Method not found:

Do we need to configure cli somehow?

[discussion] Find a new way to install CLI

ref #25 (comment)

I think we could find a better way to install the cli.

[go] Use camel-case instead of underscores

don't use underscores in Go names; var study_id should be studyID

Some variables in Go file use underscores, but "golint" complains about the underscores. We should avoid them and use camel-case instead.

unknown method 'GetSuggestions' error while running MinikubeDemo

While following instructions for MinikubeDemo, I get the following error:

$ go run random/random-suggest-demo.go 
2018/06/01 15:37:07 Study ID m9510a9448938b00
2018/06/01 15:37:07 Study ID m9510a9448938b00 StudyConfname:"grid-demo" owner:"katib" optimization_type:MAXIMIZE optimization_goal:0.99 parameter_configs:<configs:<name:"--lr" parameter_type:DOUBLE feasible:<max:"0.07" min:"0.03" > > > default_suggestion_algorithm:"grid" default_early_stopping_algorithm:"medianstopping" objective_value_name:"Validation-accuracy" metrics:"accuracy" metrics:"Validation-accuracy" 
2018/06/01 15:37:07 GetSuggestion Error rpc error: code = Unimplemented desc = unknown method GetSuggestions
exit status 1

Any thoughts on what may be going wrong? Also, I saw a comment regarding drastic changes in manager (#97), is this related?

[API][discussion] Create a CRD for study

We have some discussions on slack and I have a idea to define a CRD and operator to manage studies, thus we could eliminate the customized CLI katib-cli and reuse kubectl to do the same thing.

I opened the issue to keep track of the idea.

[feature] Save temporary data of suggestions to DB.

Currently, suggestions store the intermediate information in their memory.
It makes the services stateful.
We should make DB interface for suggestion services.

[Discussion] structural refactor to make the source more like 'idiomatic go'

Right now, we have the components running (excluding modeldb):

dlk-manager
vizier-core
vizier-db
vizier-suggestion-[algorithm]

These are the binaries I've found after briefly scanning the code structure:

.
├── cli
│   ├── Dockerfile
│   └── main.go
├── dlk
│   └── dlkmanager
│       └── dlkmanager.go
├── manager
│   └── main.go
├── suggestion
│   ├── grid
│   |   └── main.go
│   └── random
│       └── main.go
└── earlystopping
    └── medianstopping
        └── main.go

Personally, I think there are two issues we can fix to improve the code base:

component naming and source code naming is inconsistent
structure can be improved to make it consistent with go projects in the wild :)

Here is the structure off my head:

├── cmd
│   ├── cli
│   │   └── cli.go
│   └── dlkctl
│   │   └── dlkcli.go
│   └── dlkmanager
│   │   └── dlkmanager.go
│   └── viziercore
│   |   └── viziercore.go
│   └── earlystopping
│       └── medianstopping.go
├── pkg
│   └── apis
|       └── v1alpha1
|           └── api.proto
│   └── db
│   └── dlk
│   └── mock
│   └── vizier
│   └── suggestion
│   └── earlystopping
├── suggestion
|   └── random
|   └── grid
├── docs
├── manifests
│   └── conf
│   └── dlk
│   └── modeldb
│   └── vizier
├── hack
│   └── build.sh
│   └── deploy.sh
├── test
├── vendor

Note that since we mostly want to write suggestion service in python, it should have its own root directory. If we want to write earlystopping service in other languages as well, then we can also move it out to top-level root.

@gaocegege @YujiOshima WDYT?

/improvement enhancement
/area suggestion

Deploy Katib in dev.kubeflow.org

We should deploy Katib in dev.kubeflow.org so we can test it out.

Blocked by #32 (ksonnet component for Katib)

[upstream] Update name in kubernetes/test-infra

https://github.com/kubernetes/test-infra/blob/master/prow/config.yaml#L105

/cc @YujiOshima

Error running katib on latest master (04/13)

After deploying katib following getting started guide, I've seen the following errors:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY     STATUS             RESTARTS   AGE
katib         dlk-manager-698ccb5fdc-hb7xc              0/1       CrashLoopBackOff   6          13m
katib         modeldb-backend-6855d95fb4-2sxw9          1/1       Running            0          14m
katib         modeldb-db-6cf5bb764-5s65f                1/1       Running            0          14m
katib         modeldb-frontend-5868bffc64-rhrr7         1/1       Running            0          14m
katib         vizier-core-86c5566c88-kvsp9              0/1       CrashLoopBackOff   6          13m
katib         vizier-db-64557596dc-mpgh4                1/1       Running            0          13m
katib         vizier-suggestion-random-6b4d6db6-m8l94   0/1       CrashLoopBackOff   6          13m
kube-system   kube-dns-5c6c5b55b-qmd9l                  3/3       Running            0          16m

I've managed to get it running; it turns out the command is not correct. For example, I have to change this:

    spec:
      serviceAccountName: vizier-core
      containers:
      - name: vizier-core
        image: katib/vizier-core
        args:
          - "-w"
          - "dlk"
        ports:
        - name: api
          containerPort: 6789

    spec:
      serviceAccountName: vizier-core
      containers:
      - name: vizier-core
        image: katib/vizier-core
        args:
          - ./vizier-manager    <-- add this line
          - "-w"
          - "dlk"
        ports:
        - name: api
          containerPort: 6789

However, based on docker file for vizier-core, vizier-manager is already set as entrypoint,

FROM golang:alpine AS build-env
# The GOPATH in the image is /go.
ADD . /go/src/github.com/kubeflow/hp-tuning
WORKDIR /go/src/github.com/kubeflow/hp-tuning/manager
RUN go build -o vizier-manager

FROM alpine:3.7
WORKDIR /app
COPY --from=build-env /go/src/github.com/kubeflow/hp-tuning/manager/vizier-manager /app/
COPY --from=build-env /go/src/github.com/kubeflow/hp-tuning/manager/visualise /
ENTRYPOINT ["./vizier-manager"]
CMD ["-w", "dlk"]

Anything wrong with the above 👆 setup?

/cc @gaocegege @YujiOshima

[build-release] Reuse the vendor during the image building process

Now we run go get xxx in Dockerfile to get all dependencies for the build and it takes long time and increases the size of image significantly. Thus we think we should build the binary via multiple stage build or out of image building process to accelerate it.

Study controller - Running HP Jobs without writing code

I open PR #86
This is for study controller.
Currently, we need to write some code to use Katib in any case.
I want to make users don't need to write any code In a typical usecase,
The study controller is the implementation of logic for how to call services, run worker, and save models.
This is a POC of the study controller https://github.com/YujiOshima/hp-tuning/blob/51d456da58f2a77648290c175491e0692e0f3d4c/pkg/manager/studycontroller/defaultcontroller.go

In this PR, default study controller request all suggestions at first.
It is not suitable for Bayse Opt or Hyperband since they need to call GetSuggestions several time.
I implement study controller as a go process, but it can separate as a service like the suggestion and the earlystopping.

WDYT? @gaocegege @ddysher @libbyandhelen

Make low-level API for using katib flexibly

Discussde #66
This is the APIs I'm going to refactor and add.

API	input	process	output
CreateStudy	StudyConfig	Save Study conf to DB CreateStudyID	StudyID error
GetSuggestions	StudyID SuggestionAlgorithmName RequestNum	Create Trials from Suggesiton	[]TrialID error
RunTrials	StudyID []TrialID Worker	Request to run Trial to worker Set Trial status running	error
StopTrials	StudyID []TrialID IsComplete	Stop Trial worker Set Trial Status Complete	[]TrialID error
ShouldStopTrial	StudyID EarlyStopAlgorithm	Get ShuoulStop Trials	[]TrialID error
SetSuggestionParameter	StudyID SuggestionAlgorithmName AlgorithmParam	Set Parameters	error
SetEarlyStoppingParameter	StudyID EarlyStoppingAlgorithmName EarlyStoppingParam	Set Parameters	error
GetMetrics	[]TrialID	Get Metrics of Trials	[]Metrics error
SaveStudy	StudyID	Save StudyInfo to ModelDB	error
SaveModels	StudyID []TrialID []Metrics	Save Trial and Metrics Info to ModelDB	error

Typical usage is like below.

	studyId, _ := grpc.CreateStudy(studyConfig)
	grpc.SetSuggestionParameter(studyId, "random", suggestParam)
	grpc.SetEarlyStoppingParameter(studyId, "medianstopping", earlystopParam)
	grpc.SaveStudy(studyId)
	for IsStudyComleted() {
		trials, _ := grpc.GetSuggesitons(studyId, "random", 10)
		grpc.RunTrials(studyId, trials)
		for {
			metrics, workerState, _ := grpc.GetMetrics(studyId, trials)
			if AllWorkerCompleted(workerState) {
				grpc.CompleteTrial(studyId, trials, true)
				grpc.SaveModels(studyId, trials, metrics)
				break
			}
			shouldStops := grpc.ShouldStopTrial(studyId, trials)
			grpc.CompleteTrial(studyId, shouldStops, false)
			deleteShuldStopsFromTrialList(trials, shouldStops)
		}
	}

WDYT? @ddysher @gaocegege @libbyandhelen

Segfault when saving study after completion

I have not looked into why this might be occurring, but I came across this segfault error trying to run examples/random-cpu.yml and examples/mnist-cpu.yml.

katib checkout at: ecb27de
katib-cli: katib-cli-darwin-amd64 v0.1.1-alpha

Logs:

2018/05/25 18:40:03 Worker: dlk
2018/05/25 18:41:33 Study mnist is already exist (Project ID 1)
2018/05/25 18:41:33 Study w06b53328cefce8c start.
2018/05/25 18:41:33 Study conf name:"mnist" owner:"root" optimization_type:MAXIMIZE parameter_configs:<configs:<name:"--lr" parameter_type:DOUBLE feasible:<max:"0.07" min:"0.03" > > configs:<name:"--num-epochs" parameter_type:INT feasible:<max:"20" min:"10" > > > suggest_algorithm:"random" suggestion_parameters:<name:"SuggestionNum" value:"2" > suggestion_parameters:<name:"MaxParallel" value:"2" > objective_value_name:"Validation-accuracy" metrics:"accuracy" metrics:"Validation-accuracy" image:"mxnet/python" command:"python" command:"/mxnet/example/image-classification/train_mnist.py" scheduler:"default-scheduler"
2018/05/25 18:41:34 Created Lt s7a3421ec7f226d8.
2018/05/25 18:41:34 Created Lt r2ab0e99a5c1588c.
2018/05/25 18:42:10 Trial r2ab0e99a5c1588c is completed.
2018/05/25 18:42:10 Objective Value: 0.970442
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x10621a9]

goroutine 70 [running]:
main.(*server).saveCompletedModels(0xc4201b28a0, 0xc420346870, 0x10, 0xc4202e06e0, 0xc42018c9e0, 0x0)
        /go/src/github.com/kubeflow/katib/cmd/manager/main.go:91 +0x699
main.(*server).trialIteration(0xc4201b28a0, 0xc4202e06e0, 0xc420346870, 0x10, 0xc4203b6000, 0xc4203b6060, 0x0, 0x0)
        /go/src/github.com/kubeflow/katib/cmd/manager/main.go:158 +0xf0a
created by main.(*server).CreateStudy
        /go/src/github.com/kubeflow/katib/cmd/manager/main.go:236 +0x5b9

[release] Ksonnet the katib

Please try to use ksonnet. We should try to be consistent about how we package things. We want to have a single registry for all our packages and deploy them in a consistent fashion.

[feature] Integrate with tf-operator, pytorch-operator

take parameters that control distributed training (e.g. number of replicas)
create the appropriate CRD as opposed to K8s job controller

https://kubeflow.slack.com/archives/C9ZLKR73L/p1523553472000001

[feature] Support NAS

We support some parameter search algorithms such as random search as grid search. Personally, I think it is better if we could support neural architecture search based on our system. I am not sure if it is feasible.

@YujiOshima We have some discussion on slack and I am glad to move it here.

/cc @weiweijiuzaizhe

[release] Add cli to v0.1.0-alpha

And we need to update the docs, too.

[design] How and which Katib services need to be exposed to users?

vizier-core is now a NodePort service. Should we have an ingress / LB?
modeldb ingress has hard-coded host: k-cluster.example.net. The endpoint k-cluster.example.net/katib mentioned in getting-start won't work.

[feature] Use vitess instead of MySQL

I think vitness is more appropriate than mysql since it is cloud native. https://github.com/vitessio/vitess

[test] Fix broken unit test cases

Now our unit test cases are broken.

[cli] Use cobra to refactor the cli

/assign @YujiOshima

/priority p1
/area cli

[manager & worker] Migrate dlk into worker interface

#46 (comment)

How about migrating it to other worker interfaces and refining the role of worker interfaces as below?

Tensorflow, pytorch.. operator worker interface: Support distributed learning task.
Kubernetes worker interface: For frameworks not supported by kubeflow operator. Only manage single machine task.

Existing approaches and design for hyperparameter-tuning

https://github.com/tobegit3hub/advisor
https://github.com/CiscoAI/hyper-advisor-client
https://github.com/mlkube/katib

Please feel free to add ...

[docs] Add docs about how to implement a new algorithm

We have description for it https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md#implement-new-suggestion-algorithm but it is outdated and does not show how to write it in Python.

[suggestion] Move the logic about random service to `random` package

We put the struct RandomSuggestService in suggestion/suggestionService.go, I think we should move it to suggestion/random to improve readability.

Rename to hyperparameter-tuning ?

Would it make sense to rename this repo to hyperparameter-tuning instead of hp-tuning. hp-tuning is less informative as a name :)

Automated build

@gaocegege @jlewi
Who is the owner of the kubeflow docker hub repos?
Can you setup dockerhub's Automated Build for Katib images?

distributed suggestion service

suggestion service is inherently stateful, we want to make sure a study is processed by a single suggestion service instance. For example, in current setup, if we run 2 replicas of grid service, they will both receive requests, thus essentially suggesting the same parameters twice.

The simplest solution would be to change the service affinity to 'ClientIP' in kubernetes. However, in the long run, we need to have proper handling in our suggestion services themselves as well, to handle potential failure cases.

/cc @YujiOshima @gaocegege

Release process for K8s components

We need a release process for the components that run on the K8s cluster and the associated ksonnet packages.

There is a separate issue (#78) for the CLI.

[maintainance] Setup the repository

We should set up CI and other things according to https://github.com/kubeflow/community/blob/master/repository-setup.md

Provide a cli binary for macOS / darwin

The current alpha release only provides a cli binary for linux-amd64. Would be nice to have a build for macOS / darwin as well

/cc @gaocegege

[test] Add e2e test cases

I know that it is hard to establish e2e test for katib but we need it :-)

[go] Update the package name, again

We have the name katib now 😄

Release v0.1.2-alpha

The API of Katib is drastically changed from v0.1.1-alpha.
Though we can't update v0.2 yet, we need to release v0.1.2-alpha, update docker images and update docs.
cc @gaocegege

[copyright] Add copyright headers in all files

// Copyright 2018 The Kubeflow Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

Upload existing models to modelDB interface

ModelDB provides nice UI.
Enabled to upload models that are not the Katib managed trail would be nice.
Add model upload API to Katib manager and commands to katib-cli.
In my idea, cli is like below.

katib-cli upload --parameter "{'paramA':20,'paramB':'cat'}" --metrics "{'accuracy':0.9,'recall':0.7}" --pvc nfs --path logs/model1

--parameter and --metrics are optional but you can sort with these parameters in the UI.
Setting pvc and path correctly, katib create TF-board link.

cc @jlewi

[request] Invite libbyandhelen as reviewer for algorithm support

libbyandhelen implemented Bayesian Optimization for katib and are implementing Covariance Matrix Adaptation Evolution Strategy these days. His contribution demonstrates his knowledge on katib code base.

@libbyandhelen Then would you like to be an reviewer of katib and help us review the code about suggestion algorithms? I'd appreciate it.

/cc @YujiOshima

[test] Add verify-mockgen in CI

Now we do not check if generated code is changed, thus we should do it. Refer to https://github.com/kubeflow/tf-operator/blob/master/hack/verify-codegen.sh

[suggestion] Implement more algorithms

katib has a extensible architecture and three search algorithms thanks to YujiOshima@:

vizier-suggestion-random
vizier-suggestion-grid
vizier-suggestion-hyperband

And we could implement more algorithms based on the arch. It helps us to support more scenarios.

ref https://github.com/tobegit3hub/advisor#algorithms

/cc @ddutta

The second would be to run container builds in parallel using argo rather than sequentially.

Any other ideas?

[maintainnace] Invite more reviewers

Now we have 0 reviewer, and it is not a good thing for an open source project.

[central ui] Need a link to Katib UI

central-ui should provide a link to Katib.

This is blocked on a ksonnet component for Katib (#32)

Since Katib isn't always deployed it would be nice if the central UI could convey whether or not Katib was running (e.g. greying out the link if Katib isn't available).

/priority p1
/cc @swiftdiaries