intel / experiments Goto Github PK

Experiments API for Experiment Tracking on Kubernetes

License: Apache License 2.0

Makefile 6.26% Python 92.94% Dockerfile 0.80%

experiments's Introduction

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel. Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project. Intel no longer accepts patches to this project.

Experiments API

Training a deep neural network requires finding a good combination of model hyperparameters. The process of finding good values for each is called hyperparameter optimization. The number of jobs required for each such experiment typically ranges from the low ones into the hundreds.

Individual workflows for optimization vary, but this is typically an ad-hoc manual process including custom job submit scripts or even pen and paper.

This project provides an API to support machine learning experiments on Kubernetes. This is done by moving the experiment context into a shared API and standardizing experiment job metadata. This promotes sharing results and tool development. Decoupling parameter space search from job execution further promotes re-use. This project eases job integration with the experiment tracking system by providing a python client library.

Prerequisites

git
make
python (v3 is the only tested variant at the moment)
kubectl and a connected cluster (minikube or a full cluster)

Installation

To install the most recent release, run the following:

$ pip install experiments

Development

To check out and install the latest development release, run:

$ git clone https://github.com/IntelAI/experiments.git
$ cd experiments
$ pip install .

To test the Experiments API, run the following:

$ pip install -r requirements-dev.txt
$ make test

Appendix

Concepts

Experiment Describes a hyperparameter space and how to launch a job for a sample in that space. Has a unique name.

Optimizer A program that reads an experiment and creates jobs with different hyperparameter settings. This can be done all in one shot, or the optimizer could be a long-running coordinator that monitors the performance of various samples to direct the hyperparameter optimization process. This program is supplied by the user.

Result Encodes metadata about a single job run for an experiment. For example, a handful of high level metrics per training epoch and a pointer to an output directory on shared storage. There is one result resource per job. Each result has the same name as the job it represents.

experiments's People

Contributors

Stargazers

Watchers

Forkers

dmsuehir nparkstar piaktipik isabella232

experiments's Issues

Allow nested parameters

For example:

client.create_job(experiment, {
    'master': {
        'noise_eps': 0.5,
    },
    'sub': {
        'noise_eps': 0.1,
    },
})

These could be serialized as: MASTER__NOISE_EPS with double under: __ separating levels.

Allow user to specify the kubernetes context to use

The call to load_kube_config should allow the user to specify a context variable: https://github.com/IntelAI/experiments/blob/master/lib/__init__.py#L9. This could be done through a sensibly named environment variable.

client.get_or_create_experiment(experiment_name, ...)

This might not be the best interface/name, but it would be nice to have a simple method for checking to see if an experiment already exists and adding to it if so, otherwise creating a new one. It looks like the only way to do this now is with a try/except or enumerating all experiments and checking the name from that somehow.

This might also just be a option of create_experiment instead: create_experiment(..., return_existing=True).

ensure python2 support in addition to python3

To support the broadest potential user base we should ensure that the API installs and runs under python2 not just python3.

This will require at least the following:

building universal wheels for distribution
regular testing under python2 in addition to python3
removing hard-coded python3 / pip3 references in Makefile

unintuitive create_experiment api

Doesn't work:

    experiment = exp.Experiment(name=name, job_template=job['spec'])
    client.create_experiment(experiment)
    job = client.create_job(experiment, parameters)

Does:

    experiment = exp.Experiment(name=name, job_template=job['spec'])
    experiment = client.create_experiment(experiment)
    job = client.create_job(experiment, parameters)

The error message received in the first case doesn't make it obvious what the problem is:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job.batch \"hrl-4148c399\" is invalid: metadata.ownerReferences.uid: Invalid value: \"\": uid must not be empty","reason":"Invalid","details":{"name":"hrl-4148c399","group":"batch","kind":"Job","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"\": uid must not be empty","field":"metadata.ownerReferences.uid"}]},"code":422}

Parameter type mismatch with job.py and Experiment.result()

I'm trying to run job.py and I'm getting the error below. I'm debugging it and it looks like job.py is sending a string job_name as the parameter to Experiments.result(), but that method is expecting some other type of object.

Starting job dina-exp-e9787b1d-7894-490a-ae0e-2920c5a32518-00174be8 for experiment dina-exp-e9787b1d-7894-490a-ae0e-2920c5a32518
Traceback (most recent call last):
  File "./job.py", line 48, in <module>
    main()
  File "./job.py", line 23, in main
    result = c.create_result(exp.result(job_name))
  File "/experiments/lib/exp.py", line 289, in result
    if 'job_parameters' in job.metadata.annotations:
AttributeError: 'str' object has no attribute 'metadata'

Parameters stored in the Experiments CR are not used

We currently have two places for parameters to be stored;

class Experiment(object):
    def __init__(self,
                 name,
                 job_template,
                 parameters=None,
                 status=None,
                 meta=None):

and

    def create_job(self, experiment, parameters):

This is probably a bug and should be fixed so either,

the parameters only can be provided doing create_job(), or
the parameters only can be provided in the experiments constructor and create_job() will get the parameters from there.

not pip installable from pypi

this package should be pip installable from pypi. Installing private github repos in a Dockerfile is especially annoying.

Abstraction and help to interpret job failure (OOM, exit code, etc)

As an optimizer author, I would like some abstraction to help me interpret the status of jobs and for failure, a higher level indication of why it failed. Say, OOM, image failure, exit code - as this will determine the strategy for continuing scheduling jobs. In the case of image failure, this will most likely be wrong for all subsequent jobs and the optimizer should exit. For OOM, a certain parameter subspace may be infeasible to run and should be avoided and for exit code, may indicate faulty code and should be reported to the user as well.

Error running examples.

On running the sample code as : make test, execution is successful and following is the output.

mkdir -p test-reports
nosetests -v --debug=test --with-xunit --xunit-file=test-reports/nosetests.xml
tests.libexp_test.test_client_list_experiments ... ok
tests.libexp_test.test_client_list_results ... ok
tests.libexp_test.test_ux_flow ... ok

XML: /home/nidhi/work/experiments/test-reports/nosetests.xml

Ran 3 tests in 5.525s

whereas on checking status of the pods , I get the following error.

NAME READY STATUS RESTARTS AGE
test-1cb66098-68vd4 0/1 ImagePullBackOff 0 35s
test-b745089f-jnp4z 0/1 ImagePullBackOff 0 35s

kubectl -n nscc45c3c3-5233-4556-8587-c5bac9ade98c logs test-1cb66098-68vd4
Error from server (BadRequest): container "train" in pod "test-1cb66098-68vd4" is waiting to start: trying and failing to pull image

missing LICENSE.txt

confirm which license this project is released under

Make python builds deterministic

Version numbers are not specified for dependencies in requirements.txt which prevents repeatable builds. At a bare minimum we should pin these, though it may instead be worth switching to: pipenv which is now the officially recommended practice.

keep job_name and output log directory names consistent

I'm not sure the best way to do this, but it would be nice to be able to easily keep the name of the job and the name of the directory it is writing logs to consistent. Ideally in some standard directory structure like: .../experiment_name/job_name. I have a solution now, but it requires sticking more logic in the script itself, and it would be nice to not replicate this in every script. Perhaps this issue belongs somewhere else.

constraints on experiment names

metadata.name: Invalid value: \"hierarchical_goals\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')"

Is there a reason underscores aren't permitted? I can see not wanting spaces to simplify certain space delimited interfaces, but still seems unnecessarily restricted.

Results should include the hyper parameters

Currently, a user needs to keep the kubernetes job objects around to tell which runs (in terms of hyper parameters) cause certain Results.

If we store the hyper parameters within the results object, we can avoid keeping the job objects around and make it easier to clean up after a partial run optimizer (early stopping or going down a path with, say, repeat resource exhaustion or other failures)

re-org code to experiments namespace

Typical python package practice is to import relative to some top-level namespace and to put all the modules under a subdir of the same name. Right now this project has one module named lib that is installed directly, however this name isn't exactly informative and if we later add other modules to the package they will end up scattered across different directories.

Instead we should put all modules under a new experiments subdir. Note that this change will break existing imports e.g. from lib.exp import Client needs to become from experiments.lib.exp import Client