dessa-oss / atlas Goto Github PK

View Code? Open in Web Editor NEW

290.0 16.0 43.0 11.87 MB

An Open Source, Self-Hosted Platform For Applied Deep Learning Development

Home Page: http://www.docs.atlas.dessa.com

License: Apache License 2.0

Shell 2.10% Python 70.41% Dockerfile 0.25% Batchfile 0.01% JavaScript 18.65% HTML 0.25% CSS 8.31% Makefile 0.03%

machine-learning data-science artificial-intelligence python deep-learning gpu ai ml model-management

atlas's Introduction

Build Statuses:

Atlas: Self-Hosted Machine Learning Platform

Atlas is a flexible Machine Learning platform that consists of a Python SDK, CLI, GUI & Scheduler to help Machine Learning Engineering teams dramatically reduce the model development time & reduce effort in managing infrastructure.

Development Status

Atlas has evolved very rapidly and has gone though many iterations in Dessa's history.

The latest version is in BETA.

Features

Here are few of the high-level features:

Self-hosted: run Atlas on a single node e.g. your latop, or multi-node cluster e.g. on-premise servers or cloud clusters (AWS/GCP/etc.)
Job scheduling: Collaborate with your team by scheduling and running concurrent ML jobs remotely on your cluster & fully utilize your system resources.
Flexibility: Multiple GPU jobs? CPU jobs? need to use custom libraries or docker images? No problem - Atlas tries to be unopionated where possible, so you can run jobs how you like.
Experiment managment & tracking: Tag experiments and easily track hyperparameters, metrics, and artifacts such as images, GIFs, and audio clips in a web-based GUI to track the performance of your models.
Reproducibility: Every job run is recorded and tracked using a job ID so you can reproduce and share any experiment.
Easy to use SDK: Atlas's easy to use SDK allows you to run jobs programatically allowing you to do multiple hyperparameter optimization runs programatically
Built in Tensorboard integration: We ❤️ Tensorflow - compare multiple Tensorboard-compaitable job runs directly through the Atlas GUI.
Works well with others: run any Python code with any frameworks.

Users guide

Installation

MacOS & Linux Quickstart Guide (~8 mins, recommended)
Windows 10 Guide
AWS Cloud installation
GCP Cloud installation
Multi-node cluster deployment:
- AWS guide
- On-prem cluster guide.

Documentation

Official documentation for Atlas can be found at https://www.docs.atlas.dessa.com/

All docs are hosted on Read the Docs that track the docs folder, please open a pull request here to make changes.

Community

If you have questions that are not addressed in the documentation, there are several ways to ask:

Open a Github Issue
Stack Overflow - be sure to use the foundations-atlas tag.
Join the Dessa Slack.

We will do our best to help!

Development Guide

Contributing

We ❤️ contributors and would love to work with you.

Atlas is currently open to external contributors.

Follow this guide:

Found a Bug?
- Search through the issue list to make sure an issue doesn't exist already.
- File an issue with the following:
  - Label as bug
  - Steps to reproduce
  - System used
- Got a fix?
  - Tag the issue you are fixing
  - Open a Pull Request.
Requesting a feature?
- Search through the issue list to make sure an issue doesn't exist already.
- File an issue with the following:
  - Label as feature-request
  - Why is this important to you?
  - How will this impact a data scientists workflow?
  - Add any relevant mockups (if it is user facing)
- Want to work on it?
  - Open up a Pull Request!
First-time contributor to OSS?
- Look for issues with the good first issue label and get help from the community Slack if you need it.

Development Setup

When you are ready, just follow the steps below in order to set up a development environment.

You will need to have docker, yarn, and the envsubst command line tool on your machine in order spin up a local development environment.
OSX: brew install docker
brew install yarn
brew install gettext
Ubuntu:
apt install docker
apt install docker-compose
apt install yarn
apt install gettext

For other Linux machines, replace apt install with the equivalent command for your distributions package manager.

Clone this repository and enter the new directory
- git clone [email protected]:DeepLearnI/atlas.git && cd atlas
Create and activate a brand new virtual environment with Python 3.7 then install the requirements. Some examples below.
- Using Conda:
  - conda create --name foundations python=3.7 && conda activate foundations
  - pip install -r requirements_dev.txt
- Pipenv:
  - pipenv --python 3.7 && pipenv shell
  - pipenv install
  - pipenv install -r requirements_dev.txt --dev --pre --skip-lock
- Venv:
  - python3 -m venv . && source bin/activate
  - pip install -r requirements_dev.txt
Add the packages that make up Atlas to your python path and set some environemnt variables by sourcing the activate_dev_env.sh file.
. ./activate_dev_env.sh
Launch Atlas in development mode. This may take a while to pull some required docker images.
- make devenv-start
You can now create a sample project by running the following command.
- python -m foundations init my-project
Change into the newly created project directory and execute the following command to submit your first job. This can take a while the first time as one more image may need to be pulled.
- python -m foundations submit scheduler . main.py
Navigate to localhost:3000 and verify that your newly created project exists on the frontend. Click on the project and verify that your job executed successfully.
Congrats! You are ready to go.

Running tests locally

In order to run tests, simply run:

make unit-tests
make integration-tests

Systems Overview

Last updated: March 2020

The following diagram shows a high level overview of how the Atlas system works. Note that Atlas's codebase evolves faster than this diagram and this diagram may not be kept upto date in real time. Still a good source to get a general understanding of the system.

“Atlas Server” is the term that we use to describe all of the services that allow Atlas to do its magic.

These services are as follows, note that some Atlas services live in other repos:

Scheduler (code lives in local-docker-scheduler)
Worker
Atlas GUI
Atlas REST API
Archive Server
Tracker
Authentication Proxy (code lives in foundations-auth-proxy repository)
Authentication Server
TensorBoard REST API
TensorBoard Server

Let’s dive into each service with an explanation of their role within Atlas.

Scheduler

This is a custom built Python scheduler that launches Docker based workers. It uses APSchedulr to keep track of and run jobs that are queued in the system. Users can interact with the scheduler through a Flask-based RESTful API to submit and interact with jobs. Jobs are submitted in the form of a “job spec”, which is simply a Python dictionary that describes the makeup of the job that will run.

The scheduler can run in both GPU and non-GPU mode. GPU mode will keep track of available devices and, given a job with a provided number of GPUs, will allocate jobs according to available resources.

Contribute to the Scheduler repository here.

Worker

This is what the scheduler uses to run any submitted jobs. The default Docker image used has a few libraries that are common in the machine learning/deep learning toolkit. However, users can specify a custom Docker image to use.

Atlas GUI

The web application for Atlas is a React based service that displays and interacts with information provided by the REST API.

Atlas REST API

This is Flask-based RESTful API that allows for interaction with Atlas specific information about jobs and projects. This includes information logged during the running of an Atlas job, notes on a project, and the markdown description of a project.

Archive Server

We use a basic HTTP server to host the files and directories that are archived during an Atlas job. The information and path of each file is stored in the Tracker to be served when needed.

Tracker

The tracker is the database where any saved information is stored. This includes information logged during the running of an Atlas job, notes on a project, and the markdown description of a project.

Authentication Proxy

This is a very simplistic service that we route all API calls through. If the proxy is set to “null” (via the “-n” flag), all calls will pass through without any verification, otherwise the proxy will check for the validity of the supplied token before rerouting the call. The token is generated and checked against the Authentication Server.

Contribute to the authentication proxy repository here.

Authentication Server

For our authentication system, we use the open source tool Keycloak. This gives us an off-the-shelf setup for managing and validating accounts.

TensorBoard Server

This is the default Tensorflow image that provides the TensorBoard application. Most of the magic then happens in the TensorBoard REST API

TensorBoard REST API

This system links files saved in the archive server that can be presented within TensorBoard and the directory that TensorBoard is using as a log directory. It makes these symlinks through a Flask-based RESTful API.

License

Copyright 2015-2020 Square, Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

© 2020 Square, Inc. ATLAS, DESSA, the Dessa Logo, and others are trademarks of Square, Inc. All third party names and trademarks are properties of their respective owners and are used for identification purposes only.

atlas's People

Contributors

Stargazers

Watchers

Forkers

imaxxs unnaturalshah irustandi insighty tomzhang manesioz trendingtechnology nicolmac binbinmeng inetkenya nisar-1234 rasenganai saeedseyyedi ai4labs polo9 vickatte bgreer101 quettabit mohammedri codeaudit chinmayapatnayak maplegodtraveler thanujadax lxngoddess5321 airbots1980 httpsgithu aryxns kclin phalix celestialized jali-clarke 0xforked dreamplayer-zhang yangxin1994 trellixvulnteam abualiamin techxbase adityakansara8 zn9988 mcanthomroubert thefutureismatthew

atlas's Issues

[Incident] Docker containers IP addresses conflict with VPN

Description

I have a problem with the atlas-ce-* docker containers. These container use an IP range 172.20.* that conflicts with my work VPN addresses.

Error

With these containers running i cann't connect from home to the machine where atlas is installed using VPN.

Steps to reproduce

Has this issue been seen before?

Cause of issue

The problem is due to a network interface created by atlas-ce: br-xxxxxx that uses the same IP range as my work VPN.

Link to BUG

Solution

Configure atlas-ce docker containers to use a different IP range.

[Data Model] Investigate how job_data gets populated by the Pipeline Archiver

We should do an investigation into how data gets populated into job_data by the pipeline archiver, as it is unclear how it fully works.

Update log_params to use message_router

What's holding us back, and how?
Currently, log_params writes directly to redis.

Optional: what are some options for remediating it?
To standardize how foundations talks to redis, want log_params to use producer-consumer model

[Architecture] Remove the need for a plugin-based scheduler

We currently support a plugin based architecture to be able to use both the local-docker-scheduler and the old K8s scheduler. We should remove this and migrate to the local-docker-scheduler moving forward.

Kubernetes support can be built-in to the local-docker-scheduler.

CC: @shazraz

[debt] MAINTAINER usage in Dockerfile's has been deprecated.

Use LABEL maintainer=... instead.

https://docs.docker.com/engine/reference/builder/#maintainer-deprecated

A simple search and replace is all that is needed.

Move authentication into it's own service

Basically pull the authentication code out of contrib and the auth controller out of the rest api and package them into their own project or repo.

[BUG] Wrong behaviour when refreshing table while a column is sorted

Describe the bug
If you click Refresh Table while sorted on a column, the "sorted" button (up/down arrows) on the header still indicates that it's sorted but the jobs are no longer sorted.

To Reproduce

Go to a project in the UI.
Sort on a column.
Click Refresh Table.

Expected behavior
There's two ways to fix this. Either have the "sorted" buttons all revert to their default state when Refresh Table is clicked, or have clicking Refresh Table actually retrieve the appropriate sorted jobs.

Sort project page to show most recent jobs first

Is your feature request related to a problem? Please describe.
When you visit the dashboard, you are most likely to look for the job you just submitted.

Similarly all queued jobs should appear at the top as well.

Describe the solution you'd like
Sort jobs to most recently submitted by default.

[Data Model] Do a full analysis of all our redis usage

@hyun20005

Some shallow investigation into our redis usage shows that there may be some keys that are being stored but never being used.

We should first do a full analysis across all our projects of our redis usage. Useful information to gather would be the content, type, where in the project it's set, where in the project it's retrieved for every single redis key we use.

[Data Model] Decide on a Redis key naming convention

Currently, there is no enforcement of a proper Redis key naming convention. We loosely follow a colon-delimited pattern, but there are issues inherent to this approach as well.

We should have a team-wide discussion agreeing upon a Redis key naming convention, and possibly migrate our system to use the new convention.

@hyun20005

[config] Move tensorboard workdir to submission config

Tensorboard workdir is currently loaded from service.config. This should instead live in the submission config.

CC: @shazraz

[BUG] Cannot delete true local jobs in the UI

Describe the bug
Defining a true local job as a job that doesn't touch the scheduler (e.g. python main.py without submitting the job)..
These jobs are in a half-implemented state with little working functionality. Here are a list of things you cannot do with them:

Adding tags through the SDK
Viewing logs through the UI
Viewing artifacts through the UI
Any CLI commands
Deleting the job through the UI (high priority in terms of user experience)

There is also no way for the user to distinguish between true local jobs and scheduler-run jobs.

To Reproduce

Create a foundations project with foundations init.
Run python main.py.
Try to do functionality listed above.

Expected behavior
I think that the highest priority at the moment is to add some method of deleting the job through the UI. The rest of the problems require more discussion.

There is a debate on whether running true local jobs should even log things to redis in the first place (and whether they should bundle the job). This also requires more discussion.

Hover state for rows so that it's more performant for projects > 20 rows

From @hyun20005

With our current state of the UI there's two options, either we start using Redux or we refactor the component hierarchy of the table.

Until then I don't see a good way to do this.

[Architecture] Consumers should not talk directly to Redis

Need to build abstraction for our consumers such that they do not talk directly to Redis.

@pippinlee could you add some more context to this?

Write additional unit tests for foundations_local_docker_scheduler_plugin

We need to write unit & acceptance tests as foundations_local_docker_scheduler_plugin/job_deployment.py has no unit test coverage.

[BUG] Job Details page does not show queued time

Describe the bug
Job details page does not show when a job was queued.

Expected behavior
The queued job should have a field in the table that shows me when the job was queued for me to better identify queued jobs.

Screenshots

Additional context
Currently, the datetime field is called "Launched", showing when the job was started. Either we need to rename the field or we need to create a new field showing the user when the job was queued.

[BUG] Mac Installation

Platform & setup
Mac OS 10.13.6
Docker Desktop 2.2.0.3 running with Docker Engine 19.03.5

Describe the bug
Installation fails with docker.errors.APIError: 500 Server Error: Internal Server Error ("Get https://us.gcr.io/v2/: Service Unavailable")

To Reproduce
I just ran python atlas_installer.py on a directory with only the atlas_installer.py file and a clean conda environment

Additional context / Stacktrace

Loading image: us.gcr.io/dessa-atlas/foundations/archive_server:latest
Traceback (most recent call last):
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/docker/api/client.py", line 261, in _raise_for_status
    response.raise_for_status()
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/requests/models.py", line 939, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/images/create?tag=latest&fromImage=us.gcr.io%2Fdessa-atlas%2Ffoundations%2Farchive_server

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "atlas_installer.py", line 691, in <module>
    load_docker_images(args.use_specified_version)
  File "atlas_installer.py", line 112, in load_docker_images
    image_loader.pull_image(image['name'], tag=image['tag'] if use_specified_version else "latest")
  File "atlas_installer.py", line 70, in pull_image
    image = self._client.images.pull(repo, tag)
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/docker/models/images.py", line 443, in pull
    pull_log = self.client.api.pull(
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/docker/api/image.py", line 414, in pull
    self._raise_for_status(response)
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/docker/api/client.py", line 263, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/Users/kbalafas/opt/anaconda3/envs/atlas/lib/python3.8/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("Get https://us.gcr.io/v2/: Service Unavailable")

Rename the FOUNDATIONS_COMMAND_LINE environment variable to something more appropriate

Currently, user needs to set the env variable FOUNDATIONS_COMMAND_LINE='True' in order for import foundations to not treat the python process as a job.

This should be renamed because the user will need to do this in order for their submission script (a script that only performs foundations submission, e.g. like a hyper parameter search script) to not be tracked as a job, and FOUNDATIONS_COMMAND_LINE is a misnomer in that use case.

[Data Model] Decide on a convention with Redis usage

Currently, the procedure with which we interact with redis is inconsistent. For example, accessing redis directly through the controller versus through the model, or via code in the SDK versus using the message router.

We should have a team-wide discussion to all agree on a convention, then go back and refactor places in the code that don't follow the convention to use the new convention.

Audit SDK's for code that shouldn't be exposed to users.

The SDKs should only contain user consumed code at the top level. Other packaged outside of the SDKs should not be reaching in for functionality. This has the added benefit of usability for auto completions as there is less noise for the user.

[BUG] Copying job ID from the UI creates empty space under table

Describe the bug
Copying job ID in the UI creates a big empty white space under the footer. Probably has to do with the "copying to clipboard.." modal that pops up.

To Reproduce

Open a job in the UI.
Copy a job ID.
Scroll down.

Expected behavior
Copying a job ID shouldn't create empty space at the bottom.

Screenshots
Gif with a dramatic reveal at the end:

User profile pictures for Atlas

Is your feature request related to a problem? Please describe.
This is just an enhancement on the UX. Giving a face to the usernames will help easily see who all are working on a project.

Describe the solution you'd like
In the multi-node version of Atlas, give users the ability to upload a profile photo.

Change "Project owner" to be "Project contributors" and show a list of contributors to the project on the GUI

A project will most likely have multiple contributors - while it is helpful to understand who the owner of a project is, I think it is more beneficial to know who all the contributors are instead.

Preemptive GPU instances support

This task is to integrate the preemptive GPU work that @gozepolat @ranasac19878 have worked on into Foundations as a full feature.

This will allow Atlas' scheduler to run on preemptive GPU instances - with checkpointing and retries.

Jupyter Notebook support through the GUI

As a user I should be able to go to the GUI, click on create a notebook, specify the amount of GPU and RAM the notebook should use.

If those resources are available, Atlas will start a Jupyter notebook server (as a job?).
This will perpetually be running unless stopped.

Once the notebook is stopped, there should be a way to resume the notebook again through the GUI. Ideally you should be able to use all the Atlas SDK functions through the Jupyter Notebook.

This task requires significant UX research, this task is to understand the user flow and create mockups.

Ability to filter jobs and metrics using tags via the GUI

Is your feature request related to a problem? Please describe.
A user may tag multiple jobs with the name of a particular architecture.
They should be able to filter all the job runs with that tag. This will essentially allow a sub-directory kind of user flow from the GUI.

Describe the solution you'd like

Click on a tag, to filter the job table with jobs only using that tag.

[Mac Installation] 404 Error in downloading `atlas.tgz`

Running python atlas_installer.py on a Macbook gives the following error:
Cannot download the installation package from https://github.com/dessa-oss/atlas/releases/download/0.1.1.dev3/atlas.tgz.. I downloaded the atlas_installer.py from https://github.com/dessa-oss/atlas/releases

[BUG] Y-axis in graph in Experiment Details cannot show superscripts properly

Describe the bug
The Y-axis of the graph in Experiment Details shows values needing scientific notation in an unreadable fashion. Specifically, a value like 1.0x10²⁰ shows up as 1.0x10<sup>20</sup>

To Reproduce
Steps to reproduce the behavior:

Log a parameter or metric with a large or small value (which will be expressed in the table in scientific notation)
Look at graph, select the parameter or metric

Expected behavior
The values in the Y-axis should display superscripts appropriately

Screenshots

[BUG] Could not load dynamic library 'libcuda.so.1'

Platform & setup
I am running Ubuntu server 18.04, I have a Geforce 1080, Driver Version: 440.64.00, and CUDA Version: 10.2 installed (output of nvidia-smi).
I installed Atlas on the server based on a python environment with python 3.6.x.
I stated the atlas server with the "-g" flag (while starting the server it says that Atlas is using GPU with ID [0]) and I ran the submit command with the "--num-gpus 1" flag.

Describe the bug
When running a code using tensorflow gpu version I get to following error and execution is moved to CPU:

`2020-03-31 12:19:34.418493: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

2020-03-31 12:19:34.418513: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)

2020-03-31 12:19:34.418534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist`

However when I run code locally using the same tf version everything is fine, and running nvidia docker container works as well.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
Use GPU and CUDA in tensorflow computation.

Multi-tenant Tensorboard server

Currently if you go into a project and click on send to Tesorboard, it will create a server where it will run Tensorboard for that specific job. However this is not compatible with a multi-user and multi-tenant Atlas hosted on a cluster. Since there is only one instance of the Tensorboard Service, all users will clash.

✨Job Details Page REVAMP (filter redesign, improve visual hierarchy, icons)

fixes:
** all text in din or din-bold

filter redesign
- searchbar should be fixed, the rest is scrollable
table changes
- header has white background with din-bold text
- reduce column labels height, change background and border colour, reduce font size and weight to regular din
icons replacing delete button and filter drop down,
- delete icon should still have disabled state (decrease opacity)
- icons are included in a zip file below
update checkbox behaviour

icon behavour

Filter reference

icons.zip

[config] Decouple execution/submission configs from start-up

Move the network port configuration during start-up from submission/execution config (scheduler/redis) to service.config.

[debt] Use Github Actions instead of Jenkins

What's holding us back, and how?
The Jenkins instance is private and self-managed. Outside contributors have no insight into it. e.g. the build link from the README is 403 for non-Dessa contributors.

Optional: what are some options for remediating it?

Use Github Actions (this proposal)
Use a different open CI system (See awesome-ci for options!)

[BUG] Putting colons in metric/parameter/tag keys breaks the REST API

Describe the bug
This bug potentially applies to any piece of information we store and retrieve from redis. Essentially, there are places in the project where values are stored in redis in the form foo:bar. On retrieval, these values are separated by naively splitting on the colon and unpacking into variables.

For example, from project_metrics_controller in _project_metrics, there's a line like

foo, bar = some_value_from_redis.decode().split(':')

This means that if foo or bar happens to have a colon in it, the above line would cause an unpacking error. One of such cases is if a metric key has a colon in it.

To Reproduce

Submit a job with the following line in it

foundations.log_metric('fo:o', 1)

Hit the /overview_metrics endpoint. This will cause a 500.

Expected behaviour
We should be able to handle any kind of information that's being stored in redis to have a colon in it. Another idea is to prevent the user from using colons altogether.

[Data Model] Make "user" a first-class concept

From @hyun20005

In terms of our data model, we conceptually have a User that is associated with every operation, but the implementation of a User is not very well defined. This gets even more blurry when getting into authentication and how the GUI currently deals with the user name.

We should have a discussion fleshing out the User entity within our data model and how we can integrate it into our architecture.

[BUG] If user navigates to wrong project name, they end up at an empty project details/overview page

Describe the bug
If user navigates to /projects/NOT_A_VALID_PROJECT/overview or projects/NOT_A_VALID_PROJECT/detail, where NOT_A_VALID_PROJECT is not a valid project, the user sees an empty project, rather than an error

To Reproduce
Steps to reproduce the behavior:

See description above

Expected behavior
A 404 error or similar

Screenshots
If applicable, add screenshots to help explain your problem.

Note that the REST API for job listing returns a 404

Add Atlas-Server Commands to CLI Documentation

What's holding us back, and how?
Currently, the CLI docs provide information on all foundations <..> commands. Need to add documentation for atlas-server cli as well.

[BUG] Stopping atlas server with parameters fails

Describe the bug
When starting atlas-server with option (e.g. atlas-server start -t) and trying to stop with control+C, it fails with error __main__.py: error: unrecognized arguments: -t.

To Reproduce
Steps to reproduce the behavior:

atlas-server -t (wait for all services to start)
ctrl+c (to stop services)
error message displayed

Expected behavior
Cleanly shutdown atlas

[BUG] Deleting a queued/running job through UI should show feedback

Describe the bug
Right now attempting to delete a queued/running job through the UI does nothing and shows no feedback. This is because with our current implementation we only allow the user to delete queued/running jobs.

To Reproduce

Run a long job.
Try to delete it through the UI

Expected behavior
Either we give some kind of feedback (e.g. a pop-up saying "you cannot delete running jobs", or we give the delete command the functionality to stop/deque running/queued jobs before deleting them.

[BUG] Atlas Installer Fails on MacOS Catalina

Platform & setup
MacOS Catalina

Describe the bug
Traceback (most recent call last):
File "atlas_installer.py", line 657, in
log.info("Attempting to download the installation package from {}".format(file_url))

To Reproduce
Steps to reproduce the behavior:
Go throw installation guide

Expected behavior
Changes in pyinstaller.py : Use logger.info instead of log.info

[BUG] Atlas-Server start doesn't respect home directory used in installer with advanced option

Describe the bug
A user can use the Advanced flag in the installer to change the location of the foundations home directory. However, atlas-server start does:

    try:
            self._foundations_home = Path(environ['FOUNDATIONS_HOME'])
        except KeyError:
            self._foundations_home = Path.home() / ".foundations"

when loading the configs.

If the FOUNDATIONS_HOME variable is not set, startup fails.

Expected behavior
The installer should prompt the user to set their FOUNDATIONS_HOME environment variable if Advanced mode is used.

Store job metadata (e.g. logged metrics) into completed job bundle

Currently the job bundle doesn't actually store any of this metadata, the metadata is stored in REDIS. We should include a metadata file such that the user doesn't have to run the entire bundle again just to retrieve it.

[Data Model] Remove unused redis keys

@hyun20005

After #87, look at the redis keys that are not in use.

We should either remove the redis keys altogether, or refactor the related functionality to take advantage of and use that data.

[BUG] Cannot delete job tags with a question mark

Describe the bug
Tags with question marks in it cannot be deleted. This is because the DELETE call has the tag to delete in the URL (i.e. http://localhost:37722/api/v2beta/projects/test/job_listing/some_job_id/tags/foo?bar), which is parsed as a parameter.

To Reproduce

Open GUI.
Create a tag with a question mark in it.
Try to remove it.

Expected behavior
Tags with question marks in it should be able to be deleted. Recommended way of doing this is to send the tag within the body of the DELETE call.

Improve speed of data formatting for parallel coordinates graph

Currently, the frontend does some heavy processing (O(n^2)) to convert the job data from the backend to the format that the chart library expects. This leads to long load times when landing on the page.

We should either improve the algorithm in the frontend that does the formatting, store and send the data from the backend properly formatted, or investigate trying to fit our data into the external chart library.

CC: @hyun20005

"End job" function in order to create multiple jobs from the same python script

My use case is tracking every iteration of a hyperparameter optimization process. I've created a function that roughly looks like:

def model_score(hyperparameters):
    xtrain, ytrain, xtest, ytest = get_data()
    model = create_model(hyperparameters)
    model.fit(xtrain, ytrain)
    score = model.evaluate(xtest, ytest)
    return score

I am then passing this function to my favorite optimization algorithm (like skopt's gp_minimize, hyperopt's fmin and so on) but I also want to log the inputs and outputs in the process:

def model_score(hyperparameters):
    foundations.log_params(hyperparameters)
    xtrain, ytrain, xtest, ytest = get_data()
    model = create_model(hyperparameters)
    model.fit(xtrain, ytrain)
    score = model.evaluate(xtest, ytest)
    foundations.log_metric(score)
    return score

While this code doesn't raise any errors, it only logs the first parameter iteration as a "job" and assigns all the metric iterations to it. I'm wondering if it would be possible to have something like a foundations.end_job() function that I would call once I log my metrics so that the next time log_parameters is called, it creates a new job.

List of Flaky Tests

I'm going to log flaky tests in this ticket . They waste our time by having to rerun the pipeline.

test_put_has_status_code_and_params
test_multiple_keys_with_at_most_singly_nested_values
test_config_list_returns_empty_listing_when_root_missing
test_foundations_create_syncable_directory_without_any_job_ids

[BUG] Send to Tensorboard button not working

Platform & setup
Ubuntu 18.04
Python 3.6

Describe the bug
There are two problems actually:

to enable tensorboard as per documents, I do:

import foundations
foundations.set_tensorboard_logdir("/the/shared/dir/in/host/with/docker")

This fails with below error:
TypeError: 'module' object is not callable

If you look into the foundations.__file__ directory you would see that the set_tensorboard_logdir is a python module there which has a set_tensorboard_logdir function inside. So, it means to call that set_tensorboard_logdir function we can do as below:

import foundations
import foundations.set_tensorboard_logdir as tb
tb.set_tensorboard_logdir("/the/shared/dir/in/host/with/docker")

is this correct??

My second question depends on the answer to the first question, with the workaround above, when I launch a project and click the "Send to Tensorboard" button, a new tab opens up and there is nothing in it. It will not load the "localhost:5959" address. If I manually enter that address it loads a tensorboard session but it is empty, seems that it cannot find the logdir for tensorboard. It is worthy to note that the below directory is empty:
~/.foundations/tensorboard/work_dir/
however the below dir has the tensorboard event files:
~/.foundations/job_data/archive/JOBID/synced_directories/tensorboard/

To Reproduce
Steps to reproduce the behavior:
Mentioned above
Expected behavior
Send to Tensorboard should launch the tensorboard tab.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context / Stacktrace
Add any other context about the problem here e.g. what system were you on? F9s trial environment? GCP? DGX?

Consolidate utils and helpers into one place

Find the utility and helper packages (two are within contrib) and merge them into a single place for this type of generic, useful code that is not really domain specific. Move bits of utility functions here over time to reduce copy pasted code.

CC: @amackillop

[BUG] Styling issues with having longer tags

Describe the bug
Two styling issues when having several long tags.

This is caused due to the logic of when the ... appears. Seems to be based on the number of tags and doesn't consider their size.

This is based on how our table is set up (with cells and rows). Having long tags elongates the rows, causing it to misalign with the column headers.

To Reproduce

Go to a project in the UI.
Create a few long tags.

Expected behavior
Cutting off the tags in the tag collections should be based on the size of the tag elements.
Column headers should align with the row contents, regardless of what tags exist.

dessa-oss / atlas Goto Github PK

atlas's Introduction

Atlas: Self-Hosted Machine Learning Platform

Development Status

Features

Users guide

Installation

Documentation

Community

Development Guide

Contributing

Development Setup

Running tests locally

Systems Overview

Scheduler

Worker

Atlas GUI

Atlas REST API

Archive Server

Tracker

Authentication Proxy

Authentication Server

TensorBoard Server

TensorBoard REST API

License

atlas's People

Contributors

Stargazers

Watchers

Forkers

atlas's Issues

Description

Error

Steps to reproduce

Has this issue been seen before?

Cause of issue

Link to BUG

Solution

Recommend Projects

Recommend Topics

Recommend Org