Giter Site home page Giter Site logo

uitml / springfield Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 137 KB

Project "Springfield"

Home Page: https://uitml.github.io/springfield/

License: MIT License

Dockerfile 13.59% Shell 86.41%
cuda docker gpu-cluster kubernetes machine-learning research tensorflow

springfield's People

Contributors

danieltrosten avatar jonasnm avatar madsadrian avatar mickam avatar siloekse avatar thomasjo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

madsadrian

springfield's Issues

Prepare authentication when SSH agent is not running

I came across an edgecase when the SSH agent is not running when prepare-authentication.sh is executed. The key/pub files are created, but they will of course not be added to the agent. The files will only be added to the agent if they don't exist beforehand, so in this case the user will have to add them manually after starting the SSH agent and run prepare-authentication.sh again to configure k8s.

Deploy dashboard

We should investigate deploying the standard k8s dashboard solution, which might make it easier for end users to deploy and monitor jobs, etc.

  • Find an authentication scheme
    • Experiment with client certificates
    • Experiment with OAuth via GitHub
  • Deploy dashboard
    • Test authentication
    • Expose service using ingress
  • Invite beta testers

Detect invalid file permissions on existing SSH key pairs

When a user executes prepare-authentication.sh with a pre-existing key pair, we should detect if the permissions are invalid.
And if the permissions are invalid, we should inform the user in a "friendly" manner and help them fix the permission issue.

An alternative approach would be to fix the permission issue automatically, but I'm a bit reluctant about changing file permissions on existing files; they might be "invalid" for a reason.

For reference, the .ssh folder needs to be 0700 and any file within that folder should be 0600.

Document how to use the cluster as an end-user

We need to provide good documentation on how to get started with using the cluster once it's setup. Ideally we should provide some useful examples that can be used a starting point.

Setup end-user authentication

We need to give end-users the ability to authenticate in order to access the cluster to schedule jobs/services, and so on.

The best option seems to be relying on GitHub OAuth. Another alternative is to OAuth via source.uit.no, but there doesn't seem to be a way of adding an OAuth application to the organization there, only to a user.

Update documentation on accessing cluster file storage

Currently the Getting Started guide walks the end-user through using port-forwarding to access the storage proxy (namespaced SSH server), but this is no longer necessary due to exposing the service via NodePort. The documentation needs to be updated to reflect this. Essentially replace the port-forwarding part with how instructions on how to find the correct port number.

Document how to configure client tools

Add information to the Getting Started guide about how to run the configuration script to correctly configure k8s client tools. It should probably be added between the sections on installing the client tools and connecting to the cluster file storage.

Setup resource quotas

We need to limit resource quotas on the cluster based on role (researcher, student, etc) in order to prevent someone from using e.g. all GPUs or all CPU cores.

Setup ingress

Finish setting up ingress on the cluster to allow external requests to be routed to a service running across one or more pods, on one or more nodes.

The two best alternatives look to be

  1. NGINX.
  2. Traefik.

Add some basic guidance on using sshfs with cluster file storage

To simplify end-user workflow, sshfs can be used to map the cluster file storage as a "local" mointpoint. In other words, the cluster file storage will look like a normal directory to the end-user, which will greatly improve the overall workflow.

Consider automating this (to some extent) using a shell script.

Document how to recreate the cluster

To improve the bus factor, we should document everything needed to recreate the cluster setup, in excruciating detail. The work has begun, but there is a lot left to be done. This task should have top priority.

Experiment with using a load balancer

Some time ago I ordered a Raspberry Pi 3 Model B+ with the idea that it could function as a basic load balancer of some sort. We should experiment with how feasible that is. It might be enough to use a simple NGINX setup. Might also be worth investigating using HAProxy as an alternative.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.