Giter Site home page Giter Site logo

tapis-project / pods_service Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 471 KB

Network Accessible Pods API.

Home Page: https://tapis.readthedocs.io/en/latest/technical/pods.html

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.50% Makefile 1.75% Shell 0.32% Python 82.12% Mako 0.25% Jupyter Notebook 14.47% Jinja 0.59%
kubernetes tacc

pods_service's People

Contributors

jasonthekim avatar notchristiangarcia avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jasonthekim

pods_service's Issues

KG 0.44 - Pods description should be restricted in content and length.

Currently each pod gets a description field, defaulting to ''. This field is user define-able however and should have some hard limits.

In particular, description should:

  • Be shorter than 255 (arbitrary) characters.
  • Should only allow allow regular characters: [a-z][A-Z][0-9]!.?@#',"-_ (More can be allowed if preferred)

If description does not mean the above criteria then a friendly error message should be given to the user.

Additionally. After completing the initial task, add at least two tests, one to ensure valid description's work correctly. And a second to ensure invalid descriptions give the correct error message.

KG 0.45: Traefik overhaul and adding Postgres support to templated databases.

  • Get Traefik working (first with KubernetesCRDs, then ingress, then just a dynamic config file)
  • Get Postgres working with Traefik
  • Get Neo4j working with Traefik
  • Get HTTPs working with Traefik
  • Get Jinja template working for Traefik
  • Simplify direct ip address usage to use kubernetes services
  • Have Traefik be second layer to Nginx initial proxy so we don't have to change any proxy issues in deployment.
  • Integrate Traefik with service
  • Final testing in development

PEARC 2023 Tapis Pods Service Short Paper

Short paper regarding the Pods service. An explanation of the service architecture, the services benefits, current and possible use-cases, and performance metrics.

KG 0.41 - Improve migrations.

A bit confusing current as multiple tenants loop over one function to upgrade. Need to add in some other options for later down the line when we don't want every schema to match exactly. For example when dealing with the image allowlist.

Catalog and Template endpoints with admin template additions.

Adding in catalog endpoint for users to share volumes/snapshots/pods with read permissions.

Template endpoint with ability for users to add their own templates. Looking to expand this further with users added complex pod definitions, not yet though, just images.

KG T1 - Possible to get around requiring TLS for TCP connections.

Currently TLS is required so we can "ssl_preread" at the nginx level and route according to subdomain. With the bolt driver, user has to have TLS outgoing and incoming if "encrypted" attr is True. Meaning, we need to return TLS, meaning certs.

It might be possible for a user to send us non-TLS TCP, we convert that to TLS compliant TCP, THEN we preread subdomain information? This assumes that ssl_preread just works at this point. Might be possible. Nginx has the certs. We can then go back to sending non-TLS TCP to the pod. Bolt is happy in this case, because non-TLS out and non-TLS back.

KG 0.32 - Refactor logging

Logging was done very quickly, needs to be thought out more though. Currently logging is running read_namespaced_pod_log with the Kubernetes python client. This gives us all current logs for the pod, up to the maximum Kubernetes itself stores (10 MB by default). Meaning when we update the pod's logs, we always have the latest 10 MB of logs.

Two possible problems:

  • We'll lose logs. Might be useful to have the entirety of a pods logs, thinking in the case of week/month long pod runs.
  • There might be a lot of churn, this would mean updating 10 MB of logs in the database every time health runs (a lot). At the least, it would be useful to have a function to only append with the diff of logs.
  • Bonus problem. If a pod is stopped and restarted, currently we would lose the first instances logs. That might not be preferred.

Deployer QOL fixes (abaco too)

Nicer failing for deployment. Abaco is now able to run without a few environment variables that it previously relied on allowing for an easier deployment.

KG 0.37 - Need new model for running images.

Currently specify images as:

  • templates images - neo4j
    • Requires new function, doesn't allow others to use first level image specification
    • Requires setting allowable templates in models.py so users can use it.
    • Clunky really. We need a better way to specify templates.
  • custom images - custom-jstubbs/abaco_test
    • This is just bad. Annoying to specify for users.

Ideas for solutions:

  • pod-templates/neo4j could be the new way to specify using a template. Let's just reserve pod-templates on dockerhub and understand in the service that pod-templates actually means to run our code.

    • With this users can just specify their images as jstubbs/abaco_test as you would expect.
  • Harder to make adding templates easier. Templates could be turned into a dictionary of values to use. Hard though when we have to create custom credentials per pod or run whatever arbitrary code that's required to run the template. Possible though.

  • Could also have regex parse the kubernetes_templates.py file, find functions, start_neo4j_pod, and parse just the image name. This might be the easier way.

KG 0.46 - Catch error messages to other users' http apis.

Traefik allows error message middleware. If we combine this with users' http api, we could check on the health of the pod and give informative information regarding the service.

For example, if someone attempts to access service and it returns a 404, we could intercept and check the health of the pod. If the pod is supposed to be set to off, we can state that. Otherwise we can say there's a problem.

This could be extended to TCP. Unsure.

KG 0.39 - Need pods_admin that can do whatever it wants.

Currently we can't interfere with a users pod at all. We should have that access, to change permissions or whatever else.
Maybe not though.

What we gain from having the role:

  • We can change user permissions.
  • We can view the pods.
  • We can start/stop/delete a pod.
  • We can do this all through interface rather than going through the database.

Action logs for pods and improved health logic

"action_logs": [
    "23/09/07 23:41: Pod object created by 'cgarcia'",
    "23/09/07 23:41: spawner set status to CREATING",
    "23/09/07 23:42: health set status to AVAILABLE",
    "23/09/07 23:42: 'cgarcia' updated pod, updated_fields: {'description': 'test description update'}"
]

Automatic certificate creation and management

  • Working now with Traefik LetsEncrypt certs with an ACME tls challenge.
  • Need extra work for Neo4j as it requires a cert to be inserted in container to work.
  • Need to get local development working again.

KG 0.31 - Additional Misc Features

  • Add character length limit for pod_id.
  • Image allowlist.
  • Catalog attr to show all pods in some catalog endpoint.
  • Add multi-port feature Pods so one pod can have multiple declared ports with different addresses.
  • Fix multi-port feature dropping tcp declarations due to colliding service names in Traefik
  • Add an endpoint for allow/block list per tenant (requires a separate db table per tenant + global)
  • Get global tables working with the current alembic setup.
  • Fix update/creation timestamp attr
  • Pod Stop
  • Pod Restart
  • PVC for each database pod based on pod attr?
  • Add in init container support.
  • Create local dummy cert for local dev.
  • Rework image declaration and add support for tags
  • Catalog endpoint for templates + images
  • Pod attr switch between k8 deploy or pod creation. Currently creating pods, not deployment.
  • Ensure logs get snapshotted when pod in complete and erroring it out in api.
  • Add a way for users to input their pod_password in environment variables without copy paste.
  • If pod doesn't get created properly, service won't either, even if pod gets to healthy. Needs health check + better logic.
  • Create admin and user creds for Neo4J
    • Need to delete existence of admin_password so that user can't access it. Currently an env.
      • This could be done with an init_container if there's a PVC for the neo db.
  • Require better logic for changing Nginx configmap. "Check if current matches correct, if not, update".
  • Maybe make the Nginx reload faster?
  • Environment variable proliferation
  • Fix terminating status pods acting as ready.

KG 0.36 - Security

"Who's idea was it to let third-parties run whatever they wanted in a Kubernetes?" A story on how Christian should have done this with Docker.

  • Cert isolation - Can't have a bad cert affect our normal certs. Cert errors could be bad. Should probably create a new cert per thing? Maybe always have it be non-secured?
  • Service isolation - Pods shouldn't be able to use any service at all. No Egress. Only Ingress from nginx.
  • Network isolation - Pods shouldn't be able to make any calls via ip.
  • Pod isolation - Pods shouldn't have k8 control or access to other pods.
  • Environment Variable isolation - Block access to default environment variables.

Host not found in upstream NGINX bug causing broken NGINX.

There might be real issues with the current implementation of a hot reloading NGINX. Testing is really required, it could "just work". But Nginx currently kind of breaks if it doesn't find a pod. So if we delete a pod before NGINX accounts for it, it could mean breaking NGINX every so many seconds.

This now matters as we have stop/restart commands. Seems like there is indeed an issue. Needs work.

Nginx likes to error out when an "upstream" connection is missing. As if pod is gone. Try and remedy this. Might not be a problem when reloading nginx, only initial deploy. Investigate. - stack overflow suggestions

  • Check whether or not Nginx hot-reload breaks long running query to db. (Neo4J upload)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.