The pods_service from tapis-project

KG 0.44 - Pods description should be restricted in content and length.

Currently each pod gets a description field, defaulting to ''. This field is user define-able however and should have some hard limits.

In particular, description should:

Be shorter than 255 (arbitrary) characters.
Should only allow allow regular characters: [a-z][A-Z][0-9]!.?@#',"-_ (More can be allowed if preferred)

If description does not mean the above criteria then a friendly error message should be given to the user.

Additionally. After completing the initial task, add at least two tests, one to ensure valid description's work correctly. And a second to ensure invalid descriptions give the correct error message.

Fix Pods startup bug caused by new tenant

Auto-health can attempt to use tenants that SK hasn't updated the token_generator role on yet. Should just pass.

Pods multi-namespace testing

Now that Files is not required I can test multi-namespace and find issues.

Explore decimal allocation of GPUs as we'll hit issues soon.

KG 0.33 - Have pods be routed using `podname.pods.tenant.environment.tapis.io`

Issue with certs.
Issue with DNS not resolving
- Turned out to be explicitly routed tacc.develop.tapis.io/dev.develop.tapis.io breaking wild card DNS stuff.

KG 0.45: Traefik overhaul and adding Postgres support to templated databases.

Get Traefik working (first with KubernetesCRDs, then ingress, then just a dynamic config file)
Get Postgres working with Traefik
Get Neo4j working with Traefik
Get HTTPs working with Traefik
Get Jinja template working for Traefik
Simplify direct ip address usage to use kubernetes services
Have Traefik be second layer to Nginx initial proxy so we don't have to change any proxy issues in deployment.
Integrate Traefik with service
Final testing in development

Fixed subset of error logs not being properly handled

Validation for models in models routed error messages improperly. Resulted in tapipy returning message:none

Implement nginx -> traefik to be ready for docker -> k8 push

KG 0.15 - Expose database pods to world.

TCP proxying at this point.

Fix nginx reliance on traefik on startup. Decouple Pods from environment.

Visual Analytics dev deployment and speeds up

KG 0.38 - Testing is required.

For everything.

KG 0.12 - Tapis auth implementation for FastAPI/New flaskbase

PEARC 2023 Tapis Pods Service Short Paper

Short paper regarding the Pods service. An explanation of the service architecture, the services benefits, current and possible use-cases, and performance metrics.

KG 0.16 - Expose HTTP pods to world.

A bit confusing current as multiple tenants loop over one function to upgrade. Need to add in some other options for later down the line when we don't want every schema to match exactly. For example when dealing with the image allowlist.

Catalog and Template endpoints with admin template additions.

Adding in catalog endpoint for users to share volumes/snapshots/pods with read permissions.

Template endpoint with ability for users to add their own templates. Looking to expand this further with users added complex pod definitions, not yet though, just images.

KG 0.40 - Migrations break when there's a new tenant.

Migrations were written per tenant and then deployed. That was not smart.

KG T1 - Possible to get around requiring TLS for TCP connections.

Currently TLS is required so we can "ssl_preread" at the nginx level and route according to subdomain. With the bolt driver, user has to have TLS outgoing and incoming if "encrypted" attr is True. Meaning, we need to return TLS, meaning certs.

It might be possible for a user to send us non-TLS TCP, we convert that to TLS compliant TCP, THEN we preread subdomain information? This assumes that ssl_preread just works at this point. Might be possible. Nginx has the certs. We can then go back to sending non-TLS TCP to the pod. Bolt is happy in this case, because non-TLS out and non-TLS back.

Deploy multi-namespace instance without cluster_roles

Can currently do multi-namespace work only if I have a cluster role. Trying to simplify.

Metrics overhaul for traefik

Get correct incoming IP from nginx.
Setup tracing instance?
Use health_central to do metrics work

Fixing break if network goes down.

KG 0.32 - Refactor logging

Logging was done very quickly, needs to be thought out more though. Currently logging is running read_namespaced_pod_log with the Kubernetes python client. This gives us all current logs for the pod, up to the maximum Kubernetes itself stores (10 MB by default). Meaning when we update the pod's logs, we always have the latest 10 MB of logs.

Two possible problems:

We'll lose logs. Might be useful to have the entirety of a pods logs, thinking in the case of week/month long pod runs.
There might be a lot of churn, this would mean updating 10 MB of logs in the database every time health runs (a lot). At the least, it would be useful to have a function to only append with the diff of logs.
Bonus problem. If a pod is stopped and restarted, currently we would lose the first instances logs. That might not be preferred.

Deployer QOL fixes (abaco too)

Nicer failing for deployment. Abaco is now able to run without a few environment variables that it previously relied on allowing for an easier deployment.

KG 0.17 - Create templeted NGINX config and get hot reloads working.

KG 0.37 - Need new model for running images.

Currently specify images as:

templates images - neo4j
- Requires new function, doesn't allow others to use first level image specification
- Requires setting allowable templates in models.py so users can use it.
- Clunky really. We need a better way to specify templates.
custom images - custom-jstubbs/abaco_test
- This is just bad. Annoying to specify for users.

Ideas for solutions:

pod-templates/neo4j could be the new way to specify using a template. Let's just reserve pod-templates on dockerhub and understand in the service that pod-templates actually means to run our code.
- With this users can just specify their images as jstubbs/abaco_test as you would expect.
Harder to make adding templates easier. Templates could be turned into a dictionary of values to use. Hard though when we have to create custom credentials per pod or run whatever arbitrary code that's required to run the template. Possible though.
Could also have regex parse the kubernetes_templates.py file, find functions, start_neo4j_pod, and parse just the image name. This might be the easier way.

KG 0.46 - Catch error messages to other users' http apis.

Traefik allows error message middleware. If we combine this with users' http api, we could check on the health of the pod and give informative information regarding the service.

For example, if someone attempts to access service and it returns a 404, we could intercept and check the health of the pod. If the pod is supposed to be set to off, we can state that. Otherwise we can say there's a problem.

This could be extended to TCP. Unsure.

KG 0.39 - Need pods_admin that can do whatever it wants.

Currently we can't interfere with a users pod at all. We should have that access, to change permissions or whatever else.
Maybe not though.

What we gain from having the role:

We can change user permissions.
We can view the pods.
We can start/stop/delete a pod.
We can do this all through interface rather than going through the database.

KG 0.05 - Local k8 deployment helpers

Makefile/jupyter lab. Light readme, etc.

KG 0.43.1 - Automatic NFS PKI creation and extraction

KG 0.14 - Visual Analytics group. Deploy database and server pods.

Reimplement nginx -> traefik fallback so traefik fails at tcp level

Action logs for pods and improved health logic

"action_logs": [
    "23/09/07 23:41: Pod object created by 'cgarcia'",
    "23/09/07 23:41: spawner set status to CREATING",
    "23/09/07 23:42: health set status to AVAILABLE",
    "23/09/07 23:42: 'cgarcia' updated pod, updated_fields: {'description': 'test description update'}"
]

Tests for pods, volumes, and snapshots that work locally and with nfs working.

Yay.

Prepare for Gateways presentation

Creating slides, work on a fun demo of "Hoppscotch" + a database, documentation overhaul.

KG 0.10 - Start/stop database pods

Automatic certificate creation and management

Working now with Traefik LetsEncrypt certs with an ACME tls challenge.
Need extra work for Neo4j as it requires a cert to be inserted in container to work.
Need to get local development working again.

KG 0.31 - Additional Misc Features

KG 0.36 - Security

"Who's idea was it to let third-parties run whatever they wanted in a Kubernetes?" A story on how Christian should have done this with Docker.

Cert isolation - Can't have a bad cert affect our normal certs. Cert errors could be bad. Should probably create a new cert per thing? Maybe always have it be non-secured?
Service isolation - Pods shouldn't be able to use any service at all. No Egress. Only Ingress from nginx.
Network isolation - Pods shouldn't be able to make any calls via ip.
Pod isolation - Pods shouldn't have k8 control or access to other pods.
Environment Variable isolation - Block access to default environment variables.

MySQL Support

For Carlos

KG 0.35 - Need to isolate pods from network sans their open ports.

Currently I believe arbitrary code can do basically anything to our cluster. Isolation via namespace does work, but in that case we need to move spawner into it's own namespace (pods can still talk to each other though).

Note: This is also important for Abaco.

KG 0.34 - Initial documentation

Initial development docs
Initial live docs
Initial Tapis V3 documentation
Description for each operation

Testing VM -> K8 network routing for Traefik

Host not found in upstream NGINX bug causing broken NGINX.

There might be real issues with the current implementation of a hot reloading NGINX. Testing is really required, it could "just work". But Nginx currently kind of breaks if it doesn't find a pod. So if we delete a pod before NGINX accounts for it, it could mean breaking NGINX every so many seconds.

This now matters as we have stop/restart commands. Seems like there is indeed an issue. Needs work.

Nginx likes to error out when an "upstream" connection is missing. As if pod is gone. Try and remedy this. Might not be a problem when reloading nginx, only initial deploy. Investigate. - stack overflow suggestions

Check whether or not Nginx hot-reload breaks long running query to db. (Neo4J upload)

GPU support with limit/requests

Fix periodic restarting of health pod.

Probably due to Files connections

tapis-project / pods_service Goto Github PK

pods_service's People

Contributors

Watchers

Forkers

pods_service's Issues

Recommend Projects

Recommend Topics

Recommend Org