shopify / krane Goto Github PK
View Code? Open in Web Editor NEWA command-line tool that helps you ship changes to a Kubernetes namespace and understand the result
License: MIT License
A command-line tool that helps you ship changes to a Kubernetes namespace and understand the result
License: MIT License
Saw this:
[FATAL][2017-07-31 21:33:12 +0000] > Error from kubectl:
[FATAL][2017-07-31 21:33:12 +0000] Error from server (InternalError): error when applying patch:
[FATAL][2017-07-31 21:33:12 +0000] {"spec":{"containers":[{"name":"command-runner","resources":{"limits":{"cpu":"1000m"}}}]}}
[FATAL][2017-07-31 21:33:12 +0000] to:
[FATAL][2017-07-31 21:33:12 +0000] &{0xc42100b680 0xc420152070 <app> n upload-assets-7b4d9314-1f4a36a4 /tmp/Pod-upload-assets-7b4d9314-1f4a36a420170731-5277-12341p6.yml 0xc4205cca28 0xc420f4c000 82075679 false}
[FATAL][2017-07-31 21:33:12 +0000] for: "/tmp/Pod-upload-assets-7b4d9314-1f4a36a420170731-5277-12341p6.yml": an error on the server ("Internal Server Error: \"/api/v1/namespaces/<app>/pods/upload-assets-7b4d9314-1f4a36a4\": the server could not find the requested resource") has prevented the request from succeeding (patch pods upload-assets-7b4d9314-1f4a36a4)
[FATAL][2017-07-31 21:33:12 +0000] > Rendered template content:
It appears that the pod died while/before it was being updated with the resource limit.
Could this be a concurrency bug? Subsequent deploy passed fine.
The fact that deployment rollout monitoring currently looks at all pods associated with that deployment instead of pods in the new ReplicaSet has caused several different bugs:
[KUBESTATUS] {"group":"Deployments","name":"jobs","status_string":"Waiting for rollout to finish: 0 of 1 updated replicas are available...","exists":true,"succeeded":true,"failed":false,"timed_out":false,"replicas":{"updatedReplicas":1,"replicas":1,"availableReplicas":1},"num_pods":1}
Related technical notes:
This feature will mostly be on the shipit-engine side, but will require some work in this gem. The primary purpose of the KUBESTATUS
logs is to support such a visualization. Some thoughts:
"Basic" -> "Hello Cloud"
As of v1.6, kubectl apply
can handle TPRs properly.
HorizontalPodAutoscaler is no longer supported in extensions/v1beta1 version. Use autoscaling/v1 instead.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#autoscaling-1
This causes us to fail deploys: https://shipit.shopify.io/shopify/k8s-cluster-services/production/deploys/435598
Bug appears to be this line:
https://github.com/Shopify/kubernetes-deploy/blob/22972eeb193e862b2521cec81e730e7bba9871ea/lib/kubernetes-deploy/runner.rb#L56
/usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/kubernetes_resource/daemon_set.rb:42:in `timeout_message': undefined method `map' for nil:NilClass (NoMethodError)
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/kubernetes_resource.rb:130:in `debug_message'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `block in record_statuses'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `each'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `record_statuses'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:124:in `run'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/exe/kubernetes-deploy:70:in `<top (required)>'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `load'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `<main>'
kubernetes-deploy ci ci-east --bindings=region=us-east1 exited with status 1
The monitoring for both the deployment (which currently has a guard clause on at least one replica being available) and the service (which waits to have endpoints to succeed) is causing this.
Secret Provisioner
does not respect the no-prune
flagRan into this trying to restart via shipit
Kube-restart failed
$ bundle exec kubernetes-restart identity-production tier3
pid: 20076
bundler: failed to load command: kubernetes-restart (/app/data/bundler/ruby/2.3.0/bin/kubernetes-restart)
@KnVerey @kirs right now folks trying to do a restart via shipit are hitting this bug
Instead of hardcoding details about what Shopify's TPR controllers are expected to create, we should provide generic support based on a conventional status field. According to @wfarr investigation, such a status field will be easier to implement in the new CustomResourceDefinition
s.
Hi Shopify,
I am trying to get started using this tool but would love an example or demo folder with sample app, template and template erb to hit the ground running. Is this something that exists in another repo?
Syntax errors in the k8s template error out, but the deploy still reports as successful:
[WARN] The following command failed: kubectl apply -f /tmp/testing-deployment.yml.erb20170209-12190-s9f69e --prune --all --prune-whitelist\=core/v1/ConfigMap --prune-whitelist\=core/v1/Pod --prune-whitelist\=core/v1/Service --prune-whitelist\=batch/v1/Job --prune-whitelist\=extensions/v1beta1/DaemonSet --prune-whitelist\=extensions/v1beta1/Deployment --prune-whitelist\=extensions/v1beta1/HorizontalPodAutoscaler --prune-whitelist\=extensions/v1beta1/Ingress --prune-whitelist\=apps/v1beta1/StatefulSet --namespace\=pipa-test --context\=pipa-test
[WARN] error: unable to decode "/tmp/testing-deployment.yml.erb20170209-12190-s9f69e": [pos 290]: json: expect char '"' but got char '8'
Waiting for Deployment/pipa-test
[KUBESTATUS] {"group":"Pipa-test deployment","name":"pipa-test-1149558948-5l604","status_string":"Running (Ready: true)","exists":true,"succeeded":true,"failed":false,"timed_out":false}
...
Spent 0.58s waiting for Deployment/pipa-test
Deploy succeeded!
Should this error be caught in shipit-engine
or in this gem?
We currently require Kubernetes v1.6 and will soon move to v1.7. All three executables should check the version in the target cluster and abort early if it doesn't meet requirements.
numberAvailable
component in the status field due to it being inconsistentnumberReady
but we need to reevaluate numberAvailable
after a kubernetes version upgraderelated to #148
/cc: @Shopify/cloudplatform @jonpulsifer
We should be sure to investigate #51 before doing this.
A test in @karanthukral PR points out that the Pod
code is assuming the readinessProbes are HTTP:
^ This message makes no sense, oops! ๐
I noticed today that that message can also get displayed when the probes are fine but the rollout was reeeallly slow so the container just happens to be starting. Maybe we should change the beginning to be less confident, e.g. "Your pods are running, but are not ready. Please make sure they're passing their readiness probes". And then either not push a probe-specific message if probe_location
is blank, or adjust it to work / make sense for both types of probes.
@karanthukral would you be interested in fixing this?
Likely a simple matter of printing the output of the command that did the prune.
We nominally support ruby 2.1 or greater, and have Rubocop set up accordingly. However, it turns out this has not been true since we added the ActiveSupport dependency:
activesupport-5.1.1 requires ruby version >= 2.2.2, which is incompatible with the current version, ruby 2.1.8p440
Should we:
A) Change the dependency to fall in line with shipit-engine's minimum version as originally intended; or
B) Set an independent, higher requirement.
Given the extent to which we work with variable parsed JSON in this gem, the safe navigation operator introduced in 2.3 would be tremendously useful. And although we certainly designed this gem around Shipit's use case, it isn't all that functionally tied to Shipit. For those reasons, I vote for (B), and making that requirement ruby 2.3.
@Shopify/cloudplatform @kirs
If you deploy a SHA that doesn't exist, kubernetes-deploy
will continue and the container will end up in ImagePullBackOff
until the image is up.
Should we consider blocking in kubernetes-deploy
or failing the deploy early, instead of resorting to a timeout?
How can we check this with the appropriate credentials?
It's worth noting that in Shopify we do this check in our present deployment wrapper around kubernetes-deploy
(Capistrano).
Right now, app maintainers need to push a new commit, triggering a new container build, to "restart" their application. We need a true restart mechanism, ideally enabling users to choose to restart only a specific deployment rather than all of them if desired.
Note that there has been some discussion re: adding a feature like this to kubectl. It's still open, but my read on the tl;dr there is that the maintainers are conceptually opposed to implementing it, since restarts should not be necessary when no changes have been made to the spec and both liveness and readiness probes are passing.
As that k8s issue points out, rather than implementing selective pod deletion in accordance with the deployments rollout strategy, this can be achieved by patching the deployment's podspec, e.g. with an environment variable, a label or an annotation containing a timestamp.
The initial interval was chosen pretty arbitrarily, and I have the subjective impression that is is too short, leading deploys to be noisier than necessary. We should gather some data on this and optimize it.
Shipit exposes a $TASK_ID
to its deploy scripts. IMO we should be using this when present instead of generating a new one.
Make Kubectl
's commands use --kubeconfig
instead of relying on the env var. This will enable us to stop modifying the env var in tests. We should still derive the value of the flag from the environment, as is standard in other tools, so this change will be transparent to end users.
Currently, Shipit rollbacks using this gem are identical to a deploy of the previous revision, which means that unmanaged pods will get run. This is likely contrary to expected behaviour (though I have not heard any specific complaints), and it unnecessarily slows down rollbacks while the pod is deployed and runs to completion. Note that our primary use case for unmanaged pods is Rails migrations, in which case this is launching a container with the old revision and running rake db:migrate
as it would have when that revision was first deployed.
run
and run!
methods of each Task
class).Currently seeing:
[INFO][2017-09-11 20:25:03 +0000]
[INFO][2017-09-11 20:25:03 +0000] ------------------------------------Phase 1: Initializing deploy------------------------------------
[INFO][2017-09-11 20:25:03 +0000] All required parameters and files are present
[INFO][2017-09-11 20:25:03 +0000] Context pipa-test found
[INFO][2017-09-11 20:25:34 +0000]
[INFO][2017-09-11 20:25:34 +0000] ------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2017-09-11 20:25:34 +0000] Namespace pipa-test not found
This is a bit confusing to users since it says "Context pipa-test found" which seems to refer to the kube config file, but the actual error occurs at the namespace lookup.
We have more information on why it failed, so we should provide what we can to the user. For example, we could dump relevant pod logs and events, or minimally log an error along the lines of "Go look at your logs or bug tracker" for container-related failures. Note that the cloudplatform team is considering annotating namespaces with app info, possibly including logs/bugs urls, which could be used to enhance such messages if/when available.
If you kill the deploy somewhere between the initiation and completion of an action, the deploy summary printed will be inaccurate. For example, if you abort the deploy between the predeploy and the main deploy, the summary will say "No actions taken". Ideas for solutions:
We're currently using kubectl's built-in logic for selecting which pod to dump logs from for deployments/replicaSets. That code is here and tl;dr is trying to select a pod that is more likely to actually have logs. When all the pods in an RS are failing, this is perfect. However, when some are succeeding, this logic is likely to select the good ones, which often have a large volume of irrelevant content. We should consider something like:
if @pods.map(&:deploy_succeeded?).uniq.length > 1 # split-result ReplicaSet
most_useful_pod = @pods.find(&:deploy_failed?) || @pods.find(&:deploy_timed_out?)
most_useful_pod.fetch_logs
else
# current logic
end
It's worth noting that in most cases I've seen, the bad pods in a split-result ReplicaSet are failing at a very early stage (can't pull image, can't mount volume, etc.), so in practice the effect might be suppressing irrelevant logs rather than actually grabbing relevant ones.
I have an invalid ingress that I tried to deploy. It reported success but the ingress did not change or show me the error.
Error should have been: error: ingresses "web" could not be patched: cannot convert int64 to string
What I saw:
[INFO][2017-09-08 19:28:45 +0000] Deploying resources:
[INFO][2017-09-08 19:28:45 +0000] - Ingress/web (timeout: 30s)
Successfully deployed in 8.8s: Ingress/web
The class for ingresses currently uses a basic exists?
check, although it isn't really ready until it has a public IP. It should be feasible to watch for this (see https://kubernetes.io/docs/api-reference/v1.6/#loadbalanceringress-v1-core), but I'm not 100% sure it is worthwhile. We'll need to look into whether there are cases when no IP would be expected and if so whether they're distinguishable. For example, IIRC the nginxinc ingress controller was not writing IPs back to ingress statuses when we were using it. And of course if you have no ingress controller deployed at all, you won't get an IP--I imagine this would be difficult to test in minikube.
We only selected a few that our own apps commonly use for the initial rollout. Currently unrecognized types will be kubectl applied and a warning will be logged about the fact that the script does not know how to check whether they actually came up.
Got this exception while trying to deploy wedge-viewer
for the first time, looks like the @deploy_started
variable hasn't been set by the time we try to use it there :(
Full trace:
[INFO][2017-06-06 23:07:21 +0000] ----------------------------Phase 2: Checking initial resource statuses-----------------------------
[INFO][2017-06-06 23:07:23 +0000]
[INFO][2017-06-06 23:07:23 +0000] ------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2017-06-06 23:07:23 +0000] No actions taken
/usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:82:in `block in fetch_logs': undefined method `to_datetime' for nil:NilClass (NoMethodError)
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `each'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `each_with_object'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `fetch_logs'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:98:in `display_logs'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:26:in `sync'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/runner.rb:90:in `each'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/runner.rb:90:in `run'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/exe/kubernetes-deploy:61:in `<top (required)>'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `load'
from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `<main>'
Deploy here: https://shipit.shopify.io/shopify/wedge-viewer/production/deploys/452708
The first deploy for this stack failed in an interesting way where I forced shipit to deploy before the container was ready (because I forgot there was a container build step), so maybe it's failure condition has something to do with why this is happening now? See it here: https://shipit.shopify.io/shopify/wedge-viewer/production/deploys/452703
The kubestatus logs when this happens are super unhelpful. We should probably make this abort the deploy.
See this deploy for example. We should either:
Theoretically, this could be caused by a condition that could resolve itself, but I don't recall ever seeing this be the case in practice. As a result I'd lean towards the simpler option (1) at least to start.
When I extracted the script from shipit-engine, I named the repo as the snippet was named in Shipit. However this isn't a great name for the project because all Shipit snippets are named in a very straightforward manner:
$ ls lib/snippets
deploy-to-gke
extract-gem-version
fetch-gem-version
fetch-heroku-version
push-to-heroku
...
These are great names for a bash snippet in bin/
, but not for a individual project.
My point here is that we didn't call it "sysv-resource-limiter", we called it "Semian".
I think this project would benefit from an expressive name.
There is currently rudimentary detection for ImagePullBackOff
and RunContainerError
, but it results in warnings most people ignore. These states should fail the deploy either immediately or after they have been seen a couple times in a row (livenessProbe-style).
Related #34.
Example: Pod/db-migrate-215a39ad-e344fa2b failed to deploy with status 'Failed (Reason: )
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.