shopify / krane Goto Github PK

View Code? Open in Web Editor NEW

1.4K 421.0 114.0 9.23 MB

A command-line tool that helps you ship changes to a Kubernetes namespace and understand the result

License: MIT License

Ruby 99.47% Shell 0.53%

kubernetes deploy-tool

krane's Issues

Error: the server could not find the requested resource

Saw this:

[FATAL][2017-07-31 21:33:12 +0000]	> Error from kubectl:
[FATAL][2017-07-31 21:33:12 +0000]	    Error from server (InternalError): error when applying patch:
[FATAL][2017-07-31 21:33:12 +0000]	    {"spec":{"containers":[{"name":"command-runner","resources":{"limits":{"cpu":"1000m"}}}]}}
[FATAL][2017-07-31 21:33:12 +0000]	    to:
[FATAL][2017-07-31 21:33:12 +0000]	    &{0xc42100b680 0xc420152070 <app> n upload-assets-7b4d9314-1f4a36a4 /tmp/Pod-upload-assets-7b4d9314-1f4a36a420170731-5277-12341p6.yml 0xc4205cca28 0xc420f4c000 82075679 false}
[FATAL][2017-07-31 21:33:12 +0000]	    for: "/tmp/Pod-upload-assets-7b4d9314-1f4a36a420170731-5277-12341p6.yml": an error on the server ("Internal Server Error: \"/api/v1/namespaces/<app>/pods/upload-assets-7b4d9314-1f4a36a4\": the server could not find the requested resource") has prevented the request from succeeding (patch pods upload-assets-7b4d9314-1f4a36a4)
[FATAL][2017-07-31 21:33:12 +0000]	> Rendered template content:

It appears that the pod died while/before it was being updated with the resource limit.

Could this be a concurrency bug? Subsequent deploy passed fine.

Deployment rollout monitoring revamp

The fact that deployment rollout monitoring currently looks at all pods associated with that deployment instead of pods in the new ReplicaSet has caused several different bugs:

~~Deploys never succeed when there are evicted pods associated with the deployment, even though those pods are old~~ (fixed another way)
Pod warnings get shown for pods that are being shut down, in the case where the last deploy was bad and the current one is actually succeeding (very confusing output)
Deploy success is delayed by waiting for all old pods to fully disappear
~~This false-positive deploy result is likely caused by this.~~ ~~Last poll before it "succeeded" was:~~ (I now think it probably briefly became available before failing a probe, or something like that)

[KUBESTATUS] {"group":"Deployments","name":"jobs","status_string":"Waiting for rollout to finish: 0 of 1 updated replicas are available...","exists":true,"succeeded":true,"failed":false,"timed_out":false,"replicas":{"updatedReplicas":1,"replicas":1,"availableReplicas":1},"num_pods":1}

Related technical notes:

We are selecting related pods using an assumption that they are labelled with the deployment name. I believe this is a bad assumption that has flown under the radar so far because all our templates are labelled this way by convention. The new ReplicaSet version should not do this.
Here's how kubectl gets old/new rs

Increase detection of unrecoverable states

As in #54 and #34 which other states should be detected and similarly handled.

Support Shipit visualization

This feature will mostly be on the shipit-engine side, but will require some work in this gem. The primary purpose of the KUBESTATUS logs is to support such a visualization. Some thoughts:

Ideally these logs would be hidden from the deploy output if possible (though it probably isn't) to make the deploy output itself more human-friendly than it typically is today.
Having the visualization help educate app maintainers about what actually happens during kubernetes deploys should be a goal. For example:
- make surge/unavailability visible
- don't make it look like pods restart rather than are replaced
If we stay with representing the entities being rolled out with coloured boxes, perhaps we could add a hover state revealing more info about that entity, e.g. the name and status string, and the logs if it is a pod and has failed.

Remove special handling of TPR creation/update

As of v1.6, kubectl apply can handle TPRs properly.

K8s 1.6 HorizontalPodAutoscaler is no longer supported in extensions/v1beta1

HorizontalPodAutoscaler is no longer supported in extensions/v1beta1 version. Use autoscaling/v1 instead.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#autoscaling-1

This causes us to fail deploys: https://shipit.shopify.io/shopify/k8s-cluster-services/production/deploys/435598

Bug appears to be this line:
https://github.com/Shopify/kubernetes-deploy/blob/22972eeb193e862b2521cec81e730e7bba9871ea/lib/kubernetes-deploy/runner.rb#L56

Daemon set timeouts fail

/usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/kubernetes_resource/daemon_set.rb:42:in `timeout_message': undefined method `map' for nil:NilClass (NoMethodError)
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/kubernetes_resource.rb:130:in `debug_message'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `block in record_statuses'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `each'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:168:in `record_statuses'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/lib/kubernetes-deploy/runner.rb:124:in `run'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.12.0/exe/kubernetes-deploy:70:in `<top (required)>'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `load'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `<main>'
kubernetes-deploy ci ci-east --bindings=region=us-east1 exited with status 1

`spy shipit restart shopify` should roll k8s pods

Zero-replicas deployments never succeed

Example

The monitoring for both the deployment (which currently has a guard clause on at least one replica being available) and the service (which waits to have endpoints to succeed) is causing this.

Unit tests for polling logic

Particular attention to different k8s json that can be returned
- especially in failure cases that fail (or should fail) deploys, e.g. bad container images, referencing missing configmaps or secrets
turn on mocha stub safety: http://gofreerange.com/mocha/docs/Mocha/Configuration.html

Have Secret Provisioner respect no-prune flag

What?

Currently the Secret Provisioner does not respect the no-prune flag
The resource should be updated to respect the flag

kubernetes-restart command not found

Ran into this trying to restart via shipit

Kube-restart failed
$ bundle exec kubernetes-restart identity-production tier3
pid: 20076
bundler: failed to load command: kubernetes-restart (/app/data/bundler/ruby/2.3.0/bin/kubernetes-restart)

@KnVerey @kirs right now folks trying to do a restart via shipit are hitting this bug

Bad output when scaling Deployment to zero

The Deployment was in fact successfully scaled to zero. The zero-replica integration test deploys zero replicas to begin with; we need one that deploys 1 and then scales to 0 in a second deploy.

Generic Custom Resource Definition support

Instead of hardcoding details about what Shopify's TPR controllers are expected to create, we should provide generic support based on a conventional status field. According to @wfarr investigation, such a status field will be easier to implement in the new CustomResourceDefinitions.

We'd query the cluster for valid CRD types at the beginning of the deploy, before creating resource instances.
CRD instances would be expected to expose status fields from which we can derive their success/failure
- We should have a convention for this that lines up with the fields/values first-party controllers set
- We can provide an override mechanism, likely an annotation, for specifying custom fields/values

Command failure warnings should include the command itself

Demo Folder?

Hi Shopify,

I am trying to get started using this tool but would love an example or demo folder with sample app, template and template erb to hit the ground running. Is this something that exists in another repo?

Failed deploy reported successful

Syntax errors in the k8s template error out, but the deploy still reports as successful:

[WARN]	The following command failed: kubectl apply -f /tmp/testing-deployment.yml.erb20170209-12190-s9f69e --prune --all --prune-whitelist\=core/v1/ConfigMap --prune-whitelist\=core/v1/Pod --prune-whitelist\=core/v1/Service --prune-whitelist\=batch/v1/Job --prune-whitelist\=extensions/v1beta1/DaemonSet --prune-whitelist\=extensions/v1beta1/Deployment --prune-whitelist\=extensions/v1beta1/HorizontalPodAutoscaler --prune-whitelist\=extensions/v1beta1/Ingress --prune-whitelist\=apps/v1beta1/StatefulSet --namespace\=pipa-test --context\=pipa-test
[WARN]	error: unable to decode "/tmp/testing-deployment.yml.erb20170209-12190-s9f69e": [pos 290]: json: expect char '"' but got char '8'
Waiting for Deployment/pipa-test
[KUBESTATUS] {"group":"Pipa-test deployment","name":"pipa-test-1149558948-5l604","status_string":"Running (Ready: true)","exists":true,"succeeded":true,"failed":false,"timed_out":false}
...
Spent 0.58s waiting for Deployment/pipa-test
Deploy succeeded!

Should this error be caught in shipit-engine or in this gem?

Enforce minimum kubernetes version

We currently require Kubernetes v1.6 and will soon move to v1.7. All three executables should check the version in the target cluster and abort early if it doesn't meet requirements.

Reevaluate numberAvailable for Daemon Sets after k8s upgrade

What?

During testing with Daemon Sets and kubernetes deploy we ran into issues using numberAvailable component in the status field due to it being inconsistent
Currently we are using numberReady but we need to reevaluate numberAvailable after a kubernetes version upgrade

related to #148

Add Shipit stack for rubygems

https://rubygems.org/profiles/shopify-shipit should be an owner.
Shipit stack needs to be added, like: https://shipit.shopify.io/shopify/buildkit/rubygems

/cc: @Shopify/cloudplatform @jonpulsifer

Use kubeclient gem in place of kubectl in polling code

We should be sure to investigate #51 before doing this.

Readiness probe message makes bad assumptions

A test in @karanthukral PR points out that the Pod code is assuming the readinessProbes are HTTP:

^ This message makes no sense, oops! 😄

I noticed today that that message can also get displayed when the probes are fine but the rollout was reeeallly slow so the container just happens to be starting. Maybe we should change the beginning to be less confident, e.g. "Your pods are running, but are not ready. Please make sure they're passing their readiness probes". And then either not push a probe-specific message if probe_location is blank, or adjust it to work / make sense for both types of probes.

@karanthukral would you be interested in fixing this?

Need output re: which resources got pruned

Likely a simple matter of printing the output of the command that did the prune.

Revisit supported ruby version

We nominally support ruby 2.1 or greater, and have Rubocop set up accordingly. However, it turns out this has not been true since we added the ActiveSupport dependency:

activesupport-5.1.1 requires ruby version >= 2.2.2, which is incompatible with the current version, ruby 2.1.8p440

Should we:

A) Change the dependency to fall in line with shipit-engine's minimum version as originally intended; or

B) Set an independent, higher requirement.

Given the extent to which we work with variable parsed JSON in this gem, the safe navigation operator introduced in 2.3 would be tremendously useful. And although we certainly designed this gem around Shipit's use case, it isn't all that functionally tied to Shipit. For those reasons, I vote for (B), and making that requirement ruby 2.3.

@Shopify/cloudplatform @kirs

Check if Image exists

If you deploy a SHA that doesn't exist, kubernetes-deploy will continue and the container will end up in ImagePullBackOff until the image is up.

Should we consider blocking in kubernetes-deploy or failing the deploy early, instead of resorting to a timeout?

How can we check this with the appropriate credentials?

It's worth noting that in Shopify we do this check in our present deployment wrapper around kubernetes-deploy (Capistrano).

What do you think @KnVerey @kirs?

Need to support a "restart" button in Shipit

Right now, app maintainers need to push a new commit, triggering a new container build, to "restart" their application. We need a true restart mechanism, ideally enabling users to choose to restart only a specific deployment rather than all of them if desired.

Note that there has been some discussion re: adding a feature like this to kubectl. It's still open, but my read on the tl;dr there is that the maintainers are conceptually opposed to implementing it, since restarts should not be necessary when no changes have been made to the spec and both liveness and readiness probes are passing.

As that k8s issue points out, rather than implementing selective pod deletion in accordance with the deployments rollout strategy, this can be achieved by patching the deployment's podspec, e.g. with an environment variable, a label or an annotation containing a timestamp.

Increase integration test coverage

Regression tests for fixes we merged before the framework was ready (e.g. #31)
Cloudsql and Redis third party resources?
Up-front validation failures
Audit code for cases that result in hard deploy failures and make sure they're all covered

Support Helm-style templating

{env}.yaml with most values app maintainers will change parameterized
partials
probably not ERB (cloudplatform team did prototypes of this in both ERB and Golang)

Optimize polling interval

The initial interval was chosen pretty arbitrarily, and I have the subjective impression that is is too short, leading deploys to be noisier than necessary. We should gather some data on this and optimize it.

Use $TASK_ID as deploy ID when running in Shipit

Shipit exposes a $TASK_ID to its deploy scripts. IMO we should be using this when present instead of generating a new one.

Use --kubeconfig instead of env var in kubectl invocations

Make Kubectl's commands use --kubeconfig instead of relying on the env var. This will enable us to stop modifying the env var in tests. We should still derive the value of the flag from the environment, as is standard in other tools, so this change will be transparent to end users.

Rollback should skip unmanaged pods

Currently, Shipit rollbacks using this gem are identical to a deploy of the previous revision, which means that unmanaged pods will get run. This is likely contrary to expected behaviour (though I have not heard any specific complaints), and it unnecessarily slows down rollbacks while the pod is deployed and runs to completion. Note that our primary use case for unmanaged pods is Rails migrations, in which case this is launching a container with the old revision and running rake db:migrate as it would have when that revision was first deployed.

Try to find a way around having two kubeclient helpers in tests

#24 (comment)

Better documentation

Improve readme (better explanation of core functionality and options, screenshots)
Convert all existing comment-based docs to rdoc or yard
Add comment-based docs for the key methods you need to use the tasks from Ruby instead of from the CLI (i.e. the run and run! methods of each Task class).

Better error when the kube config has the wrong master IP

Currently seeing:

[INFO][2017-09-11 20:25:03 +0000]	
[INFO][2017-09-11 20:25:03 +0000]	------------------------------------Phase 1: Initializing deploy------------------------------------
[INFO][2017-09-11 20:25:03 +0000]	All required parameters and files are present
[INFO][2017-09-11 20:25:03 +0000]	Context pipa-test found
[INFO][2017-09-11 20:25:34 +0000]	
[INFO][2017-09-11 20:25:34 +0000]	------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2017-09-11 20:25:34 +0000]	Namespace pipa-test not found

This is a bit confusing to users since it says "Context pipa-test found" which seems to refer to the kube config file, but the actual error occurs at the namespace lookup.

More helpful output when deploys fail

We have more information on why it failed, so we should provide what we can to the user. For example, we could dump relevant pod logs and events, or minimally log an error along the lines of "Go look at your logs or bug tracker" for container-related failures. Note that the cloudplatform team is considering annotating namespaces with app info, possibly including logs/bugs urls, which could be used to enhance such messages if/when available.

Summary section inaccurate when deploy aborted

If you kill the deploy somewhere between the initiation and completion of an action, the deploy summary printed will be inaccurate. For example, if you abort the deploy between the predeploy and the main deploy, the summary will say "No actions taken". Ideas for solutions:

Introduce a mechanism for tracking the action currently being attempted and reporting it on failure
Don't print the summary section when the process has been killed
Change the message printed to be more ¯\_(ツ)_/¯

Prioritize failed pod as debug log source for split-result RS

We're currently using kubectl's built-in logic for selecting which pod to dump logs from for deployments/replicaSets. That code is here and tl;dr is trying to select a pod that is more likely to actually have logs. When all the pods in an RS are failing, this is perfect. However, when some are succeeding, this logic is likely to select the good ones, which often have a large volume of irrelevant content. We should consider something like:

if @pods.map(&:deploy_succeeded?).uniq.length > 1 # split-result ReplicaSet
  most_useful_pod = @pods.find(&:deploy_failed?) || @pods.find(&:deploy_timed_out?)
  most_useful_pod.fetch_logs
else
  # current logic
end

It's worth noting that in most cases I've seen, the bad pods in a split-result ReplicaSet are failing at a very early stage (can't pull image, can't mount volume, etc.), so in practice the effect might be suppressing irrelevant logs rather than actually grabbing relevant ones.

cc @kirs @karanthukral

ingress fails to deploy but reports success

I have an invalid ingress that I tried to deploy. It reported success but the ingress did not change or show me the error.

Error should have been: error: ingresses "web" could not be patched: cannot convert int64 to string

What I saw:

[INFO][2017-09-08 19:28:45 +0000]	Deploying resources:
[INFO][2017-09-08 19:28:45 +0000]	- Ingress/web (timeout: 30s)
Successfully deployed in 8.8s: Ingress/web

Make Ingress wait to receive an IP

The class for ingresses currently uses a basic exists? check, although it isn't really ready until it has a public IP. It should be feasible to watch for this (see https://kubernetes.io/docs/api-reference/v1.6/#loadbalanceringress-v1-core), but I'm not 100% sure it is worthwhile. We'll need to look into whether there are cases when no IP would be expected and if so whether they're distinguishable. For example, IIRC the nginxinc ingress controller was not writing IPs back to ingress statuses when we were using it. And of course if you have no ingress controller deployed at all, you won't get an IP--I imagine this would be difficult to test in minikube.

Support more resource types

We only selected a few that our own apps commonly use for the initial rollout. Currently unrecognized types will be kubectl applied and a warning will be logged about the fact that the script does not know how to check whether they actually came up.

pod.rb:82:in `block in fetch_logs': undefined method `to_datetime' for nil:NilClass

Got this exception while trying to deploy wedge-viewer for the first time, looks like the @deploy_started variable hasn't been set by the time we try to use it there :(

Full trace:


[INFO][2017-06-06 23:07:21 +0000]	----------------------------Phase 2: Checking initial resource statuses-----------------------------
[INFO][2017-06-06 23:07:23 +0000]	
[INFO][2017-06-06 23:07:23 +0000]	------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2017-06-06 23:07:23 +0000]	No actions taken
/usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:82:in `block in fetch_logs': undefined method `to_datetime' for nil:NilClass (NoMethodError)
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `each'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `each_with_object'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:77:in `fetch_logs'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:98:in `display_logs'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/kubernetes_resource/pod.rb:26:in `sync'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/runner.rb:90:in `each'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/lib/kubernetes-deploy/runner.rb:90:in `run'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/lib/ruby/gems/2.3.0/gems/kubernetes-deploy-0.7.4/exe/kubernetes-deploy:61:in `<top (required)>'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `load'
	from /usr/lib/ruby-shopify/ruby-shopify-2.3.3/bin/kubernetes-deploy:22:in `<main>'

Deploy here: https://shipit.shopify.io/shopify/wedge-viewer/production/deploys/452708

The first deploy for this stack failed in an interesting way where I forced shipit to deploy before the container was ready (because I forgot there was a container build step), so maybe it's failure condition has something to do with why this is happening now? See it here: https://shipit.shopify.io/shopify/wedge-viewer/production/deploys/452703

Add timestamps to logger(s)?

Detect init container failures

The kubestatus logs when this happens are super unhelpful. We should probably make this abort the deploy.

Pod CrashBackoffLoop detection and handling

See this deploy for example. We should either:

fail the deploy immediately; or
warn the first 1 or 2 times this condition is seen and fail it on the next occurrence.

Theoretically, this could be caused by a condition that could resolve itself, but I don't recall ever seeing this be the case in practice. As a result I'd lean towards the simpler option (1) at least to start.

Enable per-resource timeout overrides via an annotation

Originally suggested in #60. See also #62. If this pattern works nicely, it could come in handy for other types of overrides (e.g. priority resource identification) as well.

cc @kirs @sirupsen @wvanbergen @DazWorrall

Pick a name?

When I extracted the script from shipit-engine, I named the repo as the snippet was named in Shipit. However this isn't a great name for the project because all Shipit snippets are named in a very straightforward manner:

$ ls lib/snippets
deploy-to-gke
extract-gem-version
fetch-gem-version
fetch-heroku-version
push-to-heroku
...

These are great names for a bash snippet in bin/, but not for a individual project.
My point here is that we didn't call it "sysv-resource-limiter", we called it "Semian".
I think this project would benefit from an expressive name.

thoughts? @sirupsen @KnVerey

ImagePullBackOff and RunContainerError should fail deploys

There is currently rudimentary detection for ImagePullBackOff and RunContainerError, but it results in warnings most people ignore. These states should fail the deploy either immediately or after they have been seen a couple times in a row (livenessProbe-style).

Related #34.

Unmanaged pod failure reason is missing

Example: Pod/db-migrate-215a39ad-e344fa2b failed to deploy with status 'Failed (Reason: )

shopify / krane Goto Github PK

krane's Issues

What?

What?

Recommend Projects

Recommend Topics

Recommend Org