gardener-attic / gardener-resource-manager Goto Github PK
View Code? Open in Web Editor NEWKubernetes resource reconciliation controller capable of performing health checks for managed resources.
Home Page: https://gardener.cloud
Kubernetes resource reconciliation controller capable of performing health checks for managed resources.
Home Page: https://gardener.cloud
How to categorize this issue?
/area quality
/area usability
/kind bug
/priority normal
What happened:
grm applied a Service with .spec.clusterIP
set (kube-dns
with x.x.x.10
) but it was discarded while merging and the Service got assigned a different cluster IP. That way, some the guestbook app failed to lookup DNS entries on x.x.x.10
.
What you expected to happen:
grm to use the provided clusterIP
.
How to reproduce it (as minimally and precisely as possible):
spec.clusterIP
setAnything else we need to know?:
Environment:
kubectl version
): v1.18.2How to categorize this issue?
/area robustness quality
/kind enhancement
/priority normal
What would you like to be added:
GRM could add a hash suffix to the names of ConfigMap
s or Secret
s that are mounted by Deployment
s/DaemonSet
s/etc. (it would have to adapt the names also in the .volumes[]
section). The hash could be a checksum of the ConfigMap
's/sSecret
's .data
map.
This should be done by default, however, it should be possible to
ManagedResource
via a field in the .spec
Additionally, the ConfigMap
s/Secret
s should be marked immutable.
Why is this needed:
GRM creates the Kubernetes resources managed by a ManagedResource
in parallel. This can lead to issues when e.g. a Deployment
is mounting a ConfigMap
/Secret
and the Deployment
is updated before the ConfigMap
/Secret
and/or the update of the ConfigMap
/Secret
fails for whatever reason. In rare cases it can happen that a new pod is already rolled out and started but still mounting the old ConfigMap
/Secret
data. Eventually, GRM will also succeed updating the ConfigMap
/Secret
, however, the pod is already running and has started with the old configuration data. This can lead to issues when the new pod is incompatible with the old configuration data.
Another problematic case is when the kubelet's ConfigMap
/Secret
cache is stale. Also, very rarely, we had observed this in the past, i.e., the kubelet starts a pod but mounts the old/outdated configuration data. Eventually, the kubelet's cache will be up-to-date again, but the pod might have already been started and read the old config.
With above implementation (hash suffix) both problematic scenarios cannot happen anymore because a new pod would only start with the new ConfigMap
/Secret
.
Implementation considerations:
We would need a mechanism to cleanup the old/no longer used ConfigMap
s/Secret
s, perhaps with a separate controller.
Let's look at the following example:
apiVersion: v1
kind: Secret
metadata:
name: managedresource-example2
namespace: default
type: Opaque
data:
other-objects.yaml: YXBpVmVyc2lvbjogYXBwcy92MSAjIGZvciB2ZXJzaW9ucyBiZWZvcmUgMS45LjAgdXNlIGFwcHMvdjFiZXRhMgpraW5kOiBEZXBsb3ltZW50Cm1ldGFkYXRhOgogIG5hbWU6IG5naW54LWRlcGxveW1lbnQKICBuYW1lc3BhY2U6IGRlZmF1bHQKc3BlYzoKICBzZWxlY3RvcjoKICAgIG1hdGNoTGFiZWxzOgogICAgICBhcHA6IG5naW54CiAgcmVwbGljYXM6IDIgIyB0ZWxscyBkZXBsb3ltZW50IHRvIHJ1biAyIHBvZHMgbWF0Y2hpbmcgdGhlIHRlbXBsYXRlCiAgdGVtcGxhdGU6CiAgICBtZXRhZGF0YToKICAgICAgbGFiZWxzOgogICAgICAgIGFwcDogbmdpbngKICAgIHNwZWM6CiAgICAgIGNvbnRhaW5lcnM6CiAgICAgIC0gbmFtZTogbmdpbngKICAgICAgICBpbWFnZTogbmdpbng6MS43LjkKICAgICAgICBwb3J0czoKICAgICAgICAtIGNvbnRhaW5lclBvcnQ6IDgwCg==
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: nginx-deployment
# spec:
# selector:
# matchLabels:
# app: nginx
# replicas: 2 # tells deployment to run 2 pods matching the template
# template:
# metadata:
# labels:
# app: nginx
# spec:
# containers:
# - name: nginx
# image: nginx:1.7.9
# ports:
# - containerPort: 80
The Deployment
is a namespaced resource but has no namespace specified in the metadata
section. Consequently, the Gardener-Resource-Manager
creates the resource with the default namespace either specified in the target
Kubeconfig or by the in-cluster
config via the mounted ServiceAccount
(/var/run/secrets/kubernetes.io/serviceaccount/namespace).
Furthermore, the Deployment
is written to the status
of the ManagedResource
but without any namespace
information. After that the Gardener-Resource-Manager
is not able to delete the resource any more.
To solve the issue we have to address two challenges:
1.) Dynamically find out what namespace is used when applying the resource or set it via a command line flag, e.g. --default-namespace
.
2.) If the resource has no namespace specified, find out if it's a namespaced resource at all and set the namespace
from 1.) if applicable. Therefore, we'd need to use a DiscoveryClient
.
How to categorize this issue?
/area ops-productivity
/kind bug
What happened:
I see an Shoot cluster which SystemComponentsHealthy
condition is flapping quite often between healthy and unhealthy:
- type: SystemComponentsHealthy
status: 'False'
lastTransitionTime: '2021-02-15T09:55:34Z'
lastUpdateTime: '2021-02-15T09:53:33Z'
reason: ApplyFailed
message: >-
Could not apply all new resources: 1 error occurred: Operation cannot be
fulfilled on horizontalpodautoscalers.autoscaling "coredns": the object
has been modified; please apply your changes to the latest version and
try again
When I check the logs of the gardener-resource-manager I see that shoot-core ManagedResource is reconciled for more than 20m:
$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep '"object":"shoot--foo--bar/shoot-core"' | grep -v 'controller-runtime.manager.health-controller'
{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.943Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.421Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
What you expected to happen:
Reconcile of shoot-core ManagedResource to take up to several minutes at most.
How to reproduce it (as minimally and precisely as possible):
Not clear for now.
In the logs of gardener-resource-manager I see logs about throttling requests:
$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep 'Throttling request'
# full output omitted
I0215 13:57:12.080636 1 request.go:621] Throttling request took 13.597998551s, request: GET:https://kube-apiserver/apis/crd.projectcalico.org/v1?timeout=32s
I0215 13:57:22.280577 1 request.go:621] Throttling request took 9.197729656s, request: GET:https://kube-apiserver/apis/servicecatalog.kyma-project.io/v1alpha1?timeout=32s
I0215 13:57:32.280683 1 request.go:621] Throttling request took 4.598267175s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:57:43.480787 1 request.go:621] Throttling request took 1.195492007s, request: GET:https://kube-apiserver/apis/apps/v1?timeout=32s
I0215 13:57:53.680666 1 request.go:621] Throttling request took 11.394955641s, request: GET:https://kube-apiserver/apis/serverless.kyma-project.io/v1alpha1?timeout=32s
I0215 13:58:03.880641 1 request.go:621] Throttling request took 6.99571505s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:58:14.080556 1 request.go:621] Throttling request took 2.597793766s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:58:24.080630 1 request.go:621] Throttling request took 12.597752422s, request: GET:https://kube-apiserver/apis/batch/v1?timeout=32s
I0215 13:58:34.280663 1 request.go:621] Throttling request took 8.107359855s, request: GET:https://kube-apiserver/apis/autoscaling/v1?timeout=32s
I0215 13:58:44.480527 1 request.go:621] Throttling request took 3.798117917s, request: GET:https://kube-apiserver/apis/install.istio.io/v1alpha1?timeout=32s
I0215 13:58:54.680493 1 request.go:621] Throttling request took 13.997952271s, request: GET:https://kube-apiserver/apis/ui.kyma-project.io/v1alpha1?timeout=32s
I0215 13:59:04.680516 1 request.go:621] Throttling request took 9.397612672s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:59:14.880495 1 request.go:621] Throttling request took 4.998066968s, request: GET:https://kube-apiserver/apis/flows.knative.dev/v1alpha1?timeout=32
Not sure, but from the logs it seems that gardener-resource-manager is doing quite a lot of discovery calls.
Anything else we need to know?:
Environment:
kubectl version
): v1.18.12How to categorize this issue?
/area networking
/kind bug
/priority normal
What happened:
In https://github.com/gardener/gardener-extension-provider-packet, CCM sets spec.loadBalancerIP
field for Services
of type: LoadBalancer
with the Elastic IP assigned to each service. This IP is then picked up by MetalLB and announced by Service
.
However, for services managed by Gardener-Resource-Manager
like kube-system/vpn-shoot
, the spec.loadBalancerIP
gets emptied every minute, when Gardener-Resource-Manager
runs it's reconciliation loop. This breaks the Shoot VPN, making Shoot cluster state unstable.
What you expected to happen:
When spec.loadBalancerIP
field is set on target resource and it is not specified on source resource, Gardener-Resource-Manager should merge both values instead of removing it from target resource.
How to reproduce it (as minimally and precisely as possible):
Service
object managed by Resource managed without spec.loadBalancerIP
set.spec.loadBalancerIP
using API on target cluster.Anything else we need to know?:
Perhaps https://github.com/gardener/gardener-resource-manager/blob/master/pkg/controller/managedresource/merger.go#L227 could be changed to preserve this field.
Environment:
kubectl version
):How to categorize this issue?
/area control-plane
/kind enhancement
/priority normal
What would you like to be added:
When resources are generated, would it be possible to add annotations to the resources to document which ManagedResource
and Secret
that the resource was defined in? This would greatly enhance usability, especially in a downtime situation to figure out why resources keep getting replaced.
Why is this needed:
When the resource manager creates a resource, it's unclear where it is generated from. Sometimes the resource is wrong and needs to be live patched, but it keeps getting overwritten. Finding the source is difficult because the secrets are base64 encoded.
How to categorize this issue?
/area usability
/kind bug
/priority normal
What happened:
The GRM is blocked because it waits for the deletion of a ClusterRole
or ClusterRoleBinding
to be deleted. However, both have the foregroundDeletion
finalizer and cannot be deleted.
{"level":"info","ts":"2020-06-22T14:20:45.041Z","logger":"gardener-resource-manager.reconciler","msg":"Deletion is still pending","object":"shoot--dev--foo/shoot-core","err":"Could not clean all old resources: 9 errors occurred: [deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:vpa-admission-controller\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-evictionter-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:vpa-checkpoint-actor\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-admission-controller\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:evictioner\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-checkpoint-actor\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-exporter-role-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-target-reader-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:metrics-reader\" is still pending]"}
Logs of KCM:
E0622 14:19:56.996355 1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"rbac.authorization.k8s.io/v1", Kind:"ClusterRole", Name:"system:vpa-admission-controller", UID:"bee40e0d-5816-4fc2-84c3-024ae6312837", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: clusterroles.rbac.authorization.k8s.io "system:vpa-admission-controller" is forbidden: user "system:serviceaccount:kube-system:generic-garbage-collector" (groups=["system:serviceaccounts" "system:serviceaccounts:kube-system" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:
What you expected to happen:
GRM is not blocked.
How to reproduce it (as minimally and precisely as possible):
Let GRM manage a ClusterRole
/ClusterRoleBinding
and then delete the ManagedResource
.
Environment:
master
kubectl version
): v1.18.0
How to categorize this issue?
/area robustness
/kind bug
/priority normal
What happened:
We have observed some situations, were grm gets stuck reconciling a specific managed resource and does not act upon it anymore.
In all cases I observed, it was either happening in conjunction with a longer period of downtime of the source or target API server (before #95) or a large amount of secret data in the target cluster (like described in #92).
What you expected to happen:
grm should not get stuck and reconcile all managed resources with the given sync interval.
How to reproduce it (as minimally and precisely as possible):
Not sure yet.
My guess would be that the worker goroutines get stuck in some WaitForCacheSync
, when the API server is unavailable for a longer period of time or the amount of watched data is to big.
Anything else we need to know?:
Environment:
kubectl version
):How to categorize this issue?
/area usability
/kind bug
/priority blocker
What happened:
Deletion of ManagedResource
s with Service
s can be blocked because of kubernetes/kubernetes#91287.
TL;DR: the EndpointSlice
controller tries to create new EndpointSlice
s although the respective Service is already marked for deletion. That way, the Service
deletion is blocked.
Similar to #60
What you expected to happen:
GRM to be able to delete the Service
.
How to reproduce it (as minimally and precisely as possible):
Create a ManagedResource with a Service, let it be created and delete it again.
Does not happen always but with a high probability.
Sometimes the deletion succeeds after a few minutes, when the garbage collection controller runs in the exact right moment.
Anything else we need to know?:
Environment:
kubectl version
): v1.18.3What happened:
If you remove a secretRef
from a MR, grm will not remove its finalizer from the referenced Secret.
Therefore the secret cannot be deleted without manual intervention.
With gardener/gardener#2210 (which creates dedicated secrets for each worker pool and references each of them in the respective MR) this will lead to a problem when g/grm is released, that the Shoot's namespace in the Seed cannot be deleted anymore, if the enduser has removed a worker pool from their Shoot (because it will contain Secrets with stale finalizers).
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
Environment:
How to categorize this issue?
/area scalability
/kind enhancement
/priority normal
What would you like to be added:
Once kubernetes-sigs/controller-runtime#1147 has been released and vendored, we should come back to the default leader election settings again and rather switch to the lease object for leader election.
Why is this needed:
We have seen significant scalability problems, when running more than 100 Shoot grm instances in one Seed each doing leader election on configmaps every 2s.
We were able to workaround this issue with #72 and gardener/gardener#2668 but we should switch to leases for having a long-term solution.
When the field externalTrafficPolicy
of a service is a Local, it can't be updated by Resource manager.
The error message is like this:
{
"level": "error",
"ts": "2020-01-18T13:59:03.230Z",
"logger": "gardener-resource-manager.reconciler",
"msg": "Could not apply all new resources",
"object": "shoot--core--v15/addons",
"error": "Errors occurred during applying: [error during apply of object \"v1/Service/kube-system/addons-nginx-ingress-controller\": Service \"addons-nginx-ingress-controller\" is invalid: spec.healthCheckNodePort: Invalid value: 0: cannot change healthCheckNodePort on loadBalancer service with externalTraffic=Local during update]",
"stacktrace": "
github.com/go-logr/zapr.(*zapLogger).Error
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).reconcile
/root/gopath/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:210
github.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).Reconcile
/root/gopath/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:97
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}
How to categorize this issue?
/area usability
/kind bug
/priority normal
What happened:
The status
is not kept when the GRM reconciles a VerticalPodAutoscaler
object because the VerticalPodAutoscaler
CRD does not support subresources.
Most likely all other CRDs without support for status
subresource are affected as well.
What you expected to happen:
The GRM should not overwrite the status
.
How to reproduce it (as minimally and precisely as possible):
VerticalPodAutoscaler
status
being overwritten.Environment:
How to categorize this issue?
/area ops-productivity
/kind bug
/priority normal
What happened:
Currently CheckDaemonSet func does not handle all cases in which DaemonSet is ready/not ready. See the following example:
$ k -n kube-system get po csi-disk-plugin-alicloud-xbwhj
NAME READY STATUS RESTARTS AGE
csi-disk-plugin-alicloud-xbwhj 1/2 ImagePullBackOff 0 19m
$ k -n kube-system get ds csi-disk-plugin-alicloud
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
csi-disk-plugin-alicloud 1 1 0 1 0 <none> 30m
$ k -n kube-system get ds csi-disk-plugin-alicloud -o yaml
status:
currentNumberScheduled: 1
desiredNumberScheduled: 1
numberMisscheduled: 0
numberReady: 0
numberUnavailable: 1
observedGeneration: 4
updatedNumberScheduled: 1
And on the other hand the the corresponding MR is reported as healthy:
$ k -n shoot--foo--bar get mr extension-controlplane-shoot
NAME CLASS APPLIED HEALTHY AGE
extension-controlplane-shoot True True 8h
What you expected to happen:
Similar issues to be catched by CheckDaemonSet func.
How to reproduce it (as minimally and precisely as possible):
Create a mr that manages a DaemonSet. Set the image of the DaemonSet to non-existing one.
Ensure that the mr is reported as healthy
Environment:
How to categorize this issue?
/area ops-productivity
/kind enhancement
/priority normal
What would you like to be added:
We should introduce a /healthz
endpoint for a liveness probe that can be used in the Deployment
manifest for GRM so that it is automatically restarted when it does not function properly.
Why is this needed:
I saw a running GRM pod that wasn't function, but it also didn't get restarted automatically (obviously). Those were the logs:
E1023 06:02:21.112646 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Role: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.113814 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.NetworkPolicy: Get https://kube-apiserver/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.114967 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.PodSecurityPolicy: Get https://kube-apiserver/apis/policy/v1beta1/podsecuritypolicies?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.116021 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v2beta1.HorizontalPodAutoscaler: Get https://kube-apiserver/apis/autoscaling/v2beta1/horizontalpodautoscalers?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-worker-mcm-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/shoot-cloud-config-execution"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-worker-mcm-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/shoot-cloud-config-execution"}
{"level":"info","ts":"2020-10-23T06:02:21.214Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-controlplane-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.214Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-controlplane-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.218Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-controlplane-storageclasses"}
{"level":"info","ts":"2020-10-23T06:02:21.218Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-controlplane-storageclasses"}
E1023 06:02:22.090404 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.PodDisruptionBudget: Get https://kube-apiserver/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.097324 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Secret: Get https://kube-apiserver/api/v1/secrets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.098385 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.DaemonSet: Get https://kube-apiserver/apis/apps/v1/daemonsets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.099407 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ServiceAccount: Get https://kube-apiserver/api/v1/serviceaccounts?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.100688 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Namespace: Get https://kube-apiserver/api/v1/namespaces?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.101760 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ConfigMap: Get https://kube-apiserver/api/v1/configmaps?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.103386 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ClusterRoleBinding: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.104123 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ClusterRole: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.105368 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Service: Get https://kube-apiserver/api/v1/services?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.106388 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.HorizontalPodAutoscaler: Get https://kube-apiserver/apis/autoscaling/v1/horizontalpodautoscalers?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.107496 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.CustomResourceDefinition: Get https://kube-apiserver/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.108534 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Deployment: Get https://kube-apiserver/apis/apps/v1/deployments?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.109591 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.APIService: Get https://kube-apiserver/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.110817 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.RoleBinding: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.111773 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.StorageClass: Get https://kube-apiserver/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.112863 1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Role: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
...
How to categorize this issue?
/area usability
/kind bug
/priority normal
What happened:
{"level":"info","ts":"2020-06-22T14:14:44.780Z","logger":"gardener-resource-manager.health-reconciler","msg":"could not create new object of kind for health checks (probably not registered in the used scheme), falling back to unstructured request","object":"shoot--dev--foo/shoot-core","GroupVersionKind":"apiregistration.k8s.io/v1, Kind=APIService","error":"no kind \"APIService\" is registered for version \"apiregistration.k8s.io/v1\" in scheme \"github.com/gardener/gardener-resource-manager/cmd/gardener-resource-manager/app/app.go:124\""}
What you expected to happen:
No error message
How to reproduce it (as minimally and precisely as possible):
Let GRM manage an APIService
object.
How to categorize this issue?
/kind enhancement
/priority normal
What would you like to be added:
There was an issue with a Gardener Shoot most likely caused by the GRM.
Apparently, there is some merging logic when reconciling an existing resource with an updated resource.
In the logs below, you can see that the managed resource was unhealthy and that the batch/v1 Job had an invalid spec when applying it to the target cluster.
Network extension (shoot--dxp--trial/trial) reports failing health check: Health check summary: 1/1 unsuccessful, 0/1 progressing, 0/1 successful. Unsuccessful checks: 1) ManagedResourceUnhealthy: managed resource extension-networking-cilium-config in namespace shoot--dxp--trial is unhealthy: condition "ResourcesApplied" has invalid status False (expected True) due to ApplyFailed: Could not apply all new resources: 1 error occurred: error during apply of object "batch/v1/Job/kube-system/hubble-generate-certs": Job.batch "hubble-generate-certs" is invalid: spec.template:
The issue got resolved after deleting the existing batch/v1 Job in the target cluster. This indicates that there is some problem with the merge of the existing and updated resource.
Why is this needed:
Since 0.6.0 the namespace for the ManagedResource resources is required. If namespace is not provided, the g/gardener-resource-manager fails to apply the resources.
/kind bug
Steps to reproduce:
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: managedresource-example1
type: Opaque
stringData:
objects.yaml: |
apiVersion: v1
kind: ConfigMap
metadata:
name: test-1234
annotations:
resources.gardener.cloud/ignore: "true"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: test-5678
EOF
$ cat <<EOF | kubectl apply -f -
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
name: example
namespace: default
spec:
secretRefs:
- name: managedresource-example1
EOF
$ make start
{"level":"info","ts":"2019-11-07T16:46:00.297+0200","logger":"gardener-resource-manager.reconciler","msg":"Applying","resource":"v1/ConfigMap/default/test-1234"}
{"level":"error","ts":"2019-11-07T16:46:00.312+0200","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"default/example","error":"Errors occurred during applying: [error during apply of object \"v1/ConfigMap/default/test-1234\": the server does not allow this method on the requested resource error during apply of object \"v1/ConfigMap/default/test-5678\": the server does not allow this method on the requested resource]",
/cc @vpnachev
How to categorize this issue?
/area scalability
/kind cleanup
/priority normal
What would you like to be added:
With the recent improvements in controller-runtime (kubernetes-sigs/controller-runtime#1151), we can/should replace the DeferredDiscoveryRESTMapper
by a DynamicRESTMapper
to avoid explicit rediscovery.
This way, grm immediately rediscovers resources if some of them are not known (yet), so that no extra reconciliation is needed.
It will still adhere to the specified rate limits, so we don't do unnecessary extraneous discovery calls (even across managed resources - which wasn't the case before).
Error on Gardener:
Flow "Shoot cluster deletion" encountered task errors: [task "Waiting until managed resources have been deleted" failed: retry failed with context deadline exceeded, last error: not all managed resources have been deleted in the shoot cluster (still existing: [addons])]
Message on addons
managedresource:
Missing required Deployment with name "addons-nginx-ingress-nginx-ingress-k8s-backend".
Logs of resource manager pod:
{"level":"error","ts":"2019-11-25T13:13:04.240Z","logger":"gardener-resource-manager.reconciler","msg":"Deletion is still pending","object":"shoot--it--tmuog7z/addons","error":"Deletion of old resource v1/Service/kube-system/addons-nginx-ingress-controller is still pending","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).delete\n\t/go/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:235\ngithub.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).Reconcile\n\t/go/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":"2019-11-25T13:13:04.240Z","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"shoot--it--tmuog7z/addons","error":"Deletion of old resource v1/Service/kube-system/addons-nginx-ingress-controller is still pending","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
There are no pods or deployments left on the shoot cluster:
kubectl get all --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 100.104.0.1 <none> 443/TCP 42h
kube-system service/addons-nginx-ingress-controller LoadBalancer 100.108.88.55 172.18.120.41 443:30376/TCP,80:31097/TCP 42h
Infrastructure: Openstack
Resouce Manager image version: 0.7.0
Shoot kubernetes version: 1.16.2
How to categorize this issue?
/area usability
/kind bug
/priority normal
What happened:
Shoot was stuck in deletion with the following error:
Waiting until extension resources have been deleted" failed: 1 error occurred:
* Failed to delete Extension shoot--foo--bar/shoot-dns-service: Error deleting Extension resource: timed out waiting for the condition
Checking the logs of the gardener-resource-manager revealed:
{"level":"info","ts":"2020-10-21T13:09:03.264Z","logger":"gardener-resource-manager.reconciler","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"info","ts":"2020-10-21T13:09:03.264Z","logger":"gardener-resource-manager.reconciler","msg":"Starting to delete ManagedResource","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"info","ts":"2020-10-21T13:09:03.843Z","logger":"gardener-resource-manager.reconciler","msg":"All resources have been deleted, removing finalizers from ManagedResource","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"error","ts":"2020-10-21T13:09:07.325Z","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"shoot--foo--bar/extension-shoot-dns-service-seed","error":"error removing finalizer from ManagedResource: Operation cannot be fulfilled on managedresources.resources.gardener.cloud \"extension-shoot-dns-service-seed\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
After that, the GRM never tried to remove the finalizer again, leading to the stuck ManagedResource
in the system.
What you expected to happen:
GRM retries to remove its finalizer
How to reproduce it (as minimally and precisely as possible):
n/a yet, sorry
Environment:
kubectl version
): v1.17.8How to categorize this issue?
/area quality robustness cost
/kind bug
/priority normal
What happened:
We observed a Shoot cluster which had a lot of very large secrets (~3000 with ~1MiB each).
As the gardener-resource-manager caches all objects in the target cluster it touches, its cache had grown to an inacceptable size:
$ k top po
NAME CPU(cores) MEMORY(bytes)
gardener-resource-manager-77876b69b7-kvvkz 400m 5875Mi
Also, one of its worker routines trying to get a secret for applying was stuck, as the client tried to aquire a watch, but the API server panic'd (for some reasons), so the client waited forever for the watch cache to sync:
I1127 13:55:38.982551 1 trace.go:201] Trace[4834220]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 (27-Nov-2020 13:54:00.969) (total time: 60013ms):
Trace[4834220]: [1m0.013079589s] [1m0.013079589s] END
E1127 13:55:38.982571 1 reflector.go:178] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 57251; INTERNAL_ERROR
What you expected to happen:
grm should be resistent to such user behaviour and should therefore allow to disable the target cache via a command line flag.
This way, gardener can deploy grm instances for Shoots without a cache for the Shoot API server to avoid failing watches and minimize the memory footprint.
/cc @mandelsoft @timuthy
Environment:
kubectl version
):A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.