Giter Site home page Giter Site logo

Comments (9)

adnankobir avatar adnankobir commented on May 27, 2024

Here is a more severe case, where all pods are in not ready state:

❯ kubectl get po -n cert-manager
NAME                                                    READY   STATUS    RESTARTS   AGE
cert-manager-istio-csr-79ffc5bfd-q4qw8                  0/1     Running   0          20d
cert-manager-istio-csr-79ffc5bfd-vrjdd                  0/1     Running   0          20d
cert-manager-istio-csr-79ffc5bfd-xs9mj                  0/1     Running   0          20d
❯ kubectl describe po cert-manager-istio-csr-79ffc5bfd-xs9mj -n cert-manager
Name:         cert-manager-istio-csr-79ffc5bfd-xs9mj
Namespace:    cert-manager
Priority:     0
Node:         ip-10-136-208-186.ec2.internal/10.136.208.186
Start Time:   Wed, 17 Feb 2021 16:19:11 -0500
Labels:       app=cert-manager-istio-csr
              pod-template-hash=79ffc5bfd
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.136.212.19
IPs:
  IP:           10.136.212.19
Controlled By:  ReplicaSet/cert-manager-istio-csr-79ffc5bfd
Containers:
  cert-manager-istio-csr:
    Container ID:  docker://844832e7090dd643e7e296def2cbe8c3c9519d6f38537480a2510bf63a3ace7d
    Image:         quay.io/jetstack/cert-manager-istio-csr:v0.1.0
    Image ID:      docker-pullable://quay.io/jetstack/cert-manager-istio-csr@sha256:f9d473fa10520d0a255a4b60350a9f9057834da762129f9e5ecb9681955b1fd0
    Port:          6443/TCP
    Host Port:     0/TCP
    Command:
      cert-manager-istio-csr
    Args:
      --log-level=1
      --readiness-probe-port=6060
      --readiness-probe-path=/readyz
      --serving-address=0.0.0.0:6443
      --serving-certificate-duration=24h
      --root-ca-configmap-name=istio-ca-root-cert
      --certificate-namespace=istio-system
      --issuer-group=cert-manager.io
      --issuer-kind=ClusterIssuer
      --issuer-name=vault-issuer
      --max-client-certificate-duration=24h
      --preserve-certificate-requests=false
    State:          Running
      Started:      Wed, 17 Feb 2021 16:19:37 -0500
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://:6060/readyz delay=3s timeout=1s period=7s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cert-manager-istio-csr-token-h42zw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cert-manager-istio-csr-token-h42zw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cert-manager-istio-csr-token-h42zw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
  Warning  Unhealthy  2m43s (x139117 over 20d)  kubelet  Readiness probe failed: Get http://10.136.212.19:6060/readyz: dial tcp 10.136.212.19:6060: connect: connection refused

The only log of interested here is around RBAC:

E0226 17:37:13.286265       1 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".16675cce7821d48e", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"LeaderElection", Message:"cert-manager-istio-csr-79ffc5bfd-xs9mj_7e03f6c1-8793-4729-aa4d-4ca47180a174 stopped leading", Source:v1.EventSource{Component:"cert-manager-istio-csr-79ffc5bfd-xs9mj_7e03f6c1-8793-4729-aa4d-4ca47180a174", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc0066a5250ef3a8e, ext:764255996264212, loc:(*time.Location)(0x27b9ac0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc0066a5250ef3a8e, ext:764255996264212, loc:(*time.Location)(0x27b9ac0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:cert-manager:cert-manager-istio-csr" cannot create resource "events" in API group "" in the namespace "default"' (will not retry!)

from istio-csr.

JoshVanL avatar JoshVanL commented on May 27, 2024

Thanks for opening this @adnankobir. This is quite a concerning bug.. will spend some time to see if I can replicate the issue.

Roughly how long does it take for the probe to start failing?

from istio-csr.

adnankobir avatar adnankobir commented on May 27, 2024

Thanks @JoshVanL

I don't have a rough estimate, it appears to be completely random, fortunately I have this deployed in 7 clusters, some clusters exhibit this behaviour within a matter of hours, others a couple of days.

As a workaround for now, what are the implications of removing the healthchecks or better yet simply doing a TCP check on the serving address?

from istio-csr.

JoshVanL avatar JoshVanL commented on May 27, 2024

Very strange. This readiness endpoint is managed through controller-runtime so that is where I'll be looking first. The only times where this check is set to false after successful initialization, is during termination.

The implications of removing the check is that the pod will receive request traffic before it has initialized (fetched a serving cert and has begun serving). Changing to TCP may work, but it would be interesting to see that TCP works and not HTTP.

Another option is to add a liveness probe with the same check, so long as it had a reasonable initialDelaySeconds to complete the initial initialization (something very large like 10m is probably fine here). This would at least kill the pod and come back up being ready...

from istio-csr.

schantaraud avatar schantaraud commented on May 27, 2024

In case it helps, I had the same issue on 2 clusters at the exact same time, 30 days after the pods were created. Restarting the pods seems to have fixed it for now.

from istio-csr.

irbekrm avatar irbekrm commented on May 27, 2024

I've briefly tested this (on GKE, k8s v1.19) - it seems that istio-csr pods become not ready if cert-manager webhook goes down whilst istio-csr is processing some certificate requests.
It seems to remain not ready even after webhook is healthy again.
Adding liveness probe as @JoshVanL mentions above seems to fix that issue. I've not done any extensive testing on this though.

from istio-csr.

JoshVanL avatar JoshVanL commented on May 27, 2024

Thanks all. I have managed to track down this issue;

If there is a transient network connectivity error or similar, istio-csr will lose leader election or fail to renew the lease. If this happens the pod become unready. To resolve this, istio-csr now correctly exits which will allow either another istio-csr to assume the leader, or for the pod to come back up as the leader.

This has been fixed in this PR: #62

from istio-csr.

JoshVanL avatar JoshVanL commented on May 27, 2024

This fix has been merged as part of the v0.2.0 release.

Closing this for now, but please feel free to open if you continue to have issues.

/close

from istio-csr.

jetstack-bot avatar jetstack-bot commented on May 27, 2024

@JoshVanL: Closing this issue.

In response to this:

This fix has been merged as part of the v0.2.0 release.

Closing this for now, but please feel free to open if you continue to have issues.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from istio-csr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.