kubernetes-csi / livenessprobe Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 90.0 23.85 MB

A sidecar container that can be included in a CSI plugin pod to enable integration with Kubernetes Liveness Probe.

License: Apache License 2.0

Makefile 10.03% Go 10.47% Dockerfile 0.55% Shell 69.59% Python 9.37%

k8s-sig-storage

livenessprobe's People

Contributors

Stargazers

Watchers

Forkers

sbezverk jsafrane hchenxa nikhita bluebreezecf openshift sarjeet2013 msau42 darkowlzz madhu-1 humblec andrewsykim morrislaw pohly amruta-bandhu-chaudhury stefansedich timyinshi wangzihao3 nanzehua storageos linux-on-ibm-z kartik494 windayski jchauncey trancerelaxer nettoclaudio muzi-bootstrap c3y1huang wstrunk bnrjee stuart-mclaren-hpe mmussett baseonballs jiawei0227 verult aramase subodh01 sfowl sylr yiren1105 gtaylor anubha-v-ardhan chrishenzie mauriciopoppe ialidzhikov dobsonj nick5616 kkbill lx1036 knkgun arbelnathan andyzhangx georgicodes gil-seong xing-yang nourahkareemalanazi rainbow954 amolmote patrickwalker eviltheory sunnylovestiramisu nbalacha k8s-infra-cherrypick-robot jinx-heniux sneha-at antoine-gaillard datadog bells17 siddhikhapare mowangdk zafs23 umagnus dargudear-google namndv ejweber luvjoey1996 acquia wolfi-chainguard-demo raunakshah

livenessprobe's Issues

armv7 images for Raspberry Pi and other arm based boards

I am currently running a mixed cluster with armv7 and amd64 nodes.
the csi is running fine for the amd64 but the armv7 nodes have a ImagePullBackOff since there is only a arm64 and amd64 build and no armv7 build.

This issue related to, since the project uses:

kubernetes-csi/livenessprobe
kubernetes-csi/csi-node-driver-registrar

kubernetes-csi/csi-driver-smb#346

support exec style liveness probes in addition to http

I have scenarios where either or both of the node and controller services run in the hostNetwork namespace. Combined with the possibility of multiple deployments it means I need to eat through host ports (and reserve them) which is less than ideal.

What would be great is an ability to use the exec style probes and simply have the app connect to the uds and exit as appropriate.

High risk vulnerability with v2.9.0

Hi Team,

We have one High risk vunerability with v2.9.0

golang.org/x/net version v0.4.0 has 1 vulnerability

Can you please help us by fixing this.
With this, livenessprobe will be vulnerability free.

Thank you.

Support structured logging

The livenessprobe container should support structured logging when given the --log-format-json flag, like for instance the secrets-store-csi-driver does.

A simple implementation can be found here: https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/main/cmd/secrets-store-csi-driver/main.go#L90-L94

I‘d like to understand your release strategy.

Since last release was released in May, and many CVE mitigation PRs has been merged. But how often would you release a new version of livenessprobe? Like when we can get a newly release that contains the fixes?
Thank you!

Avoid CVEs with cron! (One simple trick, etc)

Hello, as mentioned in #135, it seems like this container is (for practical purposes) rarely free of CVEs with a high or critical score. Whether or not they are vulnerable is another story. When referenced artifacts get flagged, it adds significant developer workload to folks who do not know the source of projects like this to dismiss the flags. This is exacerbated when there is a new release and the process of dismissing flags starts anew.

It would be far easier if dependent builds could automatically upgrade to the latest versions of components like this rather than getting involved in deep analysis or improperly dismissing vulnerabilities with a broad (and incorrect) assessment.

I believe the prow configuration for this component is using the golang tag with no bugfix (ie 1.18). A build referencing that tag will pull the latest version of 1.18 on the date of the build.

It follows that this project could be automatically avoid CVEs in older versions of Golang by simply releasing the artifact once a month with an incremented bugfix release. The golang tag will move to the latest version and all will be well. Orgs like ours that use compliance scanning as a blunt force instrument will be satisfied, etc.

Security should be maintained since release branches are used. Merges from master would leverage human review, merges to master from contributions would be ignored unless merged.

@pohly, does this seem sane to you? If it becomes useful, it's a pattern that could be propagated to similar projects as well.

Facing vulnerability issues with golang, net

Facing vulnerability issues with golang, net since these are outdated/older


livenessprobe/6233d3ca658768ca9c3 9e8ad55f01f84adff930c/sha256__32ca3d8516d3b0 cd0311d54a49ea0616f4964c49e38b23f1d88f215e9 3fe1b7d.tar.gz/livenessprobe/golang.org/x/net	golang.org/x/net	< 0.7.0	0.7.0	2023-03-10T03:31:0 4Z

current env:
k8s.gcr.io/sig-storage/livenessprobe: 2.9.0
EKS 1.22

are there upcoming releases are going have fixes for these vulnerabilities

Bump Go Version for CVE Fixes?

Hi,
We are seeing the following CVE's while scanning the livenessprobe container for security issues.

CVE-2022-30631
CVE-2022-30633
CVE-2022-30632
CVE-2022-30630
CVE-2022-28131
CVE-2022-32189
CVE-2022-30580
CVE-2022-30635
CVE-2022-24675
CVE-2022-28327
CVE-2022-1962
CVE-2022-32148
CVE-2022-1705
CVE-2022-30629

Is there a plan to resolve these security issues? If you could provide mitigation steps or timelines when this would be resolved. Thanks!

liveness-probe is logging to stderr

Can liveness-probe be updated the not log everything to stderr
See issue:
Azure/secrets-store-csi-driver-provider-azure#398

Memory leak in v2.1.0

This is a slice of the average memory usage plot for the liveness probe container to illustrate the issue (Y-axis in MiB):

We had been running the driver with a short 2 second liveness check interval to increase our chances discovering any issues with it (a new cluster and CSI is an uncharted territory for us), racking up ~1.5GiB of memory usage per liveness probe container in around two weeks.

Otherwise the liveness probe chugs along happily with no anomalies in logs or other metrics for the probe itself or any of the related workloads.

CVE-2021-39293

Hello,

We are trying to use this image but our vulnerability scanner detected CVE-2021-39293 as a "High":

A vulnerability was found in archive/zip of the Go standard library. Applications written in Go can panic or potentially exhaust system memory when parsing malformed ZIP files. An attacker capable of submitting a crafted ZIP file to a Go application using archive/zip to process that file could cause a denial of service via memory exhaustion or panic. This particular flaw is an incomplete fix for a previous flaw.

Would it be possible to build this image using an updated version of Go? Per the vulnerability report, this is fixed in: 1.17.1, 1.16.8.

golang/go#47801

This also affects the "node-driver-registrar" and "csi-driver-smb" images we are trying to use (as I'm sure any other images built using the same version of Go). I can open issues with those images as well if needed:

https://github.com/kubernetes-csi/node-driver-registrar
https://github.com/kubernetes-csi/csi-driver-smb

Thank you.

v2.11.0 showing High vulnerability CVE-2023-44487

Hello, our security tooling is showing the latest version has CVE-2023-44487. This requires upgrading Golang to a later patch version to pull in the latest net package.

Please push v2.3.0 to container registry

I'd like to use the latest release (v2.3.0) but it isn't available on the AWS public container registry:
https://gallery.ecr.aws/eks-distro/kubernetes-csi/livenessprobe

Create a SECURITY_CONTACTS file.

As per the email sent to kubernetes-dev[1], please create a SECURITY_CONTACTS
file.

The template for the file can be found in the kubernetes-template repository[2].
A description for the file is in the steering-committee docs[3], you might need
to search that page for "Security Contacts".

Please feel free to ping me on the PR when you make it, otherwise I will see when
you close this issue. :)

Thanks so much, let me know if you have any questions.

(This issue was generated from a tool, apologies for any weirdness.)

[1] https://groups.google.com/forum/#!topic/kubernetes-dev/codeiIoQ6QE
[2] https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS
[3] https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance-template-short.md

Broken Link of `contributor cheat sheet` need to fix

Bug Report

I have observed same kind of issue in various kubernetes-csi project.
this happens because after the localization there are too much modifications done in the various directories.
I have observed same issue in this page also.

It has one broken link of the contributes cheat sheet which needs to fix.
I will try to look in further csi repo as well and try to fix it as soon as I can

/kind bug
/assign

When is the next release 2.5?

When is the next release 2.5? Need the update to support Windows Server 2022. Thanks!

mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.4.0 doesn't support Windows Server 2022

The docker image supports only Windows Server 2019 (1809).
So, we can't deploy csi-driver-smb to k8s cluster with Windows nodes based on Windows Server 2022

Please create multi-target manifest for livenessprobe:*

panic: runtime error: invalid memory address or nil pointer dereference

Hi,

I've a CSI driver, implemented according to CSI standard. My node server includes liveness probe (v2.8.0) as sidecar. After normal execution for some hours I got (twice) the following:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x857ceb]goroutine 130244 [running]:
google.golang.org/grpc.(*ClientConn).Close(0x0)
/home/esepadm/jenkins-exclusive-executor/workspace/SEPReleaseLivenessProbe/3pp/vendor/google.golang.org/grpc/clientconn.go:1017 +0x4b
main.acquireConnection.func1()
/home/esepadm/jenkins-exclusive-executor/workspace/SEPReleaseLivenessProbe/3pp/cmd/livenessprobe/main.go:103 +0x16a
created by main.acquireConnection
/home/esepadm/jenkins-exclusive-executor/workspace/SEPReleaseLivenessProbe/3pp/cmd/livenessprobe/main.go:97 +0x1a5

This causes node server daemonset pod restart.
Can you help?

Best regards,
Antonio Vitiello

livenessprobe consumes too much memory

I ran the CSI driver with liveness probe for several days and the liveness probe memory usage turned to be 765MB which is abnormal compared to other csi sidecars.

The follow is a screenshot from htop:

Another side affect is my CSI driver turned to be crashing a lot more after liveness probe is enabled:

NAMESPACE     NAME                                                    READY   STATUS    RESTARTS   AGE
kube-system   dns-controller-64db5996cd-wvrmv                         1/1     Running   0          6d7h
kube-system   ebs-csi-controller-0                                    6/6     Running   36         5d18h
kube-system   ebs-csi-node-djvrx                                      3/3     Running   20         5d18h
kube-system   ebs-csi-node-g2c2k                                      3/3     Running   9          5d18h

Here is my liveness probe config:

          livenessProbe:
            httpGet:
              path: /healthz
              port: healthz
            initialDelaySeconds: 10
            timeoutSeconds: 3
            periodSeconds: 10
            failureThreshold: 5

Need fix for CVE-2023-39325, CVE-2023-44487, CVE-2023-44487

I see that above CVEs are already fixed in main, can you please release new version.

livenessprobe:v2.1.0 image has VA issues

following is the VA issues in livenessprobe:v2.1.0 image

DLA-2424-1
   Policy Status
   Active
   Summary
   tzdata, the time zone and daylight-saving time data, has been updated to the latest version.
  - Revised predictions for Morocco's changes starting in 2023. - Macquarie Island has stayed in sync with Tasmania since 2011. - Casey, Antarctica is at +08 in winter and +11 in summer since 2018. - Palestine ends DST earlier than predicted, on 2020-10-24. - Fiji starts DST later than usual, on 2020-12-20.
   Vendor Security Notice IDs   Official Notice   
   DLA-2424-1                   https://lists.debian.org/debian-lts-announce/2020/10/msg00037.html   
   Affected Packages   Policy Status   How to Resolve                        Security Notice   
   tzdata              Active          Upgrade tzdata to >= 2020d-0+deb9u1   DLA-2424-1

Observing many logs in the format "Connecting to %s"

We are using livenessprobe v2.2.0 and see many logs in stderr that look like this:

1 connection.go:153] Connecting to unix:///csi/csi.sock

I tracked the line of code that generates these log messages to the golang connection library:

klog.V(5).Infof("Connecting to %s", address)

Source: https://github.com/kubernetes-csi/csi-lib-utils/blob/v0.9.1/connection/connection.go#L153 . Although the V(5) was added only 3 months ago, 29/Dec/20 in kubernetes-csi/csi-lib-utils@75fbafd.

We configure the livenessprobe sidecar like this:

      - args:
        - --csi-address=/csi/csi.sock
        - --health-port=9808
        - --v=3
        image: k8s.gcr.io/sig-storage/livenessprobe:v2.2.0
        imagePullPolicy: IfNotPresent
        name: liveness-probe
        resources:
          limits:
            cpu: 50m
            memory: 32Mi
          requests:
            cpu: 10m
            memory: 8Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /csi
          name: socket-dir

My understanding was that the V(5) in the Connection logging whould mean that if we used --v=5 we would see the log, but since we use --v=3, we shouldn't see it.

Are we doing something wrong? or do we simply need to wait until v2.3 to pickup the new V(5) added to the connection library logging?

I noticed in the v2.2 CHANGELOG, there was a PR to reduce the default log level of the livenessprobe-sidecar to '4' (#88).

Currently we get about 20,000 logs like this per hour, which is wasting space in ElasticSearch. As a workaround we can filter out these logs with a logstash filter, for example:

    filter {
      if [kubernetes][container][name] == "liveness-probe" {
        if "Connecting to unix:///csi/csi.sock" in [message] {
          drop{}
        }
      }
    }

use nonroot user in Dockerfile

there is a security warning produced by twistlock that this livenessprobe image should use nonroot user, while if I change this Dockerfile as following (andyzhangx@42fc328), liveness probe would failed finally, not sure what's the right fix to make this image use nonroot user:

FROM gcr.io/distroless/static:nonroot
LABEL maintainers="Kubernetes Authors"
LABEL description="CSI Driver liveness probe"
ARG binary=./bin/livenessprobe

COPY ${binary} /livenessprobe
USER nonroot:nonroot
ENTRYPOINT ["/livenessprobe"]

failed events:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  3m23s                default-scheduler  Successfully assigned kube-system/csi-smb-node-z54xp to aks-agentpool-90924120-vmss000006
  Normal   Pulling    3m23s                kubelet            Pulling image "andyzhangx/livenessprobe:v2.12.0"
  Normal   Created    3m22s                kubelet            Created container node-driver-registrar
  Normal   Created    3m22s                kubelet            Created container liveness-probe
  Normal   Started    3m22s                kubelet            Started container liveness-probe
  Normal   Pulled     3m22s                kubelet            Container image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1" already present on machine
  Normal   Pulled     3m22s                kubelet            Successfully pulled image "andyzhangx/livenessprobe:v2.12.0" in 858.088069ms (858.101669ms including waiting)
  Normal   Started    3m22s                kubelet            Started container node-driver-registrar
  Warning  Unhealthy  23s (x5 over 2m23s)  kubelet            Liveness probe failed: Get "http://10.224.0.255:29643/healthz": dial tcp 10.224.0.255:29643: connect: connection refused
  Normal   Killing    23s                  kubelet            Container smb failed liveness probe, will be restarted
  Normal   Pulled     22s (x2 over 3m22s)  kubelet            Container image "gcr.io/k8s-staging-sig-storage/smbplugin:canary" already present on machine
  Normal   Created    22s (x2 over 3m22s)  kubelet            Created container smb
  Normal   Started    22s (x2 over 3m22s)  kubelet            Started container smb
root@andydev:~/go/src/github.com/kubernetes-csi/livenessprobe# k logs csi-smb-node-z54xp  -n kube-system liveness-probe
W0206 13:22:14.691040       1 connection.go:234] Still connecting to unix:///csi/csi.sock
W0206 13:22:24.690443       1 connection.go:234] Still connecting to unix:///csi/csi.sock
W0206 13:22:34.691048       1 connection.go:234] Still connecting to unix:///csi/csi.sock
W0206 13:22:44.691010       1 connection.go:234] Still connecting to unix:///csi/csi.sock

Support ARM Images?

I recently opened a request with kubernetes-csi/csi-driver-smb to support ARM. They did and pushed the image. Now this project needs the same.
Here are the related issues and PRs:

My end goal is to use the chart within smb-csi repo which also uses this projects image.

Rebuild with golang v1.18.6 or higher

Hi,

as mentioned in this issue, we saw some CVEs in the latest realeas due to the usage of go 1.18.

Is there and chance that the livenessprobe also gets updated to a later go version?

Thanks!

v2.2.0-eks-1-18-5 has 1 High + 15 others vulnerabilities

Good afternoon,

I pulled and pushed v2.2.0-eks-1-18-5 into an ECR repository in my personal account, and I noticed it has 1 High + 15 others vulnerabilities. I see this also happens for v2.2.0-eks-1-20-1.

Some of these vulnerabilities are:

ALAS2-2021-1655 (High)
ALAS2-2021-1653 (Medium)
ALAS2-2021-1656 (Medium)

Would it be possible to release a new image anytime soon that addresses these vulnerabilities? Would you like me to take a look at this myself and submit a PR?

Thanks!

No Windows 2004 image available

According to curl -L https://mcr.microsoft.com/v2/oss/kubernetes-csi/livenessprobe/tags/list there is no image other than the 1809 kernel for Windows available.

It would make sense to support the versions according to the official windows-os-version-support

Probe requests still reporting ready status even when socket file doesn't exist anymore

Description
Once established a connection to CSI driver's identity server, the livenessprobe server will not attempt to reconnect again - until a restart occurs. We rely on this persistent connection to dispatch the Probe calls to the identity server.

A side effect of this approach comes when the socket file is removed (deliberately or not) shortly after the first established connection. Under these conditions, the probe requests reach the CSI driver's identity server and may return a healthy status as long as this connection is open.

Components will not succeed to open new connections to this CSI driver, leading to a stuck scenario. For example, the kubelet won't contact the CSI driver's node server about NodePublishVolume calls, causing pending pods forever (until a human intervention).

What is expected
Whether Unix Domain Socket file does not exist anymore, requests to /healthz should return a not ready/unhealthy status.

Unable to run Livenessprobe sidecar yaml

When i am trying to run liveness probe sidecar yaml getting below error:

kubectl apply -f livenessprobe-sidecar.yaml error: error parsing livenessprobe-sidecar.yaml: error converting YAML to JSON: yaml: line 22: did not find expected '-' indicator

livenessprobe/deployment/kubernetes/livenessprobe-sidecar.yaml

Line 22 in 4d2afd4

But in line number 22 i found # only.

Kubernetes version : v1.25

livenessprobe fails for controller only CSI driver

livenessprobe can be integrate to CSI driver which only implement Controller and Identity service, but no NodeServer service.

The following is making rpc call which is being implemented in NodeServer. We should make liveness probe universally common for CSI driver to only depend on Identity Service rpc calls.

https://github.com/kubernetes-csi/livenessprobe/blob/master/cmd/main.go#L56:
csiDriverNodeID, err := csiConn.NodeGetId(ctx)

Here is the log when integrated it to controller:

I1030 04:17:27.965740 1 main.go:109] Serving requests to /healthz on: 0.0.0.0:9809
I1030 04:17:40.913574 1 main.go:82] Request: /healthz from: 172.17.0.1:40302
I1030 04:17:40.913607 1 main.go:72] Attempting to open a gRPC connection with: /var/lib/csi/sockets/pluginproxy/csi.sock
I1030 04:17:40.913624 1 connection.go:70] Connecting to /var/lib/csi/sockets/pluginproxy/csi.sock
I1030 04:17:40.936761 1 connection.go:97] Still trying, connection is CONNECTING
I1030 04:17:40.937903 1 connection.go:94] Connected
I1030 04:17:40.937922 1 main.go:47] Calling CSI driver to discover driver name.
I1030 04:17:40.937933 1 connection.go:150] GRPC call: /csi.v0.Identity/GetPluginInfo
I1030 04:17:40.937943 1 connection.go:151] GRPC request:
I1030 04:17:40.950593 1 connection.go:153] GRPC response: name:"com.mapr.csi-kdf" vendor_version:"0.3.0"
I1030 04:17:40.950671 1 connection.go:154] GRPC error:
I1030 04:17:40.950680 1 main.go:52] CSI driver name: "com.mapr.csi-kdf"
I1030 04:17:40.950690 1 main.go:55] Calling CSI driver to discover node ID.
I1030 04:17:40.950699 1 connection.go:150] GRPC call: /csi.v0.Node/NodeGetId
I1030 04:17:40.950706 1 connection.go:151] GRPC request:
I1030 04:17:40.951186 1 connection.go:153] GRPC response:
I1030 04:17:40.951226 1 connection.go:154] GRPC error: rpc error: code = Unimplemented desc = unknown service csi.v0.Node
I1030 04:17:40.951289 1 main.go:95] Health check failed with: rpc error: code = Unimplemented desc = unknown service csi.v0.Node.

@sbezverk As discussed in #wg-csi, Let me know if there are additional information required for it?

memory leak on release-0.4

The livenessprobe will create new grpc connection every request and never close it.

livenessprobe/cmd/main.go

Lines 70 to 73 in 9ef0206

    
           func getCSIConnection() (connection.CSIConnection, error) { 
        
           	// Connect to CSI. 
        
           	glog.Infof("Attempting to open a gRPC connection with: %s", *csiAddress) 
        
           	csiConn, err := connection.NewConnection(*csiAddress, *connectionTimeout)

Fitlering out the health check begin/succeded logs

Sending probe request to CSI driver "efs.csi.aws.com"
Health check succeeded

Currently these are noisy and I wish to filter them out, would there be opposition to adding an V(5) to these log lines so that one can use something like --v=3 to filter them out?

support JSON log format for liveness-probe

Provide a flag to enable JSON log formatting.

We do this today for CSI driver implementations but the same can't be configured for liveness-probe. This can be done by adding log-format-json=true flag to the liveness-probe and setting the log format based on that.

http: Accept error: accept tcp [::]:9808: accept4: too many open files; retrying in 5ms

After running livenessprobe for 2 days the following error appeared in the log and it causes liveness check failure, as a result csi driver gets restarted.

http: Accept error: accept tcp [::]:9808: accept4: too many open files; retrying in 5ms

Bump dependencies to address CVE-2022-41723

A trivy scan lists this project as vulnerable to CVE-2022-41723 because of the indirect dependency of golang.org/x/net at v0.4.0.

Is the project actually vulnerable to this CVE?

If so, could you bump the required dependencies to bring the indirect x/net package to >= 0.7.0, or use a replace directive?

Thanks!

Integrate sidecar into e2es

Currently none of our e2es run with the liveness probe. Also we don't have e2e test cases.

Build new image using go 1.20.4 to address vulnerabilities

CVE-2023-24540
CVE-2023-29400
CVE-2023-24539

The current image have above CVE. Build new image to fix them

need version compatibility for windows 2022

in the repo dockerfile.windows
there is base, addon image is windwos images "1809"

Images created in this way do not seem to work on Windows 2022 servers due to version compatibility issues.

CIS issues in mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.10.0

Our Qualys scans are showing the following docker CIS benchmark issues for mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.10.0

SEVERITY	STATEMENT	REMEDIATION
SERIOUS	Status of the ADD instructions in Dockerfile	Use COPY rather than ADD instructions in Dockerfiles.
MEDIUM	Status of the HEALTHCHECK setting for the Docker Images	Follow Docker documentation and rebuild the Docker Images with HEALTHCHECK instruction.

livenessprob 0.4.2 image

The latest image for 0.4.1 doesn't include the following fix: #18

Can we cut a 0.4.2 csi livenessprobe image which has above fix?

Make passing Travis required for merge

xref kubernetes/test-infra#9014 (comment)

/assign @sbezverk

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage

The livenessprobe code expects to try forever to connect with the CSI plugin via csi.sock on startup.

livenessprobe/cmd/livenessprobe/main.go

Lines 142 to 147 in 33ea6c0

    
           csiConn, err := acquireConnection(context.Background(), metricsManager) 
        
           if err != nil { 
        
           	// connlib should retry forever so a returned error should mean 
        
           	// the grpc client is misconfigured rather than an error on the network 
        
           	klog.Fatalf("failed to establish connection to CSI driver: %v", err) 
        
           }

However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.

According to the associated PR, the goal was to avoid a deadlock in which node-driver-registrar failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.

In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:

The CSI plugin fails to start for an ephemeral reason and enters a CrashLoopBackOff.
livenessprobe fails to connect and enters a CrashLoopBackOff.
Eventually, the CSI plugin can start successfully. Since livenessprobe is not running at that time, kubelet kills it, increasing the backoff.
Every time livenessprobe starts, the CSI plugin is waiting in backoff, so livenessprobe crashes, increasing the backoff.

IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.

Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of kubernetes-csi/csi-lib-utils#131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.

	func getCSIConnection() (connection.CSIConnection, error) {
	// Connect to CSI.
	glog.Infof("Attempting to open a gRPC connection with: %s", *csiAddress)
	csiConn, err := connection.NewConnection(csiAddress, connectionTimeout)

	csiConn, err := acquireConnection(context.Background(), metricsManager)
	if err != nil {
	// connlib should retry forever so a returned error should mean
	// the grpc client is misconfigured rather than an error on the network
	klog.Fatalf("failed to establish connection to CSI driver: %v", err)
	}

kubernetes-csi / livenessprobe Goto Github PK

livenessprobe's People

Contributors

Stargazers

Watchers

Forkers

livenessprobe's Issues

Bug Report

Recommend Projects

Recommend Topics

Recommend Org