airwallex / k8s-pod-restart-info-collector Goto Github PK
View Code? Open in Web Editor NEWAutomated troubleshooting of Kubernetes Pods issues. Collect K8s pod restart reasons, logs, and events automatically.
Automated troubleshooting of Kubernetes Pods issues. Collect K8s pod restart reasons, logs, and events automatically.
Hey folks,
I've modified the code a bit for our use case (fairly large gaming company). I was wondering how should I proceed with having it sent back to you so everything is in sync.
Motivation -
Some companies would prefer sending slack alerts for specific applications. For example, I may only be interested in the failing pods that are critical applications for which we are sending "on-call alerts". Everything else, can be ignored. We have no option to do that right now.
What's done?
In the helm values.yaml, users can now supply labels that they would want to be monitored. A new function "NewControllerWithLabels" will do everything as "NewController", except, it will only send a message to slack if the pod (that's restarting) has that label key on it.
This will bypass "ignoredNamespace" "ignoredPod" functions and will only rely on the label key that's supplied in the values.yaml.
This way, users can
I am still testing it out in our environment. I am not sure how to proceed, if I should send the code back as a PR and if I can review it with somebody.
Hi,
I run a cluster that has a policy engine on it that forbids insecure pods/containers.
Currently there is a way to define a pod security context, but not a container security context.
Can we add this in please? It just needs to be a new line in the container spec.
This is what I require:
podSecurityContext:
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault
capabilities:
drop: ["ALL"]
While slack is most defiantly the best thing since sliced bread, our corporate IT do not let us use it. :(
Can we please add support some of the other major chat clients?
etc
In some production environments you are not allowed to post application logs to slack,
because they may contains some secrets or personal data.
can we also integrate with ES or MS teams?
Hi,
at the moment, there is a possibility to override slack channel via label/annotation in your application.
Only legacy slack webhooks are capable of posting to multiple channels with one webhook URL via channel parameter.
Slack apps and "modern" incoming webhooks have specific URL mapped to specific channel, so there is no way to override them.
Would it be difficult to add some kind of option or better word would be mapping, where it would be possible to map annotation (or slack channel)
alert-slack-channel: "your-slack-channel-name"
to specific webhook URL in the config?
something like:
WebhookURLMapping:
default: https://hooks.foo.bar/xxxxxxx
my_second_channel: https://hooks.foo.bar/yyyyyyyy
I understand, that at the moment I can use legacy webhooks, but they will stop working eventually.
After the installation was completed, the test encountered this problem.
Sending to Slack channel failed with failed to post webhook: Post "https://hooks.slack.com/services/***": x509: certificate is valid for *.github.com...
I have checked the information, it may be a time zone problem, but there is no way to change the time zone in the devopsairwallex/k8s-pod-restart-info-collector
container, and I don't have any permission to enter the container, and I can't install any tools, is there any other solution, or is there an image that can be sudoed?
I0102 03:32:06.934618 1 controller.go:69] Ignore: metallb-system/speaker-6mwfj restartCount: 7714 > 30
I0102 03:32:12.356664 1 controller.go:64] Update: metallb-system/speaker-6mwfj
┌──[[email protected]]-[~/ansible/hook]
└─$date
2023年 01月 03日 星期二 11:00:37 CST
/ # date -s "22:12:00"
date: can't set date: Operation not permitted
Tue Jan 3 22:12:00 UTC 2023
/ # date
Tue Jan 3 03:12:24 UTC 2023
/ #
/ # apk add -U tzdata
fetch https://mirrors.aliyun.com/alpine/v3.15/main/x86_64/APKINDEX.tar.gz
SSL certificate subject doesn't match host mirrors.aliyun.com
ERROR: https://mirrors.aliyun.com/alpine/v3.15/main: Permission denied
.....
Currently there is no way to define custom envvars in the helm chart in https://github.com/airwallex/k8s-pod-restart-info-collector/blob/master/helm/templates/deployment.yaml#L33-L50
Use case: I have to set some additional envvars e.g. HTTP_PROXY and want to do this without forking the chart.
Hi Team,
I want to intregrate alerts with google chat instead of slack.
In my organisation slack has not been used so want some solution for g-chat integration.
Thanks in Advance
Vishal
Thanks for the great tool!
In my case there are some system pods from DaemonSets which get expected restarts while the Node is still being initialized.
Will be useful to have a way to ignore a set of namespaces or even better specific pods via label selector.
Heya - quick one - can we have a release of the latest master please?
Want to use the regex functionality in watchNamespaces but it requires a new version.
Thanks!
Does it support teams?
Currently this tool will send last 50 lines of logs. If we have more health check logs we will not get the correct error msg in logs to resolve this
Hi, wanted to understand whether there is specific reason why pods with restart count larger than 30 are ignored?
Relevant code - https://github.com/airwallex/k8s-pod-restart-info-collector/blob/master/controller.go#L67
Could we make this configurable?
Thanks!
Hi All,
I tried install k8s-pod-restart-info-collector through helm in arm based worker nodes .
But i am geting error like this
exec /k8s-pod-restart-info-collector: exec format error
Any Workaround how to run in arm based Nodes ?
Within the values.yaml
I would like to be able to reference a pre-existing secret to define my variables (I set them via a secrets manager for encryption/gitops). Something like how Grafana allows it:
existingSecret: "kube-prometheus-stack-grafana"
Thanks for making this open-source, nice work!
I didn't see this was available as a Helm Chart Repository. I've added it to my repository if this helps someone:
https://github.com/reefland/helm-charts/tree/main/charts/apps/pod-restart-info-collector
helm repo add reefland https://reefland.github.io/helm-charts
helm repo update
helm install pod-restart-info-collector reefland/k8s-pod-restart-info-collector
I also created an ArgoCD application deployment, which references the above Helm Repository:
https://github.com/reefland/ansible-k3s-argocd-renovate/tree/master/_extra_apps/pod-restart-info-collector
The Slack WebHook works perfectly with other Slack compatible services. I used it with Mattermost:
We can add this blog to Readme later.
https://blogs.halodoc.io/troubleshooting-kubernetes-pod-restarts-with-info-collector/
Hi
I think more usable watchNamespaces instead of ignoreNamespaces. We can append or drop some NS and after this, we should update config of collector. Can you append this work mode of the collector watch only listed NS?
Or when collector work without clusterrole, only in one NS.
If there are multiple containers in a pod
Currently it misses init-containers restart, would be great to monitor them also
Hello,
I am seeing a lot of slack messages from the service for restarts with exit code 0. Can we get an option to disable posting for exit code 0?
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.