Giter Site home page Giter Site logo

ocp4-dwatch's Introduction

ocp4-dwatch

Watch for D-State process on Red Hat OpenShift 4 clusters and log them

About

The ocp4-dwatch utility runs as a DaemonSet across your OpenShift 4.X cluster, and logs any occurances of D-state processes by monitoring /dev/kmsg as well as setting sysctls appropriate for produccing more logging. It prints stacks and logs as appropriate to the pod logs.

It runs as a privelged container in a DaemonSet, with full access to /proc and /dev/kmsg, and with root access to the node. It runs in it's own security context, which tries to be as unintrusive as possible.

Each pod log will be from a different node, where it will track occurances of hung tasks and their respective stacks, along with timestamps.

Additional input to this project is always welcome, but we have to be careful to not over-utilize resources from the node in tracking input; the original form of this project frequently read /proc/sched_debug and printed /proc/$PID/stack which could cause softlockups from time to time if performed too frequently, which made things far worse. Be careful in testing and adding additional commands per loop of data capture, and consider the expense to the system in resources wherever possible.

This is a debugging tool provided without warranty. It offers no support from Red Hat or any other official source. Please use at your own risk.

Deploy on a Red Hat OpenShift 4.X cluster

  • Assuming you have set values as desired in the ocp4-dwatch-deploy.yaml file, create resources with:
$ oc create -f ocp4-dwatch-deploy.yaml
namespace/ocp4-dwatch created
serviceaccount/ocp4-dwatch-sa created
securitycontextconstraints.security.openshift.io/ocp4-dwatch-scc created
daemonset.apps/ocp4-dwatch-ds created

Remove from a Red Hat OpenShift 4.X cluster

  • To remove all resources relating to ocp4-dwatch on your cluster, assuming you deployed with the above instructions, run:
$ oc delete all -l=app=ocp4-dwatch; oc delete scc ocp4-dwatch-scc; oc delete sa ocp4-dwatch-sa; oc delete project ocp4-dwatch
pod "ocp4-dwatch-ds-2q9c4" deleted
pod "ocp4-dwatch-ds-nc74k" deleted
securitycontextconstraints.security.openshift.io "ocp4-dwatch-scc" deleted
serviceaccount "ocp4-dwatch-sa" deleted
project.project.openshift.io "ocp4-dwatch" deleted

Deployment Options

A number of environment variables exist to change the behavior of ocp4-dwatch. They are listed below.

  • PROCPATH: The path in which to set the bind-mounted location of the host's /proc within the container. Defaults to /hostproc to avoid conflict with /proc within the container namespace.

  • INTERVAL: The amount of time between checking dmesg for D-state events. Defaults to 60 seconds.

  • KERNEL_HUNG_TASK_WARNINGS: The value to set kernel.hung_task_warnings to on the entire node. As defined by the kernel documentation:

The maximum number of warnings to report. During a check interval if a hung task is detected, this value is decreased by 1. When this value reaches 0, no more warnings will be reported. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.

-1: report an infinite number of warnings.

  • KERNEL_HUNG_TASK_TIMEOUT: The value to set kernel.hung_task_timeout_secs to on the entire node. As defined by the kernel documentation:

When a task in D state did not get scheduled for more than this value report a warning. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.

0: means infinite timeout - no checking done. Possible values to set are in range {0..LONG_MAX/HZ}.

License

Copyright 2020 Robert Thomas Manes <[email protected]>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

ocp4-dwatch's People

Contributors

robbmanes avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.