ncabatoff / process-exporter Goto Github PK

View Code? Open in Web Editor NEW

1.6K 30.0 254.0 2.31 MB

Prometheus exporter that mines /proc to report on selected processes

License: MIT License

Go 96.26% Makefile 2.82% Shell 0.23% Dockerfile 0.69%

prometheus-exporter process-metrics go

process-exporter's Introduction

process-exporter

Prometheus exporter that mines /proc to report on selected processes.

Some apps are impractical to instrument directly, either because you don't control the code or they're written in a language that isn't easy to instrument with Prometheus. We must instead resort to mining /proc.

Installation

Either grab a package for your OS from the Releases page, or install via docker.

Running

Usage:

  process-exporter [options] -config.path filename.yml

or via docker:

  docker run -d --rm -p 9256:9256 --privileged -v /proc:/host/proc -v `pwd`:/config ncabatoff/process-exporter --procfs /host/proc -config.path /config/filename.yml

Important options (run process-exporter --help for full list):

-children (default:true) makes it so that any process that otherwise isn't part of its own group becomes part of the first group found (if any) when walking the process tree upwards. In other words, resource usage of subprocesses is added to their parent's usage unless the subprocess identifies as a different group name.

-threads (default:true) means that metrics will be broken down by thread name as well as group name.

-recheck (default:false) means that on each scrape the process names are re-evaluated. This is disabled by default as an optimization, but since processes can choose to change their names, this may result in a process falling into the wrong group if we happen to see it for the first time before it's assumed its proper name. You can use -recheck-with-time-limit to enable this feature only for a specific duration after process starts.

-procnames is intended as a quick alternative to using a config file. Details in the following section.

To disable any of these options, use the -option=false.

Configuration and group naming

To select and group the processes to monitor, either provide command-line arguments or use a YAML configuration file.

The recommended option is to use a config file via -config.path, but for convenience and backwards compatibility the -procnames/-namemapping options exist as an alternative.

Using a config file

The general format of the -config.path YAML file is a top-level process_names section, containing a list of name matchers:

process_names:
  - matcher1
  - matcher2
  ...
  - matcherN

The default config shipped with the deb/rpm packages is:

process_names:
  - name: "{{.Comm}}"
    cmdline:
    - '.+'

A process may only belong to one group: even if multiple items would match, the first one listed in the file wins.

(Side note: to avoid confusion with the cmdline YAML element, we'll refer to the command-line arguments of a process /proc/<pid>/cmdline as the array argv[].)

Using a config file: group name

Each item in process_names gives a recipe for identifying and naming processes. The optional name tag defines a template to use to name matching processes; if not specified, name defaults to {{.ExeBase}}.

Template variables available:

{{.Comm}} contains the basename of the original executable, i.e. 2nd field in /proc/<pid>/stat
{{.ExeBase}} contains the basename of the executable
{{.ExeFull}} contains the fully qualified path of the executable
{{.Username}} contains the username of the effective user
{{.Matches}} map contains all the matches resulting from applying cmdline regexps
{{.PID}} contains the PID of the process. Note that using PID means the group will only contain a single process.
{{.StartTime}} contains the start time of the process. This can be useful in conjunction with PID because PIDs get reused over time.
{{.Cgroups}} contains (if supported) the cgroups of the process (/proc/self/cgroup). This is particularly useful for identifying to which container a process belongs.

Using PID or StartTime is discouraged: this is almost never what you want, and is likely to result in high cardinality metrics which Prometheus will have trouble with.

Using a config file: process selectors

Each item in process_names must contain one or more selectors (comm, exe or cmdline); if more than one selector is present, they must all match. Each selector is a list of strings to match against a process's comm, argv[0], or in the case of cmdline, a regexp to apply to the command line. The cmdline regexp uses the Go syntax.

For comm and exe, the list of strings is an OR, meaning any process matching any of the strings will be added to the item's group.

For cmdline, the list of regexes is an AND, meaning they all must match. Any capturing groups in a regexp must use the ?P<name> option to assign a name to the capture, which is used to populate .Matches.

Performance tip: give an exe or comm clause in addition to any cmdline clause, so you avoid executing the regexp when the executable name doesn't match.


process_names:
  # comm is the second field of /proc/<pid>/stat minus parens.
  # It is the base executable name, truncated at 15 chars.
  # It cannot be modified by the program, unlike exe.
  - comm:
    - bash

  # exe is argv[0]. If no slashes, only basename of argv[0] need match.
  # If exe contains slashes, argv[0] must match exactly.
  - exe:
    - postgres
    - /usr/local/bin/prometheus

  # cmdline is a list of regexps applied to argv.
  # Each must match, and any captures are added to the .Matches map.
  - name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
    exe:
    - /usr/local/bin/process-exporter
    cmdline:
    - -config.path\s+(?P<Cfgfile>\S+)

Here's the config I use on my home machine:


process_names:
  - comm:
    - chromium-browse
    - bash
    - prometheus
    - gvim
  - exe:
    - /sbin/upstart
    cmdline:
    - --user
    name: upstart:-user

Using -procnames/-namemapping instead of config.path

Every name in the procnames list becomes a process group. The default name of a process is the value found in the second field of /proc//stat ("comm"), which is truncated at 15 chars. Usually this is the same as the name of the executable.

If -namemapping isn't provided, every process with a comm value present in -procnames is assigned to a group based on that name, and any other processes are ignored.

The -namemapping option is a comma-separated list of alternating name,regexp values. It allows assigning a name to a process based on a combination of the process name and command line. For example, using

-namemapping "python2,([^/]+).py,java,-jar\s+([^/]+).jar"

will make it so that each different python2 and java -jar invocation will be tracked with distinct metrics. Processes whose remapped name is absent from the procnames list will be ignored. On a Ubuntu Xenian machine being used as a workstation, here's a good way of tracking resource usage for a few different key user apps:

process-exporter -namemapping "upstart,(--user)"
-procnames chromium-browse,bash,gvim,prometheus,process-exporter,upstart:-user

Since upstart --user is the parent process of the X11 session, this will make all apps started by the user fall into the group named "upstart:-user", unless they're one of the others named explicitly with -procnames, like gvim.

Group Metrics

There's no meaningful way to name a process that will only ever name a single process, so process-exporter assumes that every metric will be attached to a group of processes - not a process group in the technical sense, just one or more processes that meet a configuration's specification of what should be monitored and how to name it.

All these metrics start with namedprocess_namegroup_ and have at minimum the label groupname.

num_procs gauge

Number of processes in this group.

cpu_seconds_total counter

CPU usage based on /proc/[pid]/stat fields utime(14) and stime(15) i.e. user and system time. This is similar to the node_exporter's node_cpu_seconds_total.

read_bytes_total counter

Bytes read based on /proc/[pid]/io field read_bytes. The man page says

Attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. This is accurate for block-backed filesystems.

but I would take it with a grain of salt.

As /proc/[pid]/io are set by the kernel as read only to the process' user (see #137), to get these values you should run process-exporter either as that user or as root. Otherwise, we can't read these values and you'll get a constant 0 in the metric.

write_bytes_total counter

Bytes written based on /proc/[pid]/io field write_bytes. As with read_bytes, somewhat dubious. May be useful for isolating which processes are doing the most I/O, but probably not measuring just how much I/O is happening.

major_page_faults_total counter

Number of major page faults based on /proc/[pid]/stat field majflt(12).

minor_page_faults_total counter

Number of minor page faults based on /proc/[pid]/stat field minflt(10).

context_switches_total counter

Number of context switches based on /proc/[pid]/status fields voluntary_ctxt_switches and nonvoluntary_ctxt_switches. The extra label ctxswitchtype can have two values: voluntary and nonvoluntary.

memory_bytes gauge

Number of bytes of memory used. The extra label memtype can have three values:

resident: Field rss(24) from /proc/[pid]/stat, whose doc says:

This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.

virtual: Field vsize(23) from /proc/[pid]/stat, virtual memory size.

swapped: Field VmSwap from /proc/[pid]/status, translated from KB to bytes.

If gathering smaps file is enabled, two additional values for memtype are added:

proportionalResident: Sum of "Pss" fields from /proc/[pid]/smaps, whose doc says:

The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it.

proportionalSwapped: Sum of "SwapPss" fields from /proc/[pid]/smaps

open_filedesc gauge

Number of file descriptors, based on counting how many entries are in the directory /proc/[pid]/fd.

worst_fd_ratio gauge

Worst ratio of open filedescs to filedesc limit, amongst all the procs in the group. The limit is the fd soft limit based on /proc/[pid]/limits.

Normally Prometheus metrics ought to be as "basic" as possible (i.e. the raw values rather than a derived ratio), but we use a ratio here because nothing else makes sense. Suppose there are 10 procs in a given group, each with a soft limit of 4096, and one of them has 4000 open fds and the others all have 40, their total fdcount is 4360 and total soft limit is 40960, so the ratio is 1:10, but in fact one of the procs is about to run out of fds. With worst_fd_ratio we're able to know this: in the above example it would be 0.97, rather than the 0.10 you'd see if you computed sum(open_filedesc) / sum(limit_filedesc).

oldest_start_time_seconds gauge

Epoch time (seconds since 1970/1/1) at which the oldest process in the group started. This is derived from field starttime(22) from /proc/[pid]/stat, added to boot time to make it relative to epoch.

num_threads gauge

Sum of number of threads of all process in the group. Based on field num_threads(20) from /proc/[pid]/stat.

states gauge

Number of threads in the group in each of various states, based on the field state(3) from /proc/[pid]/stat.

The extra label state can have these values: Running, Sleeping, Waiting, Zombie, Other.

Group Thread Metrics

Since publishing thread metrics adds a lot of overhead, use the -threads command-line argument to disable them, if necessary.

All these metrics start with namedprocess_namegroup_ and have at minimum the labels groupname and threadname. threadname is field comm(2) from /proc/[pid]/stat. Just as groupname breaks the set of processes down into groups, threadname breaks a given process group down into subgroups.

thread_count gauge

Number of threads in this thread subgroup.

thread_cpu_seconds_total counter

Same as cpu_user_seconds_total and cpu_system_seconds_total, but broken down per-thread subgroup. Unlike cpu_user_seconds_total/cpu_system_seconds_total, the label cpumode is used to distinguish between user and system time.

thread_io_bytes_total counter

Same as read_bytes_total and write_bytes_total, but broken down per-thread subgroup. Unlike read_bytes_total/write_bytes_total, the label iomode is used to distinguish between read and write bytes.

thread_major_page_faults_total counter

Same as major_page_faults_total, but broken down per-thread subgroup.

thread_minor_page_faults_total counter

Same as minor_page_faults_total, but broken down per-thread subgroup.

thread_context_switches_total counter

Same as context_switches_total, but broken down per-thread subgroup.

Instrumentation cost

process-exporter will consume CPU in proportion to the number of processes in the system and the rate at which new ones are created. The most expensive parts - applying regexps and executing templates - are only applied once per process seen, unless the command-line option -recheck is provided.

If you have mostly long-running processes process-exporter overhead should be minimal: each time a scrape occurs, it will parse of /proc/$pid/stat and /proc/$pid/cmdline for every process being monitored and add a few numbers.

Dashboards

An example Grafana dashboard to view the metrics is available at https://grafana.net/dashboards/249

Building

Requires Go 1.21 (at least) installed.

make

Exposing metrics through HTTPS

web-config.yml

# Minimal TLS configuration example. Additionally, a certificate and a key file
# are needed.
tls_server_config:
  cert_file: server.crt
  key_file: server.key

Running

$ ./process-exporter -web.config.file web-config.yml &
$ curl -sk https://localhost:9256/metrics | grep process

# HELP namedprocess_scrape_errors general scrape errors: no proc metrics collected during a cycle
# TYPE namedprocess_scrape_errors counter
namedprocess_scrape_errors 0
# HELP namedprocess_scrape_partial_errors incremented each time a tracked proc's metrics collection fails partially, e.g. unreadable I/O stats
# TYPE namedprocess_scrape_partial_errors counter
namedprocess_scrape_partial_errors 0
# HELP namedprocess_scrape_procread_errors incremented each time a proc's metrics collection fails
# TYPE namedprocess_scrape_procread_errors counter
namedprocess_scrape_procread_errors 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.21
# HELP process_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which process_exporter was built.
# TYPE process_exporter_build_info gauge
process_exporter_build_info{branch="",goversion="go1.17.3",revision="",version=""} 1
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 10

For further information about TLS configuration, please visit: exporter-toolkit

process-exporter's People

Contributors

Stargazers

Watchers

Forkers

wasburn flixr yanghongkjxy wiennat txdyzdy lakshmanvvs knweiss dineshpandav urbanairship giovannipaologibilisco lleontop stut aledbf mikich sunlianqiang bderickson lmesz catawiki bratislavml kandy000 dingzhenkui timn finskiyfin adivinho shizifan yannrobert azman0101 oxa-test endouble aman5121993 sara4dev sjhzju pluralsight kusora bashims stayman ageis normanshulman cnsuhao digitalocean 14799678 icmp1234 haimhm superq ncamx leons727 ycyr creinholdsson agareev opvizor lumi017 rajatvig shellgithub wardellc tontinme tewrence hosmostn widy28 itsx tlinhart wuchengjiang livelse chensunny xlogin danielpalstra jiangwenyu1 davidone bigosprite caoxiaojian k8stech davidcorbin belyenochi rohan-moengage minstrel56 binnyrs huangyanhong styxman pierref lovew23 wangwang leoowu sharavan86 avarf dizhaung lb-j fildenisov tada-team ozhiwei smiley-yoyo yyf330 rothsa maylament jinnzy zhaodiaoer paranoidd radishgz gn01 22dm grafana junneyang

process-exporter's Issues

version latest, 0.3.1 & 0.3.2 is broken

Version used: docker pull ncabatoff/process-exporter
Even 0.3.1 and 0.3.2 are listed as prerelease on the github release page,
the latest tag on dockerhub is pointing to 0.3.2.
Unfortunately 0.3.x are broken.
reproduction:

root@schulung:~# docker run --rm -i ncabatoff/process-exporter:0.2.12
2018/08/10 11:01:17 Reading metrics from /proc for procnames: []
root@schulung:~# docker run --rm -i ncabatoff/process-exporter:0.3.1 
standard_init_linux.go:178: exec user process caused "no such file or directory"
root@schulung:~# docker run --rm -i ncabatoff/process-exporter:0.3.2 
standard_init_linux.go:178: exec user process caused "no such file or directory"
root@schulung:~# docker run --rm -i ncabatoff/process-exporter:latest
standard_init_linux.go:178: exec user process caused "no such file or directory"

[feature request] Support for more process and thread metrics

Hi,

First of all many thanks for this great exporter!

I've been using it a lot and started to rely on it more and more to have process-level visibility. Over time, I've found that I needed a couple of more metrics that can be really handy in performance analysis. To cover such needs, I had to resort to other tools such as atop or pidstat, however I have clearly lost the benefit of having a centralized time series database like prometheus.

Here is my current wishlist:

Process CPU time
Total CPU user time
Total CPU system time
Memory
Peak resident size
Context switches:
Number of voluntary context switches
Number of involuntary context switches
Page faults:
Number of minor faults
Number of major faults
Process threads:
Number of threads in state 'running' (R)
Number of threads in state 'interruptible sleeping' (S)
Number of threads in state 'uninterruptible sleeping' (D)
Waiting channel:
Number of threads waiting on a specific wchan

Most of them come from the usual /proc/PID/stat files, while others require visiting process threads via /proc/PID/task.

Do you think they can be added to the process-exporter?

Thank you in advance

Detecting Ruby processes by parameter, RegEx in `namemapping` not working

Hi dear,first of all,thaks for your tools.
I tried to use process-exporter to monitor my ruby process,
there is the /proc/pid/cmdlineof my ruby process

[medal@a5 3325]$ cat cmdline
unicorn worker[0] -c /opt/work/batman/current/config/unicorn/alpha.rb -E deployment -D

and there is the cmd of running process-exporter:
./process-exporter -procnames ruby -namemapping "ruby,^unicorn\s.+batman.+alpha.rb\s.+" -once-to-stdout
the regex could be mattched with unicorn worker[0] -c /opt/work/batman/current/config/unicorn/alpha.rb -E deployment -D
but I couldn't find any monitor data for the ruby process, so where is my mistake?

V 0.3.9 Error reading metrics for pid

Hi. Download the docker image and download the rpm file for linux amd64 and had seen the error. Running with any configuration file or standard configuration:
16:48:57 error reading metrics for {Pid:31819 StartTimeRel:1435632}: line 25 from status file without ':': 16:48:57 error reading metrics for {Pid:31820 StartTimeRel:1435632}: line 25 from status file without ':': 16:48:57 error reading metrics for {Pid:31821 StartTimeRel:1435632}: line 25 from status file without ':':
In version 0.3.7 all works.

process-exporter omits some processes

I have the problem that the process-exporter quite often misses some processes although they are specified in the yml file (as comm).

The process that is missing is actually running and e.g. doing a process-exporter -config.path foo.yml -once-to-stdout shows the process correctly.
Also restarting the process-exporter works.

Also I'm not getting any scrape errors...

Any ideas on what is going on here?

Report on thread states

Split off from #16. New metrics desired:

Process threads:

Number of threads in state 'running' (R)
Number of threads in state 'interruptible sleeping' (S)
Number of threads in state 'uninterruptible sleeping' (D)

mac process exporter

Can we have metric from process info (like ps aux) in MacOs?

VERSION is inconsistent

VERSION file contains value is "0.1.0" and was updated 2 years ago. The latest release seems to be 0.2.12

Intermittant test failure: TestTrackerBasic

This test sometimes fails with output like:

--- FAIL: TestTrackerBasic (0.00s)
	tracker_test.go:45: 2: update differs: (-got +want)
		{[]proc.Update}[0].GroupName:
			-: "g2"
			+: "g4"
		{[]proc.Update}[0].Start:
			-: s"1970-01-01 00:00:02 +0000 UTC"
			+: s"1970-01-01 00:00:03 +0000 UTC"
		{[]proc.Update}[1].GroupName:
			-: "g4"
			+: "g2"
		{[]proc.Update}[1].Start:
			-: s"1970-01-01 00:00:03 +0000 UTC"
			+: s"1970-01-01 00:00:02 +0000 UTC"

How do I build process-exporter for deployment without docker?

I would like to build or download process-exporter binaries for Ubuntu and CentOS. Please let me know how can I build for CentOS and Ubuntu without using docker image.

Regex to track all process

Is it possible to use regex to track all process running in a machine or choose the top processes ?
@ncabatoff @flixr

namedprocess_namegroup_read_bytes_total seems to be always zero

Hi,
it seems namedprocess_namegroup_read_bytes_total is always 0.
namedprocess_namegroup_write_bytes_total seems to be OK.

Using latest version 0.4.0

# uname -a
Linux driver 3.8.0-44-generic #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Configuration:

process_names:
  - exe: 
    - /opt/java/jdk1.8.0_92/bin/java
    cmdline: 
    - com.connecty.collector.DataCollector
    name: collector
  - exe:
    - /opt/java/jdk1.8.0_92/bin/java
    cmdline:
    - -Dcatalina.base=/var/lib/tomcat/fm
    name: fm
  - exe:
    - /opt/java/jdk1.8.0_92/bin/java
    cmdline:
    - -Dcatalina.base=/var/lib/tomcat/mk3
    name: mk3
  - exe:
    - node
    name: node
  - exe:
    - /usr/bin/redis-server
    name: redis

How do I monitor a relative command?

I want to monitor a process executed by relative path like ./prometheus
But I can not get the result except the prometheus http analysis.

report thread stats as well

The things I'm monitoring are often processes with quite a few threads and the most interesting thing is actually not how much resources (CPU) each process needs, but how much the (named) thread in the process need.

It would be great if the per thread cpu usage (/proc/<pid>/task/<tid>/stat) would be reported as well.
The format is the same, as the process stat file, so it seems that the procfs iterator could "just" be applied to the task dir per process pid again.

Provide Ubuntu .deb packages

The problem:

When I run the command via BASH command line:
./process-exporter --web.listen-address=:39256 --procnames=bash
The process exporter works fine and provides information about the bash process.

However, when I have a SYSTEMD file on Ubuntu 16.04:

[Unit]
Description=Prometheus process-exporter
Wants=basic.target
After=basic.target network.target

[Service]
EnvironmentFile=/etc/environment
User=process_exporter
Group=prometheus
ExecStart=/usr/local/bin/process-exporter \
--web.listen-address=:39256 \
--procnames=bash

ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always

[Install]
WantedBy=multi-user.target

The process starts but it fails to report any information about the bash process (or any other process specified for that matter)

How to replicate:

Put the process-exporter binary into /usr/local/bin and execute it with the same command found above.
Then create a Systemd init script and copy paste the above code block. systemctl daemon-reload and service process-exporter start

Resolution attempts:

Tried to use a config file argument, the config file argument gets ignored both on the command line and through systemd.
Tried to not user the web.listen-address flag.
Tried to change which procnames to mine for.

Mapping process according to users of the same process

My top example is as follows,

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13168 ahfvw0d1 30 10 498m 35m 20m S 0.0 0.2 0:03.49 php-cgi
8859 realnoni 30 10 495m 33m 20m S 0.0 0.2 0:11.27 php-cgi
6590 asjzdiwq 30 10 495m 32m 20m S 0.0 0.2 0:13.34 php-cgi
5657 holeyrai 30 10 495m 31m 19m S 0.0 0.2 0:04.47 php-cgi
14480 ripplecr 30 10 498m 31m 17m S 0.0 0.2 0:02.90 php-cgi
14442 ripplecr 30 10 497m 31m 17m S 0.0 0.2 0:02.00 php-cgi
10720 computer 30 10 496m 31m 18m S 0.0 0.2 0:08.75 php-cgi
16126 realnoni 30 10 484m 30m 18m S 0.0 0.2 0:12.08 php-cgi
31561 a0w4pkbp 30 10 496m 30m 17m S 0.0 0.2 0:03.54 php-cgi
31565 ahfvw0d1 30 10 484m 29m 17m S 0.0 0.2 0:05.80 php-cgi
21275 asjzdiwq 30 10 484m 29m 18m S 0.0 0.2 0:01.77 php-cgi

I'm using the YAML file approach to configure which kind of process to monitor. As such
process_names:

comm:
- httpd
- php-cgi
- memcached
- node_exporter
exe:
- /usr/bin/php-cgi
name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
exe:
- /usr/local/bin/process_exporter
cmdline:
- -config.path\s+(?P\S+)

What i'm getting in return is a summed value of php-cgi instead of each individual users

How do i go about separating the php-cgi to have breakdowns by users ?

[ARM] no metrics extracted

I'm having a weird problem on my Nvidia Jetson TK1 board (ARMv7) with kernel 3.10.40.

The process-exporter runs without any errors, but there are no named processes collected.

I tried with a "catch-all" yml file

process_names:
  - cmdline: 
    - .+

as well as simply ./process-exporter -procnames bash -once-to-stdout

There are no errors in the build/test:

>> formatting code
>> vetting code
>> fetching promu
>> building binaries
 >   process-exporter
>> running short tests
?   	github.com/ncabatoff/process-exporter	[no test files]
?   	github.com/ncabatoff/process-exporter/cmd/process-exporter	[no test files]
ok  	github.com/ncabatoff/process-exporter/config	0.011s
ok  	github.com/ncabatoff/process-exporter/proc	0.115s

Scrape errors is zero, see the metrics output

cpu statistics: add dimentions with mode="user|system" insetead of separate counter names?

The cpu metrics were recently split from namedprocess_namegroup_cpu_seconds_total into namedprocess_namegroup_cpu_user_seconds_total and namedprocess_namegroup_cpu_system_seconds_total.
I'm wondering if it wouldn't be nicer to expose them under the old name and add an extra dimension with mode instead (basically like the node_exporter).
Then it would be backwards compatible and easier to extend if we wanted to expose other "modes" later as well...

Error Building Docker Image

Might just be a n00b mistake, but I'm getting this after a fresh clone on OSX:

❯ docker --version
Docker version 17.09.0-ce, build afdb6d4

❯ make docker
>> building docker image
Sending build context to Docker daemon  12.48MB
Step 1/6 : FROM golang
 ---> 59c0da3fc7cc
Step 2/6 : ADD . /go/src/github.com/ncabatoff/process-exporter
 ---> c3f532ec62a2
Step 3/6 : RUN make -C /go/src/github.com/ncabatoff/process-exporter
 ---> Running in 385e87743d56
make: Entering directory '/go/src/github.com/ncabatoff/process-exporter'
>> formatting code
>> vetting code
>> fetching promu
>> building binaries
 >   process-exporter
>> running short tests
?   	github.com/ncabatoff/process-exporter	[no test files]
?   	github.com/ncabatoff/process-exporter/cmd/process-exporter	[no test files]
ok  	github.com/ncabatoff/process-exporter/config	0.007s
--- FAIL: TestMissingIo (0.00s)
	read_test.go:160: got 0, want 1
FAIL
FAIL	github.com/ncabatoff/process-exporter/proc	0.007s
make: *** [test] Error 1
Makefile:36: recipe for target 'test' failed
make: Leaving directory '/go/src/github.com/ncabatoff/process-exporter'
The command '/bin/sh -c make -C /go/src/github.com/ncabatoff/process-exporter' returned a non-zero code: 2
make: *** [docker] Error 2

can't build binary from source

When cloning the source and then running ./build_static.sh results in error of missing dependencies:

cd cmd/process-exporter; CGO_ENABLED=0 go build -o ../../process-exporter -a -tags netgo
main.go:12:2: cannot find package "github.com/ncabatoff/fakescraper" in any of:
	/usr/lib/go/src/github.com/ncabatoff/fakescraper (from $GOROOT)
	/gopath/src/github.com/ncabatoff/fakescraper (from $GOPATH)
../../proc/grouper.go:6:2: cannot find package "github.com/ncabatoff/go-seq/seq" in any of:
	/usr/lib/go/src/github.com/ncabatoff/go-seq/seq (from $GOROOT)
	/gopath/src/github.com/ncabatoff/go-seq/seq (from $GOPATH)
../../proc/read.go:10:2: cannot find package "github.com/ncabatoff/procfs" in any of:
	/usr/lib/go/src/github.com/ncabatoff/procfs (from $GOROOT)
	/gopath/src/github.com/ncabatoff/procfs (from $GOPATH)
main.go:16:2: cannot find package "github.com/prometheus/client_golang/prometheus" in any of:
	/usr/lib/go/src/github.com/prometheus/client_golang/prometheus (from $GOROOT)
	/gopath/src/github.com/prometheus/client_golang/prometheus (from $GOPATH)
../../config/config.go:14:2: cannot find package "gopkg.in/yaml.v2" in any of:
	/usr/lib/go/src/gopkg.in/yaml.v2 (from $GOROOT)
	/gopath/src/gopkg.in/yaml.v2 (from $GOPATH)
make: *** [Makefile:29: build] Error 1

I've updated the build_static.sh file and added a go get ./... after the GOPATH export, this seems to have resolved the missing dependencies but unfortunately it lead to another error:

proc/read.go:129:12: undefined: procfs.ProcStatus

unable to view named process in grafana

I'm using the docker image for this, but unable to view any data points for named_process.
What could be missing?

Some process are not found by the -comm: param

This is the config all.yml used

process_names:

  - exe:
    - clickhouse-server
    - clickhouse-client
    - python

  - comm:
    - postmaster

Processes:

ps aux | grep postgres
postgres  3201  0.0  0.0 187028  4316 ?        S    19:36   0:00 /usr/pgsql-9.6/bin/postmaster -D /u01/postgres
postgres  3203  0.0  0.0 184904  1412 ?        Ss   19:36   0:00 postgres: logger process                      
postgres  3205  0.0  0.0 187160  1944 ?        Ss   19:36   0:00 postgres: checkpointer process                
postgres  3206  0.0  0.0 187028  1580 ?        Ss   19:36   0:00 postgres: writer process                      
postgres  3207  0.0  0.0 187028  1636 ?        Ss   19:36   0:00 postgres: wal writer process                  
postgres  3208  0.0  0.0 188140  2416 ?        Ss   19:36   0:00 postgres: autovacuum launcher process         
postgres  3209  0.0  0.0 191168  5236 ?        Ss   19:36   0:00 postgres: stats collector process             
pgagent   3235  0.0  0.0 232020  2616 ?        S    19:36   0:00 /usr/bin/pgagent-9.6 -s /var/log/pgagent-9.6.log hostaddr= 127.0.0.1 dbname= postgres user= postgres port= 5433
postgres  3237  0.0  0.0 193420  8008 ?        Ss   19:36   0:00 postgres: postgres postgres 127.0.0.1(36327) idle
root      5337  0.0  0.0 103328   896 pts/4    S+   20:12   0:00 grep postgres

Result in prometheus

Why only 2 postgres processes are found by -comm param ? (They all have postmaster in comm file)

Im using
process-exporter-0.3.2.linux-386 on Rhel6
Thanks!

.ExeFull should be the fully-qualified path to the executable but isn't

My intent when I introduced the templating system for naming groups was that .ExeFull would yield the fully-qualified path to the executable. In practice it was implemented by looking at /proc//cmdline (a NULL-delimited list of strings which corresponds to argv from the process' perspective) and taking the first element of the list. .ExeBase is then the basename of that.

I knew that a process can write to its argv to change what ps reports, but I somehow didn't think about the implications for .ExeFull while I was implementing it. If the process decides to change its argv then we're at the mercy of what it writes. It may even introduce spaces, which would never occur naturally since they should be replaced by NULLs.

I'm reluctant to change the existing behaviour because it may break existing configurations. One idea I had was to introduce a new 'exe' label which would have the value in /proc//exe, but we can't assume that all members of a namegroup would share the same executable. It'll have to be a new field in the structure given to the template engine. Unfortunately .ExeFull is what it really ought to be called, but as I say, I don't want to break existing setups by changing its behaviour. I'm open to naming suggestions. If I don't hear anything better I'll go with .RealExe.

Inconsistent Handling of the Process Threads

It looks like different data is handled inconsistently for threads belonging to the process

for namedprocess_namegroup_cpu_user_seconds_total the value seems to include all the threads, same for namedprocess_namegroup_read_bytes_total however namedprocess_namegroup_context_switches_total seems to only report value from the main thread which is not meaningful value for threaded processes.

Process-Matches wierd

Hi, first of all thanks for this tool - I really appreciate your work.
I have the process exporter running on a Ubuntu 16 and the number of matched processes doesnt seem to add up to the numbers that I would expect.

I am using the following config.yml file:

process_names:

name: "Docker Container Prozesse"
comm:
- docker-containerd
  cmdline:
- docker-containerd
- unix
- -l
  ... more config but here is everything working as expected.

I try to find the matched processes in this way:
ps -C docker-containerd | wc -l
20

But the Agent keeps telling me a far higher number of processes that it finds, is this a bug ?
namedprocess_namegroup_num_procs{groupname="Docker Container Prozesse"} 34

If I can provide any more information that you need, just tell.

Just to be more precise, I need to monitor this process (CMDLINE):
docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc
I don't need to keep track of any of the other processes, which look mainly like this:
docker-containerd-shim 4f646407d3139634c0a0a080946d06782f1a32dcf528a131d88bb9f10ce8d936 /var/run/docker/libcontainerd/4f646407d3139634c0a0a080946d06782f1a32dcf528a131d88bb9f10ce8d936 docker-runc

Thanks!

crash on multiple exe: match

I'm usually have just one mono process, but today I've started another one and process-exporter crashed. Here are these processes:

# ps ax | grep mono
30225 pts/15   Sl+    0:01 mono wdmrc.exe -p 10801
32293 ?        Sl     1:40 mono --verify-all /usr/lib64/keepass/KeePass.exe

My config:

process_names:
  - comm:
    - bash
    - dropbox
    - firefox
    - goldendict
    - grafana-server
    - mc
    - mutt
    - mysqld
    - node-exporter
    - pidgin
    - process-exporter
    - prometheus
    - rtorrent main
    - urxvt
    - vi
    - VirtualBox

  - name: KeePass
    exe:
    - mono
    cmdline:
    - KeePass

Panic:

2017/02/06 01:37:14 Reading metrics from /proc based on "config.yml"
panic: runtime error: index out of range

goroutine 1 [running]:
panic(0x7b1ba0, 0xc420014090)
	/usr/lib/go/src/runtime/panic.go:500 +0x1a1
github.com/ncabatoff/process-exporter/config.(*cmdlineMatcher).Match(0xc42012e9c0, 0xc420380ca8, 0x4, 0xc420360c80, 0x4, 0x4, 0xc4202fc001)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/config/config.go:118 +0x1ed
github.com/ncabatoff/process-exporter/config.andMatcher.Match(0xc42012e9e0, 0x2, 0x2, 0xc420380ca8, 0x4, 0xc420360c80, 0x4, 0x4, 0xc4202fc000)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/config/config.go:126 +0x74
github.com/ncabatoff/process-exporter/config.(*matchNamer).MatchAndName(0xc42012ea00, 0xc420380ca8, 0x4, 0xc420360c80, 0x4, 0x4, 0x0, 0x0, 0x0)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/config/config.go:68 +0x85
github.com/ncabatoff/process-exporter/config.FirstMatcher.MatchAndName(0xc42012ea20, 0x2, 0x2, 0xc420380ca8, 0x4, 0xc420360c80, 0x4, 0x4, 0x9ffa40, 0x0, ...)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/config/config.go:60 +0x7f
github.com/ncabatoff/process-exporter/config.(*FirstMatcher).MatchAndName(0xc42012ea60, 0xc420380ca8, 0x4, 0xc420360c80, 0x4, 0x4, 0x0, 0x0, 0x0)
	<autogenerated>:1 +0x8b
github.com/ncabatoff/process-exporter/proc.(*Grouper).Update(0xc420128f90, 0x9e6660, 0xc4200572c0, 0x0, 0xc420129020, 0x50)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/proc/grouper.go:85 +0x295
main.NewProcessCollector(0x81b465, 0x5, 0xc42012ea01, 0x9e0680, 0xc42012ea60, 0x18, 0xc420130080, 0xc420125ad0)
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/cmd/process-exporter/main.go:274 +0x2e3
main.main()
	/home/powerman/gocode/src/github.com/ncabatoff/process-exporter/cmd/process-exporter/main.go:218 +0x6b5

Process Exporter Default Binding in DEB package

Now we have Debian package and startup script which is great!

The problem however it changes default from the previous default to bind to local host only:

root@mysql8:~# ps aux | grep process
root 15556 0.0 0.9 13260 9292 ? Ssl 15:31 0:00 /usr/bin/process-exporter --config.path /etc/process-exporter/all.yaml --web.listen-address=127.0.0.1:9256

This makes it not overly useful by default as it is unlikely you have Prometheus Server running on the same host.

Report Busiest Thread in terms of CPU Usage

Process Exporter already reports the worst FD Usage Ratio for processes in the group

namedprocess_namegroup_worst_fd_ratio{groupname="agetty"} 0.00390625

It would be great to apply the same to get the metrics for the busiest CPU thread in the group. This is helpful to understand when particular thread is CPU bound.

OkMeter has good description of this idea
https://okmeter.io/docs/processes

how to monitor both parent and child

I've urxvt running bash running rtorrent. I'd like to monitor all 3 apps, but, of course, I don't wanna have rtorrent I/O counted on bash or urxvt instead of rtorrent. Is this possible?

Please make releases available

This stuff is great, please make builds available for Linux (X86, X64, ARM builds would be amazing)

Add VmSwap metric for process memory

Currently Virtual and Resident memory per process are reported. It would be great to also report VmSwap amount so it is easier to troubleshoot processes were swapped out.

Report on context switches

Split off from #16. New metrics desired:

Context switches:

Number of voluntary context switches
Number of involuntary context switches

These should be reported both per-group and, if -threads is enabled, per-threadname.

package naming convention

Hello,
Thank-you very much for that exporter.
I have two remarks to improve the binaries packaging.

The underscore separator in the 2.5.0 package name create a singularity with other exporters (and applications):

elasticsearch_exporter-1.0.2.linux-amd64.ta
hadoop_exporter-1.0.linux-amd64.tar.gz
node_exporter-0.15.2.linux-amd64.tar.gz
process-exporter-0.1.0.linux-amd64.tar.gz
process-exporter_0.2.5_linux_amd64.tar.gz       <----<<< _0.2.5_linux_

I don't mind a lot as it is just a convention but it had side effect on Ansible deployment role as in the role the package name is made from the version given as parameter so perhaps I will not be the only one to face this issue.

Another formal change is the package content.
The files were stored into a directory of the same name as the archive, now files are going to the current directory:

$ tar tzf process-exporter-0.1.0.linux-amd64.tar.gz
process-exporter-0.1.0.linux-amd64/
process-exporter-0.1.0.linux-amd64/LICENSE
process-exporter-0.1.0.linux-amd64/process-exporter

$ tar tzf process-exporter_0.2.5_linux_amd64.tar.gz
LICENSE
README.md
cmd/process-exporter

Here also it makes deployment scripts a bit more complicated to match any version

Question about monitoring multiple instances

Hi there, i'm trying to watch multiple java process, but fails at this task; could you please help me?
my java processes looks like this:

bbes 53784  0.2  5.2 4681488 320296 ?      Sl   Apr06   2:31 /opt/java/jdk/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/opt/atlassian/bitbucket/elasticsearch -cp /opt/atlassian/bitbucket/elasticsearch/lib/elasticsearch-2.3.1.jar:/opt/atlassian/bitbucket/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -d -p /var/atlassian/application-data/stash/shared/search/elasticsearch.pid -Dpath.conf=/var/atlassian/application-data/stash/shared/search -Dpath.logs=/var/atlassian/application-data/stash/log/search -Dpath.data=/var/atlassian/application-data/stash/shared/search/data
bb 53839  4.1 45.6 6593284 2796664 ?     Sl   Apr06  44:14 /opt/java/jdk/bin/java -Djava.util.logging.config.file=/opt/atlassian/bitbucket/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms2560m -Xmx2560m -XX:+UseG1GC -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Datlassian.standalone=BITBUCKET -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true -Dmail.mime.decodeparameters=true -Dorg.apache.catalina.connector.Response.ENFORCE_ENCODING_IN_GET_WRITER=false -Dcom.sun.jndi.ldap.connect.pool.protocol=plain ssl -Dcom.sun.jndi.ldap.connect.pool.authentication=none simple DIGEST-MD5 -Djava.library.path=/opt/atlassian/bitbucket/lib/native:/var/atlassian/application-data/stash/lib/native -Dbitbucket.home=/var/atlassian/application-data/stash -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Djava.endorsed.dirs=/opt/atlassian/bitbucket/endorsed -classpath /opt/atlassian/bitbucket/bin/bitbucket-bootstrap.jar:/opt/atlassian/bitbucket/bin/bootstrap.jar:/opt/atlassian/bitbucket/bin/tomcat-juli.jar -Dcatalina.base=/opt/atlassian/bitbucket -Dcatalina.home=/opt/atlassian/bitbucket -Djava.io.tmpdir=/opt/atlassian/bitbucket/temp com.atlassian.stash.internal.catalina.startup.Bootstrap start
daemon     594  0.9  1.3 5587072 432556 ?      Sl   Mar29 125:54 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -classpath /opt/atlassian/confluence/temp/synchrony-standalone8170942077011237673.jar:/opt/atlassian/confluence/confluence/WEB-INF/lib/h2-1.3.176.jar -Xss2048k -Xmx1g synchrony.core sql
daemon     621  0.2  1.7 4834780 591276 ?      Sl   Mar29  27:36 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -classpath /opt/atlassian/confluence/temp/synchrony-standalone8438142208099368562.jar:/opt/atlassian/confluence/confluence/WEB-INF/lib/h2-1.3.176.jar -Xss2048k -Xmx1g synchrony.core sql
daemon    1435  0.4  1.7 6399608 587060 ?      Sl   Mar29  58:39 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -classpath /opt/atlassian/confluence/temp/synchrony-standalone3177267881670344377.jar:/opt/atlassian/confluence/confluence/WEB-INF/lib/h2-1.3.176.jar -Xss2048k -Xmx1g synchrony.core sql
daemon   17673  3.5  8.3 10416536 2736508 ?    Ssl  Mar29 452:41 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/opt/atlassian/confluence/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dconfluence.context.path= -Dorg.apache.tomcat.websocket.DEFAULT_BUFFER_SIZE=32768 -Dsynchrony.enable.xhr.fallback=true -Xms1024m -Xmx1024m -XX:+UseG1GC -Datlassian.plugins.enable.wait=300 -Djava.awt.headless=true -XX:G1ReservePercent=20 -Xloggc:/opt/atlassian/confluence/logs/gc-2017-03-29_14-17-12.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2M -XX:-PrintGCDetails -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -Xms256m -Xmx4096m -Djava.endorsed.dirs=/opt/atlassian/confluence/endorsed -classpath /opt/atlassian/confluence/bin/bootstrap.jar:/opt/atlassian/confluence/bin/tomcat-juli.jar -Dcatalina.base=/opt/atlassian/confluence -Dcatalina.home=/opt/atlassian/confluence -Djava.io.tmpdir=/opt/atlassian/confluence/temp org.apache.catalina.startup.Bootstrap start
daemon   17676  0.8  8.2 10373748 2724836 ?    Ssl  Mar29 106:51 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/opt/atlassian/jira/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms384m -Xmx768m -Djava.awt.headless=true -Datlassian.standalone=JIRA -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true -Dmail.mime.decodeparameters=true -Dorg.dom4j.factory=com.atlassian.core.xml.InterningDocumentFactory -XX:+PrintGCDateStamps -XX:-OmitStackTraceInFastThrow -Djira.home=/var/atlassian/jira -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Xms384m -Xmx4096m -Datlassian.plugins.enable.wait=300 -classpath /opt/atlassian/jira/bin/bootstrap.jar:/opt/atlassian/jira/bin/tomcat-juli.jar -Dcatalina.base=/opt/atlassian/jira -Dcatalina.home=/opt/atlassian/jira -Djava.io.tmpdir=/opt/atlassian/jira/temp org.apache.catalina.startup.Bootstrap start
daemon   17721  4.5  7.0 10484692 2322232 ?    Ssl  Mar29 574:46 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/opt/atlassian/confluence/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dconfluence.context.path= -Dorg.apache.tomcat.websocket.DEFAULT_BUFFER_SIZE=32768 -Xms1024m -Xmx1024m -XX:+UseG1GC -Datlassian.plugins.enable.wait=300 -Djava.awt.headless=true -XX:G1ReservePercent=20 -Xloggc:/opt/atlassian/confluence/logs/gc-2017-03-29_14-17-12.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2M -XX:-PrintGCDetails -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -Xms384m -Xmx4G -Djava.endorsed.dirs=/opt/atlassian/confluence/endorsed -classpath /opt/atlassian/confluence/bin/bootstrap.jar:/opt/atlassian/confluence/bin/tomcat-juli.jar -Dcatalina.base=/opt/atlassian/confluence -Dcatalina.home=/opt/atlassian/confluence -Djava.io.tmpdir=/opt/atlassian/confluence/temp org.apache.catalina.startup.Bootstrap start
daemon   17876  4.6  5.9 7329312 1955060 ?     Ssl  Mar29 583:35 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Djava.util.logging.config.file=/opt/atlassian/confluence/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Dorg.apache.tomcat.websocket.DEFAULT_BUFFER_SIZE=32768 -Xms1024m -Xmx1024m -XX:+UseG1GC -Datlassian.plugins.enable.wait=300 -Djava.awt.headless=true -XX:G1ReservePercent=20 -Xloggc:/opt/atlassian/confluence/logs/gc-2017-03-29_14-17-13.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2M -XX:-PrintGCDetails -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -Djava.endorsed.dirs=/opt/atlassian/confluence/endorsed -classpath /opt/atlassian/confluence/bin/bootstrap.jar:/opt/atlassian/confluence/bin/tomcat-juli.jar -Dcatalina.base=/opt/atlassian/confluence -Dcatalina.home=/opt/atlassian/confluence -Djava.io.tmpdir=/opt/atlassian/confluence/temp org.apache.catalina.startup.Bootstrap start

meanwhile in man i see this statement:

  The -namemapping option allows assigning a group name based on a combination of
  the process name and command line.  For example, using 

    -namemapping "python2,([^/]+\.py),java,-jar\s+([^/]+).jar)"

which actually does not work 😢

$ ./process-exporter -namemapping "java,-jar\s+([^/]+).jar)"                                                                                        
2017/04/07 12:03:51 Error parsing -namemapping argument 'java,-jar\s+([^/]+).jar)': error compiling regexp '-jar\s+([^/]+).jar)': error parsing regexp: unexpected ): `-jar\s+([^/]+).jar)`
$ ./process-exporter -namemapping "python2,([^/]+\.py),java,-jar\s+([^/]+).jar)"
2017/04/07 12:03:59 Error parsing -namemapping argument 'python2,([^/]+\.py),java,-jar\s+([^/]+).jar)': error compiling regexp '-jar\s+([^/]+).jar)': error parsing regexp: unexpected ): `-jar\s+([^/]+).jar)`

So maybe good way to map these processes would be like:

  - name: "elasticsearch_2_3_1" #<which would be show up like {groupname="elasticsearch_2_3_1"}
    exe:
    - /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
    cmdline:
    - .*-cp /opt/atlassian/stash/4.9.1/elasticsearch/lib/elasticsearch-2.3.1.jar.*

Create histograms from per-thread counters

With -threads enabled, we already have CPU usage for each thread group. But using threadid as a label is problematic from a cardinality perspective, so we break down it down by thread name. This works great for some apps, like Chromium, that name their threads. Most apps do not, so it's completely unhelpful.

As a result I'm considering dropping the existing per-thread metrics in favour of a new approach. Since I can't usefully name groups of thread or use ids, I'll settle for characterizing the distribution of threads in a process namegroup. The deltas of the counters for each thread (cpu, io bytes, page faults, context switches) each cycle become histogram observations, e.g.

namedprocess_namegroup_threads_cpuseconds_bucket{le="0.5", cpumode="user", groupname="bash"} = 2

says that there were two threads consuming between 0 and 0.5s of user cpu time in the group named 'bash'.

Figuring out good bucket sizes for each of these that will apply to all or even most workloads may be challenging.

Allow to get user name label for processess

I recognize it is tricky with flexible group definition but I think it would be great to expose the user which owns the process as a label which would allow to aggregate the same data by user which is helpful for some workloads.

namedprocess_namegroup_cpu_system_seconds_total{groupname="acpid"} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="agetty"} 0
namedprocess_namegroup_cpu_system_seconds_total{groupname="apt-check"} 0

It would be great to get user="root" label added

Possible regression in v0.3.1 amd64 release on Debian testing

I currently use this package on a fairly up-to-date Debian testing amd64 system, with the older process-exporter-0.2.12.linux-amd64 release.

Kernel version is:

$ uname -a
Linux [hostname] 4.17.0-1-amd64 #1 SMP Debian 4.17.8-1 (2018-07-20) x86_64 GNU/Linux

Version 0.2.12 works on this setup with a simple configuration:

process_names:
  - name: "{{.Comm}}"
    cmdline: 
    - '.+'

And this command line:

./process-exporter --config.path foo.yml

Many, many process metrics are then produced, which is fantastic.

$ curl http://localhost:9256/metrics 2> /dev/null | wc -l 
2084

Now if I shut that down and repeat the same process with process-exporter-0.3.1.linux-amd64, I get no metrics.

$ curl http://localhost:9256/metrics 2> /dev/null | wc -l 
129

The output indicates some issues were encountered, but I can't see any debug options which might explain what those issues were:

namedprocess_scrape_errors 0
# HELP namedprocess_scrape_partial_errors incremented each time a tracked proc's metrics collection fails partially, e.g. unreadable I/O stats
# TYPE namedprocess_scrape_partial_errors counter
namedprocess_scrape_partial_errors 0
# HELP namedprocess_scrape_procread_errors incremented each time a proc's metrics collection fails
# TYPE namedprocess_scrape_procread_errors counter
namedprocess_scrape_procread_errors 271

I next attempted to use strace to see if invalid files were being opened.

0.2.12 starts like this:

$ strace ./process-exporter --config.path foo.yml 2>&1 | grep open
openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "foo.yml", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc//localtime", O_RDONLY) = 3
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/1/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/1/io", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
openat(AT_FDCWD, "/proc/1/fd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
openat(AT_FDCWD, "/proc/1/limits", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/1/cmdline", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/2/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/2/io", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
openat(AT_FDCWD, "/proc/2/fd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
openat(AT_FDCWD, "/proc/2/limits", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/2/cmdline", O_RDONLY|O_CLOEXEC) = 3

While 0.3.1 starts like this:

$ strace ./process-exporter --config.path foo.yml 2>&1 | grep open
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "foo.yml", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc//localtime", O_RDONLY) = 3
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/1/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/1/status", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/2/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/2/status", O_RDONLY|O_CLOEXEC) = 3

With no new files feeding into the erroneous output, I'm not really sure how to debug further, so I'm just sticking to the older version for now. I can post examples of /proc/stat, /proc/1/stat and /proc/1/status or more complete output if you think it would be relevant.

Aside, #50 is reported to be affected by OS version, not process-exporter version, I'm assuming that these are different issues.

Doesn't work on Ubuntu 18.04

Hello!

I checked on several ubuntu 18.04 machines and a problem looks similar - process-exporter cannot read processes from /proc directory. This counter shows it:

namedprocess_scrape_procread_errors 121

About OS and configs:
/etc/os-release

NAME="Ubuntu"
VERSION="18.04 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

uname -a

Linux server 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

/etc/process-exporter/all.yaml

process_names:
  - name: "{{.Comm}}"
    cmdline: 
    - '.+'

/lib/systemd/system/process-exporter.service

[Unit]
Description=Process Exporter for Prometheus

[Service]
User=root
Type=simple
ExecStart=/usr/bin/process-exporter --config.path /etc/process-exporter/all.yaml --web.listen-address=:9256
KillMode=process
Restart=always

[Install]
WantedBy=multi-user.target

On ubuntu 16.04 machines process-exporter works as expected.

Report on per-thread states (Running, Sleeping, etc)

Hi,

Here is VMSTAT Data I see:

root@rocky:/var/lib/mysql# vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
7 1 0 60208724 249524 13975188 0 0 446 1264 1 0 10 2 84 5 0
9 2 0 60183048 249524 14000796 0 0 45019 71400 28571 120740 11 2 83 4 0
19 0 0 60151924 249532 14028544 0 0 45933 58183 30071 124420 11 2 84 3 0
5 0 0 60129124 249532 14054136 0 0 43394 59277 26325 119453 10 2 85 3 0
8 0 0 60096532 249536 14081696 0 0 47411 73015 30156 124027 12 2 81 5 0
7 0 0 60073072 249536 14108168 0 0 42472 59339 24350 117616 10 2 85 3 0

This shows there is certain number of processes always runable. I would expect similar results from process_exporter, yet I see:

If I simplify this to prometheus expression:

max_over_time(namedprocess_namegroup_states{instance=~"rocky", state="Running"}[30s])

I still have value 0 returned for basically all processess:

This is:

root@rocky:/var/lib/mysql# uname -a
Linux rocky 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I wonder if this is some issue with kernel reporting or with information capture

Report on per-thread wchan state

Split off from #16: add a per-group metric counting the number of threads sleeping on each wchan (kernel function name).

Alerting on missing processes upon startup

Hello,

First of all, thank you for creating and maintaining this project! It's been a big help for us.

We have an alert set up like this to fire if any expected processes are not running:

alert: ProcessNotRunning
expr: namedprocess_namegroup_num_procs
  < 1
for: 1m
labels:
  severity: page
annotations:
  description: '{{ $labels.groupname }} process missing on {{ $labels.instance }}'
  summary: '{{ $labels.groupname }} process on {{ $labels.instance }} has not been
    running for 1 min.'

One thing I just noticed is that if some expected processes are not running at the time when I start process-exporter then I get no metrics about them. This means that if an instance restarts, we need to manually check that all expected processes are running.

Wondering if I'm doing something wrong, or if there is another way to go about this. Everything works great if the process is already running at the time when I start process_exporter.

Here is how I'm starting process-exporter (using dummy processes top and tail for demo purposes):

  docker run \
   -d \
   --privileged \
   --name process_exporter \
   -v /proc:/host/proc \
   -p 9256:9256 \
   ncabatoff/process-exporter:0.3.9 \
   --procfs /host/proc \
   -procnames top,tail

Thanks!
Dave

[feature request] add number of threads

Hi,

this is already immensely useful, but would be even better if it would include the number of threads (which is also available from proc).

Cheers, Felix

Grafana dashboard - view processes graphics on specific host

Hello,

I have few hosts with process-exporter installed.

Seems like on grafana dashboard https://grafana.net/dashboards/249 there is not ability to view processes graphics on specific hosts. So i added this ability, you can find dashboard with my fixes here - https://github.com/chromium58/grafana_dashboards/blob/master/named-processes_rev2_byhost.json

Hope it might be helpful.

Uptime graph

Currently, the stat oldest_start_time_seconds can be used to display the start time of the (oldest) process in a group.

However, there's no way to use this in a graph and produce meaningful results - the best I can do is something like time()-namedprocess_namegroup_oldest_start_time_seconds{groupname="some_process"} as a SingleStat to show the current "uptime".

I'd like to be able to graph the uptime of a process, but if I use a similar forumla for a graph, it would just continually increase, even after a process restarted. I think the only way to get this information would be if process-exporter itself did the calcualation, and exposed uptime as now()-starttime(22)

If there's a way I can do this in Graphana without changes to process-exporter, that would be awesome!

No values for cpu utilization in grafana.com dashboard

The dashboard shows No data points for CPU utilization:

When I check /proc/pid/stat as far as I can tell values are there:
1799 (nginx) S 1 1799 1799 0 -1 4202816 79 0 0 0 0 0 0 0 20 0 1 0 4989 59432960 319 18446744073709551615 93830307962880 93830309116724 140731212184672 140731212183352 140600322262326 0 0 1073745920 402745863 18446744071579500377 0 0 17 0 0 0 0 0 0 93830311217320 93830311354432 93830313832448 140731212189544 140731212189600 140731212189600 140731212189672 0

I am not sure, but I think the correct metrics are gathered from /proc and exposed:

curl localhost:9256/metrics  | grep "^namedprocess_namegroup_cpu_user_seconds_total"

namedprocess_namegroup_cpu_user_seconds_total{groupname="java"} 9.640000000000004
namedprocess_namegroup_cpu_user_seconds_total{groupname="nginx.con"} 0
namedprocess_namegroup_cpu_user_seconds_total{groupname="nginx: worker process"} 89
namedprocess_namegroup_cpu_user_seconds_total{groupname="node_exporter"} 5.109999999999999

Detecting Java processes by parameter, RegEx in `cmdline` not working

I tried to translate a working regex into the shape that your config file needs but failed. I already tried to "debug" this but failed also.

This regex on its own works:
-Dprocess_name=(?P<process>[\w\d_-]*)

I ran it through regex101 and it works as intended:

The same in a config does not yield a result:

process_names:
  - comm:
    - bash

  - exe:
    - nginx

  - name: "{{.ExeFull}}:{{.Matches.process}}"
    exe:
    - java
    cmdline:
    - -Dprocess_name=(?P<process>[\\w\\d_-]*)

I also checked with ps, there are enough Java processes that fit this pattern.

In the exported metrics I can see the process, but the process name from the -D is missing:

namedprocess_namegroup_cpu_system_seconds_total{groupname="/var/opt/jdk9/bin/java:"} 0.060000000000044906

Where is my mistake ?

I also tried your example from the readme (removing the scan for prometheus, postgres and bash):

process_names:
  - name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
    exe:
    - /usr/local/bin/process-exporter
    cmdline:
    - -config.path\\s+(?P<Cfgfile>\\S+)

Even this (proven ?) example does not work. I run process exporter and can see that the process is running:
root 14818 1.7 0.0 14848 6916 pts/2 Sl 07:42 0:00 ./process-exporter -config.path bsp.yml

When I run curl http://localhost:9256/metrics | grep "^namedprocess" this is all I get:

namedprocess_scrape_errors 0
namedprocess_scrape_partial_errors 0
namedprocess_scrape_procread_errors 0

states vs num_procs discrepancy

I have only one named group, running process-exporter 0.4.0 and I have noticed strange discrepancy between num_procs (namedprocess_namegroup_num_procs) metrics and sum of states (namedprocess_namegroup_states{state="Sleeping"}).

num_procs shows 22 processes, which is fine and it is number of procsseses (ps aux ... | wc -l). All processes in the group are sleeping and sum of sleeping processes shows exactly twice as much - 44.

And I am not sure why. Is this some kind of a bug? /proc is read twice? Or I misunderstood this metric?

I have also noticed, if num_procs is 1 it is not multiplicated in states metric, otherwise it is always doubled.

[feature request] Add maximum open file descriptors limit

Hi,

Thanks for this great exporter! I was wondering if it would be possible to add the maximum open file descriptors limit per tracked process so as to be used together with open fds and alert on the 2 metrics.
Unfortunately, I am not very familiar with go exporters but from an investigation, something similar is implemented in other exporters for monitoring its own fds like https://github.com/prometheus/memcached_exporter/blob/69ad29b59bf4a790785061094a5f1bb0112aec77/vendor/github.com/prometheus/client_golang/prometheus/process_collector.go

if limits, err := p.NewLimits(); err == nil {
	c.maxFDs.Set(float64(limits.OpenFiles))
	ch <- c.maxFDs
}

I would appreciate this addition! Thank you!

Publish docker image?

Couldn't find this on docker hub. Would it be possible to publish there, or another registry?

Thanks for putting this out there!

Next release

Hi, guys!

Can you describe a plan for a new release? Our team required for #17 but it was merged to master only still.

The last release was on 16 Nov 2016 and we don't understand can we use master in production with the required feature or should wait until next release will ship?

ncabatoff / process-exporter Goto Github PK

process-exporter's Introduction

process-exporter

Installation

Running

Configuration and group naming

Using a config file

Using a config file: group name

Using a config file: process selectors

Using -procnames/-namemapping instead of config.path

Group Metrics

num_procs gauge

cpu_seconds_total counter

read_bytes_total counter

write_bytes_total counter

major_page_faults_total counter

minor_page_faults_total counter

context_switches_total counter

memory_bytes gauge

open_filedesc gauge

worst_fd_ratio gauge

oldest_start_time_seconds gauge

num_threads gauge

states gauge

Group Thread Metrics

thread_count gauge

thread_cpu_seconds_total counter

thread_io_bytes_total counter

thread_major_page_faults_total counter

thread_minor_page_faults_total counter

thread_context_switches_total counter

Instrumentation cost

Dashboards

Building

Exposing metrics through HTTPS

process-exporter's People

Contributors

Stargazers

Watchers

Forkers

process-exporter's Issues

The problem:

How to replicate:

Resolution attempts:

Recommend Projects

Recommend Topics

Recommend Org