Giter Site home page Giter Site logo

kayrus / prometheus-kubernetes Goto Github PK

View Code? Open in Web Editor NEW
450.0 26.0 216.0 165 KB

Most common Prometheus deployment example with alerts for Kubernetes cluster

License: GNU General Public License v2.0

Shell 100.00%
prometheus prometheus-configuration kubernetes kubernetes-monitoring prometheus-deployment

prometheus-kubernetes's Introduction

See also Elasticsearch+Kibana Kubernetes complete example

Prerequisites

Kubectl

kubectl should be configured.

Namespace

This example uses monitoring namespace. If you wish to use your own namespace, just export NAMESPACE=mynamespace environment variable.

Upload etcd TLS keypair

In case when you use TLS keypair and TLS auth for your etcd cluster, please put corresponding TLS keypair into the etcd-tls-client-certs secrets:

kubectl --namespace=monitoring create secret generic --from-file=ca.pem=/path/to/ca.pem --from-file=client.pem=/path/to/client.pem --from-file=client-key.pem=/path/to/client-key.pem etcd-tls-client-certs

otherwise create a dummy secret:

kubectl --namespace=monitoring create secret generic --from-literal=ca.pem=123 --from-literal=client.pem=123 --from-literal=client-key.pem=123 etcd-tls-client-certs

Upload Ingress controller server TLS keypairs

In order to provide secure endpoint available trough the Internet you have to set example-tls secret inside the monitoring Kubernetes namespace.

kubectl create --namespace=monitoring secret tls example-tls --cert=cert.crt --key=key.key

Detailed information is available here. Ingress manifest example.

Create Ingress basic auth entry

With the internal-services-auth name. More info is here. Ingress manifest example.

Set proper external URLs to have correct links in notifications

Run EXTERNAL_URL=https://my-external-prometheus.example.com ./deploy.sh to deploy Prometheus monitoring configured to use https://my-external-prometheus.example.com base URL. Otherwise it will use default value: https://prometheus.example.com.

Assumptions

Disk mount points

This repo assumes that your Kubernetes worker nodes contain two observable mount points:

  • root mount point / which is mounted as readonly /root-disk inside the node-exporter pod
  • data mount point /localdata which is mounted as readonly /data-disk inside the node-exporter pod

If you wish to change these values, you have to modify node-exporter-ds.yaml, prometheus-rules/low-disk-space.rules, grafana-import-dashboards-configmap and then rebuild configmap manifests before you run ./deploy.sh script.

Data storage

This repo uses emptyDir data storage which means that every pod restart will cause data loss. In case when you wish to use persistant storage please modify the following manifests correspondingly:

Grafana dashboards

Initial Grafana dashboards were taken from this repo and adjusted.

Ingress controller

Example of an ingress controller to get an access from outside:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/auth-realm: Authentication Required
    ingress.kubernetes.io/auth-secret: internal-services-auth
    ingress.kubernetes.io/auth-type: basic
    kubernetes.io/ingress.allow-http: "false"
  name: ingress-monitoring
  namespace: monitoring
spec:
  tls:
  - hosts:
    - prometheus.example.com
    - grafana.example.com
    secretName: example-tls
  rules:
  - host: prometheus.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-svc
          servicePort: 9090
      - path: /alertmanager
        backend:
          serviceName: alertmanager
          servicePort: 9093
  - host: grafana.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000

If you still don't have an Ingress controller installed, you can use manifests from the test_ingress directory for test purposes.

Alerting

Included alert rules

Prometheus alert rules which are already included in this repo:

  • NodeCPUUsage > 50%
  • NodeLowRootDisk > 80% (relates to /root-disk mount point inside node-exporter pod)
  • NodeLowDataDisk > 80% (relates to /data-disk mount point inside node-exporter pod)
  • NodeSwapUsage > 10%
  • NodeMemoryUsage > 75%
  • ESLogsStatus (alerts when Elasticsearch cluster status goes yellow or red)
  • NodeLoadAverage (alerts when node's load average divided by amount of CPUs exceeds 1)

Notifications

alertmanager-configmap.yaml contains smtp_* and slack_* inside the global sections. Adjust them to meet your needs.

Updating configuration

Prometheus configuration

Update command line parameters

Modify prometheus-deployment.yaml and apply a manifest:

kubectl --namespace=monitoring apply -f prometheus-deployment.yaml

If deployment manifest was changed, all Prometheus pods will be restarted with data loss.

Update configfile

Update prometheus-configmap.yaml or prometheus-rules directory contents and apply them:

./update_prometheus_config.sh
# or
./update_prometheus_rules.sh

These scripts will update configmaps, wait until changes will be delivered into the pod volume (if the configmap was not changed, this script will work forever) and reload the configs. You can also reload configs manually using the commands below:

curl -XPOST --user "%username%:%password%" https://prometheus.example.com/-/reload
# or
kubectl --namespace=monitoring exec $(kubectl --namespace=monitoring get pods -l app=prometheus -o jsonpath={.items..metadata.name}) -- killall -HUP prometheus

Alertmanager configuration

Update command line parameters

Modify alertmanager-deployment.yaml and apply a manifest:

kubectl --namespace=monitoring apply -f alertmanager-deployment.yaml

If deployment manifest was changed, all Alertmanager pods will be restarted with data loss.

Update configfile

Update alertmanager-configmap.yaml or alertmanager-templates directory contents and apply them:

./update_alertmanager_config.sh
# or
./update_alertmanager_templates.sh

These scripts will update configmaps, wait until changes will be delivered into the pod volume (if the configmap was not changed, this script will work forever) and reload the configs. You can also reload configs manually using the commands below:

curl -XPOST --user "%username%:%password%" https://prometheus.example.com/alertmanager/-/reload
# or
kubectl --namespace=monitoring exec $(kubectl --namespace=monitoring get pods -l app=alertmanager -o jsonpath={.items..metadata.name}) -- killall -HUP alertmanager

Pictures

grafana

prometheus-kubernetes's People

Contributors

kayrus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prometheus-kubernetes's Issues

No kubernetes pod metrics

Hello @kayrus :
I found a problem on my environment. I use this repo to deploy prometheus. All is corrected but kubernetes-pod target. In prometheus-configmap.yaml kubernetes-pod target has been set.But no 'container_xxx' metrics has been shown on Prometheus. No kubernetes-pod target shown on prometheus dashboard targets page. Can u help me ?

My Environment

Kubernetes 1.7
Use `kayrus/prometheus-kubernetes` master branch

Problem Picture

image

Alert Manager Container Crash

I found an issue with the alertmanager-deployment.yaml when using the most up to date prometheus.

If anyone else sees this, the pod will deploy and then you will see the following. See fix section if you need to fix this.

This is fixed for me, just wanted to leave this for awareness.

Normal Scheduled 46s default-scheduler Successfully assigned alertmanager-6c54bccc56-84k4g to aks-nodepool1-42287233-0
Normal SuccessfulMountVolume 46s kubelet, aks-nodepool1-42287233-0 MountVolume.SetUp succeeded for volume "alertmanager"
Normal SuccessfulMountVolume 46s kubelet, aks-nodepool1-42287233-0 MountVolume.SetUp succeeded for volume "config-volume"
Normal SuccessfulMountVolume 46s kubelet, aks-nodepool1-42287233-0 MountVolume.SetUp succeeded for volume "templates-volume"
Normal SuccessfulMountVolume 46s kubelet, aks-nodepool1-42287233-0 MountVolume.SetUp succeeded for volume "default-token-c658r"
Normal SuccessfulMountVolume 46s kubelet, aks-nodepool1-42287233-0 MountVolume.SetUp succeeded for volume "etcd-tls-client-certs"
Normal Pulling 25s (x3 over 45s) kubelet, aks-nodepool1-42287233-0 pulling image "prom/alertmanager:master"
Normal Pulled 24s (x3 over 40s) kubelet, aks-nodepool1-42287233-0 Successfully pulled image "prom/alertmanager:master"
Normal Created 24s (x3 over 40s) kubelet, aks-nodepool1-42287233-0 Created container
Normal Started 24s (x3 over 40s) kubelet, aks-nodepool1-42287233-0 Started container
Warning BackOff 13s (x4 over 37s) kubelet, aks-nodepool1-42287233-0 Back-off restarting failed container
Warning FailedSync 13s (x4 over 37s) kubelet, aks-nodepool1-42287233-0 Error syncing pod

# FIX

They have changed the syntax of the deployment file to require a "--" in the specified area, the syntax of the spec heading should be as follows in the "alertmanager-deployment.yaml" file.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:master
args:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=$(EXTERNAL_URL)/alertmanager'

was

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:master
args:
- '-config.file=/etc/alertmanager/config.yml'
- '-storage.path=/alertmanager'
- '-web.external-url=$(EXTERNAL_URL)/alertmanager'

dashboard import api

Hi,
I noticed that the code is doing some cool stuff to import dashboards via the http interface:

              for file in *-dashboard.json ; do
                if [ -e "$file" ] ; then
                  # wrap exported Grafana dashboard into valid json
                  echo "importing $file" &&
                  (echo '{"dashboard":';cat "$file";echo ',"inputs":[{"name":"DS_PROMETHEUS","pluginId":"prometheus","type":"datasource","value":"prometheus"}]}') | curl --silent --fail --show-error \
                    --request POST http://localhost:3000/api/dashboards/import \
                    --header "Content-Type: application/json" \
                    --data-binary @-;
                  echo "" ;
                fi
              done ;

This seems to be an undocumented api, at least I can't find it on the grafana website. Where can I find more about this api? I've enabled persistent storage and I don't want to import the dashboards every time I bounce the container.

Thanks!

CPU alert works?

I have two questions.

Q1) README.md says

NodeCPUUsage > 50%

While prometheus-rules/cpu-usage.rules is as follows:

IF (100 - (avg by (instance) (irate(node_cpu{name="node-exporter",mode="idle"}[5m])) * 100)) > 75

CPU Usage threshold could be 75% rather than 50%, right?

Q2) Label "name" doesn't work actually. In my environment, dashboard of node_cpu is as follows:

node_cpu{..., kubernetes_name="prometheus-node-exporter", ...} = 0.15...

So, alert rule label condtion should be as follows, right?:

IF (100 - (avg by (instance) (irate(node_cpu{kubernetes_name="node-exporter",mode="idle"}[5m])) * 100)) > 75

node low data disk doesn't show alert

I define node low data disk for test like this,node low root disk show alert,but data disk doesn't
ALERT NodeLowDataDisk
IF ((node_filesystem_size{mountpoint="/var/lib/docker/"} - node_filesystem_free{mountpoint="/var/lib/docker/"}) / node_filesystem_size{mountpoint="/var/lib/docker/"} * 100) > 1
FOR 2m
LABELS {severity="page"}
ANNOTATIONS {DESCRIPTION="{{$labels.instance}}: Data disk usage is above 1% (current value is: {{ $value }})", SUMMARY="{{$labels.instance}}: Low data disk space"}

I find disk mount /var/lib/docker like this:
[root@master2 prometheus]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/cl-root 37G 4.0G 33G 11% /
devtmpfs 3.9G 4.0K 3.9G 1% /dev
shm 64M 0 64M 0% /dev/shm
tmpfs 3.9G 247M 3.6G 7% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sdc 40G 26G 15G 64% /var/lib/docker

unable to access Prometheus

I have a problem with Prometheus.
all pods run fine:
kubectl -n monitoring get all
NAME READY STATUS RESTARTS AGE
po/alertmanager-6f5466ddb9-bmrdb 1/1 Running 0 34m
po/grafana-bdbbc4775-grqh5 1/1 Running 0 34m
po/kube-state-metrics-798b9487d7-sxmfn 2/2 Running 0 34m
po/prometheus-d56c4947-89m6b 1/1 Running 0 34m
po/prometheus-node-exporter-276wv 1/1 Running 0 34m
po/prometheus-node-exporter-4s8qj 1/1 Running 0 34m
po/prometheus-node-exporter-bpfx6 1/1 Running 0 34m

NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/alertmanager 10.3.206.109 9093:9093/TCP 34m
svc/grafana 10.3.216.104 3000:7788/TCP 34m
svc/kube-state-metrics 10.3.100.30 8080/TCP 35m
svc/prometheus-node-exporter None 9100/TCP 34m
svc/prometheus-svc 10.3.36.2 9090:8899/TCP 34m

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/alertmanager 1 1 1 1 34m
deploy/grafana 1 1 1 1 34m
deploy/kube-state-metrics 1 1 1 1 34m
deploy/prometheus 1 1 1 1 34m

NAME DESIRED CURRENT READY AGE
rs/alertmanager-6f5466ddb9 1 1 1 34m
rs/grafana-bdbbc4775 1 1 1 34m
rs/kube-state-metrics-798b9487d7 1 1 1 34m
rs/kube-state-metrics-86dd5fbdcc 0 0 0 34m
rs/prometheus-d56c4947 1 1 1 34m

but Prometheus in the browser does not open (port 8899).
But grafanа (port 7788) and alertmanager (port 9093) open in the browser normally.
More grafana is unable to access Prometheus (http://prometheus-svc.monitoring:9090)

logs from Prometheus pod:
https://gist.github.com/beatlejuse/f5a303ffd620548bf5799f07c8d93c87

I can't understand what could be the reason. No hint on the error I can't find

grafana-import-dashboards start error

Hi,i use this repo to start prometheus-kubernetes, all components start ok but grafana-import-dashboards.
I run kubectl logs -f it show below:

syntax error: unexpected "do"

My k8s env:
'''
kubernetes: 1.6.3 startup by kubeadm
network: calico bgp
'''
Can u help me?

I deploy promethus result in apiserver fail for kubernetes 1.8

root@node1:/home/kubernetes/prometheus/prometheus-kubernetes# EXTERNAL_URL=https://prometheus.k8s.cn ./deploy.sh
Error from server (AlreadyExists): namespaces "monitoring" already exists
configmap "external-url" created
Set https://prometheus.k8s.cn as an external url
configmap "grafana-import-dashboards" created
configmap "prometheus-rules" created
configmap "alertmanager-templates" created
configmap "alertmanager" created
deployment "alertmanager" created
service "alertmanager" created
deployment "grafana" created
service "grafana" created
daemonset "node-exporter" created
configmap "prometheus-configmap" created
deployment "prometheus-deployment" created
configmap "prometheus-env" created
W1128 01:42:57.110043 46820 factory_object_mapping.go:423] Failed to download OpenAPI (Get https://192.168.48.128:6443/swagger-2.0.0.pb-v1: dial tcp 192.168.48.128:6443: getsockopt: connection refused), falling back to swagger
error: error validating "prometheus-svc.yaml": error validating data: Get https://192.168.48.128:6443/swaggerapi/api/v1: dial tcp 192.168.48.128:6443: getsockopt: connection refused; if you choose to ignore these errors, turn validation off with --validate=false
The connection to the server 192.168.48.128:6443 was refused - did you specify the right host or port?

apiserver logs:

logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12\xb6\x03\n\x06\n\x00\x12\x00\x1a\x00\x12\aFailure\x1a\xc9\x01Unable to refresh the initializer configuration: Get https://127.0.0.1:6443/apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations: dial tcp 127.0.0.1:6443: getsockopt: connection refused"\x14LoadingConfiguration*\xbd\x01\n\x06create\x12\x00\x1a\nnamespaces"\xa0\x01\n\x1fInitializerConfigurationFailure\x12{An error has occurred while refreshing the initializer configuration, no resources can be created until a refresh succeeds.\x1a\x00(\x012\x000\xf4\x03\x1a\x00"\x00"

Node exporter permission denied on root-disk

Node exporter daemon can't provide storage information returning permission errors:

time="2017-12-20T21:56:20Z" level=error msg="Error on statfs() system call for "/root-disk/run/docker/netns/029a1d875ff5": permission denied" source="filesystem_linux.go:57"
time="2017-12-20T21:56:20Z" level=error msg="Error on statfs() system call for "/root-disk/var/lib/kubelet/pods/94333d97-d07d-11e7-9cab-00155d008304/volumes/kubernetes.io~secret/flannel-token-mr8tv": permission denied" source="filesystem_linux.go:57"

Execcing into the container, seems like the user Nobody can't read past /root-disk/run/docker.

Some other values as CPU and Memory are passed correctly to Prometheus and Grafana, but not storage.

both root-disk and data-disk are mounted as Read Only.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.