datadog / the-monitor Goto Github PK

View Code? Open in Web Editor NEW

614.0 79.0 212.0 3.48 MB

Markdown files for Datadog's longform blog posts: https://www.datadoghq.com/blog/

Python 51.31% Shell 48.69%

the-monitor's Introduction

the-monitor

Markdown files for Datadog's longform blog posts: https://www.datadoghq.com/blog/

Please read our contribution guidelines before opening a new issue or pull request.

the-monitor's People

Contributors

Stargazers

Watchers

Forkers

jhotta deadfire sanderson22 laxman-sm dbenamydd chenbk85 abrilloyaenriquez ningning992 hivefans abhilash07 deef333 sunwq sptao huoxudong125 openit robin-norwood kimroen sekhardev yanana starknowdata donghaima dingminyi ithacadream siyuanzeng amitk ybtsdst xluffy-fork shivannakarthik tangfeixiong x0x8x kiddingmu mreddy8182 psyoblade jeremy-lq it-man-cn widy28 alexxnica kryndex colinsongf lonely7345 mylxsw fabio42 lavatory iketutg dmikenz tony612 eyalba cholman-zd helakelas kurochan chuyuqiao galaxydata ankitgoelcmu dskyers liubin artifactory vinayvicky useswincrt kvmadan mpessas wintersommer guoleihit habutre fagossa pathumf skalva404 vishnuatrai mingqiancoding ms-building-blocks s0nm roongr2k7 deanwei ianmadlenya judexzhu navatha24 idrisgawr heiqiaoxiang venoodkhatuva12 clinx lingya billyteves mingyitianxia coderminer wangweihong mixergit kaz-onzo glowdb mmxpanda tristanstraub kratos81 diogonicoleti dummyswim zxczhkzxc sxauyhz enixdark slim-naifar sindershyu sjanulonoks tophua jacobsy

the-monitor's Issues

Varnish dashboard doesn't work

The dashboard showed at https://github.com/DataDog/the-monitor/blob/master/varnish/monitor_varnish_using_datadog.md is blank.

This is probably related to DataDog/dd-agent#1459

Anyone willing to follow the guide will end up with a blank dashboard wondering if something went wrong, but seeing the graphs from "infrastructure->apps->varnish" is possible.

Is there a way i can fix the dashboard (".MAIN.") without having to rewrite everything?

edit-> varnish 5.2 and datadog-agent-5.18.1

HAProxy config not correct for latest version on blog post

On the page https://www.datadoghq.com/blog/how-to-collect-haproxy-metrics/, the parameters to enable stats on HAProxy doesn't work with the latest version (1.6) : https://cbonte.github.io/haproxy-dconv/1.6/configuration.html#4

listen stats :9000 should be replaced by

listen stats
bind :9000

Template Elasticsearch Timeboard indexing/query latency might be wrong

The template chart of indexing latency and query latency might be wrong. It shows different with Kibana.

Take query latency for example. The template is using rate of fetch time and rate of query time like this:

It might be better to use derivatives to get the same chart with Kibana like this:
(derivative of fetch time + derivative of query time ) / derivative of query total

The json of this is

{ "viz": "timeseries", "status": "done", "requests": [ { "q": "( derivative(sum:elasticsearch.search.fetch.time{$cluster}) + derivative(sum:elasticsearch.search.query.time{$cluster}) ) * 1000 / derivative(sum:elasticsearch.search.query.total{$cluster})", "aggregator": "avg", "conditional_formats": [], "type": "line" } ], "autoscale": true }

HAproxy config using socket connection instead of localhost?

Is it possible to configure the datadog agent to gather stats using the stats socket rather than a localhost url?

In this article: https://www.datadoghq.com/blog/how-to-collect-haproxy-metrics/ it does say how to setup socket stats collecting.. but then in the next article it only says how to integrate that using the localhost endpoint.

Thanks for any help.

DataDog Agent for PostgreSQL vs Tracing Postgres Queries with ddtrace

Hi,

We monitor our Postgres Database using the DataDog Agent for PostgreSQL as well as by tracing the queries our Go applications generate. We use the ddtrace's gorm package to trace the database queries.

When investigating the requests for a query, the values for trace.postgres.query.hits and postgresql.queries.count are returning completely different results within the same timeframe.

Is this expected? Why is this happening?

Issue with applying kubernetes deployment for AWS EKS

https://www.datadoghq.com/blog/eks-monitoring-datadog/#create-and-deploy-the-cluster-agent-manifest

Based on the provided documentation, kubectl apply -f /path/to/datadog-cluster-agent.yaml should work, but it fails with error:

kubectl apply -f /path/to/datadog-cluster-agent.yaml
service/datadog-cluster-agent created
error: unable to recognize "/path/to/datadog-cluster-agent.yaml": no matches for kind "Deployment" in version "extensions/v1beta1"

Fixes required:

Update apiVersion: extensions/v1beta1 to apiVersion: apps/v1
Add the following underneath spec:

selector:
    matchLabels:
      app: datadog-cluster-agent

Please update the documentation to support the fixes as otherwise other users will run into issues

"Configure the Agent" section does not reflect NGINXPlus properly

In the section "Configure the Agent" there is a problem which delayed my integration.
This section is supposed to be for NGINX and not for NGINXPlus. For NGINXPlus we also have to add "use_plus_api: true" which is not mentioned in this guide.

conf.yaml

init_config:

instances:

nginx_status_url: http://localhost/nginx_status/
tags:
- instance:foo

Update Kafka series

Update Kafka series to match corp-hugo.

Spark api's

Is there any road map to add documentation apis for spark as well ?

Document linking scheme for external contributors

Some of our articles contain internal links to other Datadog pages and articles, e.g.

/blog/monitoring-101-collecting-data/

These links won't work from within Github, and they can be confusing to external contributors (see for example #68), so we should either:

Standardize on full, functional URLs for source files (though this has some drawbacks for us while the posts are in development/staging)
Document clearly why we're doing what we're doing with links and other quirks so that well-meaning contributors don't spend time fixing them

Incorrect explanation on acceptCount in tomcat-architecture-and-performance.md

Quote from https://www.datadoghq.com/blog/tomcat-architecture-and-performance

Upon startup, Tomcat will create threads based on the value set for minSpareThreads and increase that number based on demand, up to the number of maxThreads. If the maximum number of threads is reached, and all threads are busy, incoming requests are placed in a queue (acceptCount) to wait for the next available thread. The server will only continue to accept a certain number of concurrent connections (as determined by maxConnections). When the queue is full and the number of connections hits maxConnections, any additional incoming clients will start receiving Connection Refused errors.

The bold part is incorrect. Tomcat will continue accept new connections until the concurrent connections reaches maxConnections. Once maxConnections is reached, OS will queue new connections until acceptCount.

https://tomcat.apache.org/tomcat-8.5-doc/config/http.html

Question on how to get the Swap file size.

On Docker for Mac, if we go to Preferences > Advanced. We see a swap field defaulted to 1GB.
Can you please tell me if there is a command or API to get the present value set for the Swap.

Question: How to get frontend stats by host header?

Hi, i'm looking for a way to monitor frontends by host header in HAProxy config where we route by host header. For example, consider the following config:

frontend http-in
    bind *:80
    log /dev/log    len 65535 local1 info
    capture request header User-Agent len 30
    capture request header X-Request-ID len 36
    capture request header Host len 32

    # Frontend rules for host header routing
    use_backend user if { hdr(Host) -i user user.example.com  }
    use_backend login if { hdr(Host) -i login login.example.com }


backend user
    mode http
    server-template user 10 _user._tcp.service.consul resolvers consul resolve-prefer ipv4 check

backend login
    mode http
    server-template login 10 _login._tcp.service.consul resolvers consul resolve-prefer ipv4 check

Is there a way to get stats of all frontends, for example in particular haproxy.frontend.response.4xx by header:Host?

Current monitoring for openstack with neutron instead of Nova

Does DataDog monitor openstack now that Neutron is the preferred component? The document written in 2015 goes into detail over monitoring with Nova but that is not an option for those using Neutron.

https://www.datadoghq.com/blog/openstack-monitoring-nova/

NGINX log format renders with slashes in HTML

I was following the docs for https://www.datadoghq.com/blog/how-to-collect-nginx-metrics/#metrics-collection-nginx-logs and noticed extra slashes in the rendered output which aren't in the example output

It looks like the Markdown parser or HTML generator is inserting extra slashes for $

Expected:

the-monitor/nginx/how_to_collect_nginx_metrics.md

Lines 207 to 209 in c98a018

    
           log_format nginx '$remote_addr - $remote_user [$time_local] ' 
        
                            '"$request" $status $body_bytes_sent $request_time ' 
        
                            '"$http_referer" "$http_user_agent"';

log_format nginx '$remote_addr - $remote_user [$time_local] '
                 '"$request" $status $body_bytes_sent $request_time '
                 '"$http_referer" "$http_user_agent"';

Actual:

problem in elasticsearch monitoring metrics article

in datadoghg web site in article How to monitor Elasticsearch performance should be mentioned that thread_pool.bulk from version 6.3 of elastic has been renamed to thread_pool.write.

Kubernetes custom application metrics.

Hello, I'm wondering if the dd-agent can collect custom applications metrics exposed to cadvisor, without the need to use statsd agent.

EKS containers need automountServiceAccountToken set

In EKS automountServiceAccountToken to be set to true for data dog agents to work

Submitting a new blog for Apache APISIX's Datadog Plugin

Hello,

I am Yilin, from the Apache APISIX community. Apache APISIX is a Cloud-native API gateway, and it is the top-level project of the Apache Software Foundation. You can get more details from GitHub: https://github.com/apache/apisix.

We recently released a plugin to integrate Datadog in Apache APISIX. I think the plugin is very meaningful for developers and both communities.In addition, it will help Datadog and Apache APISIX publicize and let more developers and companies know about us.

I am reaching out to you guys to see if we can have this blog posted here.What do you think?

Stats API won't return anything

The "Metric to watch: Volume queue length" Section in The "Part 1: Key metrics for Amazon EBS monitoring" Article

Hello All...

In the "Metric to watch: Volume queue length" Section in The "Part 1: Key metrics for Amazon EBS monitoring" article, it was mentioned that "A rule of thumb for SSD volumes is to aim for a queue length of one for every 500 IOPS available" and the source for that statement is https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_procedures.html#UnderstandingQueueLength , which has been updated to be "we recommend that you target a queue length of 1 for every 1000 IOPS available".

So, Could you please update your article to reflect the latest changes in the documentation?

Thank you,
Ahmed

Update Kafka version references

Reword this section, please.

Despite being pre-1.0, (current version is 0.9.0.1), it is production-ready

Kafka is now 2.0

Issue with jqmath on the-monitor/redis/how_to_monitor_redis_performance.md

Under some sections, the jqmath rendered text has subscript letters where it should not.

Metric to watch: hit rate

Metric to alert on: mem_fragmentation_ratio

No metric for Pod & Container CPU utilization?

https://www.datadoghq.com/blog/monitoring-kubernetes-performance-metrics/ says:

Metric to watch: CPU utilization
Tracking the amount of CPU your pods are using compared to their configured requests and limits, as well as CPU utilization at the node level, will give you important insight into cluster performance.

However, it doesn't explain how to do that. The https://docs.datadoghq.com/containers/kubernetes/data_collected/ page doesn't show any metric with pod or container-level CPU usage information. What metric should queries use to get that info?

Way to collect network statistics not complete

Hi,
suggested method to look at network stats works only if there is only one process inside a container. In most of the cases, we might have a shell script launching all the other required processes and entering into an infinite loop or monitoring applications that were launched.

Incorrect DD_KUBERNETES_KUBELET_HOST field path for EKS

Per DataDog/integrations-core#2582 (comment)

- name: DD_KUBERNETES_KUBELET_HOST
           valueFrom:
             fieldRef:
               fieldPath: status.hostIP

needs to be replaced with

- name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

Provide a license that allows translations

Maybe http://creativecommons.org/licenses/by-nc-sa/4.0/

example dashboard

The article https://www.datadoghq.com/blog/monitor-elasticsearch-datadog/#building-custom-elasticsearch-dashboards
has a nice looking elasticsearch dashboard, can you provide the source? Even if it doesn't work out of the box I'd rather start with something.

HAproxy integration & SSL

If HAproxy isn't terminating SSL, the metrics look a bit misleading (large red number for 2xx).

Are there any plans to allow an alternate SSL passthrough centric dashboard? (connections per second, response times, etc.)

Python syntax issue?

Issue with deploy_droplet.py in https://github.com/DataDog/the-monitor/blob/master/openstack/devstack/deploy_droplet.py

○ → deploy_droplet.py 
  File "/usr/bin/deploy_droplet.py", line 33
    print "IP: " + IP
               ^
SyntaxError: Missing parentheses in call to 'print'

A typo in the dd-agent.yaml manifest

env: - name: API_KEY Value: "YOUR_API_KEY_HERE"

Value must be value

documentation/blog post issue about haproxy metrics

https://www.datadoghq.com/blog/how-to-collect-haproxy-metrics/

has this fragment

listen stats # Define a listen section called "stats"
  bind :9000 # Listen on localhost:9000

that does NOT make haproxy stats service bind only to localhost:9000

Filtering and collecting additional elasticsearch metrics

In elasticsearch integration doc I am seeing few of the metrics are missing from the Metrics section. For example, jvm.buffer_pools.* , jvm.classes.*. Can someone let me know

if it is possible to add these metrics in elastic.d/conf.yaml ?
is there a way to filter metrics that is collected by elasticsearch integration via conf.yaml. For example, let's say I am not interested in elasticsearch.cgroup.cpu.stat.number_of_times_throttled and want it not to be collected?

Agent Version - 7.21.1

Time Maps

Apologies if this isn't the proper forum for this kind of feedback. After reading the time series graph 101 blog post, I wanted to mention this kind of visualization in case not on your roadmap already. I haven't seen it available in any of the monitoring systems I've used.

Garbage Collection Metrics in Kubernetes

Hello, I had gone through the post https://www.datadoghq.com/blog/monitoring-kubernetes-performance-metrics/
. It is very well written and good to read.

I had a query regarding some metrics in Kubernetes which I am not able to see on the web mentioned clearly .
Is it possible to get the Garbage collector metrics etc. I saw there are some for Go lang in cAdvisor but anything specific for Java JVM .
I saw the below article
https://www.robustperception.io/measuring-java-garbage-collection-with-prometheus/

But do you have any specific way we can do it.

Monitor Backends UP / DOWN

Hello,

I have a HAProxy questionn.

Is it possible to monitor the number of backends currently UP / DOWN (the kind of thing you can see by opening HATOP)?

I'd quite like to see that information and have monitors against it.

Thanks for your help.

Measure server/client timeout value ?

We have set server and client timeout. But I could not able to measure it from ui. I want to know how many clients are getting timeout and when. Same for the server.

How to enable nginx log collection by datadog for specific endpoints

I work on a product with multiple endpoints, but due to the sensitive nature of data, we are only looking to send Nginx logs for one endpoint. Does DataDog support it? If Yes, How can we enable it?

Wildcard in a timeseries graph

Just trying to create a timeseries graph based on a wildcard of the name of the instance.

{
  "requests": [
    {
      "q": "avg:system.load.1{name:content*}",
      "type": "line",
      "conditional_formats": [],
      "aggregator": "avg"
    }
  ],
  "viz": "timeseries"
}

You get the idea.. trying to match on any server with name content but it breaks horribly. No carnage but it just turns red. Am I missing something? My google foo is failing me.

what is the command to know the cpu utilization for pod creation or container creation?

Pseudo file location on CoreOS

Hi folks,

Yesterday I ran into an issue about collecting docker metrics on CoreOS v. 1068.6.0 using a shell script. I was trying to collect memory usage of a container by cat-ing the following file /sys/fs/cgroup/memory/system.slice/docker-$CONTAINER_ID/memory.usage_in_bytes
All the time I was getting the same value and it wasn't the right one. Digging around I've found that this file has been moved at /sys/fs/cgroup/memory/init.scope/system.slice/docker-$CONTAINER_ID/memory.usage_in_bytes

So basically the path for the metrics in newer versions of CoreOS has been changed from /sys/fs/cgroup/<METRIC>/system.slice/docker-$CONTAINER_ID/<METRIC_VALUE> to /sys/fs/cgroup/<METRIC>/init.scope/system.slice/docker-$CONTAINER_ID/<METRIC_VALUE>

The testing has been done against 2 CoreOS clusters. One cluster has version 1010.6.0 and the other one 1068.6.0. Also the new path exists in the latest version of CoreOS (1068.9.0).

Maybe you should update the metrics collection page

Thanks!

Request latency average description

Please verify, but the description seems inaccurate - see the Sender.java code, runOnce function.

#148

Datadog agent / cluster agent setup instructions incorrect

Hello,

the information posted here is all incorrect (you will spend a lot of time following it and end up with pods that are either erroring or crashing)

part 3:
https://www.datadoghq.com/blog/eks-monitoring-datadog/

you should delete this page so people will not waste time and instead direct readers to the official documentation (which I am trying right now after spending hours following outdated documentation)

Add markdown how to configure datadog-agent to get Celery metrics with Redis as a broker.

I would appreciate if someone will document how the datadog agent should be configured to get Celery metrics in Datadog.

Broken link to Oracle's GC01 tutorial in ES article

In the Garbage Collection section of the "How to monitor Elasticsearch performance" article, the link to Oracle's Garbage Collection article is not working anymore:
http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html

Keep up the great work, those articles rock!

Not so good json log format for Nginx in this guide

You are giving an example JSON log for Nginx in this guide :
https://www.datadoghq.com/blog/how-to-monitor-nginx-with-datadog/#use-json-logs-for-automatic-parsing
it's quite a poor example as:

it doesn't match with the standard attributes - http_user_agent should be in a http element to have the http.user_agent
and there's a collision with the reserved status attributes, that should be a http.status_code probably
lot of interesting fields are still missing (X-forwarded-for ..

NGINX 5xx errors monitoring

Is there a way to monitor NGINX 5xx errors with NGINX (open-source)?
I saw an old article that says it's only available for NGINX Plus.
But on the most recent article there's nothing saying about that.

Broken links

Great documentation. However, the links to Part 1 + 3 are broken. Thanks.

	log_format nginx '$remote_addr - $remote_user [$time_local] '
	'"$request" $status $body_bytes_sent $request_time '
	'"$http_referer" "$http_user_agent"';

datadog / the-monitor Goto Github PK

the-monitor's Introduction

the-monitor

the-monitor's People

Contributors

Stargazers

Watchers

Forkers

the-monitor's Issues

Recommend Projects

Recommend Topics

Recommend Org