dabz / ccloudexporter Goto Github PK

View Code? Open in Web Editor NEW

87.0 87.0 52.0 2.28 MB

Prometheus exporter for Confluent Cloud API metric

Home Page: https://docs.confluent.io/current/cloud/metrics-api.html

Go 96.46% Dockerfile 0.60% Makefile 2.48% Shell 0.46%

ccloudexporter's People

Contributors

Stargazers

Watchers

Forkers

sebco59 swiggy sirianni babu-conf derplarsen cotedm yangli6 digital-assets-data piarmolatii nsatyendra nerdynick zohimi waliaabhishek codingparsley coughman carlessanagustin mcmanford nbchn bjaggi abraham-leal vinclv michaelpearce-gain mgazza isabella232 aaron-trout jgrossmac laloyalo kjang1750 awalther28 javabrett svetixbot maralonceny git-suraj nbommu1 oorobfuoo gcalabro-rli subkanthi emrantalukder sirjek raytung prinsonthomas001 sduff electric-saw kutysam angoothachap vikaslalwani rohammosalli jeqo prabhakarank87

ccloudexporter's Issues

Health endpoint

In order to incorporate this as part of a production monitoring solution I'm looking for a "health" endpoint that would fail if this app is down, is there such an endpoint which isn't the metrics endpoint that seems to be hitting the cluster everytime?

Thanks a lot for this project

Unused admin client

I noticed that the admin client is failing to connect. It appears unused (and should be unnecessary) so I recommend removing references to it: https://github.com/Dabz/ccloudexporter/blob/master/cmd/ccloudexporter/ccloudexporter.go#L30-L40

403 when using latest image

Hello,

I've followed the exact instructions for both docker and Go.

getting the following error message, even through the API key is valid:

{
   "Endpoint": "https://api.telemetry.confluent.cloud/v1/metrics/cloud/descriptors",
   "StatusCode": 403,
   "body": "eyJlcnJvciI6eyJjb2RlIjo0MDMsIm1lc3NhZ2UiOiJpbnZhbGlkIEFQSSBrZXkifX0K",
   "level": "fatal",
   "msg": "Received status code 403 instead of 200 for GET on https://api.telemetry.confluent.cloud/v1/metrics/cloud/descriptors. \n\n{\"error\":{\"code\":403,\"message\":\"invalid API key\"}}\n\n\n",
   "time": "2021-01-14T00:20:31Z"
 }

would appreciate any help with this

Kafka Output

The Prometheus format is available and works really well. To elevate ccloudexporter as the de-facto for gathering CCloud metrics instead of relying on multiple different solutions, I want to propose adding a Kafka Sink to the code base.

The defaults dont need to change at all, but if we can provide a Kafka sink, it would provide customers an option to stream the data to Kafka. Considering they might have a system that is streaming data off a sink connector to aggregation platforms like Splunk, ES etc, this would mean just adding topic name from this component to sink connector config to stream API data from Kafka as well.

P.S: I do understand that keeping the data of the system being monitored onto the system itself is an anti-pattern, but if we can give a choice, a lot of customers might like it.

Metrics using Docker / Kubernetes

Hello.
Is it possible somehow to pass config with other metrics or just change the metrics we export from Confluent, when using Docker or Kubernetes?
Thank you in advance.

403 instead of 200 when using api cloud keys

I started using the latest image and i am getting 403 instead of 200 on posting https://api.telemetry.confluent.cloud//v1/metrics/cloud/query ..

and i am not getting all the metrics per topic

Received status code 403 instead of 200

never got the 403 when using the Dockerfile on Tree 296442c -- ( but the size of the image is huge )

Used attributes endpoint for topic list

Recommend using the 'attributes' endpoint to get the topic list to avoid any future dependency on the admin client: https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/scrapper/scrapper.go#L38-L54

https://api.telemetry.confluent.cloud/v1/metrics/{dataset}/attributes

Integration with Azure Monitor Logs Question

Hi all,
I have followed https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration
documentation and currently we have cluster-wide omsagent replicaset (singleton) to scrape from the ccloudexporter deployment in AKS via kubernetes service.

kubectl logs <ccloudexporter> pod says "Listening on http://:2112/metrics\n", and

and just example of 1 of the lines in kubectl logs <omsagent rs> pod says
> prometheus, address=x.x.x.x, scrapeUrl=http://ccloudexportersvc.ccloudexporternamespace.svc.cluster.local:2112/metrics go_memstats_heap_objects=2577

However, in Azure portal > kubernetes service > one of the cluster > Logs > and I've ran queries such as

InsightsMetrics
| where Namespace contains "prometheus"
| where Computer contains "<hostname/node of the omsagent rs>

but the query returns no result.

Feature request: add new config, pause duration between retries

Description

When running the ccloud-exporter container within Kubernetes. The CCloud Metrics API rate limit of 50 requets / second is easily hit. Even when the pod has livenessProbe and readinessProbe disabled. Example of such an error

  "error": "Received status code 429 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ()",
  "level": "error",
  "msg": "Query did not succeed",
  "optimizedQuery": {
    "aggregations": [
      {
        "agg": "SUM",
        "metric": "io.confluent.kafka.server/partition_count"
      }
    ],
    "filter": {
      "op": "AND",
      "filters": [
        {
          "op": "OR",
          "filters": [
            {
              "field": "resource.kafka.id",
              "op": "EQ",
              "value": "<redacted>"
            }
          ]
        }
      ]
    },
    "granularity": "PT1M",
    "group_by": [],
    "intervals": [
      "2021-09-06T17:09:00Z/PT1M"
    ],
    "limit": 1000
  },
  "response": {
    "data": null
  },
  "time": "2021-09-06T17:11:30Z"
}

Once the API rate limit is triggered, it tends to self-maintain in an infinite loop. Probably because ccloud-exporter retries in quick succession without giving enough wait time between metrics collection.

Proposed solution

Add a new config secondsBetweenRetry as waiting time between a retries when an access to CCloud Metrics API had failed. Ideally this pause should be at the individual API request, and not at the batch of requests, like 9 at a time as in config.simple.yaml.

Even better, this pause duration should follow an "exponential backoff" for example, starts at 5 seconds, then doubles at each new retry and be capped off at, let's say 5 minutes.

Feature Request: Backfill data

I am not sure if this feature already exists. This may be another major iteration, or may need a different app altogether.

Metrics API recently saw a down time of few hours.
The idea here is to have something like a simple API call to back-fill those few hours of data with an overridable endpoint + metric format, ex. influx or Datadog, all while maintaining the same metric names as exposed by Prometheus in real-time.

5xx response when querying telemetry api

When raised this issue with confluent support they simply asked to implement a retry mechanism, as recommended in the documentation. Is there a way this can be implemented in ccloudexporter?

{
"Endpoint": "https://api.telemetry.confluent.cloud//v1/metrics/cloud/query",
"StatusCode": 503,
"body": "upstream connect error or disconnect/reset before headers. reset reason: overflow",
"level": "error",
"msg": "Received invalid response",
"time": "2021-04-08T05:53:22Z"
}



{
"Endpoint": "https://api.telemetry.confluent.cloud//v1/metrics/cloud/query",
"StatusCode": 500,
"body": "{\"errors\":[{\"status\":\"500\",\"detail\":\"There was an error processing your request. It has been logged (ID xxxxxxxxxxxxxxx).\"}]}",
"level": "error",
"msg": "Received invalid response",
"time": "2021-04-08T05:59:30Z"
}

Pod keeps crashing/restarting

We are running this exporter and the Kubernetes pod keeps crashing and restarting. We aren't sure what's happening since no logging is occurring, we just know that the probe on /metrics fails and Kubernetes restarts the pod. Is there a way to get additional logging to figure out what's going on?

prometheus and splunk intergration-

TYPE confluent_kafka_server_request_bytes gauge

Send queries asynchronously instead of synchronously

All queries are sent synchronously, as the number of metric is increasing (thus the number of queries to send), the scrape duration is increasing.
The exporter should execute the query asynchronously in order to reduce the scrape duration (and avoid reaching the scrape timeout...)

Consider crashing if a 403 response is returned

It's easy to forget you have a ccloudexporter instance running against a cluster that has been torn down. This results in spamming 403 errors until someone notices. Since 403 errors are usually permanent failures that require user intervention to fix, consider crashing the ccloudexporter to fail fast if one is returned.

Remove grouping per partition

It seems that grouping per partition is generating a lot of data and make it harder to use. On top of that, it seems that not all metrics will be able to be grouped per partition in the future.

We should remove the grouping per partition, maybe we could include it later on, but with more restrictions (e.g. when it's specified for a single topic)

Grouping per cluster is broken

It seems that, if you have multiple clusters configure in the configuration file, the data points are no longer grouped by clusters.

Cause: The exporter is relying on the "labels" field in the descriptor endpoint to find out which label can be used to "group by" the metrics (https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/collector/collector.go#L144-L152). It seems that the Metrics API is no longer exposing the cluster_id in the list of label.

Workaround: Have multiple rules in your configuration file, or multiple instances of the exporter.

Kafka Output channel

The defaults don't need to change at all, but if we can provide a Kafka sink, it would provide customers an option to stream the data to Kafka. For customers already streaming metrics from Kafka to an end system, this means only adding topic name to the sink connector config to stream API data from Kafka as well.

P.S: I do understand that keeping data of the system being monitored onto the system itself is an anti-pattern, but if we can give a choice, a lot of customers might like it.

429 error when accessing Telemetry API

Hello,

While using ccloudexporter to access Metrics API v1.
Occasionally, we see a 429 response code from the metrics API. When we reached out to confluent, they pointed us to this document - https://api.telemetry.confluent.cloud/docs#section/Object-Model/Datasets

Basically, there is a limit of 50 requests/minute per IP.

Question: Does the exporter respect this limit for V1 API as of now? If not, are there plans to support it?

x509: certificate signed by unknown authority

Hi All,

I am trying to run the exporter by using Docker command to extract metrics from our confluent cloud setup.

docker run \
  -e CCLOUD_API_KEY=$CCLOUD_API_KEY \
  -e CCLOUD_API_SECRET=$CCLOUD_API_SECRET \
  -e CCLOUD_CLUSTER=lkc-abc123 \
  -p 2112:2112 \
  dabz/ccloudexporter:latest

But I am getting the following error:

{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": x509: certificate signed by unknown authority",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-12-08T08:42:53Z"
}

Is it due to we enable the X.509 certificate at confluent cloud? Anyone know how to solve it? Appreciate your help. Thanks.

Add hosted-monitoring support

https://api.telemetry.confluent.cloud/docs#section/Object-Model/Labels

connectors metrics don't have any of kafka.id

Hi,

we have connectors running in dev clusters and we have more than one dedicated cluster in our environments, we are able to see the metrics of kafka cluster which has dimensions of kafka.id where as connectors don't have any dimensions(kafka.id)), would that be possible to add cluster for the connectors so that we can find connector running associated with kafka.id?

I don't see label of kafka.id when I query manual from metrics api.

thanks
Niranjan

Timeout issue on EKS

Converted the ccloudexporter kubernetes files into a helm chart and am running into a timeout issue.

The deployment has these env vars set:

env:
- name: CCLOUD_API_KEY
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY"
- name: CCLOUD_API_SECRET
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET"
- name: CCLOUD_CLUSTER
  value: {{ .Values.cluster }}

Seeing:

kubectl logs -n grafana ccloud-exporter-deployment-cdcbbbb67-wq9hr -f
{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": dial tcp 52.38.184.52:443: i/o timeout",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-08-31T18:18:57Z"
}

which tells me the env vars (api_key|secret) are valid but the request is timing out.

Did a little test with a test pod:

› cat test.yaml                                                                                                                                                ☠️
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-name
  namespace: grafana
spec:
  containers:
  - name: test-pod-name
    env:
    - name: CCLOUD_API_KEY
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY
    - name: CCLOUD_API_SECRET
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET
    command: ["/bin/bash", "-c"]
    args:
    - curl -u $CCLOUD_API_KEY:$CCLOUD_API_SECRET https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\?resource_type\=kafka

and I see:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   590  100   590    0     0      2      0  0:04:55  0:03:35  0:01:20   131
{"data":[{"type":"kafka","description":"A Kafka cluster","labels":[{"description":"ID of the Kafka cluster","key":"kafka.id"}]},{"type":"connector","description":"A Kafka Connector","labels":[{"description":"ID of the connector","key":"connector.id"}]},{"type":"ksql","description":"A ksqlDB application","labels":[{"description":"ID of the ksqlDB application","key":"ksql.id"}]},{"type":"schema_registry","description":"A schema registry","labels":[{"description":"ID of the schema registry","key":"schema_registry.id"}]}],"meta":{"pagination":{"page_size":100,"total_size":4}},"links":{}}%

and sometimes:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:04:17 --:--:--     0
curl: (28) Failed to connect to api.telemetry.confluent.cloud port 443: Connection timed out

Looks like it's taking too long and ends up timing out at times.

Any idea what could cause this in EKS?

Feature request: expose additional endpoints for self health check (liveness and readiness)

Description

As of 2021-09-02, ccloud-exporter exposes only endpoint localhost:2112/metrics. When an HTTP request is made on this /metrics endpoint, ccloud-exporter makes outgoing requests to Confluent Cloud Metrics API. Which is the normal and expected behaviour.

In the context of Kubernetes, when ccloud-exporter runs within a pod with livenessProbe and readinessProbe. As the /metrics is the only endpoint exposed by ccloud-exporter, we might be attempted to use this endpoint to probe the readiness status of the ccloud-exporter container.

As a result, each time the /metrics endpoint is probed, and the probe frequency is high (every 5 seconds in this example). The probe request will trigger a collection of requests to Confluent Cloud Metrics API. The quick repeats of probing on the /metrics endpoint will then exhaust the CCloud Metrics API rate limit of 50 requets / minute.

}
  "Endpoint": "https://api.telemetry.confluent.cloud//v2/metrics/cloud/query",
  "StatusCode": 429,
  "body": "",
  "level": "error",
  "msg": "Received invalid response",
  "time": "2021-09-02T14:36:40Z"
}
{
  "error": "Received status code 429 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ()",
  "level": "error",
  "msg": "Query did not succeed",
  ... etc...
}

In the case of this example, the API rate limit error status 429 occurs within 15 seconds.
Then ccloud-exporter is stuck in an infinite loop of "StatusCode": 429. Because Kubernetes will endlessly probe the /metrics endpoint to check the health of the pod.

Suggestion

Add a separate endpoint for self health-check. For example: localhost:2113/selfcheck which returns OK if ccloud-exporter is in good shape. This helps Kubernetes to manage the life cycle of the container. For example, to restart the container if it is stuck in a non-functional state.

To reproduce the "StatusCode": 429

Uncomment the livenessProbe and readinessProbe sections in the manifest below and deploy it on your Kubernetes cluster.
Configure the value for the environment variables CCLOUD_...

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ccloud-exporter
  namespace: monitoring
  labels:
    app: ccloud-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ccloud-exporter
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: ccloud-exporter
    spec:
      containers:
        - name: ccloud-exporter
          image: dabz/ccloudexporter:latest
          imagePullPolicy: IfNotPresent
          env:
            - name: CCLOUD_API_KEY
              value: CloudAPIKey?????
            - name: CCLOUD_API_SECRET
              value: CloudAPISecret?????
            - name: CCLOUD_CLUSTER
              value: lkc-?????
          ports:
            - name: metrics
              containerPort: 2112
              protocol: TCP
#         livenessProbe:
#           httpGet:
#             path: /metrics
#             port: metrics
#             scheme: HTTP
#           initialDelaySeconds: 30
#           timeoutSeconds: 30
#           periodSeconds: 15
#           successThreshold: 1
#           failureThreshold: 3
#         readinessProbe:
#           httpGet:
#             path: /metrics
#             port: metrics
#             scheme: HTTP
#           initialDelaySeconds: 30
#           timeoutSeconds: 30
#           periodSeconds: 5
#           successThreshold: 1
#           failureThreshold: 3
          resources:
            requests:
              cpu: "250m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: ccloud-exporter-service
  namespace: monitoring
  labels:
    app: ccloud-exporter
spec:
  ports:
    - name: metrics
      protocol: TCP
      port: 2112
      targetPort: 2112
  selector:
    app: ccloud-exporter

Collect schema_registry metrics (initial io.confluent.kafka.schema_registry/schema_count)

V2 API added resource io.confluent.kafka.schema_registry and an initial metric io.confluent.kafka.schema_registry/schema_count. Exporter could collect this metric.

Sample query:

{
    "aggregations": [
        {
            "agg": "SUM",
            "metric": "io.confluent.kafka.schema_registry/schema_count"
        }
    ],
    "filter": {
        "field": "resource.schema_registry.id",
        "op": "EQ",
        "value": "lsrc-xxxxx"
    },
    "granularity": "PT1H",
    "intervals": [
        "2021-02-23T11:00:00+11:00/P0Y0M0DT1H0M0S"
    ],
    "group_by": [
    "resource.schema_registry.id"
    ]
}

Example response/metric:

{
    "aggregations": [
        {
            "agg": "SUM",
            "metric": "io.confluent.kafka.schema_registry/schema_count"
        }
    ],
    "filter": {
        "field": "resource.schema_registry.id",
        "op": "EQ",
        "value": "lsrc-xxxxx"
    },
    "granularity": "PT1H",
    "intervals": [
        "2021-02-23T11:00:00+11:00/P0Y0M0DT1H0M0S"
    ],
    "group_by": [
    "resource.schema_registry.id"
    ]
}

{
    "data": [
        {
            "timestamp": "2021-02-23T00:00:00Z",
            "value": 9.0,
            "resource.schema_registry.id": "lsrc-rw6m7"
        }
    ]
}

Repository License

Please add a central LICENSE document declaring the whole repository under the MIT license.

Adding helm chart would be really helpful

Question: is there a way to provide cluster id as an environment variable

For easier integration with k8s secrets?

Thanks for this project!

Environment Variables in config.yml are not resolved

Hi,

I created a config.yml file based on the default configuration in the README.md page using :

rules:
  - clusters:
      - $CCLOUD_CLUSTER

On metrics fetch, I see errors as

Received status code 403 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ({\"errors\":[{\"status\":\"403\",\"detail\":\"Query must filter by at least one of your authorized resources

Running with docker-compose, I exec a shell in the ccloud_exporter and display environment variables :

$ env
...
CCLOUD_CLUSTER=lkc-xxxx1
...

When I set the real value (lkc-xxxx1) instead of the environment variable $CCLOUD_CLUSTER, metrics are correctly fetched on the cluster.

Rely on a custom collector instead of GaugeVec

It seems that you have more control over the timestamp of the metrics if you implement a custom collector. This could be useful in the case of Confluent Cloud as the latest data point might not be accurate and we might need to update old data points.

Error on active_connection_count

Hi,

this kind of error happened for active_connection_count metrics.

Received status code 400 instead of 200 for POST on https://api.telemetry.confluent.cloud/v1/metrics/cloud/query with {"aggregations":[{"agg":"SUM","metric":"io.confluent.kafka.server/active_connection_count"}],"filter":{"op":"AND","filters":[{"field":"metric.label.cluster_id","op":"EQ","value":"lkc-aaaaa"}]},"granularity":"PT1M","group_by":["metric.label.topic"],"intervals":["2020-03-17T09:38:25+01:00/2020-03-17T09:39:25+01:00"],"limit":1000}

According to Confluent Cloud support, you can remove "group_by":["metric.label.topic"]

docker.errors.DockerException: Error while fetching server API version

while running 'docker-compose up -d' getting below .
I installed docker and docker compose , adding user to docker group .

Traceback (most recent call last):
File "urllib3/connectionpool.py", line 677, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1277, in request
File "http/client.py", line 1323, in _send_request
File "http/client.py", line 1272, in endheaders
File "http/client.py", line 1032, in _send_output
File "http/client.py", line 972, in send
File "docker/transport/unixconn.py", line 43, in connect
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "requests/adapters.py", line 449, in send
File "urllib3/connectionpool.py", line 727, in urlopen
File "urllib3/util/retry.py", line 410, in increment
File "urllib3/packages/six.py", line 734, in reraise
File "urllib3/connectionpool.py", line 677, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1277, in request
File "http/client.py", line 1323, in _send_request
File "http/client.py", line 1272, in endheaders
File "http/client.py", line 1032, in _send_output
File "http/client.py", line 972, in send
File "docker/transport/unixconn.py", line 43, in connect
urllib3.exceptions.ProtocolError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker/api/client.py", line 214, in _retrieve_server_version
File "docker/api/daemon.py", line 181, in version
File "docker/utils/decorators.py", line 46, in inner
File "docker/api/client.py", line 237, in _get
File "requests/sessions.py", line 543, in get
File "requests/sessions.py", line 530, in request
File "requests/sessions.py", line 643, in send
File "requests/adapters.py", line 498, in send
requests.exceptions.ConnectionError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "docker-compose", line 3, in
File "compose/cli/main.py", line 80, in main
File "compose/cli/main.py", line 189, in perform_command
File "compose/cli/command.py", line 70, in project_from_options
File "compose/cli/command.py", line 153, in get_project
File "compose/cli/docker_client.py", line 43, in get_client
File "compose/cli/docker_client.py", line 170, in docker_client
File "docker/api/client.py", line 197, in init
File "docker/api/client.py", line 222, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', PermissionError(13, 'Permission denied'))
[3257] Failed to execute script docker-compose

Gaps in data due to interval misses

I'm seeing pretty regular gaps in the data, probably due to interval misses. I noticed here we're scraping between now and the previous minute:
https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/scrapper/query.go#L67-L76

If prometheus easily supports updating values, I'd recommend just widening the query window to 5 minutes so you get data points. Updates on the fly would also help with values that are still stabilizing (you'll notice that we expose data right away rather than waiting for late arrivals, then update on the fly).

Misleading logs for metrics listener

This looks like a substitution error, but its very misleading to think that there is something wrong with the host name

{
"level": "info",
"msg": "Listening on http://:2112/metrics\n",
"time": "2021-07-22T18:34:31Z"
}

Scrapper --> Scraper

Recommend find and replace for scrapper -> scraper to fix the typo since it's used many places

Support exporting for multiple clusters

It would be useful to be able to support exporting for multiple clusters. The Metrics API can do this by using an OR:

"filter": { "filters": [ { "field": "metric.label.cluster_id", "op": "EQ", "value": "lkc-XXXX1" }, { "field": "metric.label.cluster_id", "op": "EQ", "value": "lkc-XXXX2" } ], "op": "OR" },

Results from the query can also be grouped by both the cluster id and the topic name to avoid collisions of topic names across clusters:

"group_by": [ "metric.label.cluster_id", "metric.label.topic" ],

Finally, you can also use the new GROUPED format by passing in the format parameter to make the grouping more explicit if it is helpful.

Use descriptors endpoint to get available metrics

Right now the list of metrics is hardcoded:
https://github.com/Dabz/ccloudexporter/blob/master/cmd/internal/scrapper/query.go#L71-L73

I recommend using the descriptors endpoint https://api.telemetry.confluent.cloud/v1/metrics/{dataset}/descriptors

An example on getting the available metrics here:
https://docs.confluent.io/current/cloud/metrics-api.html#list-the-available-metrics

Better indicate deprecation of username/password authentication

Username/password authentication to the Metrics API was only supported in the preview phase. For GA release, only API key/secret authentication is officially supported.

Rename CCLOUD_USER and CCLOUD_PASSSWORD environment variables to CCLOUD_APIKEY and CCLOUD_APISECRET respectively
- The legacy environment variables may still need to be supported to maintain backwards compatibility
Update the README accordingly

Do not expose all available metrics by default

The number of metrics available in Confluent Cloud Metrics API is increasing every day. As we have more and more data, exposing all of them does not make sense. Instead, we should:

Have a default configuration with the most important metrics
Have a way to expose new metrics (either through the command line or through a configuration file).

Allow an exclusion list for topics

Currently, ccloudexporter allows filtering based on topics specified.
It should also allow an exclusion list, where the listed topics are excluded from the prometheus metrics endpoint.

Remove superfluous group_by when filtering by single value

When the query is filtering to a single metric.label.cluster_id:

"filter" : {
  "op": "EQ",
  "field": "metric.label.cluster_id",
  "value": "lkc-12345"
}

there is no need to also specify a group_by, since we know that all results have the same cluster_id (lkc-12345 in this example).

"group_by": [
  "metric.label.cluster_id"
]

This superfluous group_by causes the query to be more expensive on the backend. We can explore optimizing this out on the backend, but it is difficult to do since the filter can contain arbitrarily complex boolean expressions.

fetching topic with sent bytes equal to nill

Some of the topics on ccloud are listed with the production and consumption as null .. these are not fetched.. how can we fetch these as well?

Please add successful authentications

The API now also exposes numbers for successful authentications

Gather consumer lag

Currently, the Metric API does not expose the consumer lag. But we could retrieve it in multiple ways, e.g. the exporter could rely on the Admin API to expose it.

Add a customer user agent to trace requests

The User-Agent should be specified: the format should be ccloudexporter/<commit version>. This, in order to help the Confluent Cloud team identify the origin of requests and have a way to easily trace the source of unusual workload.

Not able to access more than one cluster using docker-compose option

Using this code, i tried to connect confluent using the docker-compose method. This code is able to connect and pull the metrics only when I am able to pass CCLOUD_CLUSTER with a single cluster only. Having the following issues and need resolve.

During the single cluster-info through CCLOUD_CLUSTER, not able to get the topics/partitions information.
How do I make the docker-compose method pull more than cluster pieces of information along with topics/partitions etc..

Appreciate if anyone able to help me with this.

Bhaskar

timeout issue with docker compose

ccloud_exporter container logs below,

{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": dial tcp: lookup api.telemetry.confluent.cloud on 127.0.0.11:53: read udp 127.0.0.1:57113-\u003e127.0.0.11:53: i/o timeout",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-09-21T14:12:54Z"
}

I tried to increase the timeout to 120 seconds as provided in README.md but no luck.
flag.IntVar(&Context.HTTPTimeout, "timeout", 120, "Timeout, in second, to use for all REST call with the Metric API")

Thanks in advance!

Fix CVE-2019-11254 in gopkg.in/yaml.v2 by upgrading to v2.2.8

During a recent vulnerability scan we run internally this was identified in the ccloudexporter binary.
Could I ask for a fix for this please?

{
    "Target": "ccloudexporter",
    "Type": "gobinary",
    "Vulnerabilities": [
      {
        "VulnerabilityID": "CVE-2019-11254",
        "PkgName": "gopkg.in/yaml.v2",
        "InstalledVersion": "v2.2.5",
        "FixedVersion": "v2.2.8",
        "Layer": {
          "DiffID": "sha256:c87148c01e568bde3a58ce90550eb43596a0d9c36bb0bfcb25d31df097c8439f"
        },
        "SeveritySource": "nvd",
        "PrimaryURL": "https://nvd.nist.gov/vuln/detail/CVE-2019-11254",
        "Title": "kubernetes: Denial of service in API server via crafted YAML payloads by authorized users",
        "Description": "The Kubernetes API Server component in versions 1.1-1.14, and versions prior to 1.15.10, 1.16.7 and 1.17.3 allows an authorized user who sends malicious YAML payloads to cause the kube-apiserver to consume excessive CPU cycles while parsing YAML.",
        "Severity": "MEDIUM",
        "CVSS": {
          "nvd": {
            "V2Vector": "AV:N/AC:L/Au:S/C:N/I:N/A:P",
            "V3Vector": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "V2Score": 4,
            "V3Score": 6.5
          },
          "redhat": {
            "V3Vector": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "V3Score": 6.5
          }
        },
        "References": [
          "https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-11254",
          "https://github.com/kubernetes/kubernetes/issues/89535",
          "https://groups.google.com/d/msg/kubernetes-announce/ALL9s73E5ck/4yHe8J-PBAAJ",
          "https://groups.google.com/forum/#!topic/kubernetes-security-announce/wuwEwZigXBc",
          "https://linux.oracle.com/cve/CVE-2019-11254.html",
          "https://linux.oracle.com/errata/ELSA-2020-5653.html",
          "https://security.netapp.com/advisory/ntap-20200413-0003/"
        ],
        "PublishedDate": "2020-04-01T21:15:00Z",
        "LastModifiedDate": "2020-10-02T17:37:00Z"
      }
    ]
  }

Request to add swagger.yaml

Even though this has only 2 endpoints it would be nice to add swagger.yaml to the repo and as a path. This would formalise the API and also make automations easier

Query intervals should be aligned to minute boundaries

The query interval timeFrom is taken by applying the configured delay to time.Now()

ccloudexporter/cmd/internal/collector/query.go

Line 72 in 77195ba

    
           timeFrom := time.Now().Add(time.Duration(Context.Delay*-1) * time.Second) // the last minute might contains data that is not yet finalized

Instead, the start time should be time.Now() with the seconds truncated (i.e. rounded down to the nearest minute). Since the Metrics API only stores data at minutely granularity, using time.Now() is effectively rounding up to the next minute, which makes the effective delay less than the configured delay.

For example:

Delay (configured): 120 seconds
Current time: 00:10:05
Query interval (actual): 00:08:05 / PT1M
Query interval (effective): 00:09:00 / PT1M (only metrics with timestamp 00:09:00 will be matched)
Delay (effective): 65 seconds (00:10:05 - 00:09:00)

dabz / ccloudexporter Goto Github PK

ccloudexporter's People

Contributors

Stargazers

Watchers

Forkers

ccloudexporter's Issues

Description

Proposed solution

Description

Suggestion

To reproduce the "StatusCode": 429

Recommend Projects

Recommend Topics

Recommend Org