azure / prometheus-collector Goto Github PK

License: Other

Dockerfile 1.90% Go 56.08% Ruby 7.79% Smarty 0.11% Makefile 0.38% Shell 1.16% Python 0.03% HTML 0.11% Mustache 0.52% Jsonnet 22.08% PowerShell 3.83% Batchfile 0.16% JavaScript 0.10% TypeScript 0.96% Bicep 3.49% HCL 1.30%

prometheus-collector's Introduction

Build Status

Dev

Step	Status
Linux
Windows
Chart
Deploy

Prod

Step	Status
Publish
Deploy

Project

This project is Azure Monitor managed service for Prometheus, which is the agent based solution to collect Prometheus metrics to be sent to managed Azure Monitor store.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Telemetry

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

To disable telemetry, you can set the environment variable TELEMETRY_DISABLED to true for the container either by yaml or in the Dockerfile.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

prometheus-collector's People

Contributors

Stargazers

Watchers

prometheus-collector's Issues

Any issue in templatizing the cloudName variable?

The cloudName variable in prometheus-collector secretProviderClass isn't templatized. Is there any known issue in doing that?

I tried "AzureChinaCloud" and tried the chart in Mooncake and it seems to be working for me.

container label used in cpu usage alert does not exist

The query is: "sum (rate(container_cpu_usage_seconds_total{image!="", container_name!="POD"}[5m])) by (pod,cluster,container,namespace) / sum(container_spec_cpu_quota{image!="", container_name!="POD"}/container_spec_cpu_period{image!="", container_name!="POD"}) by (pod,cluster,container,namespace) > .95",

The label being used in the query is container_name but when I use grafana to explore the data I only see a label for container and a seprate label for name. I believe this one should be using container.

I did not check the other alerts but may be a good idea to go through the rest of them as well.

Support auto-Sharding for scrape config across multiple collector instances

Prometheus Collector Fails Liveness Probe When Scraping Metrics from Flink TaskManager in Kubernetes

I am attempting to gather metrics from Flink running on a Kubernetes cluster. I've configured the setup using the flink-kubernetes-operator along with the Prometheus metricreporter. While I can successfully scrape metrics from the Job Manager, I encounter issues when trying to collect metrics from the TaskManager.

When the Prometheus collector attempts to scrape metrics from the TaskManager, it fails the liveness probe, causing the container to restart. For detailed error logs, please refer to collector-log.txt.

Initially, I suspected that the issue might be due to insufficient memory resources. However, even after scaling up to higher node instances, the problem persists. The issue seems to be isolated to scraping metrics from the TaskManager, leading me to believe there may be a compatibility issue with Flink.

I've also tried using both static scrape configurations and pod annotations, but the issue remains unresolved.

Any guidance or suggestions would be greatly appreciated, especially since this is a managed service and I don't have much insight into its inner workings.

tls_config for the promethues-collector

Hello ,

We have current scenario where would like to use the tls_config to use CA certificate to validate API server certificate . For this we will have to mount the certificates as volumes into the collector pod and use them as part of the ama-metrics-prometheus-config-configmap .

Is there a way to accomplish this? Not sure about the approach as we know the Prometheus collector get deployed automatically when the monitoring metrics profile is enabled.

Running the az cli commands to enable prometheus metrics seems to have westeurope statically defined in the naming

I have my Azure Monitor workspace and AKS cluster deployed in norwayeast.

When I run the command to enable monitoring and associate this workspace to the cluster;

az aks update --enable-azure-monitor-metrics -n $clusterName -g $clusterResourceGroupName --azure-monitor-workspace-resource-id $workspaceResourceId

It creates the dataCollectionEndpoint and datacollectionRules with westeurope in the name (but in the right location).

I would expect the location to be retrieved from where the Azure monitor workspace is actually located (like in the bicep templates).

Fundamentals - OSS compliance for hosting & shipping dependencies

Legal & compliance guidelines for shipping -

open telemetry collector
node exporter (linux)
kube-state-metrics
windows exporter (as needed)

Installation of node-exporter often fails

Installing node-exporter under the following conditions often results in failure.

enable with Terraform
- azurerm provider 3.53.0
- use argument monitor_metrics of azurerm_kubernetes_cluster resource
Kubernetes 1.26.3
Azure CNI Overlay network plugin

The success and failure rates are equal, with a 50-50 split. As a result, node-exporter is not installed on the nodes as follows.

root [ / ]# ps -ef | grep exp
root        8433    5186  0 05:55 ?        00:00:00 grep --color=auto exp

root [ / ]# ls -asl /usr/local/bin/
total 344336
     4 drwxr-xr-x 2 root root        4096 Apr 28 05:48 .
     4 drwxr-xr-x 7 root root        4096 Mar 21 10:44 ..
 34556 -rwxr-x--- 1 root root    35384960 Apr 10 16:04 bpftrace
     4 -rwxr-xr-x 1 root root         705 Apr 10 16:02 ci-syslog-watcher.sh
 51008 -rw-r----- 1 root root    52232184 Apr 10 16:03 containerd-shim-slight-v1
 44276 -rw-r----- 1 root root    45334640 Apr 10 16:03 containerd-shim-spin-v1
 49136 -rwxr-xr-x 1 1001 docker  50311268 Aug 26  2022 crictl
     4 -r-xr--r-- 1 root root        2462 Apr 10 16:02 health-monitor.sh
 46912 -rwxr-xr-x 1 root root    48037888 Mar 21 00:57 kubectl
118432 -rwxr-xr-x 1 root root   121272408 Mar 21 00:57 kubelet

Additionally, in cases where Cilium was enabled along with it, all installations failed. I informed you about the network plugin for your reference, although it is unclear whether it has any impact.

It should be noted that all ama-metrics-node-* DaemonSets are running, and metrics can be collected from kubelet and cAdvisor.

What probable causes can you think of? Any advice would be appreciated.

Parameterize Target Cloud in SecretProviderClass

We've rolled out prometheus-collector to all of our public regions and I've been thinking about what we need to do before we can roll out to sovereign clouds.

Our clusters in FF + MC send their metrics back to our dedicated stamp in public cloud so I'm wondering if the collector might work in those regions without much modification in our scenario?

The only blocker I could identify is that the SecretProviderClass needs to support the CloudName.

I know Sovereign clouds are on the roadmap for Zn, but I'm curious if I'm crazy or if it might just work for us if we make that change?

Terraform template do not link aks prometheus to monitor workspace (sometimes)

Integration with metrics workspace/container

Also includes MSI based auth support for metrics ingestion

Support multi-homing for metrics ingestion

Add support for ingesting same metrics to multiple accounts.

Resource usage %% metrics do not work if containers do not specify requests/limits

For example, Prometheus doesn't specify requests/limits for most of its containers. This makes dashboards plotting CPU/memory usage %% less useful/misleading, because only e.g., config-reloader is visible. In this case, pod metrics (sum of all containers) is even stranger, because the whole Prometheus' resource usage is divided by requests/limits of only config-reloader.

What's the difference between "controlplane-apiserver" and "apiserver" targets

By default, controlplane-apiserver target is enabled but apiserver is not.
What's the difference between these two?

With the default config I treid to query using PromQL for metrics collected from controlplane-apiserver target (https://learn.microsoft.com/en-us/azure/azure-monitor/containers/prometheus-metrics-scrape-configuration-minimal#:~:text=node_uname_info%22-,controlplane%2Dapiserver,-apiserver_request_total) but looks like all of them were empty.

invalid Azure Monitor Workspace Resource Id error

When I try to deploy the Bicep FullAzureMonitorMetricsProfile.bicep template using the command in the readme, I am receiving an error:
Inner Errors:
{"code": "InvalidAzureMonitorWorkspaceResourceId", "message": "'/subscriptions/subid/resourcegroups/rg/providers/microsoft.operationalinsights/workspaces/workspace name' seems like an invalid Azure Monitor Workspace Resource Id. It should be like '/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/my-resource-group/providers/microsoft.monitor/accounts/my-azure-monitor-workspace'"}

I checked all of the Log Analytics Workspaces in my subscription and all of them are in /providers/microsoft.operationalinsights/workspaces and not /providers/microsoft.operationalinsights/accounts

Am I missing something or is this an issue that needs to be addressed with support?

Improve scrape config validation

Currently we use basic validation using promtool, but that doesn;t cover all the ifs & buts for collector. We woul dneed to make config validation more reliable & friendlier, and also have clear versioning dependency with prometheus schema.

Improve validation
Include custom validations for collector
Make this a standalone tool for using while authoring config
Use APIs from # 3 above for collector config validation at startup to be consistent with the tooling

Fundamentals - CI/CD & Release automation

CI/CD for every PR merge to deploy to dev/int clusters
CI/CD for deploying to prod cluster for prod releases
Release automation (chart & image auto versioning)

livenessprobe.cmd does not work as expected.

In https://github.com/Azure/prometheus-collector/blob/main/otelcollector/build/windows/scripts/livenessprobe.cmd, exit /b 1 does not set the exit code to 1 after if errorlevel 1.

Below works:

            if errorlevel 1 (
                echo "Metrics Extension is not running (configuration exists)"
                exit
            )

[BUG] Default prometheus configuration doesn't include kubelet, cadvisor or node jobs

When deploying AMA add-on with AKS, deployment is using a replicaSet. #249 added a hardcoded variable @sendDSUpMetric that bypasses entirely the "if" conditions blocks for kubelet, cadvisor and node exporter (issue highlighted as comment in code below):

if currentControllerType == @replicasetControllerType
  if advancedMode == false
    UpdateScrapeIntervalConfig(@cadvisorDefaultFileRsSimple, cadvisorScrapeInterval)
    if !cadvisorMetricsKeepListRegex.nil? && !cadvisorMetricsKeepListRegex.empty?
      AppendMetricRelabelConfig(@cadvisorDefaultFileRsSimple, cadvisorMetricsKeepListRegex)
    end
    defaultConfigs.push(@cadvisorDefaultFileRsSimple)
  elsif @sendDSUpMetric == true # == if currentControllerType == "replicaset" && advancedMode == true && @sendDSUpMetric == true. but @sendDSUpMetric is ALWAYS false
    UpdateScrapeIntervalConfig(@cadvisorDefaultFileRsAdvanced, cadvisorScrapeInterval)
    defaultConfigs.push(@cadvisorDefaultFileRsAdvanced)
  end
  # no if-statement for currentControllerType == "replicaset" && advancedMode == true && @sendDSUpMetric == false. Whole block is doing nothing for that use case which is AMA AKS add-on's mode.
else
  if advancedMode == true && ENV["OS_TYPE"].downcase == "linux"
    UpdateScrapeIntervalConfig(@cadvisorDefaultFileDs, cadvisorScrapeInterval)
    if !cadvisorMetricsKeepListRegex.nil? && !cadvisorMetricsKeepListRegex.empty?
      AppendMetricRelabelConfig(@cadvisorDefaultFileDs, cadvisorMetricsKeepListRegex)
    end
    contents = File.read(@cadvisorDefaultFileDs)
    contents = contents.gsub("$$NODE_IP$$", ENV["NODE_IP"])
    contents = contents.gsub("$$NODE_NAME$$", ENV["NODE_NAME"])
    File.open(@cadvisorDefaultFileDs, "w") { |file| file.puts contents }
    defaultConfigs.push(@cadvisorDefaultFileDs)
  end
end

Hence, default scrape configs for kubelet, cadvisor and node jobs are not generated. All other components are not using @sendDSUpMetric variable and their default scrape configs get generated correctly.

KubeContainerOOMKilledCount never resolves because relies on terminated reason

hi,

the rule KubeContainerOOMKilledCount keeps being active as long as the pod has containerStatuses.lastState.terminated.reason OOMKilled.

is this intended behaviour? Should the metric not exclude pods that are healthy again after some time.

thank you for your time

sum by (cluster,container,controller,namespace)(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} * on(cluster,namespace,pod) group_left(controller) label_replace(kube_pod_owner, "controller", "$1", "owner_name", "(.*)")) > 0

Unable to add external labels to ama-metrics pods

I've updated both "ama-metrics-prometheus-config" and "ama-metrics-prometheus-config-node" configmaps with the following settings:

data:
  prometheus-config: |
    global:
      external_labels:
        my_fqdn: sample_fqdn

However only the pods for "ama-metrics-node-*" receive the external_labels configuration.
Why are the other pods ignoring the configmap?

thank you for your time.

prometheus-collector keeps crashing with no_proxy setup in AKS

When setting up an AKS cluster with a configuration similar to this:

resource "azurerm_kubernetes_cluster" "aks" {
  # ...
  monitor_metrics {}

  http_proxy_config {
    https_proxy = "http://proxy-ip-address:port"
    no_proxy      = "localhost"
  }
  # ...
}

The ama-metrics-node pods keep crashing and show these logs:

******************Done Merging Default and Custom Prometheus Config*******************
prom-config-validator::No custom prometheus config found. Only using default scrape configs
prom-config-validator::Config file provided - /opt/defaultsMergedConfig.yml
prom-config-validator::Successfully generated otel config
prom-config-validator::Loading configuration...
prom-config-validator::Successfully loaded and validated prometheus config
prom-config-validator::Prometheus default scrape config validation succeeded, using this as collector config
prom-config-validator::Use default prometheus config: true
checking health of token adapter after 1 secs
checking health of token adapter after 2 secs
checking health of token adapter after 3 secs
checking health of token adapter after 4 secs
checking health of token adapter after 5 secs
checking health of token adapter after 6 secs
checking health of token adapter after 7 secs
checking health of token adapter after 8 secs
checking health of token adapter after 9 secs
checking health of token adapter after 10 secs
checking health of token adapter after 11 secs
checking health of token adapter after 12 secs
checking health of token adapter after 13 secs
checking health of token adapter after 14 secs
checking health of token adapter after 15 secs
checking health of token adapter after 16 secs
checking health of token adapter after 17 secs
checking health of token adapter after 18 secs
checking health of token adapter after 19 secs
checking health of token adapter after 20 secs
checking health of token adapter after 21 secs
checking health of token adapter after 22 secs
checking health of token adapter after 23 secs
checking health of token adapter after 24 secs
checking health of token adapter after 25 secs
checking health of token adapter after 26 secs
checking health of token adapter after 27 secs
checking health of token adapter after 28 secs
checking health of token adapter after 29 secs
checking health of token adapter after 30 secs
checking health of token adapter after 31 secs
checking health of token adapter after 32 secs
checking health of token adapter after 33 secs
checking health of token adapter after 34 secs
checking health of token adapter after 35 secs
checking health of token adapter after 36 secs
checking health of token adapter after 37 secs
checking health of token adapter after 38 secs
checking health of token adapter after 39 secs
checking health of token adapter after 40 secs
checking health of token adapter after 41 secs
checking health of token adapter after 42 secs
checking health of token adapter after 43 secs
checking health of token adapter after 44 secs
checking health of token adapter after 45 secs
checking health of token adapter after 46 secs
checking health of token adapter after 47 secs
checking health of token adapter after 48 secs
checking health of token adapter after 49 secs
checking health of token adapter after 50 secs
checking health of token adapter after 51 secs
checking health of token adapter after 52 secs
checking health of token adapter after 53 secs
checking health of token adapter after 54 secs
checking health of token adapter after 55 secs
checking health of token adapter after 56 secs
checking health of token adapter after 57 secs
checking health of token adapter after 58 secs
checking health of token adapter after 59 secs
checking health of token adapter after 60 secs
Connecting to proxy-ip-address:port (proxy-ip-address:port)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden
giving up waiting for token adapter to become healthy after 61 secs

This suggest that wget is trying to go through the proxy even though no_proxy is seated up. We have also confirmed that this is the case by looking at the logs from the proxy directly.

Do you have a suggestion on how we can workaround this, or if there is a fix for this?

Support managing collector instances thru open telemetry operator

label_limit exceeded (metric: kafka_controller_ActiveBrokerCount_Value, number of labels: 73, limit: 63

Hi team
I had enabled pod monitor for kafka cluster, but it does not work due to label, according to the configuration of Managed Prometheus,
I guess it caused by the specify setting below. Could you please help me on this issue.

separator: ;
regex: _meta_kubernetes_pod_label(.+)
replacement: $1
action: labelmap

Create policy to delete resources created by prometheus-collector after AKS clusters delete.

Using the process described in here https://github.com/Azure/prometheus-collector/tree/main/AddonPolicyTemplate, we can create a policy to monitor AKS clusters.

But when user deletes the AKS cluster, the resources created by this policy ie

Data collection endpoint,
Data collection rule,
Prometheus rule group (preview)

are not getting deleted.

Can you please provide a policy (or some other automated process) that will delete all three type of resources, when the AKS cluster doesn't exist anymore?

Move to OTLP > 0.7.x

We are pinned to 0.7.x, and we need to move to latest (0.9.x at the time of creating this issue).

Ensure ME understands latest OTLP
Update collector to latest
Validate data quality
Validate perf & scale
Ensure no regressions
Validate new features (staleness markers mainly)

Cannot enable annotation based scraping for endpoints (Kubernetes services)

I am modifying Configmap "ama-metrics-prometheus-config". My scrape job is:

      - job_name: kubernetes-service-endpoints
        scrape_interval: 30s
        label_limit: 63
        label_name_length_limit: 511
        label_value_length_limit: 1023
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: (.+)(?::\d+);(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name

But the prometheus-collector cannot load this configuration.

 prometheus-collector ******************Done Merging Default and Custom Prometheus Config*******************                                  │
│ prometheus-collector prom-config-validator::Config file provided - /opt/promMergedConfig.yml                                                 │
│ prometheus-collector prom-config-validator::Successfully generated otel config                                                               │
│ prometheus-collector prom-config-validator::Loading configuration...                                                                         │
│ prometheus-collector prom-config-validator::Cannot load configuration: cannot unmarshal the configuration: 1 error(s) decoding:              │
│ prometheus-collector                                                                                                                         │
│ prometheus-collector * error decoding 'receivers': error reading configuration for "prometheus": prometheus receiver failed to unmarshal yam │
│ prometheus-collector prom-config-validator::Prometheus custom config validation failed. The custom config will not be used                   ││ prometheus-collector prom-config-validator::Running validator on just default scrape configs                                                 ││ prometheus-collector prom-config-validator::Config file provided - /opt/defaultsMergedConfig.yml                                             ││ prometheus-collector prom-config-validator::Successfully generated otel config                                                               ││ prometheus-collector prom-config-validator::Loading configuration...                                                                         ││ prometheus-collector prom-config-validator::Successfully loaded and validated prometheus config                                              ││ prometheus-collector prom-config-validator::Use default prometheus config: true

We do not use annotations on pods in our product, we use them on services.

Deployment not working

There seems to be an issue with the bicep files. running them as is returns errors and the deployment fails.

Looks like there is an extra '}' at the end of the resource names.
{
"status": "Failed",
"error": {
"code": "BadRequest",
"message": "The alert rule name can’t contain the following characters: @ / # % & + * < > ? : \ { } Activity ID: ec76a6b0-6e8b-46bd-b60c-856a01be59cb."
}
}

i changed the cluster name from var to a param and added to the params file.
the dashboard was created but the link to the cluster is broken

It was also missing the prem for the dashboard id on the AMW. added that but it didn't fix the broken link.

this resource class 'Microsoft.ContainerService/managedClusters/providers/dataCollectionRuleAssociations@2021-09-01-preview', in the nested_azuremonitormetrics_dcra_clusterResourceId.bicep file doesn't exit. Replaced it with Microsoft.Insights/dataCollectionRuleAssociations@2021-09-01-preview, but still not getting the cluster connection.

(had to add the 'existing' to get the cluster scope)

This is compared to running the update command from the cli

az aks update --enable-azuremonitormetrics -n $aksName -g $aksResourceGroupName --azure-monitor-workspace-resource-id $azMonWorkspaceId --grafana-resource-id $grafanaDashboardId

I think the main issue is with the MSProm name. When I hard coded EUS, the short name for eastus it finally worked (with all the previous changes intact).

and I found no API that would return the short name. I just got the EUS from previous command line runs

The last thing is, it's not recognizing the Windows node pools. This is even true when turning on from the cli. I know there's a Windows node exporter, I assume it should be installed when you have Windows node pools. This is W2022, did not try with W2019
They do at least show up on the Kubernetes / Compute Resources / Node (Pods) dashboard, but with no data

Actually one more thing. A few weeks back, you would also get dashboards for Azure Monitor Container Insights

That's no longer happening even when the insights is enabled for the cluster.

The two questions I have are

How do we get the Windows servers supported?
How do we get the Azure Monitor Container Insights dashboard's back.

Configure remote_write / custom endpoint

Is it possible to configure remote_write with custom endpoint?
I would like to use the ama agent and send data to elastic, along managed Prometheus.

Clarify how to set ama-metrics-settings-configmap, and add troubleshooting tips

Discussed in #611

^{Originally posted by DoyleRavPMPC September 29, 2023}
Update: I worked out my issue (see below). See suggestions and >> updates below, hope this helps others.

Suggestions to improve docs:

Add a note to the docs near the links to the ama-metrics-settings-configmap.yaml explaining that the file should be renamed before applying the configmap. Perhaps a comment could be added to the file itself in the repo, aka: "apply this file as follows..."
Add a note in the troubleshooting section so folks verify the data section. I used the azure portal to browse my cluster config, and noticed this in the YAML view for the config map:

data:
  ama-metrics-settings-configmap.yaml: "kind:

which needed to be this:

data:
  ama-metrics-settings-configmap: "kind:

The first (with the yaml extension) will not work.

from this Q&A discussion:

Hello,

We are using the managed prometheus service in our AKS cluster. I would like to add annotations to my pods to configure prometheus scraping. I have questions on how to do that and some follow ups on how to troubleshoot. I've read over the default/custom/troubleshooting topics.

My goal is to have the minimal ingestion targets enabled (cadvisor, kubelet, etc) and in addition, use annotations to indicate which pods in selected namespaces should also be scraped.

Here are the steps I'm following:

I have some nginx pods in the cluster that already have the prometheus.io annotations on them (scrape = true and port set to 10254). They are in namespace ingress-nginx. My deployed pods are in the namespace "dev" and do not have annotations (yet).

I copy the ama-metrics-settings-configmap.yaml in this repo, and I've editted these three fields:

I have ensured I do not have a configmap named ama-metrics-prometheus-config set (see Q2) and there is no other configmaps with the prefix ama-metrics. I am assuming these settings are all I need, and that I would use the prometheus config for static scrape config targets.

I upload the settings as follows:

Updated (IMPORTANT): I saw this tip in another topic and realized the file name is significant here - ensure your settings file is named ama-metrics-settings-configmap
If you copied the file from the repo, it will have a yaml extension, and that extension will be used for more than just locating the source file in the command. Here is an example of how to apply the file correctly:

copy ama-metrics-settings-configmap.yaml ama-metrics-settings-configmap
kubectl create configmap ama-metrics-settings --from-file=ama-metrics-settings-configmap -n kube-system

(original content with follow ups):
I don't see a pod restart, which seems odd. I don't see errors in logs, but I don't see obvious new activity on merging the config (as I would if I loaded a prometheus configmap). If I portforward 9090 to see Prometheus on my ama-metrics and ama-metrics-node pods, I see Configuration reload unsuccessful - that doesn't seem good. If I delete the configmap, I do not see the expected list of the minimal ingestion targets reappear in Prometheus as scrape targets. I only see kube-state-metrics on ama-metrics, and the four other default kube-system targets on ama-metrics-node.

Q1: any obvious issues with my configmap name, settings or commands above?

It was the yaml extension on the file name

Q2: Do I need to additionally add a job in prometheus configmap for service discovery? I wasn't clear if setting the podannotationsnamespace was enough (as long as annotations are present on pods in the namespace).

Yes on adding a job to prometheus-config and creating that configmap. here's an example of my prometheus-config in case it helps others (note my settings file specifies namespaces for pod annotation based scraping)

kubectl create configmap ama-metrics-prometheus-config --from-file=prometheus-config -n kube-system
contents of prometheus-config:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: kubernetes_pods
    scheme: http
    kubernetes_sd_configs:
        - role: pod

Q3: Do you have a simple "hello world" example of pod based annotation scraping in managed prometheus?

This would still be useful to others for the docs, but I am past this issue

Q4: Can I expect to be able to combine pod annotation scraping with the scraping of the minimal ingestion kube-system targets? What configmaps are needed for that to work as expected?

This does appear to work, once I got the settings applied without the file extension. Note that if you use port forwarding to see th targets, you'll see kube-state scraping in the ama-metrics pod and you'll see the other "minimal ingestion" targets (cadvisor, etc) in the ama-metrics-node pods.

Troubleshooting questions:

Which logs should I look at for the load/parsing of the ama-metrics-settings-configmap? I'd like to troubleshoot the "configuration load unsuccessful" status reported in prometheus, I assume that indicates an issue.

I saw more info in the logs once I fixed the create configmap command for the settings file

Is there a way to validate the ama-metrics-settings-configmap after editting, similar to the promconfigvalidator process?
Should I see a pod restart when I create or delete the ama metrics settings configmap? (I do not see any restarts in kube-system, which seems odd). Should I be forcing nodes to restart after creating or deleting settings (and if so - which nodes)?

Perhaps the "no pod restart" was a key clue on the settings not being recognized

Thanks in advance,
Doyle

Prometheus collector with temporarily disconnected cluster

I have an Arc-enabled K8s cluster, and I enabled Prometheus metrics like described here.
I see a couple of ama-metrics pods with prometheus-collector running. What would happen if there were connectivity/bandwidth issues for an hour on the edge location? Would metrics be dropped, or stored locally for some time? Is there any configuration of it?

@vishiy, @gracewehner

Prometheus collector running on windows nodes

collector, ME running on a server 2019 container
being able to scrape & ingest local node targets from windows nodes
collect telemetry from windows nodes
measure perf/scale for windows container
Dashboard support for default windows targets

Collector Perf @ scale

Measure per instance max/recommended
Document recommendations (based on speech team's scale requirements [7-12 mil events/min], for that scale)

Missing recommended alerts

Cx is migrating from Custom metrics alerts (retiring on 5/31/2024) to Prometheus alerts.
In below docs, https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-metric-alerts?tabs=arm-template%2Cazure-portal#recommended-alert-rules

Code: https://raw.githubusercontent.com/Azure/prometheus-collector/main/GeneratedMonitoringArtifacts/Default/recommendedMetricAlerts.json

It is given that source code is available in GitHub but for few recommended alerts like CPU and disk of nodes as highlighted below, it is not there in GitHub as we checked. Cx needs the source code to complete deployments.

Failed to install the Microsoft.AzureMonitor.Containers.Metrics Arc extension on AKS EE

Following these docs I'm trying to collect Prometheus metrics from an Arc-enabled AKS Edge Essentials Kubernetes cluster. When executing the following statement the operation times out after several minutes.

az k8s-extension create --name azuremonitor-metrics --cluster-name $Env:CLUSTER_NAME --resource-group $Env:RESOURCE_GROUP --cluster-type connectedClusters --extension-type Microsoft.AzureMonitor.Containers.Metrics

(ExtensionOperationFailed) The extension operation failed with the following error:  
Error: [ InnerError: [Helm installation failed : 
Timed out waiting for the resource to come to a ready/completed stateDeployment is not ready: 
kube-system/ama-metrics. 0 out of 1 expected pods are ready : 
Recommendation Please contact Microsoft support for further inquiries : 
InnerError [release azuremonitor-metrics failed, and has been uninstalled due to atomic being set: 
timed out waiting for the condition]]] occurred while doing the operation : 
[Create] on the config.

From kubectl logs the prometheus-collector container is stuck at ContainerCreating and the following can be seen from the events

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    3m47s                default-scheduler  Successfully assigned kube-system/ama-metrics-794786cdfb-cfsln to <windows-device-name>
  Warning  FailedMount  3m46s                kubelet            MountVolume.SetUp failed for volume "ama-metrics-proxy-cert" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  104s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[anchors-ubuntu], unattached volumes=[settings-vol-config prometheus-config-vol host-log-containers host-log-pods anchors-mariner anchors-ubuntu ama-metrics-proxy-cert kube-api-access-866x4]: timed out waiting for the condition
  Warning  FailedMount  99s (x9 over 3m47s)  kubelet            MountVolume.SetUp failed for volume "anchors-ubuntu" : mkdir /usr/local/share/ca-certificates/: read-only file system

Config:
Windows 11 Enterprise 22H2
AKS EE: v1.2.414.0
Arc Region: eastus2
Distro: k8s

"LinuxNode": {
        "CpuCount": 4,
        "MemoryInMB": 8192,
        "DataSizeInGB": 30
}

Why create a Managed Resource Group, instead allow a choice for where resources are deployed

As per this documentation:

https://learn.microsoft.com/en-Us/azure/azure-monitor/essentials/azure-monitor-workspace-overview

problem

Having the new resource created with a random name goes against declarative deployment techniques and best practice.

solution

When creating the monitor accounts, please either allow one of the following options:

Allow to pre-create dataCollectionEndpointResource and dataCollectionRuleResource, then link them back to the account
Allow the Managed Resource Group were the dataCollectionEndpointResource and dataCollectionRuleResource to be manually defined.

resource monitorAccount 'Microsoft.Monitor/accounts@2021-06-03-preview' = {
  name: '${DeploymentURI}Monitor'
  location: resourceGroup().location
  properties: {
    #disable-next-line BCP037
    publicNetworkAccess: 'Enabled'
    defaultIngestionSettings: {
      dataCollectionEndpointResourceId: dataCollectorEPLinux.id
      dataCollectionRuleResourceId: dataCollectorEPLinuxRule.id
    }
  }
}

example of the current behaviour

random "new" managed resource group

preferred way to do it

Allow me to pre-deploy these other resources, then link them back to the account
- e.g. in the same Resource Group
- e.g. with my chosen names for the resources

recommendedMetricAlerts throws Property name: Rules.Count, Attempted value: 40, Error: 'Rules Count' must be between 1 and 20. You entered 40. Activity ID: fa7d4520-dfe5-4b1a-81dc-75018f0b5f8d. (Code: BadRequest)

Hi,

I am trying to deploy the recommended alerts that are mentioned here through the Custom Deployment resource on Azure. When deployed, I get the following error.

Property name: Rules.Count, Attempted value: 40, Error: 'Rules Count' must be between 1 and 20. You entered 40. Activity ID: fa7d4520-dfe5-4b1a-81dc-75018f0b5f8d. (Code: BadRequest).

Does anyone know what this means?

Steps to Reproduce:

Download the recommendMetricAlerts

Log into the Azure Portal

Redirect to "Deploy a Custom Template"

Click "Build your own template in the editor."

"Load file"

Choose the downloaded file.

Click "Save"

Fill in the Project Details and Instance Detail

Click Review + Create

Scrapping Metrics from Authenticated Endpoints

As an user of managed Prometheus , is there a possibility of scraping data from authenticated endpoints ? if so how can that be achieved.

CI Recomended Alert for Pod Restart incorrect

The recommended Alert for Pod Restart in https://github.com/Azure/prometheus-collector/blob/main/mixins/kubernetes/rules/recording_and_alerting_rules/templates/ci_recommended_alerts.json does not seem to be correct.

The query is: "sum by (namespace, pod, container, cluster) (kube_pod_container_status_restarts_total{job="kube-state-metrics", namespace="kube-system"}) > 0"

The metric being used in the query is a counter so it will always be increasing. This means that if a pod has ever restarted it will fire forever. I think the correct approach would be using a function such as increase over a period of time.

Also why does the alert only restrict to pods in the kube-system namespace? I believe it would be useful to be across all namespaces.

Maximum allowed content length: 1048576 bytes (1 MB). Provided content length: 1175590 bytes

Hi Team,

As far as I can understand the architecture there is a collector service running that scrapes the target and pushes to the metric extension service which is running on port 55680.
Now, this metricsExtenstion service is doing a remote write to managed Prometheus.
I can see you have given config for the metricsExtention.
But the problem is now the managed Prometheus is not accepting payloads of more than 1 MB and metricExtention is trying to push more data and ultimately failing.
We need to fix this config mismatch by either providing support and more documentation on how to control metricExtention parameters I can see they are hard coded as of now.
I can contribute once the issue Is confirmed.

Need a way to specify the priorityClassName for the deployment

We request a lot of CPU and memory to our prom-collector deployment. When our AKS cluster is at max node capacity, the Pod would sometimes be in pending because there isn't a single node with enough resource to accommodate it.
Setting a higher priority to the deployment would allow prom-collector to evict other nodes to make sure it can find a node with enough resources

Support collector and its dependencies running in all clouds

public, ussec, usnat, mc

Missing step in AddonTerraformTemplate

Hi all,

For the below we already had a support ticket at Microsoft. Their conclusion was that there are no issue with the Azure backplane regarding this specific issue, and it should be a problem in the code. They reproduced our situation, and had the exact same results.

The issue
We're setting up managed Prometheus on some of our private AKS clusters, using the code provided here:
https://github.com/Azure/prometheus-collector/tree/main/AddonTerraformTemplate

It seems to be missing one essential step, or in any case it doesn't work: the data collection endpoint and data collection rule aren't properly linked (I believe this is called the data collection rule association). See screenshot.

When this is not set properly, managed Prometheus is not working, and the ama-metrics pods on the cluster keep crashing.

We've tested this in different subscriptions with different clusters, with different versions of the Azure TerrAform providers even, the results are always the same. So there seems to be a step missing, in the code, Terraform provider, a combination of those, or something else. As said, the Microsoft support team had the same results with the same code.

When we manually select and save the proper endpoint, managed Prometheus starts working and pods stop crashing. We need to do this via the Terraform code though, and not manually.

Any help and input is greatly appreciated.

Thanks and regards!

How to ignore kube-system namespace for KubePodContainerRestart

Problem
We configured the recommended alerts, the KubePodContainerRestart triggers quite often because services in kube-system restart.

Is it possible for KubePodContainerRestart to ignore kube-system (or other namespaces)?

prometheus path annotation not honored for service discovery?

Hello again,

TL:DR; I've added a metrics path annotation to some of my pods, and I see the annotations picked up by managed prometheus, but the default path is used (and doesn't work). Am I missing a step in overriding the default path for service discovery with pod annotation?

Details:

I've configured a pod with prometheus.io annotations to enable scraping it. One of the annotations is path, ie:
prometheus.io/path: '/api/myservice/telemetry/metrics'
I've also set the following:

       prometheus.io/scrape: 'true'
       prometheus.io/port: '80'
       prometheus.io/scheme: 'http'

My prometheus config is pretty simple:
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: kubernetes_pods
    kubernetes_sd_configs:
      - role: pod

I've set the pod annotation namespace in ama-metrics-settings-configmap to include my service's namespace (as well as nginx-ingress).

I apply the config for settings and prometheus to the ama-metrics config maps - so far so good. I wait for a bit for the config to take effect. My deployment and pod show the expected annotations for prometheus.io.

I use kubectl exec and curl to ensure I see metrics at the given url, ala:
kubectl exec -n <mynamespace> <mypodthathascurl> curl http://<myserviceip>/api/<myservice>/telemetry/metrics

I see metrics at the given url.

I then use port-forward to see the prometheus ui on the ama-metrics pod. I see a lot of targets show up for service discovery, which is good. I see the nginx metrics show up on 10254, as expected (nginx annotations have scrape=true and port=10254, but does not override the default /metrics path). I see a bunch of kube-system targets, some of which have annotations apparently and are getting scraped successfully.

I find my pod in targets, but the path listed for the service appears to be the default, not my annotated override:

http://10.224.0.173/metrics	DOWN

I look under service discovery and I see the following relevant annotations for my service:

__meta_kubernetes_pod_annotation_prometheus_io_path="/api/myservice/telemetry/metrics"
__meta_kubernetes_pod_annotation_prometheus_io_scrape="true"
__meta_kubernetes_pod_annotationpresent_prometheus_io_path="true"
__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape="true"
...
__metrics_path__="/metrics"
__scheme__="http"
__scrape_interval__="15s"
__scrape_timeout__="10s"

Is there something else I need to configure or specify to get my path annotation honored for certain pods that use a non-default metrics path?

Thanks,
Doyle

Address private preview feedbacks

catch all for all feedbacks (fixed & new)

Sync etcd metrics to self-managed Prometheus

Hi Team

Is there any way to sync etcd metrics to external self-managed Prometheus? customer hope the united data management

Setting to allow config parse error => abort

My colleague was trying to add a new scrape target to our Prometheus configuration, and was confused when no metrics showed up. I helped them debug the error and we realized there was a typo in the configuration. They asked if config parse errors could fail the deployment rather than revert back to the default scrape config.

I'm not entirely sure whether that's desirable or not, but I wanted to bring it up here for discussion. I think you could do something like this by having the option to abort on parse error which will put the pod in CrashLoopBackoff