Giter Site home page Giter Site logo

canonical / cos-proxy-operator Goto Github PK

View Code? Open in Web Editor NEW
2.0 8.0 12.0 442 KB

A machine charm that provides a single integration point in the machine world with the Kubernetes-based COS bundle.

Home Page: https://charmhub.io/cos-proxy

License: Apache License 2.0

Python 100.00%
grafana juju kubernetes machine observability prometheus juju-charm

cos-proxy-operator's Introduction

COS Proxy charm

Charmhub Badge Release Discourse Status

This Juju machine charm that provides a single integration point in the machine world with the Kubernetes-based COS bundle.

This charm is designed to be easy to integrate in bundles and Juju-driven appliances, and reduce the amount of setup needed to integrate with the Kubernetes-based COS to just connect the COS Proxy charm with it.

Proxying support is provided for:

  • Prometheus
  • Grafana Dashboards
  • NRPE (through nrpe_exporter, which sends NRPE results to Prometheus)

Deployment

The cos-proxy charm is used as a connector between a Juju model hosting applications on machines, and COS charms running within Kubernetes. In the following example our machine charms will be running on an OpenStack cloud, and the Kubernetes is Microk8s running on a separate host. There must be network connectivity from each of the endpoints to the Juju controller.

For example, we have two models. One, named 'reactive', hosting machine charms running on OpenStack. There is a Telegraf application, cs:telegraf, collecting metrics from units, and we wish to relate that to Prometheus and Grafana running in another model named cos, running Kubernetes.

If you already have a working COS Lite deployment, you can skip creating another one, as well as the steps where you would deploy the COS Lite components one by one.

Here's the steps to create the models:

$ juju clouds
Only clouds with registered credentials are shown.
There are more clouds, use --all to see them.

Clouds available on the controller:
Cloud             Regions  Default      Type
microk8s-cluster  1        localhost    k8s
serverstack       1        serverstack  openstack

juju add-model reactive serverstack
juju add-model cos microk8s-cluster

Next we'll deploy an example application on the reactive model, which should be present on a single Juju controller which manages both microk8s and a reactive model:

juju deploy -m cos prometheus-k8s
juju deploy -m cos grafana-k8s
juju deploy -m reactive cs:ubuntu --series focal -n 3
juju deploy -m reactive cs:telegraf
juju relate -m reactive telegraf:juju-info ubuntu:juju-info

To relate Telegraf to Prometheus in order to add scrape targets and alerting rules, we must use a cross model relation.

Offer the relation in the cos model:

juju offer microk8s-cluster:cos.prometheus-k8s:metrics-endpoint

Deploy the cos-proxy charm in a new machine unit on the target model:

juju deploy -m reactive cos-proxy  # or ./cos-proxy_ubuntu-20.04-amd64.charm
juju relate -m reactive telegraf:prometheus-client cos-proxy:prometheus-target
juju relate -m reactive telegraf:prometheus-rules cos-proxy:prometheus-rules

Add the cross model relation:

juju consume -m reactive microk8s-cluster:cos.prometheus-k8s
juju relate -m reactive prometheus-k8s cos-proxy:downstream-prometheus-scrape

Now we can do the same for Grafana

juju offer microk8s-cluster:cos.grafana:grafana-dashboard
juju relate -m reactive telegraf:dashboards cos-proxy:dashboards

Add the cross model relation:

juju consume -m reactive microk8s-cluster:cos.prometheus-k8s
juju consume -m reactive microk8s-cluster:cos.grafana
juju relate -m reactive prometheus-k8s cos-proxy:downstream-prometheus-scrape
juju relate -m reactive grafana cos-proxy:downstream-grafana-dashboard

A complete set of relations on the from the consuming model will appear as:

Model       Controller        Cloud/Region         Version  SLA          Timestamp
reactive    overlord          localhost/localhost  3.0.3    unsupported  20:40:45+01:00

SAAS     Status  Store               URL
grafana  active  microk8s            admin/cos.grafana
loki     active  microk8s            admin/cos.loki
metrics  active  microk8s            admin/cos.metrics

App        Version  Status  Scale  Charm      Channel  Rev  Exposed  Message
cos-proxy  n/a      active      1  cos-proxy  edge      15  no       
filebeat   6.8.23   active      1  filebeat   stable    49  no       Filebeat ready.
nrpe                active      1  nrpe       stable    97  no       Ready
telegraf            active      1  telegraf   stable    65  no       Monitoring ubuntu/0 (source version/commit 23.01)
ubuntu     20.04    active      1  ubuntu     stable    21  no       

Unit           Workload  Agent  Machine  Public address  Ports          Message
cos-proxy/0*   active    idle   1        10.218.235.237                 
ubuntu/0*      active    idle   0        10.218.235.198                 
  filebeat/0*  active    idle            10.218.235.198                 Filebeat ready.
  nrpe/0*      active    idle            10.218.235.198  icmp,5666/tcp  Ready
  telegraf/0*  active    idle            10.218.235.198  9103/tcp       Monitoring ubuntu/0 (source version/commit 23.01)

Machine  State    Address         Inst id        Series  AZ  Message
0        started  10.218.235.198  juju-732594-0  focal       Running
1        started  10.218.235.237  juju-732594-1  focal       Running

NRPE Exporting

NRPE targets may appear on multiple relations. To capture all jobs, cos-proxy should be related to BOTH an existing reactive NRPE subordinate charm, as well as the application which that charm is subordinated to, as the monitors interface may appear on either, with the principal charm providing "host-level" checks, and the subordinate nrpe providing application-level ones.

cos-proxy-operator's People

Contributors

abuelodelanada avatar ca-scribner avatar dnegreira avatar dstathis avatar francescodesimone avatar ibraaoad avatar lucabello avatar marcusboden avatar mmkay avatar mthaddon avatar przemeklal avatar rbarry82 avatar samuelallan72 avatar sed-i avatar simskij avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cos-proxy-operator's Issues

Variables queries of dashboards imported are updated breaking the dashboards.

Bug Description

When importing dashboards via cos-proxy variables queries get updated.

I.e. I import libvirt-exporter dashboard via cos-proxy dashboards relation. This dashboard contains an host variable which uses the following query:

https://git.launchpad.net/charm-prometheus-libvirt-exporter/tree/src/files/grafana-dashboards/libvirt.json#n1649

When imported on COS this variable query is updated to:

label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid",juju_application="$juju_application"},host)

To Reproduce

  1. juju add-relation prometheus-libvirt-exporter:dashboards cos-proxy:dashboards

Environment

cos-proxy rev. 22 from channel edge

Relevant log output

None

Additional context

No response

Add support for the filebeat logstash interface or any other log collection mechanism that would work with machine charms

Enhancement Proposal

In the LMA stack, a combination of Filebeat, Graylog and Elasticsearch charms can be used to aggregate and store logs.

COS Proxy implements relations for Grafana dashboards, Prometheus scrape targets, there's also a solution to get NRPE checks to work, but there's no 1:1 replacement for the filebeat's logstash relation. This is a potential blocked for migrating from the LMA stack to COS.

nrpe_exporter: default rules cause one firing alert to fire all alerts on the same unit

Bug Description

The on the fly generated alerts for the individual nrpe checks are all firing together regardless of the actual firing alert on the nagios screen. This is cos-proxy on current edge/latest rev 14. If I am not mistaken this is due to the generated alert expression just filtering by the unit, not by the check as well.

image

To Reproduce

1 - deploy
  - model test
    - nrpe
    - principal charm
    - cos-proxy
  - model cos
    - cos-lite
  - model lma
    - nagios
2 - relate
  - nrpe <-> lma.nagios
  - cos.prometheus-metrics <-> cos-proxy <-> nrpe <-> principle

Environment

miles@mertkirpici-bastion:~$ juju status --relations -m cos
Model  Controller  Cloud/Region    Version  SLA          Timestamp
cos    vader       mk8s/localhost  2.9.38   unsupported  07:23:20Z

App           Version  Status   Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.23.0   active       1  alertmanager-k8s  edge      53  10.152.183.122  no       
catalogue              active       1  catalogue-k8s     edge      13  10.152.183.118  no       
grafana       9.2.1    active       1  grafana-k8s       edge      69  10.152.183.125  no       
loki          2.4.1    active       1  loki-k8s          edge      65  10.152.183.189  no       
prometheus    2.42.0   active       1  prometheus-k8s    edge     110  10.152.183.64   no       
traefik       2.9.6    waiting    1/4  traefik-k8s       edge     117  10.5.0.100      no       installing agent

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.117.153         
catalogue/0*     active    idle   10.1.117.155         
grafana/0*       active    idle   10.1.117.157         
loki/0*          active    idle   10.1.117.154         
prometheus/0*    active    idle   10.1.117.160         
traefik/0        error     lost   10.1.117.167         crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=traefik-0_cos(e4cfd170-716c-4cef-883a-43...
traefik/1        error     lost   10.1.117.168         crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=traefik-1_cos(720b22fc-f0f2-48b5-a78a-42...
traefik/2        error     lost   10.1.117.169         crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=traefik-2_cos(46357296-3857-4bc9-91d5-eb...
traefik/3*       active    idle   10.1.117.171         

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  53   0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       69   0/0        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          65   0/0        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    110  0/0        receive-remote-write  prometheus_remote_write  provider
prometheus-scrape                prometheus    prometheus-k8s    110  1/1        metrics-endpoint      prometheus_scrape        requirer

Relation provider                   Requirer                     Interface              Type     Message
alertmanager:alerting               loki:alertmanager            alertmanager_dispatch  regular  
alertmanager:alerting               prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:grafana-dashboard      grafana:grafana-dashboard    grafana_dashboard      regular  
alertmanager:grafana-source         grafana:grafana-source       grafana_datasource     regular  
alertmanager:replicas               alertmanager:replicas        alertmanager_replica   peer     
alertmanager:self-metrics-endpoint  prometheus:metrics-endpoint  prometheus_scrape      regular  
catalogue:catalogue                 alertmanager:catalogue       catalogue              regular  
catalogue:catalogue                 grafana:catalogue            catalogue              regular  
catalogue:catalogue                 prometheus:catalogue         catalogue              regular  
grafana:grafana                     grafana:grafana              grafana_peers          peer     
grafana:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
loki:grafana-dashboard              grafana:grafana-dashboard    grafana_dashboard      regular  
loki:grafana-source                 grafana:grafana-source       grafana_datasource     regular  
loki:metrics-endpoint               prometheus:metrics-endpoint  prometheus_scrape      regular  
prometheus:grafana-dashboard        grafana:grafana-dashboard    grafana_dashboard      regular  
prometheus:grafana-source           grafana:grafana-source       grafana_datasource     regular  
prometheus:prometheus-peers         prometheus:prometheus-peers  prometheus_peers       peer     
traefik:ingress                     alertmanager:ingress         ingress                regular  
traefik:ingress                     catalogue:ingress            ingress                regular  
traefik:ingress-per-unit            loki:ingress                 ingress_per_unit       regular  
traefik:ingress-per-unit            prometheus:ingress           ingress_per_unit       regular  
traefik:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
traefik:traefik-route               grafana:ingress              traefik_route          regular  




miles@mertkirpici-bastion:~$ juju status --relations
Model      Controller  Cloud/Region             Version  SLA          Timestamp
openstack  vader       serverstack/serverstack  2.9.38   unsupported  07:23:52Z

SAAS               Status  Store  URL
nagios             active  vader  admin/lma.nagios
prometheus-scrape  active  vader  admin/cos.prometheus-scrape

App              Version  Status  Scale  Charm                 Channel      Rev  Exposed  Message
cos-proxy        n/a      active      1  cos-proxy             edge          14  no       
keystone         21.0.0   active      1  keystone              yoga/stable  595  no       Application Ready
keystone-router  8.0.32   active      1  mysql-router          8.0/stable    35  no       Unit is ready
mysql            8.0.32   active      3  mysql-innodb-cluster  8.0/stable    43  no       Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
nrpe                      active      1  nrpe                  stable        97  no       Ready

Unit                  Workload  Agent  Machine  Public address  Ports          Message
cos-proxy/0*          active    idle   4        10.5.0.176                     
keystone/0*           active    idle   0        10.5.1.11       5000/tcp       Unit is ready
  keystone-router/0*  active    idle            10.5.1.11                      Unit is ready
  nrpe/0*             active    idle            10.5.1.11       icmp,5666/tcp  Ready
mysql/0               active    idle   1        10.5.1.59                      Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql/1               active    idle   2        10.5.2.43                      Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql/2*              active    idle   3        10.5.0.32                      Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.

Machine  State    Address     Inst id                               Series  AZ    Message
0        started  10.5.1.11   904940b2-3b5e-4431-894d-b1452b5ae87f  jammy   nova  ACTIVE
1        started  10.5.1.59   f6d4d004-99d7-4b37-89e6-aab1972b4179  jammy   nova  ACTIVE
2        started  10.5.2.43   fbd97bd8-0640-4da7-9a25-10faa4a822bb  jammy   nova  ACTIVE
3        started  10.5.0.32   2e5d4e6d-25c7-4a25-a975-d445ed3552e4  jammy   nova  ACTIVE
4        started  10.5.0.176  2125e627-5a29-44b4-89e6-7b4dac920dd4  focal   nova  ACTIVE

Relation provider                       Requirer                            Interface             Type         Message
cos-proxy:downstream-prometheus-scrape  prometheus-scrape:metrics-endpoint  prometheus_scrape     regular      
keystone-router:shared-db               keystone:shared-db                  mysql-shared          subordinate  
keystone:cluster                        keystone:cluster                    keystone-ha           peer         
keystone:nrpe-external-master           nrpe:nrpe-external-master           nrpe-external-master  subordinate  
mysql:cluster                           mysql:cluster                       mysql-innodb-cluster  peer         
mysql:coordinator                       mysql:coordinator                   coordinator           peer         
mysql:db-router                         keystone-router:db-router           mysql-router          regular      
nrpe:monitors                           cos-proxy:monitors                  monitors              regular      
nrpe:monitors                           nagios:monitors                     monitors              regular      

Relevant log output

miles@mertkirpici-bastion:~$ juju debug-log --replay -i cos-proxy                                                                                                                                                                             
unit-cos-proxy-0: 12:39:47 INFO juju Starting unit workers for "cos-proxy/0"                                                                                                                                                                  
unit-cos-proxy-0: 12:39:47 INFO juju.worker.apicaller [34b851] "unit-cos-proxy-0" successfully connected to "10.5.3.181:17070"                                                                                                                
unit-cos-proxy-0: 12:39:47 INFO juju.worker.apicaller [34b851] password changed for "unit-cos-proxy-0"                                                                                                                                        
unit-cos-proxy-0: 12:39:47 INFO juju.worker.apicaller [34b851] "unit-cos-proxy-0" successfully connected to "10.5.3.181:17070"                                                                                                                
unit-cos-proxy-0: 12:39:48 INFO juju.worker.migrationminion migration phase is now: NONE                                                                                                                                                      
unit-cos-proxy-0: 12:39:48 INFO juju.worker.logger logger worker started                                                                                                                                                                      
unit-cos-proxy-0: 12:39:48 INFO juju.worker.upgrader no waiter, upgrader is done                                                                                                                                                              
unit-cos-proxy-0: 12:39:48 ERROR juju.worker.meterstatus error running "meter-status-changed": charm missing from disk                                                                                                                        
unit-cos-proxy-0: 12:39:48 INFO juju.worker.uniter unit "cos-proxy/0" started                                                                                                                                                                 
unit-cos-proxy-0: 12:39:48 INFO juju.worker.uniter resuming charm install                                                                                                                                                                     
unit-cos-proxy-0: 12:39:48 INFO juju.worker.uniter.charm downloading ch:amd64/focal/cos-proxy-14 from API server                                                                                                                              
unit-cos-proxy-0: 12:39:55 INFO juju.worker.uniter hooks are retried true                                                                                                                                                                     
unit-cos-proxy-0: 12:39:55 INFO juju.worker.uniter.storage initial storage attachments ready                                                                                                                                                  
unit-cos-proxy-0: 12:39:55 INFO juju.worker.uniter found queued "install" hook                                                                                                                                                                
unit-cos-proxy-0: 12:39:55 INFO unit.cos-proxy/0.juju-log Running legacy hooks/install.                                                                                                                                                       
unit-cos-proxy-0: 12:39:56 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)                                                                                                                       
unit-cos-proxy-0: 12:39:56 INFO juju.worker.uniter found queued "leader-elected" hook                                                                                                                                                         
unit-cos-proxy-0: 12:39:56 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)                                                                                                                
unit-cos-proxy-0: 12:39:57 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)                                                                                                                
unit-cos-proxy-0: 12:39:57 INFO juju.worker.uniter found queued "start" hook                                                                                                                                                                  
unit-cos-proxy-0: 12:39:58 INFO unit.cos-proxy/0.juju-log Running legacy hooks/start.                                                                                                                                                         
unit-cos-proxy-0: 12:39:58 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)                                                                                                                         
unit-cos-proxy-0: 12:44:41 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 12:49:55 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 12:55:27 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:01:22 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:06:38 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:11:27 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:15:44 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:20:07 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)                                                                                                                 
unit-cos-proxy-0: 13:24:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
...

Additional context

No response

Vector consume 90G of memory

Bug Description

I suspect that vector installed by cos-proxy is leaking memory.

After running for 3 months it is consuming 90G. After the restart it consumes 129M of memory

To Reproduce

Run cos-proxy for 3 months

Environment

vector 0.27.0 (x86_64-unknown-linux-musl 5623d1e 2023-01-18)

cos-proxy channel edge rev. 36

Relevant log output

# Before restart
root@juju-733f2f-3-lxd-27:~# systemctl status vector.service                                                                                                           
โ— vector.service - "Vector - An observability pipelines tool"                                                                                                          
     Loaded: loaded (/etc/systemd/system/vector.service; enabled; vendor preset: enabled)                                      
     Active: active (running) since Wed 2023-03-29 11:36:09 UTC; 3 months 9 days ago                                           
       Docs: https://vector.dev/                                                                                                                                       
   Main PID: 25005 (vector)                                                                                                                                            
      Tasks: 133 (limit: 314572)                                                                                                                                       
     Memory: 90.6G                                                                                                                                                     
     CGroup: /system.slice/vector.service 
             โ”œโ”€25005 /usr/local/bin/vector -w
             โ””โ”€25172 journalctl --follow --all --show-cursor --output=json --boot --since=2000-01-01


# After restart
root@juju-733f2f-3-lxd-27:~# systemctl status vector.service 
โ— vector.service - "Vector - An observability pipelines tool"
     Loaded: loaded (/etc/systemd/system/vector.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2023-07-08 08:06:42 UTC; 2min 12s ago
       Docs: https://vector.dev/
    Process: 320 ExecStartPre=/usr/local/bin/vector validate (code=exited, status=0/SUCCESS)
   Main PID: 761 (vector)
      Tasks: 132 (limit: 314572)
     Memory: 129.2M
     CGroup: /system.slice/vector.service
             โ”œโ”€ 761 /usr/local/bin/vector -w
             โ””โ”€1032 journalctl --follow --all --show-cursor --output=json --boot --since=2000-01-01

Additional context

No response

Also build for Jammy

Field has reached out to ask whether we could also make the COS Proxy available for deployment on jammy. Currently, we only support focal, but I see no reason why we couldn't add another series to the charm.

Removing dashboard relation is broken

Bug Description

When trying to remove an existing dashboard relation hook dashboards-relation-broken fails and cos-proxy goes into error state.

To Reproduce

  1. juju remove-relation prometheus-libvirt-exporter:dashboards cos-proxy:dashboards

Environment

Cos-proxy rev. 22 from edge channel

Relevant log output

unit-cos-proxy-0: 08:52:21 INFO juju.worker.uniter.operation ran "dashboards-relation-departed" hook (via hook dispatching script: dispatch)
unit-cos-proxy-0: 08:52:27 INFO juju.worker.uniter awaiting error resolution for "relation-broken" hook
unit-cos-proxy-0: 08:52:27 DEBUG unit.cos-proxy/0.juju-log dashboards:478: Operator Framework 2.1.1+2.geb8e25a up and running.
unit-cos-proxy-0: 08:52:27 DEBUG unit.cos-proxy/0.juju-log dashboards:478: Emitting Juju event dashboards_relation_broken.
unit-cos-proxy-0: 08:52:27 ERROR unit.cos-proxy/0.juju-log dashboards:478: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 464, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1692, in remove_dashboards
    app_ids = _type_convert_stored(self._stored.id_mappings[event.app.name])  # type: ignore
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 1218, in __getitem__
    return _wrap_stored(self._stored_data, self._under[key])
KeyError: 'prometheus-libvirt-exporter'
unit-cos-proxy-0: 08:52:28 ERROR juju.worker.uniter.operation hook "dashboards-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1
unit-cos-proxy-0: 08:52:28 INFO juju.worker.uniter awaiting error resolution for "relation-broken" hook
unit-cos-proxy-0: 08:52:38 INFO juju.worker.uniter awaiting error resolution for "relation-broken" hook
unit-cos-proxy-0: 08:52:38 DEBUG unit.cos-proxy/0.juju-log dashboards:478: Operator Framework 2.1.1+2.geb8e25a up and running.
unit-cos-proxy-0: 08:52:38 DEBUG unit.cos-proxy/0.juju-log dashboards:478: Emitting Juju event dashboards_relation_broken.
unit-cos-proxy-0: 08:52:38 ERROR unit.cos-proxy/0.juju-log dashboards:478: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 464, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1692, in remove_dashboards
    app_ids = _type_convert_stored(self._stored.id_mappings[event.app.name])  # type: ignore
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 1218, in __getitem__
    return _wrap_stored(self._stored_data, self._under[key])
KeyError: 'prometheus-libvirt-exporter'
unit-cos-proxy-0: 08:52:38 ERROR juju.worker.uniter.operation hook "dashboards-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1


### Additional context

_No response_

Data is missing from `nrpe_lookup.csv`

Bug Description

In production deployments of ~100 units, information of about 15 units is missing from the charm-generated /etc/vector/nrpe_lookup.csv. Metrics and alerts for the missing units are also (as a result) missing from prometheus.

def _modify_enrichment_file(self, endpoints: Optional[List[Dict[str, Any]]] = None):
path = Path("/etc/vector/nrpe_lookup.csv")

nrpe_endpoints = []
nrpe_alerts = [] # type: List[Dict]
for relation_name in self._relation_names.keys():
for relation in self._charm.model.relations[relation_name]:

To Reproduce

After discussing with @przemeklal, a potential reproducer may look like this:

bundle: kubernetes
saas:
  remote-3df866476d10436281a222fc8d2fb7b0: {}
applications:
  prom:
    charm: prometheus-k8s
    channel: edge
    revision: 154
    resources:
      prometheus-image: 131
    scale: 1
    constraints: arch=amd64
    storage:
      database: kubernetes,1,1024M
    trust: true
relations:
- - prom:metrics-endpoint
  - remote-3df866476d10436281a222fc8d2fb7b0:downstream-prometheus-scrape
--- # overlay.yaml
applications:
  prom:
    offers:
      prom:
        endpoints:
        - metrics-endpoint
        acl:
          admin: admin
series: jammy
saas:
  prom:
    url: k8s2:admin/nrpetest.prom
applications:
  cos-proxy:
    charm: cos-proxy
    channel: edge
    revision: 46
    num_units: 1
    to:
    - "0"
    constraints: arch=amd64
  nrpe:
    charm: nrpe
    channel: edge
    revision: 114
  ubuntu:
    charm: ubuntu
    channel: edge
    revision: 24
    num_units: 3
    to:
    - "1"
    - "2"
    - "3"
    constraints: arch=amd64
    storage:
      block: loop,100M
      files: rootfs,100M
machines:
  "0":
    constraints: arch=amd64
  "1":
    constraints: arch=amd64
  "2":
    constraints: arch=amd64
  "3":
    constraints: arch=amd64
relations:
- - ubuntu:juju-info
  - nrpe:general-info
- - nrpe:monitors
  - cos-proxy:monitors
- - cos-proxy:downstream-prometheus-scrape
  - prom:metrics-endpoint

Environment

Jammy machines/series/bases.
cos-proxy latest (rev 46).

Relevant log output

2023-09-20 09:23:37 ERROR unit.cos-proxy/0.juju-log server.go:316 monitors:580: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2723, in _run
    result = subprocess.run(args, **kwargs)  # type: ignore
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-cos-proxy-0/relation-set', '-r', '581', '--app', '--file', '-')' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 495, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/nrpe_exporter/v0/nrpe_exporter.py", line 310, in _on_nrpe_relation_changed
    self.on.nrpe_targets_changed.emit(  # pyright: ignore
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 448, in _on_nrpe_targets_changed
    self.metrics_aggregator.set_target_job_data(
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/prometheus_k8s/v0/prometheus_scrape.py", line 2108, in set_target_job_data
    relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 1494, in __setitem__
    self._commit(key, value)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 1498, in _commit
    self._backend.update_relation_data(self.relation.id, self._entity, key, value)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 3059, in update_relation_data
    self.relation_set(relation_id, key, value, isinstance(_entity, Application))
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2828, in relation_set
    self._run(*args, input_stream=content)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2725, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: ERROR cannot write relation settings

2023-09-20 09:23:37 ERROR juju.worker.uniter.operation runhook.go:153 hook "monitors-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Rendering a template with `host="$host"` breaks some legends

Bug Description

There are some dashboards, which are using {{ host }} in legend, e.g. charm-telegraf. However in grafana_dashboard lib (here) we are rendering all {{ host }} with $host and this breaks legend in dashboards.

Snipped from telegraf dashboard in grafana-k8s related through cos-proxy.

"targets": [
        {
          "datasource": {
            "uid": "${prometheusds}"
          },
          "expr": "mem_used{juju_application=~\"$juju_application\"}",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 2,
          "legendFormat": "Used - $host",
          "refId": "A",
          "step": 2
        },

To Reproduce

juju deploy telegraf
juju relate nova-compute:juju-info telegraf:juju-info
juju relate telegraf:dashboards cos-proxy:dashboards
juju relate telegraf:prometheus-client cos-proxy:prometheus-target

Environment

cos-proxy is running on LXD with channel edge, revision 34 and series jammy
cos-lite is deployed on Microk8s with channel edge and revision 75

Relevant log output

I think that no logs are needed for this bug, if yes I could provide them.

Additional context

No response

Grafana panel legend format strings missing variables

Bug Description

The legend format strings in the charm source code of the form

"legendFormat": "{{project_name}} | {{instance_name}} | {{uuid}}"

Take the form

"legendFormat": " |  |  "

upon proxying the dashboards through cos-proxy.
This issue is not observed upon relating with the legacy grafana charm.

To Reproduce

juju deploy --channel edge cos-proxy
juju deploy --channel edge prometheus-libvirt-exporter ple
juju relate ple nova-compute
juju relate ple cos-proxy:prometheus-target
juju relate ple cos-proxy:dashboards

Environment

miles@mertkirpici-bastion:~$ juju status -m cos
Model  Controller  Cloud/Region  Version  SLA          Timestamp
cos    vader       k/localhost   2.9.38   unsupported  14:18:44Z

App                           Version  Status  Scale  Charm                         Channel  Rev  Address         Exposed  Message
alertmanager                  0.25.0   active      1  alertmanager-k8s              edge      64  10.152.183.66   no       
catalogue                              active      1  catalogue-k8s                 edge      14  10.152.183.140  no       
grafana                       9.2.1    active      1  grafana-k8s                   edge      76  10.152.183.231  no       
loki                          2.7.4    active      1  loki-k8s                      edge      80  10.152.183.246  no       
prometheus                    2.42.0   active      1  prometheus-k8s                edge     119  10.152.183.71   no       
prometheus-scrape-config-k8s  n/a      active      1  prometheus-scrape-config-k8s  edge      39  10.152.183.132  no       
traefik                       2.9.6    active      1  traefik-k8s                   edge     124  10.5.100.100    no 
miles@mertkirpici-bastion:~$ juju status -m openstack
Model      Controller  Cloud/Region             Version  SLA          Timestamp
openstack  vader       serverstack/serverstack  2.9.38   unsupported  14:19:16Z

SAAS                          Status  Store  URL
grafana-dashboards            active  vader  admin/cos.grafana-dashboards
prometheus-scrape-config-k8s  active  vader  admin/cos.prometheus-scrape-config-k8s

App                                 Version  Status   Scale  Charm                          Channel       Rev  Exposed  Message
cinder                              20.1.0   active       1  cinder                         yoga/stable   603  no       Unit is ready
cinder-mysql-router                 8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
cos-proxy                           n/a      active       1  cos-proxy                      edge           36  no       
glance                              24.1.0   active       1  glance                         yoga/stable   562  no       Unit is ready
glance-mysql-router                 8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
grafana-agent                                blocked      7  grafana-agent                  edge            5  no       Missing relation: 'logging-consumer'
horizon                             22.1.0   active       1  openstack-dashboard            yoga/stable   566  no       Unit is ready
keystone                            21.0.0   active       1  keystone                       yoga/stable   595  no       Application Ready
keystone-mysql-router               8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
mysql                               8.0.32   active       3  mysql-innodb-cluster           8.0/stable     43  no       Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
neutron-api                         20.2.0   active       1  neutron-api                    yoga/stable   547  no       Unit is ready
neutron-api-mysql-router            8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
neutron-api-plugin-ovn              20.2.0   active       1  neutron-api-plugin-ovn         yoga/stable    29  no       Unit is ready
nova-cloud-controller               25.1.0   active       1  nova-cloud-controller          yoga/stable   634  no       Unit is ready
nova-cloud-controller-mysql-router  8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
nova-compute                        25.1.0   active       6  nova-compute                   yoga/stable   650  no       Unit is ready
nrpe                                         active       1  nrpe                           edge           99  no       Ready
ovn-central                         22.03.0  active       3  ovn-central                    22.03/stable   57  no       Unit is ready (leader: ovnnb_db)
ovn-chassis                         22.03.0  active       6  ovn-chassis                    22.03/stable  118  no       Unit is ready
placement                           7.0.0    active       1  placement                      yoga/stable    74  no       Unit is ready
placement-mysql-router              8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready
ple                                          active       6  prometheus-libvirt-exporter    edge           16  no       Ready
poe                                          active       1  prometheus-openstack-exporter                 18  no       Ready
rabbitmq-server                     3.9.13   active       1  rabbitmq-server                3.9/stable    165  no       Unit is ready
vault                               1.7.9    active       1  vault                          1.7/stable     95  no       Unit is ready (active: true, mlock: enabled)
vault-mysql-router                  8.0.32   active       1  mysql-router                   8.0/edge       61  no       Unit is ready

Relevant log output

ubuntu@juju-43758e-openstack-39:/var/log/juju$ cat unit-cos-proxy-5.log 
2023-04-10 12:02:25 INFO juju unit_agent.go:289 Starting unit workers for "cos-proxy/5"
2023-04-10 12:02:25 INFO juju.worker.apicaller connect.go:163 [656710] "unit-cos-proxy-5" successfully connected to "10.5.3.181:17070"
2023-04-10 12:02:25 INFO juju.worker.apicaller connect.go:260 [656710] password changed for "unit-cos-proxy-5"
2023-04-10 12:02:25 INFO juju.worker.apicaller connect.go:163 [656710] "unit-cos-proxy-5" successfully connected to "10.5.3.181:17070"
2023-04-10 12:02:26 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE
2023-04-10 12:02:26 INFO juju.worker.logger logger.go:120 logger worker started
2023-04-10 12:02:26 INFO juju.worker.upgrader upgrader.go:216 no waiter, upgrader is done
2023-04-10 12:02:26 ERROR juju.worker.meterstatus runner.go:91 error running "meter-status-changed": charm missing from disk
2023-04-10 12:02:26 INFO juju.worker.uniter uniter.go:326 unit "cos-proxy/5" started
2023-04-10 12:02:26 INFO juju.worker.uniter uniter.go:631 resuming charm install
2023-04-10 12:02:26 INFO juju.worker.uniter.charm bundles.go:78 downloading ch:amd64/jammy/cos-proxy-36 from API server
2023-04-10 12:02:33 INFO juju.worker.uniter uniter.go:344 hooks are retried true
2023-04-10 12:02:33 INFO juju.worker.uniter.storage resolver.go:127 initial storage attachments ready
2023-04-10 12:02:33 INFO juju.worker.uniter resolver.go:149 found queued "install" hook
2023-04-10 12:02:34 INFO unit.cos-proxy/5.juju-log server.go:316 Running legacy hooks/install.
2023-04-10 12:02:39 INFO juju.worker.uniter.operation runhook.go:146 ran "install" hook (via hook dispatching script: dispatch)
2023-04-10 12:02:39 INFO juju.worker.uniter resolver.go:149 found queued "leader-elected" hook
2023-04-10 12:02:39 INFO juju.worker.uniter.operation runhook.go:146 ran "leader-elected" hook (via hook dispatching script: dispatch)
2023-04-10 12:02:40 INFO juju.worker.uniter.operation runhook.go:146 ran "config-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:02:40 INFO juju.worker.uniter resolver.go:149 found queued "start" hook
2023-04-10 12:02:40 INFO unit.cos-proxy/5.juju-log server.go:316 Running legacy hooks/start.
2023-04-10 12:02:41 INFO juju.worker.uniter.operation runhook.go:146 ran "start" hook (via hook dispatching script: dispatch)
2023-04-10 12:08:09 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:09:59 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-prometheus-scrape-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:00 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-prometheus-scrape-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:01 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-prometheus-scrape-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:11 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:12 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:13 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:13 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:14 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:15 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:16 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:16 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:17 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:18 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:19 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:20 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:21 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:22 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:23 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:24 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:25 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:26 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:26 INFO juju.worker.uniter.operation runhook.go:146 ran "prometheus-target-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:29 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:30 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:31 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:10:32 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:11:17 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-grafana-dashboard-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 12:11:18 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-grafana-dashboard-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 12:11:18 INFO juju.worker.uniter.operation runhook.go:146 ran "downstream-grafana-dashboard-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 12:12:48 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:17:26 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:21:45 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:27:30 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:32:11 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:37:58 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:42:44 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:48:34 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:53:09 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 12:58:54 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:03:14 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:07:31 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:13:27 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:17:30 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:22:13 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:26:26 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:31:57 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:37:56 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:42:48 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:00 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-created" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:01 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:01 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: NOTHING!
2023-04-10 13:45:01 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: Could not find dashboard data after a relation change for <ops.model.Application ple>
2023-04-10 13:45:01 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:02 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:02 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: NOTHING!
2023-04-10 13:45:02 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: Could not find dashboard data after a relation change for <ops.model.Application ple>
2023-04-10 13:45:03 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:03 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:04 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: NOTHING!
2023-04-10 13:45:04 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: Could not find dashboard data after a relation change for <ops.model.Application ple>
2023-04-10 13:45:04 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:05 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:06 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:06 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:07 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: NOTHING!
2023-04-10 13:45:07 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: Could not find dashboard data after a relation change for <ops.model.Application ple>
2023-04-10 13:45:07 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:08 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-joined" hook (via hook dispatching script: dispatch)
2023-04-10 13:45:08 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: NOTHING!
2023-04-10 13:45:08 WARNING unit.cos-proxy/5.juju-log server.go:316 dashboards:164: Could not find dashboard data after a relation change for <ops.model.Application ple>
2023-04-10 13:45:09 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 13:48:05 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:53:20 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 13:58:17 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 14:02:34 INFO juju.worker.uniter.operation runhook.go:146 ran "dashboards-relation-changed" hook (via hook dispatching script: dispatch)
2023-04-10 14:04:09 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 14:08:43 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 14:14:03 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 14:19:37 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-04-10 14:24:18 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)

Additional context

image
image
https://git.launchpad.net/charm-prometheus-libvirt-exporter/tree/src/files/grafana-dashboards/libvirt.json#n739

Add dns_name of the target to the labels of metrics added to prometheus-k8s via cos-proxy

Enhancement Proposal

The legacy charm promtheus2 add a label "dns_name" [0] which is used by many dashboards to correlate metrics from different targets. This label is missing when adding targets to prometheus-k8s via cos-proxy breaking the existing dashboards. It would be good if this label can be added by cos-proxy as well to ensure compatibility with existing dashboards.

  1. https://git.launchpad.net/charm-prometheus2/tree/src/reactive/prometheus.py#n722

Errors when relating with ceph-dashboard

Rather vanilla setup imaginable with Ceph running on LXD:

root@dell:~# juju status --relations
Model  Controller  Cloud/Region  Version  SLA          Timestamp
ceph   controller  maas1         2.9.14   unsupported  02:54:05-06:00

SAAS        Status  Store  URL
grafana     active  cos    admin/cos.grafana-dashboards
prometheus  active  cos    admin/cos.prometheus-scrape

App             Version  Status   Scale  Charm           Store       Channel  Rev  OS      Message
ceph-dashboard           blocked      3  ceph-dashboard  charmstore  stable     3  ubuntu  Charm config option grafana-api-url not set
ceph-fs         15.2.14  active       1  ceph-fs         charmhub    stable    36  ubuntu  Unit is ready
ceph-mon        15.2.14  active       3  ceph-mon        charmstore  stable    62  ubuntu  Unit is ready and clustered
ceph-osd        15.2.14  active       5  ceph-osd        charmstore  stable   316  ubuntu  Unit is ready (1 OSD)
ceph-radosgw    15.2.14  active       1  ceph-radosgw    charmhub    stable   499  ubuntu  Unit is ready
cos-proxy                error        1  cos-proxy       charmhub    edge       1  ubuntu  hook failed: "dashboards-relation-changed"
easyrsa         3.0.1    active       1  easyrsa         charmstore  stable   441  ubuntu  Certificate Authority connected.

Unit                 Workload  Agent  Machine  Public address  Ports   Message
ceph-fs/0*           active    idle   2/lxd/1  192.168.122.66          Unit is ready
ceph-mon/0*          active    idle   0/lxd/0  192.168.122.15          Unit is ready and clustered
  ceph-dashboard/0*  blocked   idle            192.168.122.15          Charm config option grafana-api-url not set
ceph-mon/1           active    idle   1/lxd/0  192.168.122.12          Unit is ready and clustered
  ceph-dashboard/2   blocked   idle            192.168.122.12          Charm config option grafana-api-url not set
ceph-mon/2           active    idle   2/lxd/0  192.168.122.27          Unit is ready and clustered
  ceph-dashboard/1   blocked   idle            192.168.122.27          Dashboard is not enabled
ceph-osd/0           active    idle   0        192.168.122.40          Unit is ready (1 OSD)
ceph-osd/1*          active    idle   1        192.168.122.39          Unit is ready (1 OSD)
ceph-osd/2           active    idle   2        192.168.122.9           Unit is ready (1 OSD)
ceph-osd/4           active    idle   4        192.168.122.75          Unit is ready (1 OSD)
ceph-osd/5           active    idle   5        192.168.122.70          Unit is ready (1 OSD)
ceph-radosgw/0*      active    idle   1/lxd/1  192.168.122.80  80/tcp  Unit is ready
cos-proxy/0*         error     idle   6        192.168.122.91          hook failed: "dashboards-relation-changed" for ceph-dashboard:grafana-dashboard
easyrsa/1*           active    idle   0/lxd/2  192.168.122.25          Certificate Authority connected.

Machine  State    DNS             Inst id              Series  AZ       Message
0        started  192.168.122.40  o7k1                 focal   default  Deployed
0/lxd/0  started  192.168.122.15  juju-8022af-0-lxd-0  focal   default  Container started
0/lxd/2  started  192.168.122.25  juju-8022af-0-lxd-2  focal   default  Container started
1        started  192.168.122.39  o7k2                 focal   default  Deployed
1/lxd/0  started  192.168.122.12  juju-8022af-1-lxd-0  focal   default  Container started
1/lxd/1  started  192.168.122.80  juju-8022af-1-lxd-1  focal   default  Container started
2        started  192.168.122.9   o7k3                 focal   default  Deployed
2/lxd/0  started  192.168.122.27  juju-8022af-2-lxd-0  focal   default  Container started
2/lxd/1  started  192.168.122.66  juju-8022af-2-lxd-1  focal   default  Container started
4        started  192.168.122.75  o7k4                 focal   default  Deployed
5        started  192.168.122.70  o7k5                 focal   default  Deployed
6        started  192.168.122.91  o7k6                 focal   default  Deployed

Relation provider                       Requirer                     Interface          Type         Message
ceph-dashboard:grafana-dashboard        cos-proxy:dashboards         grafana-dashboard  regular      
ceph-mon:dashboard                      ceph-dashboard:dashboard     ceph-dashboard     subordinate  
ceph-mon:mds                            ceph-fs:ceph-mds             ceph-mds           regular      
ceph-mon:mon                            ceph-mon:mon                 ceph               peer         
ceph-mon:osd                            ceph-osd:mon                 ceph-osd           regular      
ceph-mon:prometheus                     cos-proxy:prometheus-target  http               regular      
ceph-mon:radosgw                        ceph-radosgw:mon             ceph-radosgw       regular      
ceph-radosgw:cluster                    ceph-radosgw:cluster         swift-ha           peer         
ceph-radosgw:gateway                    cos-proxy:prometheus-target  http               regular      
cos-proxy:downstream-grafana-dashboard  grafana:grafana-dashboard    grafana_dashboard  regular      
cos-proxy:downstream-prometheus-scrape  prometheus:metrics-endpoint  prometheus_scrape  regular      
easyrsa:client                          ceph-dashboard:certificates  tls-certificates   regular 

Error trace as follows:

unit-cos-proxy-0: 02:51:41 ERROR unit.cos-proxy/0.juju-log dashboards:11: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 260, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 426, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 276, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 736, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 783, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1029, in update_dashboards
    self._upset_dashboards_on_event(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1033, in _upset_dashboards_on_event
    dashboards = self._handle_reactive_dashboards(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1150, in _handle_reactive_dashboards
    t = self._strip_existing_datasources(t)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1096, in _strip_existing_datasources
    dash = template["dashboard"]
KeyError: 'dashboard'
unit-cos-proxy-0: 02:51:41 ERROR juju.worker.uniter.operation hook "dashboards-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

nrpe_relation_joined should ensure the vector binary is present

Bug Description

The relation between cos-proxy and nrpe is not sufficient for logs to reach Loki when related to it cross-model. The reason for this is that vector doesn't start when the nrpe relation is joined. Adding filebeat to the model makes vector start successfully.

To address this bug, modify _nrpe_relation_joined() so that it ensures the vector binary is there, the same way it's done for _filebeat_relation_joined().

To Reproduce

machine model

  1. juju deploy ubuntu
  2. juju deploy nrpe
  3. juju deploy cos-proxy
  4. juju relate nrpe ubuntu
  5. juju relate cos-proxy nrpe

k8s model

  1. juju deploy cos-lite --channel=edge
  2. cross-model relate cos-proxy to loki

Environment

โˆฎ microk8s version
MicroK8s v1.26.1 revision 4596
โˆฎ juju --version
3.1.0-genericlinux-amd64
โˆฎ lxd --version
5.12

Relevant log output

-

Additional context

No response

Add severity label to support PD dynamic notifications

Enhancement Proposal

All alerts coming from nrpe-exporter have an hardcoded severity set to critical.

In order to support PagerDuty dynamic-notifications would be beneficial if the status of the check is reflected into the severity label of the alert.

An nrpe check can have one of the following status code that needs to be translate to a severity label as following

  • 0 = OK -> this doesn't produce any alert

  • 1 = Warning -> severity=warning

  • 2 = Critical -> severity=critical

  • 3 = Unknown -> severity=error

README - commands that include `-m reactive` give errors

I am trying to follow the steps in the README to get cos-proxy up and running but it seems that the commands including -m reactive don't work out of the box. I need to manually switch to the reactive model for them to work (and then, they also work without the -m reactive argument of course).

image

Also, to consume the cross-model relation I have to juju switch to the reactive controller and do:
j consume microk8s:admin/cos.prometheus

Otherwise:
image

Might be more problems later on, I didn't get there yet. Will keep this updated.

cos-proxy fails on hook "filebeat-relation-changed"

Bug Description

In SQA testrun 5703ae6a-aea7-4a77-947b-aa86d70b55b9, cos-proxy fails to install in hook "monitors-relation-changed".

To Reproduce

To reproduce, deploy cos and then the openstack bundle, which includes cos-proxy. This issue is not necessarily reproducible, we have seen this bundle deploy without issues before.

Environment

The environment is a juju maas controller hosting a charmed openstack deployment. This deployment is connected to cos, which is hosted on a microk8s, hosted on the same juju maas controller.

Relevant log output

2023-10-04 08:33:49 DEBUG juju.worker.uniter.operation executor.go:132 preparing operation "run relation-changed (167; unit: filebeat/53) hook" for cos-proxy/0
2023-10-04 08:33:49 DEBUG juju.worker.uniter.operation executor.go:132 executing operation "run relation-changed (167; unit: filebeat/53) hook" for cos-proxy/0
2023-10-04 08:33:49 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] executing: running filebeat-relation-changed hook for filebeat/53
2023-10-04 08:33:49 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-cos-proxy-0/agent.socket <nil>}
2023-10-04 08:33:49 DEBUG unit.cos-proxy/0.juju-log server.go:316 filebeat:167: Operator Framework 2.1.1+2.geb8e25a up and running.
2023-10-04 08:33:50 DEBUG unit.cos-proxy/0.juju-log server.go:316 filebeat:167: Emitting Juju event filebeat_relation_changed.
2023-10-04 08:33:50 DEBUG unit.cos-proxy/0.juju-log server.go:316 filebeat:167: Emitting custom event <VectorConfigChangedEvent via COSProxyCharm/VectorProvider[filebeat_downstream-logging]/on/config_changed[3235]>.
2023-10-04 08:33:50 ERROR unit.cos-proxy/0.juju-log server.go:316 filebeat:167: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/usr/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/usr/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 475, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/vector/v0/vector.py", line 225, in _on_log_relation_changed
    self.on.config_changed.emit(config=self.config)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 359, in _write_vector_config
    r = request.urlopen(dest)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
2023-10-04 08:33:51 ERROR juju.worker.uniter.operation runhook.go:153 hook "filebeat-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

remove composite_key duplication in nrpe_lookup table

Bug Description

Once vector is started when relating to nrpe (either after #59 is resolved or by additionally deploying filebeat), the file /etc/vector/nrpe_lookup.csv contains duplicate lines for composite keys.

If you juju ssh cos-proxy/0 you would see that

ubuntu@juju-142d1e-0:~$ cat /etc/vector/nrpe_lookup.csv
composite_key,juju_application,juju_unit,command,ipaddr
10.211.43.179_check_conntrack,ubuntu,ubuntu/0,check_conntrack,10.211.43.179
10.211.43.179_check_systemd_scopes,ubuntu,ubuntu/0,check_systemd_scopes,10.211.43.179
10.211.43.179_check_reboot,ubuntu,ubuntu/0,check_reboot,10.211.43.179
10.211.43.179_check_conntrack,ubuntu,ubuntu/0,check_conntrack,10.211.43.179
10.211.43.179_check_systemd_scopes,ubuntu,ubuntu/0,check_systemd_scopes,10.211.43.179
10.211.43.179_check_reboot,ubuntu,ubuntu/0,check_reboot,10.211.43.179

This duplication causes the log format to be weird and to not be correctly parsed from Grafana when using the Loki datasource.
The code responsible for this duplication should be here.

To Reproduce

machine model

  1. juju deploy ubuntu
  2. juju deploy nrpe
  3. juju deploy cos-proxy
  4. juju relate nrpe ubuntu
  5. juju relate cos-proxy nrpe

k8s model

  1. juju deploy cos-lite --channel=edge
  2. cross-model relate cos-proxy to loki

Environment

โˆฎ microk8s version
MicroK8s v1.26.1 revision 4596
โˆฎ juju --version
3.1.0-genericlinux-amd64
โˆฎ lxd --version
5.12

Relevant log output

-

Additional context

No response

After changing nagios_host_context in nrpe related to cos-proxy, the old check names are still present in nrpe_lookups.csv

Bug Description

After changing nagios_host_context in nrpe related to cos-proxy, all nrpe targets are re-added to nrpe_lookup.csv with the updated names. Old checks, however, can still be found there.

To Reproduce

  1. Deploy cos-proxy and nrpe, relate them using the monitors relation
  2. Wait for the model to settle, inspect nrpe_lookup.csv contents
  3. Change nrpe config:
juju config nrpe-container nagios_host_context=bootstack-test
  1. Wait for the model to settle and check nrpe_lookup.csv. New (the same but updated) checks with "bootstack-test" prefix will show up in the juju_unit column, but the old entries without the prefix will be still there.

Environment

rev 46 of cos-proxy, I didn't try to reproduce in rev 47 yet

Relevant log output

It's the same env that was reported in #88.

Additional context

No response

Error when NRPE checks are disabled

Bug Description

As reported by @chanchiwai-ray, if NRPE checks are disabled (which is apparently a new functionality they are integrating), the code that generates alerts and scrape jobs fails here with AttributeError: 'NoneType' object has no attribute 'values'

To Reproduce

Environment

Relevant log output

-

Additional context

No response

Inconsistent use of spaces and tabs in alert message produces an invalid YAML

Bug Description

After sending the NRPE alert generated by cos-proxy to PagerDuty, it looks like this:

{
  "client": "Alertmanager",
  "client_url": "http://redacted:80/cos-alertmanager/#/alerts?receiver=pagerduty",
  "contexts": null,
  "description": "[FIRING:1] redacted-hostname openstack 9fd94096-2f9f-4946-8bb2-e6e5b00fc37b redacted-hostname (CheckIpmiNrpeAlert check_ipmi redacted_hostname.maas redacted_ip redacted_ip:5666 juju_openstack_9fd9409_redacted_redacted_redacted_redacted_redacted_hostname_check_ipmi_prometheus_scrape nrpe-host nrpe-host/23 critical)",
  "event_type": "trigger",
  "incident_key": "6a2aa69cd89b7dcdd1f4d6af3208f42ddf07015f053b4a5276d993ae712ec052",
  "service_key": "5043bd8b1c644105c03cb37fad9c4115",
  "details": {
    "firing": "Labels:
 - alertname = CheckIpmiNrpeAlert
 - command = check_ipmi
 - dns_name = redacted_hostname.maas
 - host = redacted_ip
 - instance = redacted_ip:5666
 - job = juju_openstack_9fd9409_redacted_redacted_redacted_redacted_redacted_hostname_check_ipmi_prometheus_scrape
 - juju_application = redacted-hostname
 - juju_model = openstack
 - juju_model_uuid = 9fd94096-2f9f-4946-8bb2-e6e5b00fc37b
 - juju_unit = redacted-hostname
 - nrpe_application = nrpe-host
 - nrpe_unit = nrpe-host/23
 - severity = critical
Annotations:
 - description = Check provided by nrpe_exporter in model openstack is failing.
Failing check = check_ipmi
Unit = redacted-hostname
Value = 2
Legend:
\tStatusOK        = 0
\tStatusWarning   = 1
\tStatusCritical  = 2
\tStatusUnknown   = 3
 - summary = Unit redacted-hostname: check_ipmi critical.
Source: http://redacted:80/cos-prometheus-0/graph?g0.expr=avg_over_time%28command_status%7Bcommand%3D%22check_ipmi%22%2Cjuju_unit%3D%22redacted-hostname%22%7D%5B15m%5D%29+%3E+1+or+%28absent_over_time%28up%7Bjuju_unit%3D%22redacted-hostname%22%7D%5B10m%5D%29+%3D%3D+1%29&g0.tab=1
",
    "num_firing": "1",
    "num_resolved": "0",
    "resolved": ""
  }
}

In details, then Legend all legend entries use \t instead of spaces which breaks all potential clients that try to parse this as yaml.

To Reproduce

Integrate cos-proxy, nrpe, alertmanager and Pagerduty and generate the alert.

Environment

latest/edge of all COS components available on the date of this bug report.

Relevant log output

Event Details output from PD is pasted above.

Additional context

No response

All alerts are green, but nrpe-exporter in error trying to collect command_status

Bug Description

While deploying cos-proxy on an environment running an old version of nrpe which didn't provide target-address in the relation data, I had nrpe-exporter not being able to collect command_status since /etc/vector/nrpe_lookup.csv was missing IPs addresses of nrpe units to check.

Although command_status was missing, all alerts were green in Prometheus, meaning the environment was not monitored at all. If command status for a check is missing I think this should be reported with an alert.

To Reproduce

Use cos-proxy related to an old nrpe revision

Environment

N/A

Relevant log output

Dec 04 12:02:01 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:01.299Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:01 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:01.719Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:01 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:01.846Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:01 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:01.871Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:01 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:01.975Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:02 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:02.486Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:02 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:02.611Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:02 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:02.735Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:02 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:02.854Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"
Dec 04 12:02:03 juju-46917d-17-lxd-22 nrpe-exporter[62968]: ts=2023-12-04T12:02:03.035Z caller=log.go:124 level=error msg="Error dialing NRPE server" err="dial tcp :5666: connect: connection refused"

Additional context

No response

Improve readability of NRPE alert titles

Enhancement Proposal

Titles of NRPE alerts coming from cos-proxy are very long and difficult to read once they end up in tools like PagerDuty, especially in the environments that use long Nagios host context values and long hostnames, for example:

[FIRING:1] xxxxxxxxx-yyyyyyyyyy-zzzzzzzzzzzz-site-longhostname OpenStack
00000000-00000000-00000000-00000000 xxxxxxxxx-yyyyyyyyyy-zzzzzzzzzzzz-site-longhostname
(CheckNtpmonNrpeAlert check_ntpmon brxyz.longhostname.maas 10.256.256.1 10.256.256.1:5666
juju_openstack_1234567_xxxxxxxxx_yyyyyyyyyy_zzzzzzzzzzzz_site_longhostname_check_ntpmon_prometheus_scrape
remote-00000000000000000000000000000000 remote-00000000000000000000000000000000/2 critical)

The data is anonymised of course but I kept the original length of each string.

As an operator, I find many parts of the title unhelpful, especially:

  • remote app and unit names: remote-00000000000000000000000000000000 and remote-00000000000000000000000000000000/2 aren't very useful for human operators
  • repeated host IP and NRPE endpoint: 10.256.256.1 10.256.256.1:5666, there's also an FQDN there
  • juju model uuid 00000000-00000000-00000000-00000000 doesn't need to be in the title (in my opinion)
  • hostname is repeated at least 3 times

The same alert title looks like this when it's sent by Nagios:

CRITICAL: 'xxxxxxxxx-yyyyyyyyyy-zzzzzzzzzzzz-site-longhostname-check_ntpmon' on 'xxxxxxxxx-yyyyyyyyyy-zzzzzzzzzzzz-site-longhostname'

It's not perfect but it gives me all the necessary information.

An example of a leaner NRPE alert title coming from COS Proxy that is much more readable:

[FIRING:1] xxxxxxxxx-yyyyyyyyyy-zzzzzzzzzzzz-site-longhostname CheckNtpmonNrpeAlert critical

All the extra details could be still available in the alert description.

I believe it's worth to start a discussion.

cos-proxy unit is stuck in blocked/idle

Bug Description

After setting up relations between prometheus/scrape-config-interval and cos-proxy and nrpe, all NRPE alerts showed up in Prometheus, yet the unit stayed in the blocked state, complaining about Missing one of (Prometheus|target|nrpe) relation(s).

This is strange since all relations are in place.

To Reproduce

Just deploy cos-proxy rev 52 as always, details below.

Environment

COS:

scrape-interval-config-monitors/0*  active    idle
prometheus/0*                       active    idle

relations:
scrape-interval-config-monitors:metrics-endpoint  prometheus:metrics-endpoint  prometheus_scrape      regular

OpenStack:

cos-proxy-monitors/0*                    blocked   idle   3/lxd/24                        Missing one of (Prometheus|target|nrpe) relation(s)

relations:
cos-proxy-monitors:downstream-prometheus-scrape                      cos-scrape-interval-config-monitors:configurable-scrape-jobs  prometheus_scrape               regular
nrpe-compute:monitors                                                cos-proxy-monitors:monitors                                   monitors                        regular
nrpe-control:monitors                                                cos-proxy-monitors:monitors                                   monitors                        regular
nrpe-lxd:monitors                                                    cos-proxy-monitors:monitors                                   monitors                        regular
nrpe-storage:monitors                                                cos-proxy-monitors:monitors                                   monitors                        regular

Relevant log output

$ juju show-status-log cos-proxy-monitors/0 
Time                   Type       Status     Message
28 Nov 2023 15:33:44Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/78
28 Nov 2023 15:33:46Z  juju-unit  executing  running monitors-relation-joined hook for nrpe-lxd/79
28 Nov 2023 15:33:47Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/79
28 Nov 2023 15:33:49Z  juju-unit  executing  running monitors-relation-joined hook for nrpe-lxd/8
28 Nov 2023 15:33:49Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/8
28 Nov 2023 15:33:52Z  juju-unit  executing  running monitors-relation-joined hook for nrpe-lxd/80
28 Nov 2023 15:33:52Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/80
28 Nov 2023 15:33:54Z  juju-unit  executing  running monitors-relation-joined hook for nrpe-lxd/81
28 Nov 2023 15:33:55Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/81
28 Nov 2023 15:33:57Z  juju-unit  executing  running monitors-relation-joined hook for nrpe-lxd/9
28 Nov 2023 15:33:58Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/9
28 Nov 2023 15:34:00Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/0
28 Nov 2023 15:34:02Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/1
28 Nov 2023 15:34:05Z  juju-unit  idle       
28 Nov 2023 15:51:17Z  juju-unit  executing  running downstream-prometheus-scrape-relation-created hook
28 Nov 2023 15:51:18Z  juju-unit  idle       
28 Nov 2023 16:17:56Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/5
28 Nov 2023 16:42:55Z  juju-unit  executing  running monitors-relation-changed hook for nrpe-lxd/4
28 Nov 2023 17:33:51Z  juju-unit  idle       
29 Nov 2023 16:04:15Z  workload   blocked    Missing one of (Prometheus|target|nrpe) relation(s)

Additional context

jammy, cos-proxy rev 52, all COS components using latest/edge revisions as of 2023-11-28

Update 2023-11-30:

I noticed one nrpe relation data missing:

      nrpe-lxd/38:
        in-scope: false
        data: {}

While normally data should look like this:

      nrpe-lxd/75:
        in-scope: true
        data:
          egress-subnets: redacted/32
          ingress-address: redacted
          machine_id: 3/lxd/21
          model_id: redacted
          monitors: '{''monitors'': {''remote'': {''nrpe'': {''corosync_proc'': {''command'':
            ''check_corosync_proc''}, ''crm_status'': {''command'': ''check_crm_status''},
            ''pacemakerd_proc'': {''command'': ''check_pacemakerd_proc''}, ''prometheus_grok_exporter_http'':
            {''command'': ''check_prometheus_grok_exporter_http''}, ''telegraf_http'':
            {''command'': ''check_telegraf_http''}, ''check_conntrack'': ''check_conntrack'',
            ''check_systemd_scopes'': ''check_systemd_scopes'', ''check_reboot'':
            ''check_reboot''}}}, ''version'': ''0.3''}'
          private-address: redacted
          target-address: redacted
          target-id: redacted-placement-0

I also noticed that this missing nrpe unit was never processed:

$ juju show-status-log cos-proxy-monitors/0 --days 7 | grep nrpe-lxd/38
<empty>

versus

$ juju show-status-log cos-proxy-monitors/0 --days 7 | grep nrpe-lxd/75
28 Nov 2023 15:33:35Z  juju-unit  executing    running monitors-relation-joined hook for nrpe-lxd/75
28 Nov 2023 15:33:36Z  juju-unit  executing    running monitors-relation-changed hook for nrpe-lxd/75

cos-proxy fails on hook "downstream-logging-relation-changed" because of SSL: CERTIFICATE_VERIFY_FAILED

Bug Description

n SQA testrun ce8325a0-c0fe-46f8-af40-acd7b287c8de, cos-proxy fails to install in hook "downstream-logging-relation-changed".

To Reproduce

To reproduce, deploy cos and then the charmed kubernetes bundle, which includes cos.
This issue seems to happen after Dec 13, after some cos update. This issue is not necessarily reproducible, we have seen this bundle deploy without this.

Environment

The environment is a juju maas controller hosting a charmed kubernetes deployment. This deployment is connected to cos, which is hosted on a microk8s, hosted on the same juju maas controller.

Relevant log output

2023-12-13 21:56:09 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] executing: running downstream-logging-relation-changed hook for cos-loki/0
2023-12-13 21:56:09 DEBUG juju.worker.uniter.runner runner.go:719 starting jujuc server  {unix @/var/lib/juju/agents/unit-cos-proxy-0/agent.socket <nil>}
2023-12-13 21:56:09 DEBUG unit.cos-proxy/0.juju-log server.go:325 downstream-logging:32: ops 2.8.0+8.g26c6e95 up and running.
2023-12-13 21:56:09 DEBUG unit.cos-proxy/0.juju-log server.go:325 downstream-logging:32: Emitting Juju event downstream_logging_relation_changed.
2023-12-13 21:56:09 DEBUG unit.cos-proxy/0.juju-log server.go:325 downstream-logging:32: Emitting custom event <VectorConfigChangedEvent via COSProxyCharm/VectorProvider[filebeat_downstream-logging]/on/config_changed[133]>.
2023-12-13 21:56:09 ERROR unit.cos-proxy/0.juju-log server.go:325 downstream-logging:32: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/usr/lib/python3.10/http/client.py", line 1455, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.10/ssl.py", line 1100, in _create
    self.do_handshake()
  File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 517, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 340, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 842, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/vector/v0/vector.py", line 227, in _on_log_relation_changed
    self.on.config_changed.emit(config=self.config)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 340, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 842, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 395, in _write_vector_config
    r = request.urlopen(dest)
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)>

Additional context

No response

COS-Proxy in an LXD in the error state (`hook failed: "downstream-logging-relation-changed"`) due to DNS errors after trying to resolve Kubernetes service FQDN

Bug Description

I encountered a cos-proxy instance in the error state, failing on DNS resolution. I managed to trace DNS queries and it appears that an LXD in my Openstack model tries to resolve Loki's Kubernetes service address:

13:18:45.046592 00:16:3e:xx:xx:xx > xx:xx:xx:xx:xx:xx, ethertype IPv4 (0x0800), length 145: (tos 0x0, ttl 64, id 41035, offset 0, flags [none], proto UDP (17), length 131)
    x.x.x.x.54973 > y.y.y.y.53: 28560+ [1au] AAAA? loki-0.loki-endpoints.cos.svc.cluster.redacted.redacted.redacted. (103)                                                                                           

Since Openstack LXDs don't have access to Kubernetes DNS, it fails and the charm is stuck in the error state.

To Reproduce

Relate cos-proxy in a MAAS Juju model running in an LXD to loki running on microk8s.

Environment

cos-proxy rev 46 (latest/edge) in a MAAS model (running in a jammy LXD)
loki ch:loki-k8s rev 91 (stable) on microk8s

Relation:

cos-loki:logging                                                     cos-proxy:downstream-logging                         loki_push_api                   regular      

Relevant log output

# juju debug-log -i cos-proxy/0

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 495, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/vector/v0/vector.py", line 227, in _on_log_relation_changed
    self.on.config_changed.emit(config=self.config)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 376, in _write_vector_config
    r = request.urlopen(dest)
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1377, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

# juju status cos-proxy

Unit                           Workload  Agent  Machine  Public address  Ports     Message
cos-proxy/0*                   error     idle   3/lxd/6  y.z.z.z                   hook failed: "downstream-logging-relation-changed"
...

Additional context

No response

Add nrpe unit name label to the metrics exported by the nrpe exporter

Enhancement Proposal

Currently the metrics exported by the nrpe exporter have the label juju_application which contains the value of the nrpe application, however there's no way of knowing which unit of the application the check is related to.

This might be a useful label to utilize in the summary section of the firing alert to give the operator a pre-built command to get the NRPE check message, which can not be proxied. Example(after calling label_replace() to get the check field):
juju run-action --wait {{ $labels.juju_unit }} run-nrpe-check name={{ $labels.check }}

The operator receiving the alert would just copy and paste the command in the terminal to run the check manually and read the NRPE check message.

Hook dashboards-relation-changed fails

Bug Description

When relating cos-proxy with prometheus-libvirt-exporter on the dashboards relation, the hook fails with AttributeError: 'str' object has no attribute 'get'

To Reproduce

  1. juju deploy cos-proxy --channel edge --series focal --to lxd:15
  2. juju add-relation cos-proxy:dashboards prometheus-libvirt-exporter:dashboards

Environment

cos-proxy rev. 52 from edge channel series focal
prometheus-libvirt-exporter rev 1 channel stable

Relevant log output

unit-cos-proxy-3: 17:22:58 ERROR unit.cos-proxy/3.juju-log dashboards:386: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 519, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1651, in update_dashboards
    self._upset_dashboards_on_event(event)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1655, in _upset_dashboards_on_event
    dashboards = self._handle_reactive_dashboards(event)
  File "/var/lib/juju/agents/unit-cos-proxy-3/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1797, in _handle_reactive_dashboards
    dash = t.get("dashboard", {}) or t
AttributeError: 'str' object has no attribute 'get'
unit-cos-proxy-3: 17:22:58 ERROR juju.worker.uniter.operation hook "dashboards-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Can't import ceph-dashboards via cos-proxy.

Bug Description

When importing ceph-dashboards via cos-proxy dashboards-relation-changed hook goes into error state.

To Reproduce

  1. juju add-relation ceph-dashboard:grafana-dashboard cos-proxy:dashboards

Environment

cos-proxy rev. 22 channel edge

Relevant log output

unit-cos-proxy-0: 09:26:52 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-cos-proxy-0: 09:27:12 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-cos-proxy-0: 09:27:13 DEBUG unit.cos-proxy/0.juju-log dashboards:479: Operator Framework 2.1.1+2.geb8e25a up and running.
unit-cos-proxy-0: 09:27:13 DEBUG unit.cos-proxy/0.juju-log dashboards:479: Emitting Juju event dashboards_relation_changed.
unit-cos-proxy-0: 09:27:13 ERROR unit.cos-proxy/0.juju-log dashboards:479: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 464, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1659, in update_dashboards
    self._upset_dashboards_on_event(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1663, in _upset_dashboards_on_event
    dashboards = self._handle_reactive_dashboards(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1809, in _handle_reactive_dashboards
    dash = json.dumps(dash)
  File "/usr/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Circular reference detected
unit-cos-proxy-0: 09:27:13 ERROR juju.worker.uniter.operation hook "dashboards-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-cos-proxy-0: 09:27:13 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook

Additional context

No response

downstream-prometheus-scrape-relation-joined hook crashes in rev 51

Bug Description

I've experienced a crash in cos-proxy rev 51 in downstream-prometheus-scrape-relation-joined hook crash.

To Reproduce

Deploy cos-proxy rev 51 and relate it to scrape-interval-config.
Relate an additional nrpe application to cos-proxy.

juju show-status-log cos-proxy-monitors/4

23 Nov 2023 13:28:15Z  juju-unit  executing    running monitors-relation-changed hook for nrpe-lxd/1                                                                                                                                           
23 Nov 2023 13:28:17Z  juju-unit  idle                                                                                 
23 Nov 2023 13:29:15Z  juju-unit  executing    running downstream-prometheus-scrape-relation-created hook                                                                                                                                      
23 Nov 2023 13:29:16Z  juju-unit  executing    running downstream-prometheus-scrape-relation-joined hook for cos-scrape-interval-config-monitors/0                                                                                             
23 Nov 2023 13:42:21Z  juju-unit  error        hook failed: "downstream-prometheus-scrape-relation-joined"                                                                                                                                     
23 Nov 2023 13:42:26Z  juju-unit  executing    running downstream-prometheus-scrape-relation-joined hook for cos-scrape-interval-config-monitors/0                                                                                             
23 Nov 2023 13:55:41Z  juju-unit  error        hook failed: "downstream-prometheus-scrape-relation-joined"                                                                                                                                     
23 Nov 2023 13:55:51Z  juju-unit  executing    running downstream-prometheus-scrape-relation-joined hook for cos-scrape-interval-config-monitors/0                                                                                             
23 Nov 2023 14:09:18Z  juju-unit  error        hook failed: "downstream-prometheus-scrape-relation-joined"                                                                                                                                     
23 Nov 2023 14:09:38Z  juju-unit  executing    running downstream-prometheus-scrape-relation-joined hook for cos-scrape-interval-config-monitors/0                                                                                             
23 Nov 2023 14:23:19Z  juju-unit  error        hook failed: "downstream-prometheus-scrape-relation-joined"                                                                                                                                     
23 Nov 2023 14:24:00Z  juju-unit  executing    running downstream-prometheus-scrape-relation-joined hook for cos-scrape-interval-config-monitors/0                                                                                             
23 Nov 2023 14:37:25Z  juju-unit  error        hook failed: "downstream-prometheus-scrape-relation-joined"                                                                                                                                     

Environment

cos-proxy rev 51

Relevant log output

unit-cos-proxy-monitors-4: 14:46:39 ERROR unit.cos-proxy-monitors/4.juju-log downstream-prometheus-scrape:566: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/./src/charm.py", line 517, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/venv/ops/framework.py", line 340, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/venv/ops/framework.py", line 842, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/./src/charm.py", line 442, in _downstream_prometheus_scrape_relation_joined
    self._on_nrpe_targets_changed(None)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-4/charm/./src/charm.py", line 475, in _on_nrpe_targets_changed
    for alert in event.current_alerts:
AttributeError: 'NoneType' object has no attribute 'current_alerts'

Additional context

No response

Add support for scrape interval and scrape timeout.

Enhancement Proposal

Services configured by cos-proxy in prometheus has a default scrape_interval of 1m and a default scrape timeout of 10 seconds. These intervals may not be appropriate for nrpe checks that are usually executed every 5 minutes by Nagios with a timeout of 30 seconds.

The default intervals may generate overload on the units being monitored and false negative.

The scrape interval should default to 5m
The scrape timeout should default to 30s

Also as an operator I would like them to be configurable to handle corner cases.

Improve status message when settling is taking too long

Enhancement Proposal

When deploying COS Proxy at scale, it takes quite a while for the units to settle if the proxy is related to COS prior to being related to the workloads.

This is easily resolved by changing the order of relating to wait for the proxy to settle prior to relating it to COS.

Currently, the status message is empty while the proxy is processing. This should be improved by setting a status message whenever it takes too long (>5m?) for it to settle.

It should also say that its prcessing proxied relations when that is the case

Alerts rules are missing if more than one nrpe:monitors relation exists

Bug Description

After relating cos-proxy to nrpe over monitors, prometheus alert rules are created only for the last added relation. Scrape jobs work fine for all added nrpe applications. The last added nrpe relation "wins", existing alert rules are overwritten.

To Reproduce

Related more than one nrpe to cos-proxy and list alert rules.

Environment

cos-proxy edge 42
ch:nrpe latest/stable rev 97 and rev 86

Relevant log output

Nothing in the logs, alert_rules in relation data are missing.

Additional context

No response

cos-proxy fails on hook "downstream-logging-relation-changed"

Bug Description

In SQA testrun 651a309e-a3a6-44ab-b8a7-7905303fbc0a, cos-proxy fails to install in hook "downstream-logging-relation-changed".

To Reproduce

To reproduce, deploy cos and then the charmed kubernetes bundle, which includes cos. This issue is not necessarily reproducible, we have seen this bundle deploy without issues before.

Environment

The environment is a juju maas controller hosting a charmed kubernetes deployment. This deployment is connected to cos, which is hosted on a microk8s, hosted on the same juju maas controller.

Relevant log output

In the debug-log we see the following message:


2023-09-06 15:57:34 ERROR unit.cos-proxy/0.juju-log server.go:325 downstream-logging:35: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/usr/lib/python3.10/http/client.py", line 942, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.10/socket.py", line 845, in create_connection
    raise err
  File "/usr/lib/python3.10/socket.py", line 833, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 475, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/vector/v0/vector.py", line 225, in _on_log_relation_changed
    self.on.config_changed.emit(config=self.config)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 359, in _write_vector_config
    r = request.urlopen(dest)
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1377, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

My guess is that there was a hiccup in the networking. If that is the case, though I would expect that to resolve when the hook is retried.

Additional context

More logs and configs can be found here: https://oil-jenkins.canonical.com/artifacts/651a309e-a3a6-44ab-b8a7-7905303fbc0a/index.html

downstream-logging-relation-changed failed when Loki is using self signed certs

Bug Description

cos-proxy should provide an option similar to grafana-agent to skip the verification for insecure TLS, otherwise the relation with Loki fails when cos-lite is deployed with the provided tls-overlay.

To Reproduce

  1. juju deploy ./cos_lite.yaml --trust --overlay ./offers_overlay.yaml --overlay ./storage_overlay.yaml --overlay tls-overlay.yaml
  2. juju consume cos-microk8s-localhost:admin/cos.loki-logging cos-loki-logging
  3. juju deploy cos-proxy
  4. juju add-relation cos-proxy:downstream-logging cos-loki-logging:logging

Environment

cos-proxy rev. 52 channel edge

Relevant log output

Traceback (most recent call last):
  File "./src/charm.py", line 519, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/vector/v0/vector.py", line 227, in _on_log_relation_changed
    self.on.config_changed.emit(config=self.config)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 395, in _write_vector_config
    r = request.urlopen(dest)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)>

Additional context

No response

cos-proxy charm tries to scrape `http://<nrpe_host>:5666/metrics`

Bug Description

After relating cos-proxy to nrpe over the monitors interface, the charm ended up trying to scrape http://<nrpe_host>:5666/metrics from each related nrpe unit which fails:

2023-08-09 10:48:35 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer

Hostname and port 5666 confirmed after adding extra debug logs:

./unit-cos-proxy-monitors-1.log:322479:2023-08-10 04:19:52 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:557: przemeklal: 10.193.0.130 5666
./unit-cos-proxy-monitors-1.log:322481:2023-08-10 04:19:52 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:557: przemeklal: 10.193.0.88 5666

The same could be found with tcpdump.

Also, on the nagios-nrpe-server side, the failed connection attempts initiated by cos-proxy look like this:

Aug 09 12:16:39 juju-7df8cc-26-lxd-0 nrpe[3008317]: Error: (!log_opts) Could not complete SSL handshake with 10.193.0.59: 1

nagios-nrpe-server serves on port 5666 but it's not even a scrapable HTTP endpoint since it uses nrpe protocol, usually also secured by SSL. The charm spent ~18 hours in the active/executing state, trying to scrape http://nrpe:5666/metrics against all nrpe units multiple times after adding the relation, before settling down eventually and turning into active/idle.

At the same time, vector and nrpe-exporter worked fine on the cos-proxy unit and all nrpe targets were showing up in prometheus.

So apart from slowness and being stuck in exexecuting for 18 hours, everything else worked.

Please see the attached juju debug logs: cos-proxy-monitors-1_2023-08-10_logs.tar.gz

To Reproduce

  1. Deploy cos proxy rev 42.
  2. Relate to it nrpe:monitors.
  3. Watch debug log.

Environment

cos-proxy rev 42 (edge)
nrpe rev 65 / nrpe rev 42 (two applications)

Relevant log output

As attached in the full /var/log/juju in the bug description:

2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer
2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer
2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer
2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer
2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer
2023-08-09 10:48:38 DEBUG unit.cos-proxy-monitors/1.juju-log server.go:319 monitors:543: Could not scrape target: [Errno 104] Connection reset by peer

Additional context

No response

NRPE metrics not properly exported

When the cos-proxy is related to a nrpe-external-master, the nrpe_exporter is only exporting internal prometheus metrics, rather than the full nrpe check results.

The target is visible in Prometheus, and labelled correctly, but the actual relevant metrics are not available.

cos-proxy fails on hook "monitors-relation-changed"

Bug Description

n SQA testrun ea8cc747-5fbe-4a8b-b846-1e1e8139a53b, cos-proxy fails to install in hook "monitors-relation-changed".

To Reproduce

To reproduce, deploy cos and then the charmed kubernetes bundle, which includes cos. This issue is not necessarily reproducible, we have seen this bundle deploy without issues before.

Environment

The environment is a juju maas controller hosting a charmed kubernetes deployment. This deployment is connected to cos, which is hosted on a microk8s, hosted on the same juju maas controller.

Relevant log output

From the debug log

2023-09-26 10:40:44 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] executing: running monitors-relation-changed hook for nrpe/0
2023-09-26 10:40:44 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-cos-proxy-0/agent.socket <nil>}
2023-09-26 10:40:44 DEBUG unit.cos-proxy/0.juju-log server.go:316 monitors:44: Operator Framework 2.1.1+2.geb8e25a up and running.
2023-09-26 10:40:44 DEBUG unit.cos-proxy/0.juju-log server.go:316 monitors:44: Emitting Juju event monitors_relation_changed.
2023-09-26 10:40:44 DEBUG unit.cos-proxy/0.juju-log server.go:316 monitors:44: Operator Framework 2.1.1+2.geb8e25a up and running.
2023-09-26 10:40:44 DEBUG unit.cos-proxy/0.juju-log server.go:316 monitors:44: Emitting Juju event monitors_relation_changed.
2023-09-26 10:40:44 ERROR unit.cos-proxy/0.juju-log server.go:316 monitors:44: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2615, in _run
    result = run(args, **kwargs)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-cos-proxy-0/network-get', 'monitors', '-r', '44', '--format=json')' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 475, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/nrpe_exporter/v0/nrpe_exporter.py", line 292, in _on_nrpe_relation_changed
    endpoints, alerts = self._generate_data(relation)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/lib/charms/nrpe_exporter/v0/nrpe_exporter.py", line 387, in _generate_data
    exporter_address = self._charm.model.get_binding(relation).network.bind_address
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 835, in network
    self._network = self._network_get(self.name, self._relation_id)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 828, in _network_get
    return Network(self._backend.network_get(name, relation_id))
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2900, in network_get
    network = self._run(*cmd, return_output=True, use_json=True)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/model.py", line 2617, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: ERROR no network config found for binding "monitors"

2023-09-26 10:40:45 ERROR juju.worker.uniter.operation runhook.go:153 hook "monitors-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Scaling up the cos-proxy ends up with non-leaders erroring

Bug Description

COS Proxy tries to write to application data even from units that are not currently the leader. This makes them crash.

To Reproduce

  1. Deploy COS Proxy
  2. Scale it to two
  3. ๐Ÿ’ฃ

Environment

N/A

Relevant log output

unit-cos-proxy-5: 16:16:53 ERROR unit.cos-proxy/5.juju-log downstream-grafana-dashboard:614: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 431, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1688, in _update_remote_grafana
    grafana_relation.data[self._charm.app]["dashboards"] = json.dumps(stored_data)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/model.py", line 1473, in __setitem__
    self._validate_write(key, value)
  File "/var/lib/juju/agents/unit-cos-proxy-5/charm/venv/ops/model.py", line 1459, in _validate_write
    raise RelationDataAccessError(
ops.model.RelationDataAccessError: cos-proxy/5 is not leader and cannot write application data.

Additional context

No response

Generate alert rules on the fly

Whenever a new NRPE target is discovered, the cos-proxy should generate an alert rule on the fly and pass that on to Prometheus

cos-proxy fails hook "downstream-prometheus-scrape-relation-joined" due to attribute error

Bug Description

In test run https://solutions.qa.canonical.com/testruns/9e28d172-edf3-4c8c-9412-43473f369a4f, the deployment fails because cos-proxy is in error state:

cos-proxy/0*                                error        idle       2/lxd/4   10.246.165.58                   hook failed: "downstream-prometheus-scrape-relation-joined"
  filebeat/9                                active       idle                 10.246.165.58                   Filebeat ready.
  nrpe/15                                   active       idle                 10.246.165.58   5666/tcp icmp   Ready
  prometheus-grok-exporter/9                active       idle                 10.246.165.58   9144/tcp        Unit is ready
  telegraf/9                                active       idle                 10.246.165.58   9103/tcp        Monitoring cos-proxy/0 (source version/commit 23.10)
  ubuntu-advantage/9                        active       idle                 10.246.165.58                   Attached (esm-apps,esm-infra)

To Reproduce

  1. Deploy Cos
  2. Deploy the openstack bundle

Environment

Cos-proxy charm is latest/candidate rev 51

Cos itself has the following versions

App         Version  Status  Scale  Charm            Channel     Rev  Address  Exposed  Message
controller           active      1  juju-controller  3.1/stable   14           no       

Unit           Workload  Agent  Address     Ports      Message
controller/0*  active    idle   10.1.51.68  37017/TCP  
Model  Controller            Cloud/Region              Version  SLA          Timestamp
cos    foundations-microk8s  microk8s_cloud/localhost  3.1.6    unsupported  03:26:05Z

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.25.0   active      1  alertmanager-k8s  stable    86  10.152.183.161  no       
catalogue              active      1  catalogue-k8s     stable    24  10.152.183.217  no       
grafana       9.2.1    active      1  grafana-k8s       stable    92  10.152.183.120  no       
loki          2.7.4    active      1  loki-k8s          stable    97  10.152.183.93   no       
prometheus    2.46.0   active      1  prometheus-k8s    stable   146  10.152.183.212  no       
traefik       2.10.4   active      1  traefik-k8s       stable   148  10.246.167.206  no       

Unit             Workload  Agent  Address      Ports  Message
alertmanager/0*  active    idle   10.1.51.70          
catalogue/0*     active    idle   10.1.51.69          
grafana/0*       active    idle   10.1.83.205         
loki/0*          active    idle   10.1.83.206         
prometheus/0*    active    idle   10.1.67.197         
traefik/0*       active    idle   10.1.51.71          

Offer         Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager  alertmanager  alertmanager-k8s  86   0/0        karma-dashboard       karma_dashboard          provider
grafana       grafana       grafana-k8s       92   1/1        grafana-dashboard     grafana_dashboard        requirer
loki          loki          loki-k8s          97   1/1        logging               loki_push_api            provider
prometheus    prometheus    prometheus-k8s    146  2/2        metrics-endpoint      prometheus_scrape        requirer
                                                              receive-remote-write  prometheus_remote_write  provider

Relevant log output

In the debug log we see the following message repeated:

2023-11-29 04:51:08 DEBUG juju.worker.uniter.runner runner.go:719 starting jujuc server  {unix @/var/lib/juju/agents/unit-cos-proxy-0/agent.socket <nil>}
2023-11-29 04:51:08 DEBUG unit.cos-proxy/0.juju-log server.go:325 downstream-prometheus-scrape:162: ops 2.8.0+8.g26c6e95 up and running.
2023-11-29 04:51:08 DEBUG unit.cos-proxy/0.juju-log server.go:325 downstream-prometheus-scrape:162: Emitting Juju event downstream_prometheus_scrape_relation_joined.
2023-11-29 04:51:08 ERROR unit.cos-proxy/0.juju-log server.go:325 downstream-prometheus-scrape:162: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 517, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 340, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 842, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 442, in _downstream_prometheus_scrape_relation_joined
    self._on_nrpe_targets_changed(None)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 475, in _on_nrpe_targets_changed
    for alert in event.current_alerts:
AttributeError: 'NoneType' object has no attribute 'current_alerts'
2023-11-29 04:51:08 ERROR juju.worker.uniter.operation runhook.go:180 hook "downstream-prometheus-scrape-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

More logs, including crashdumps, can be found under: https://oil-jenkins.canonical.com/artifacts/9e28d172-edf3-4c8c-9412-43473f369a4f/index.html

Monitors relation takes long time to settle.

Bug Description

In an environment with > 2k nrpe checks and 196 nrpe units, cos-proxy is still executing monitors-relation-joined and monitors-relation-changed after more than 24 hours.

It seems that there are rooms to improve efficiency of those hooks.

To Reproduce

Relate cos-proxy with nrpe:monitors in an environment with ~ 200 nrpe units

Environment

cos-proxy rev. 33 channel edge

Relevant log output

unit-cos-proxy-monitors-0: 14:33:24 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via hook dispatching script: dispatch)
unit-cos-proxy-monitors-0: 14:33:24 INFO juju.worker.uniter.operation ran "monitors-relation-joined" hook (via hook dispatching script: dispatch)
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:25 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.
unit-cos-proxy-monitors-0: 14:33:26 WARNING unit.cos-proxy-monitors/0.juju-log monitors:788: Monitor data is not a dict after parsing. Skipping.

Additional context

No response

NRPE dashboard not available

Bug Description

When relating cos-proxy to grafana-k8s on grafana-dashboard relation, the charm goes into blocked state and the nrpe dashboard is not added to grafana

To Reproduce

  1. juju consume cos-microk8s-localhost:admin/cos.grafana-dashboards cos-grafana-dashboards
  2. juju relate cos-grafana-dashboards cos-proxy-monitors

Environment

cos-proxy rev. 52 channel edge
grafana-k8s rev. 93 channel edge

Relevant log output

$ juju status cos-proxy-monitors
...
Unit                   Workload  Agent  Machine    Public address  Ports  Message
cos-proxy-monitors/5*  blocked   idle   17/lxd/25  10.243.165.135         Missing one of (Grafana|dashboard) relation(s)
...

$ juju status --relations | grep cos-proxy-monitors
cos-proxy-monitors             n/a          blocked          1  cos-proxy                      edge      52  no       Missing one of (Grafana|dashboard) relation(s)
cos-proxy-monitors/5*              blocked      idle   17/lxd/25  10.243.165.135                                           Missing one of (Grafana|dashboard) relation(s)
cos-proxy-monitors:downstream-grafana-dashboard                      cos-grafana-dashboards:grafana-dashboard                      grafana_dashboard        regular      
cos-proxy-monitors:downstream-prometheus-scrape                      cos-scrape-interval-config-monitors:configurable-scrape-jobs  prometheus_scrape        regular      
nrpe-container:monitors                                              cos-proxy-monitors:monitors                                   monitors                 regular      
nrpe-controller:monitors                                             cos-proxy-monitors:monitors                                   monitors                 regular      
nrpe-host:monitors                                                   cos-proxy-monitors:monitors                                   monitors                 regular      
nrpe-maas-infra:monitors                                             cos-proxy-monitors:monitors                                   monitors                 regular

Additional context

No response

Add tests

Enhancement Proposal

Need to add tests to capture issues such as

For example:

  • utest: begin_with_initial_hooks with all the relations in place, and with relations added incrementally after the initial hooks
  • itest: Upgrade charmhub edge charm with local charm

Alert rules for NRPE are not created for all units

Bug Description

While paring LMA and COS alerts we noticed that alert rules for nrpe are not created for all units, only a few (one per nrpe application, instead of one per nrpe unit).

To Reproduce

juju status --relations 
...
cos-proxy-monitors:downstream-prometheus-scrape                      cos-scrape-interval-config:configurable-scrape-jobs  prometheus_scrape               regular
nrpe-compute:monitors                                                cos-proxy-monitors:monitors                          monitors                        regular      
nrpe-container:monitors                                              cos-proxy-monitors:monitors                          monitors                        regular      
nrpe-control-node:monitors                                           cos-proxy-monitors:monitors                          monitors                        regular      
nrpe-storage:monitors                                                cos-proxy-monitors:monitors                          monitors                        regular      
...

Show the metrics for all the units from the related application but not for the _alert_rules

juju show-unit scrape-interval-config/0

Remove and relate again

juju remove-relation cos-proxy-monitors:downstream-prometheus-scrape cos-scrape-interval-config:configurable-scrape-jobs
juju relate cos-proxy-monitors:downstream-prometheus-scrape cos-scrape-interval-config:configurable-scrape-jobs

No sign of alerts_rules for every unit

juju show-unit scrape-interval-config/0 | grep alert_rules

Environment

juju --version 
2.9.45-ubuntu-amd64

App                        Version                  Charm                         Channel
alertmanager               0.25.0                   alertmanager-k8s              stable
catalogue                  active                   stable                        19
grafana                    9.2.1                    grafana-k8s                   stable
loki                       2.7.4                    loki-k8s                      stable
prometheus                 2.43.0                   prometheus-k8s                stable
scrape-interval-config     n/a                      prometheus-scrape-config-k8s  latest/edge
traefik                    2.9.6                    traefik-k8s                   stable

Relevant log output

n/a

Additional context

No response

Regression: downstream-prometheus-scrape-relation-joined fails

Bug Description

It seems that #89 introduced a regression.

To Reproduce

  1. Deploy cos-proxy
  2. Relate it to nrpe (:monitors)
  3. Relate cos-proxy to scrape-interval-config (or probably directly to Prometheus)

Environment

cos-proxy latest/edge (rev 47)

Relevant log output

unit-cos-proxy-monitors-0: 06:00:28 ERROR unit.cos-proxy-monitors/0.juju-log downstream-prometheus-scrape:660: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/./src/charm.py", line 501, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/venv/ops/framework.py", line 340, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/venv/ops/framework.py", line 842, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/./src/charm.py", line 429, in _downstream_prometheus_scrape_relation_joined
    self._on_nrpe_targets_changed(None)
  File "/var/lib/juju/agents/unit-cos-proxy-monitors-0/charm/./src/charm.py", line 450, in _on_nrpe_targets_changed
    nrpes = cast(List[Dict[str, Any]], event.current_targets)
AttributeError: 'NoneType' object has no attribute 'current_targets'
unit-cos-proxy-monitors-0: 06:00:28 ERROR juju.worker.uniter.operation hook "downstream-prometheus-scrape-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

It was a fresh deployment of cos-proxy.

Pull nrpe_exporter as part of packing

Enhancement Proposal

Similar to cos-tool (e.g. in prom), it would be handy if nrpe_exporter was automatically packed so users wouldn't have to manually download it and use --resource.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.