timescale / tobs Goto Github PK

tobs - The Observability Stack for Kubernetes. Easy install of a full observability stack into a k8s cluster with Helm charts.

License: Apache License 2.0

Shell 49.68% Makefile 20.96% Mustache 29.36%

metrics observability monitoring kubernetes-monitoring kube-prometheus kubernetes opentelemetry opentelemetry-collector prometheus prometheus-operator

tobs's Introduction

Warning

Tobs has been discontinued and is deprecated.

The code in this repository is no longer maintained.

Learn more.

tobs - The Observability Stack for Kubernetes

Tobs is a tool that aims to make it as easy as possible to install a full observability stack into a Kubernetes cluster. Currently this stack includes:

Kube-Prometheus the Kubernetes monitoring stack
- Prometheus to collect metrics
- AlertManager to fire the alerts
- Grafana to visualize what's going on
- Node-Exporter to export metrics from the nodes
- Kube-State-Metrics to get metrics from kubernetes api-server
- Prometheus-Operator to manage the life-cycle of Prometheus and AlertManager custom resource definitions (CRDs)
Promscale (design doc) to store metrics for the long-term and allow analysis with both PromQL and SQL.
TimescaleDB for long term storage of metrics and provides ability to query metrics data using SQL.
- Postgres-Exporter to get metrics from PostgreSQL server
Opentelemetry-Operator to manage the lifecycle of OpenTelemetryCollector Custom Resource Definition (CRDs)

We plan to expand this stack over time and welcome contributions.

Tobs provides a helm chart to make deployment and operations easier. It can be used directly or as a sub-chart for other projects.

Quick start

Prerequisites

Using tobs to install full observability stack with openTelemetry support currently requires installation of cert-manager. To do install it please follow cert-manager documentation.

Note: cert-manager is not required when using tobs with opentelemetry support disabled.

Installing the helm chart

The following command will install Kube-Prometheus, OpenTelemetry Operator, TimescaleDB, and Promscale into your Kubernetes cluster:

helm repo add timescale https://charts.timescale.com/
helm repo update
helm install --wait <release_name> timescale/tobs

Note: --wait flag is necessary for successfull installation as tobs helm chart can create opentelemetry Custom Resources only after opentelemetry-operator is up and running. This flag can be omitted when using tobs without opentelemetry support.

For detailed configuration and usage instructions, take a look at the helm chart's README.

Configuring the stack

All configuration for all components happens through the helm values.yaml file. You can view the self-documenting default values.yaml in the repo. We also have additional documentation about individual configuration settings in our Helm chart docs.

Compatibility matrix

Tobs vs. Kubernetes

Tobs Version	Kubernetes Version
12.0.x	v1.23 to v1.24
0.11.x	v1.23 to v1.24
0.10.x	v1.21 to v1.23
0.9.x	v1.21 to v1.23
0.8.x	v1.21 to v1.23
0.7.x	v1.19 to v1.21

Contributing

We welcome contributions to tobs, which is licensed and released under the open-source Apache License, Version 2. The same Contributor's Agreement applies as in TimescaleDB; please sign the Contributor License Agreement (CLA) if you're a new contributor.

tobs's People

Contributors

Stargazers

Watchers

Forkers

spolcyn atanasovskib antekresic seletz sravankumarkayitha tuapuikia vineethreddy02 cybernetics nexclipper heyleke elserhumano svendowideit heartshare marvel-works clix-dev-llc devopstoday11 sokoow devopsbox ddiiwoong zerow0000 beogacdo jacquesdonnelly dnck suryatmodulus taupiknr26 gperreymond paulfantom asfin rnrneverdies rnrlabs nsrwissam rykelley guettli youngwookim ssola jamessewell ramonguiu devsusu avisheksarkar20 uwentaway nhudson sysrex rizkyramadhanch onprem djhaskin987 dhomane umgbhalla mshuler arajkumar sam-huang1223 tmcarr maverick0bg angel-studios liweiming0611 seyunpark tvelmachos morningswait billcleveland-usra gamulambala fjsnogueira

tobs's Issues

prometheus-node-exporter pod stuck on pending

When installing the chart after one release of the chart is already installed, the prometheus-node-exporter pod gets stuck on pending and never comes up.

The gg chart was installed first, and the ff chart second.

NAME                                                  READY   STATUS      RESTARTS   AGE
ff-grafana-c7874b854-ntgjf                            2/2     Running     4          9m51s
ff-grafana-db-hhmrv                                   0/1     Completed   4          9m51s
ff-kube-state-metrics-85944bdd8b-697lf                1/1     Running     0          9m51s
ff-prometheus-node-exporter-zsbvr                     0/1     Pending     0          9m51s
ff-prometheus-server-94ffbdb5c-jxszt                  2/2     Running     0          9m51s
ff-timescale-prometheus-5cff84c58f-txv6v              1/1     Running     4          9m51s
ff-timescale-prometheus-drop-chunk-1593707400-vtnb9   0/1     Completed   0          5m27s
ff-timescaledb-0                                      1/1     Running     0          9m50s
gg-grafana-7b8b96bbc8-q6mf8                           2/2     Running     2          17m
gg-grafana-db-wnxr6                                   0/1     Completed   0          17m
gg-kube-state-metrics-5dd59f65fb-v8h78                1/1     Running     0          17m
gg-prometheus-node-exporter-n8b6w                     1/1     Running     0          17m
gg-prometheus-server-b47f7c7dc-4ps97                  2/2     Running     0          17m
gg-timescale-prometheus-5b9455669f-mhtjj              1/1     Running     5          17m
gg-timescale-prometheus-drop-chunk-1593707400-cmcsp   0/1     Completed   0          5m27s
gg-timescaledb-0                                      1/1     Running     0          17m

configuring the stack docs fail

did a tobs install, then

sven@x1carbon:~$ tobs helm show-values > values.yaml
sven@x1carbon:~$ tobs install -f values.yaml
Adding Timescale Helm Repository
"timescale" already exists with the same configuration, skipping
Fetching updates from repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "timescale" chart repository
Update Complete. ⎈Happy Helming!⎈
Installing The Observability Stack
Error: could not install The Observability Stack: exit status 1 
Output: 
Error: cannot re-use a name that is still in use

Activate automatic compression using the helm chart

Hi. I installed a setup with timescaledb + timescale-prometheus + prometheus using this helm chart. However, I store a lot of metrics (multiple cadvisors and ~20 node exporters) and my timescale db fills up at the rate of ~1Gb/hour.

I saw that it exists a compression mechanism for timescaledb which should greatly help keeping the disk usage low but I can't see a way to set it up easily with this helm chart and its subcharts.

Is there a way to do so or any alternatives to help? I saw that the cronjob doing cleanups on data runs a CALL prom.drop_chunks() command based on the retention period. However, it just outright delete data so this isn't what I want.

Ideally, I would want to automatically compress data older than 6 hours.

Thanks for your help!

&variables in helm values

When you do tobs helm show values the output shows odd defaults:
e.g.

dbName: &metricDB postgres
secretTemplate: &dbPassSecret "{{ .Release.Name }}-timescaledb-passwords"

I haven't seen that in a helm template before. It leads me to think that maybe it's not on purpose.

Constraint violation months after the chart is up

Hi,

I have the charts deployed on several clusters and i have an issue on one cluster where the promscale pod is not pushing data to the db because of this error :

{"caller":"write.go:79","err":"ERROR: new row for relation "aws_alb_target_connection_error_count_maximum" violates check constraint "aws_alb_target_connection_error_count_maximum_labels_check" (SQLSTATE 23514)","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"ts":"2021-04-26T12:57:38.333Z"}

I have not changed anything in the setup in a while so it must be something new with the data. Can you please provide me a few commands for timescale to check the constraint , modify it, drop it or at least understand what it's details are ?

Thanks

Full support for multi-cluster

There are some issues with dashboards

metrics whitelisting and scrape durations as chart parameters

All,

I've noticed that tobs is indeed a pretty cool complete gizmo now, but there's a bigger problem when using it full time.

So, by default exporters and prometheus scrape all the metrics, invoking disk writes south of 40 megs/s - pretty big strain.I assume that if you runestablished server bases, with grafana dashboards created already for your needs, then you probably can easily list a subset of metrics you're only interested in.

As a measure of decreasing mad scraping durations x number of default metrics gathered, I'd suggest an approach to parametrize both of these in the install process. What you think ?

Timescaledb password error

I tried to install tobs using the CLI tool 'tobs install'. However, the promscale pod got a 'CrashLoopBackOff' status.

pod/tobs-promscale-6bc7f8f689-qltwh 0/1 CrashLoopBackOff 5 4m35s

Checking the promscale pod log, I got some kind of authentication error

kubectl logs tobs-promscale-6bc7f8f689-qltwh

level=info ts=2021-06-25T16:33:20.098Z caller=runner.go:29 msg="Version:0.4.1; Commit Hash: level=info ts=2021-06-25T16:33:20.098Z caller=runner.go:30 config="&{ListenAddr::9201

 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} 
SeriesCacheMemoryMaxBytes:1625900646 MetricsCacheSize:10000 LabelsCacheSize:10000} AppName:[email protected] 
Host:tobs.default.svc.cluster.local Port:5432 User:postgres password:**** Database:postgres SslMode:require DbConnectRetries:0 
DbConnectionTimeout:1m0s IgnoreCompressedChunks:false AsyncAcks:false ReportInterval:0 WriteConnectionsPerProc:4 
MaxConnections:-1 UsesHA:false DbUri: EnableStatementsCache:true} LogCfg:{Level:info Format:logfmt} APICfg:
{AllowedOrigin:^(?:.*)$ ReadOnly:false HighAvailability:false AdminAPIEnabled:false TelemetryPath:/metrics Auth:0xc000079c70 
MultiTenancy:<nil> EnableFeatures: EnabledFeaturesList:[] MaxQueryTimeout:2m0s SubQueryStepInterval:1m0s 
LookBackDelta:5m0s MaxSamples:50000000 MaxPointsPerTs:11000} LimitsCfg:{targetMemoryFlag:{kind:0 value:80} 
TargetMemoryBytes:3251801292} TenancyCfg:{SkipTenantValidation:false EnableMultiTenancy:false AllowNonMTWrites:false 
ValidTenantsStr:allow-all ValidTenantsList:[]} ConfigFile:config.yml TLSCertFile: TLSKeyFile: HaGroupLockID:0 
PrometheusTimeout:-1ns ElectionInterval:5s Migrate:true StopAfterMigrate:false UseVersionLease:true InstallExtensions:true 
UpgradeExtensions:true UpgradePrereleaseExtensions:false}"

level=error ts=2021-06-25T16:33:20.188Z caller=runner.go:40 msg="aborting startup due to error" err="failed to connect to 
`host=tobs.default.svc.cluster.local user=postgres database=postgres`: server error (FATAL: password authentication failed for 
user \"postgres\" (SQLSTATE 28P01))"

I am using the version 0.4.1

tobs version -d

Tobs CLI Version: 0.4.1, deployed tobs helm chart version: 0.4.1

The other components are running.

Any clues?
Thank you

0.4.1 might not be creating a secret properly

I get this after trying to install from 0.4.1:

Events:
  Type     Reason     Age               From                                  Message
  ----     ------     ----              ----                                  -------
  Normal   Scheduled  <unknown>         default-scheduler                     Successfully assigned default/tobs-promscale-544d8c5f46-f8clm to XXX
  Normal   Pulled     9s (x5 over 51s)  kubelet, XXX Container image "timescale/promscale:0.1.4" already present on machine
  Warning  Failed     9s (x5 over 51s)  kubelet, XXX  Error: secret "tobs-timescaledb-passwords" not found

could someone confirm ?

Failed to install tobs OR timescaldb-single : FATAL: "/var/lib/postgresql/data" is not a valid data directory

Tried to install tobs on k8s. Followed the basic steps:

curl --proto '=https' --tlsv1.2 -sSLf  https://tsdb.co/install-tobs-sh |sh
tobs helm show-values > values.yaml

And then changed the storage class to match storage class I have in the cluster.
Then installed:

tobs install -f values.yaml

Helm installation finished.
However the tobs-timescaledb-0 pod keeps crushing for :

2021-02-13 18:48:09.610 GMT [80] LOG:  skipping missing configuration file "/var/run/postgresql/timescaledb.conf"
2021-02-13 18:48:09.611 GMT [80] LOG:  skipping missing configuration file "/var/run/postgresql/timescaledb.conf"
2021-02-13 18:48:09.611 GMT [80] LOG:  skipping missing configuration file "/var/lib/postgresql/data/postgresql.auto.conf"
2021-02-13 18:48:09.611 GMT [80] FATAL:  "/var/lib/postgresql/data" is not a valid data directory
2021-02-13 18:48:09.611 GMT [80] DETAIL:  File "/var/lib/postgresql/data/PG_VERSION" is missing.
running bootstrap script ... /var/run/postgresql:5432 - no response
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1452, in run_cycle
    info = self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1346, in _run_cycle
    return self.post_bootstrap()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1242, in post_bootstrap
    self.cancel_initialization()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1235, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

I had later tried to install only timescaledb-single from the chart, and got the same failure.
I had modified the sts to sleep before running the init script on patroni, and I see directory /var/lib/postgresql/data is created by the pod command and exist.
However, indeed PG_VERSION does not exist there, not sure who shoold create it and when.

Full logs of timescaledb pod:

2021-02-13 18:47:56 - restore_or_initdb - Invoking initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
fixing permissions on existing directory /var/lib/postgresql/wal/pg_wal ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
Bus error (core dumped)
child process exited with exit code 135
initdb: removing contents of data directory "/var/lib/postgresql/data"
initdb: removing contents of WAL directory "/var/lib/postgresql/wal/pg_wal"
2021-02-13 18:48:09,364 WARNING: max_connections setting is missing from pg_controldata output
2021-02-13 18:48:09,364 WARNING: max_prepared_xacts setting is missing from pg_controldata output
2021-02-13 18:48:09,364 WARNING: max_locks_per_xact setting is missing from pg_controldata output
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=archive_command value=/etc/timescaledb/scripts/pgbackrest_archive.sh %p from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=archive_mode value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=archive_timeout value=1800s from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=autovacuum_analyze_scale_factor value=0.02 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=autovacuum_max_workers value=10 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=autovacuum_vacuum_scale_factor value=0.05 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=cluster_name value=tobs from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=hot_standby value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=listen_addresses value=0.0.0.0 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_autovacuum_min_duration value=0 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_checkpoints value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_connections value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_disconnections value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_line_prefix value=%t [%p]: [%c-%l] %u@%d,app=%a [%e]  from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_lock_waits value=on from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_min_duration_statement value=1s from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=log_statement value=ddl from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_connections value=100 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_locks_per_transaction value=64 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_prepared_transactions value=150 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_replication_slots value=10 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_wal_senders value=10 from the config
2021-02-13 18:48:09,365 WARNING: Removing unexpected parameter=max_worker_processes value=8 from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=port value=5432 from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=shared_preload_libraries value=timescaledb,pg_stat_statements from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=ssl value=on from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=ssl_cert_file value=/etc/certificate/tls.crt from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=ssl_key_file value=/etc/certificate/tls.key from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=tcp_keepalives_idle value=900 from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=tcp_keepalives_interval value=100 from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=temp_file_limit value=1GB from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=track_commit_timestamp value=off from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=unix_socket_directories value=/var/run/postgresql from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=unix_socket_permissions value=0750 from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=wal_level value=hot_standby from the config
2021-02-13 18:48:09,366 WARNING: Removing unexpected parameter=wal_log_hints value=on from the config
2021-02-13 18:48:09.610 GMT [80] LOG:  skipping missing configuration file "/var/run/postgresql/timescaledb.conf"
2021-02-13 18:48:09.611 GMT [80] LOG:  skipping missing configuration file "/var/run/postgresql/timescaledb.conf"
2021-02-13 18:48:09.611 GMT [80] LOG:  skipping missing configuration file "/var/lib/postgresql/data/postgresql.auto.conf"
2021-02-13 18:48:09.611 GMT [80] FATAL:  "/var/lib/postgresql/data" is not a valid data directory
2021-02-13 18:48:09.611 GMT [80] DETAIL:  File "/var/lib/postgresql/data/PG_VERSION" is missing.
running bootstrap script ... /var/run/postgresql:5432 - no response
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1452, in run_cycle
    info = self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1346, in _run_cycle
    return self.post_bootstrap()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1242, in post_bootstrap
    self.cancel_initialization()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1235, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

"invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable" when using KUBECONFIG env variable

Some parts of tobs do not respect KUBECONFIG env variable when using tobs install, leading to an error message

invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

Seems the tobs helm parts are not suffering from this issue

promscale is missing nodeSelector

promscale gets currently scheduled on any available node since it's missing nodeSelector.
We're using node pools and it would be great to ensure promscale stays in our o11y node pool.

Can not upgrade with timescaledb chart disabled

Hi we are running tobs with timescale outside kubernetes.
When I try to upgrade from tobs 0.2.1 with this command
helm upgrade tobs timescale/tobs --namespace tobs -f custom-values.yaml
I get this error
Error: UPGRADE FAILED: template: tobs/templates/NOTES.txt:158:94: executing "tobs/templates/NOTES.txt" at <{{template "timescaledb.fullname" $tsEnv}}>: template "timescaledb.fullname" not defined

As I understand it the template timescaledb.fullname is defined in the timescaledb chart and hence undefined.

Finalize/Test helm charts for on-premise deployments

Prometheus

Research if config is merged or replaced when using additional config maps
Fix parsing error when generating a config map with timescale-prometheus as a remote_write from the same deployment
Debug why prometheus is not sending samples to remote but no error is logged
Document all values in the readme
Include the proper labels in the prometheus config mag
Configure remote_read for our connector

Grafana

Deployment
Set up both timescale and prometheus as data sources
Example dashboards
Use timescaledb as a database for grafana

Please add --devel to the docs for every helm command.

Problem

The docs fail to mention --devel in the step where you'd download and customise your values.yaml.

Ppl like me who copy-paste tutorial code will get confused, helm won't find a valid chart.

As mentioned by @mike in the slack channel the beta will come soon, but maybe it will save others some headaches.

PR coming.

Deploy Jaeger UI with tobs

Make it possible to deploy the Jaeger UI to explore and query traces in Promscale with tobs.

Deployment will be optional (i.e. not by default).

cli/tests/e2e-tests.sh is not posix sh compliant and breaks

Running the cli/tests/e2e-tests.sh produces an error in bash:

Creating cluster "tobs" ...
 ✓ Ensuring node image (kindest/node:v1.21.1) 🖼
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-tobs"
You can now use your cluster with:

kubectl cluster-info --context kind-tobs

Thanks for using kind! 😊
trap: SIGHUP: bad trap

I believe this is due to the following:

In POSIX sh, prefixing signal names with 'SIG' is undefined.

https://github.com/koalaman/shellcheck/wiki/SC3048

Switch to or vendor kube-prometheus

We already have a project with a lot of the same goals in https://github.com/prometheus-operator/kube-prometheus. It has an amazing set of alerts, dashboards and recording rules and comes with Alertmanager too. The community is also behind that project.

Having said that, I like the UX of tobs. I think it will be better to wrap tobs on top of kube-prometheus rather than using helm charts like you do now. I know its a big change but it will literally make the project much much better. Hear me out:

Ship Prometheus Operator instead of Prometheus as adding relabelling rules is not trivial and operator makes it much easier. The example here:

tobs/chart/values.yaml

Lines 98 to 103 in c161c7f

    
           #extraScrapeConfigs: | 
        
             #example of adding hypotetical scrape job for https://github.com/AICoE/prometheus-anomaly-detector 
        
             #- job_name: prometheus-anomaly 
        
             #  static_configs: 
        
             #    - targets: 
        
             #      - prometheus-anomaly-svc.default:8080

is wrong because you shouldn't scrape the service endpoint but rather individual pods.

Ship the dashboards, alerts and recording rules from kube-prometheus. They are maintained by the experts with a lot of collaboration from the community.
Also add alertmanager to the stack and make it easier to configure.

At this point it looks like kube-prometheus :) Why not just use it?

Sporadic errors when deploying

Hi. I have been installing/deleting the timescale-observability chart a lot in my tests.

However, when I am installing the chart, it fails 1/2 of the times. I reproduced it in version 0.1.0-alpha.2 and 0.1.0-alpha.3.

The problem is the timescaledb which stays stuck in an "non ready" state which eventually fails the install. Here are the logs:

kubectl logs test-timescaledb-0
2020-05-07 03:02:03,186 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:03,186 ERROR: failed to bootstrap (without leader)
2020-05-07 03:02:13,191 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:13,191 ERROR: failed to bootstrap (without leader)
2020-05-07 03:02:23,188 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:23,189 ERROR: failed to bootstrap (without leader)
2020-05-07 03:02:33,187 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:33,187 ERROR: failed to bootstrap (without leader)
2020-05-07 03:02:43,185 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:43,186 ERROR: failed to bootstrap (without leader)
2020-05-07 03:02:53,185 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:02:53,185 ERROR: failed to bootstrap (without leader)
2020-05-07 03:03:03,187 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
2020-05-07 03:03:03,188 ERROR: failed to bootstrap (without leader

If I delete the botched release and try again, it works fine. I attached to this issue the minimal values.yml with which I reproduce the bug. I do not use complicated stuff like HA or backups so I am very surprised by the logs.

I also have sometimes a kubernetes service object which fails to be deleted by helm:

$ helm delete test                                                                                                                                                       
release "test" uninstalled
$ helm upgrade --install --wait test timescale/timescale-observability --version=0.1.0-alpha.2  -f minimal-values.yaml
Release "test" does not exist. Installing it now.
Error: rendered manifests contain a resource that already exists. Unable to continue with install: Service "test-config" in namespace "monitoring" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "test"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "monitoring"

It must be either a race condition or a resource helm failed to delete between the installs.

could not access file "$libdir/timescaledb-2.0.0-rc2": No such file or directory

Hi, I restarted my timescaledb pod today and I started to see cascade of those errors:

patroni.exceptions.PostgresConnectionException: 'connection problems'
2020-11-23 20:33:51 UTC [179]: [5fbc1caf.b3-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2020-11-23 20:33:51 UTC [179]: [5fbc1caf.b3-2] postgres@postgres,app=[unknown] [00000] LOG:  connection authorized: user=postgres database=postgres application_name=Patroni
2020-11-23 20:33:51 UTC [179]: [5fbc1caf.b3-3] postgres@postgres,app=Patroni [58P01] ERROR:  could not access file "$libdir/timescaledb-2.0.0-rc2": No such file or directory

I searched for this timescaledb-2.0.0-rc2 in the container and found only timescaledb-2.0.0-rc3 for pg12. The timescaledb pod is reported running but promscale fails to connect to it:

level=error ts=2020-11-23T20:35:59.338Z caller=runner.go:110 msg=\"aborting startup due to error\" err=\"failed to connect to `host=<pod name>.<namespace>.svc.cluster.local user=postgres database=postgres`: dial error (dial tcp <ip>:5432: connect: connection refused)\"

I had default values for the timescaledb image:

   image:
     repository: timescaledev/timescaledb-ha
     tag: pg12-ts1.7-latest

I fixed my problem by switching to pg12.4-ts1.7-latest image tag.

how to update prometheus configs

What is the recommended way to update your prometheus configuration with tobs?

e.g. I want to add scrape configs, alerting rules, etc.

Increase default wal and data disk size

Need to experiment a bit but data should probably be ~50GB wal maybe 5GB

Storage rotation/limiting

Is there any gauge in configs to make storage limits/rotation working ? I think I've seen a 150gig setting somewhere in chart values, but it just doesn't work (it crosses 150 gig and tips my cluster over). Let us know and thanks in advance.

Could not install The Observability Stack 0.3.0

Hi, I'm little bit confused that no one else reported this issue, but I'm not able to install 0.3.0 due to "Error: could not install The Observability Stack: failed to parse helm show values from yaml to json yaml: mapping values are not allowed in this context"

I've tried tobs install atop of k8s 1.20 and 1.19 and got the same result. Can anyone confirm that it works on your side? Thanks.

tobs install error on GKE

Hi,

I created a new GKE cluster and tried to deploy Promscale using Tobs. But after the installation, the process exit with the following error message.

$>tobs install
Adding Timescale Helm Repository
"timescale" has been added to your repositories
Fetching updates from repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "timescale" chart repository
Update Complete. ⎈ Happy Helming!⎈
Installing The Observability Stack
Waiting for pods to initialize...
2020/10/08 15:26:36 no Auth Provider found for name "gcp"

$>tobs grafana get-password
2020/10/08 15:30:20 no Auth Provider found for name "gcp"

kubeconf context

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://XXX.XXX.XXX.XXX
  name: gke_cluster
contexts:
- context:
    cluster: gke_cluster
    user: gke_cluster
  name: gke_cluster
current-context: gke_cluster
kind: Config
preferences: {}
users:
- name: gke_cluster
  user:
    auth-provider:
      config:
        access-token: OMITTED
        cmd-args: config config-helper --format=json
        cmd-path: gcloud
        expiry: "2020-10-08T08:17:55Z"
        expiry-key: '{.credential.token_expiry}'
        token-key: '{.credential.access_token}'
      name: gcp

tobs charts missing from charts.timescale.com

While attempting to run tobs install, I receive the error message:

Error: could not install The Observability Stack: exit status 1 
Output: Error: failed to download "timescale/tobs" (hint: running `helm repo update` may help)

Digging around a bit more, it seems like the tobs and promscale charts are missing from charts.timescale.com.

The only 2 charts currently in the index.yaml file are the ones for timescaledb:

~> helm  search repo timescale
NAME                            CHART VERSION   APP VERSION     DESCRIPTION                      
timescale/timescaledb-multinode 0.8.0                           TimescaleDB Multinode Deployment.
timescale/timescaledb-single    0.8.1                           TimescaleDB HA Deployment. 

~> curl https://charts.timescale.com/index.yaml  2>/dev/null | yq eval '.entries.[] | path | .[-1]' -
timescaledb-multinode
timescaledb-single

Looks like maybe the index file was clobbered when an update was made earlier today (based on timestamps in index.yaml)

alertmanager configuration differs from the upstream prometheus chart

I've been unable to get alertmanager working with this helm chart. No matter what options I try, the alarm never appears to make it to alertmanager.

I noticed some differences in this chart vs the prometheus upstream.

The current configuration within the tobs chart in prometheus.conf:

 alerting:
{{- if $root.Values.prometheus.alertRelabelConfigs }}
{{ $root.Values.prometheus.alertRelabelConfigs | toYaml  | trimSuffix "\n" | indent 6 }}
{{- end }}
      alertmanagers:
{{- if $root.Values.prometheus.server.alertmanagers }}
{{ toYaml $root.Values.prometheus.server.alertmanagers | indent 8 }}
{{- else }}
      - kubernetes_sd_configs:
          - role: pod
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        {{- if $root.Values.prometheus.alertmanager.prefixURL }}
        path_prefix: {{ $root.Values.prometheus.alertmanager.prefixURL }}
        {{- end }}
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace]
          regex: {{ $root.Release.Namespace }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_app]
          regex: {{ template "prometheus.name" $root }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_component]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
          regex: {{ index $root.Values.prometheus.alertmanager.podAnnotations "prometheus.io/probe" | default ".*" }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_container_port_number]
          regex:
          action: drop

The configuration in the actual upstream prometheus configuration in cm.yaml under server folder has:

alerting:
{{- if $root.Values.alertRelabelConfigs }}
{{ $root.Values.alertRelabelConfigs | toYaml  | trimSuffix "\n" | indent 6 }}
{{- end }}
      alertmanagers:
{{- if $root.Values.server.alertmanagers }}
{{ toYaml $root.Values.server.alertmanagers | indent 8 }}
{{- else }}
      - kubernetes_sd_configs:
          - role: pod
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        {{- if $root.Values.alertmanager.prefixURL }}
        path_prefix: {{ $root.Values.alertmanager.prefixURL }}
        {{- end }}
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace]
          regex: {{ $root.Release.Namespace }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_app]
          regex: {{ template "prometheus.name" $root }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_component]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
          regex: {{ index $root.Values.alertmanager.podAnnotations "prometheus.io/probe" | default ".*" }}
          action: keep
        - source_labels: [__meta_kubernetes_pod_container_port_number]
          regex: "9093"
          action: keep

The difference being the last regex match is a drop in the tobs helmchart vs a specific port and keep in the prometheus chart.

Any particular reason for this?

Refactor remote_write values structure to avoid confussion with prometheus.remoteWrite

The prometheus helm chart has a remoteWrite config section where you can put in your configuration for the endpoints. We have a remote_write section for the timescale-prometheus connector that is used to manage the queue config. This causes understandable confusion

Custom postgres settings needed

Is there any way to inject custom postgres.conf values? I'm struggling with setting up correct WAL behaviour and maximum worker count - default tobs install causes a lot of ioload later in the day, when it starts writing to disks in a manic way.

recommend way of configuring Prometheus

Once the stack is installed and running, what's the best way of configuring Prometheus? Adding targets, alert rules, alert manager receivers, etc.

Create tobs Kubernetes supportability matrix for CI & docs.

0.3.0 installer broken?

All, I think 0.3 installer is broken:

# tobs install
Adding Timescale Helm Repository
"timescale" already exists with the same configuration, skipping
Fetching updates from repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "minio" chart repository
...Successfully got an update from the "timescale" chart repository
Update Complete. ⎈Happy Helming!⎈
Error: could not install The Observability Stack: failed to parse helm show values from yaml to json yaml: mapping values are not allowed in this context

Automatic configuration of prometheus remote_read broken

Hi. I updated yesterday to the 0.1.0-alpha.3 release but since then, the prometheus instance spawned with the chart fails to read from the database.

I tracked down the error into a commit in the timescale-prometheus repository. This commit changes the name of the timescale-prometheus connector's service name.

However, this change wasn't taken into account in the autogenerated prometheus configuration:

We have two ways to fix this:

revert the service name commit
(quick fix) update the name of the service in the remote_read and remote_write urls in the prometheus configuration template (this is what I did to fix the issue)
(long fix) update the prometheus configuration and find a way to keep it in sync between the two charts (a variable should do). This will require to update both charts. Don't know if this is worth the effort though.

I tried to workaround the bug by specifying the remote_read myself using the .Values.prometheus.server.remoteRead key but wasn't able to make it work.

I have an indentation issue:

helm upgrade --install --dry-run --debug metrics timescale/timescale-observability --version=0.1.0-alpha.3 -f helm/metrics.yaml | grep "remote_read" -A 5
    remote_read:
      - url: http://metrics-timescale-prometheus.monitoring.svc.cluster.local:9201/read
    - url: http://metrics-timescale-prometheus-connector.monitoring.svc.cluster.local:9201/read

I managed to fix the chart by changing this line:
{{ $root.Values.prometheus.server.remoteRead | toYaml | indent 4 }}
into:
{{ $root.Values.prometheus.server.remoteRead | toYaml | indent 6 }}

I can fix both those issues and submit a PR if this helps.

Deploy OpenTelemetry Collector with Tobs

Make deployment of the OpenTelemetry Collector possible via tobs with default configurations for ingesting Jaeger and Zipkin traces/spans.

failed to create HA

Hi,

I am trying to install tobs with ha timescale.
In my values file timescaledb-single.replicaCount is set to 3 as mentioned in documentation.

First one goes fine but second gets stucked with following error:
│ timescaledb 2021-06-07 20:27:51,185 ERROR: Error when fetching backup: pg_basebackup exited with code=1 │ │ timescaledb 2021-06-07 20:27:51,185 ERROR: failed to bootstrap from leader 'monitoring-timescaledb-0' │ │ timescaledb 2021-06-07 20:27:56,050 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1 │ │ timescaledb pg_basebackup: error: could not connect to server: Connection timed out │ │ timescaledb Is the server running on host "10.244.32.15" and accepting │ │ timescaledb TCP/IP connections on port 5432?

There are a few links to `docs/cli-usage.md` in the tobs READMEs, but this file does not appear to exist in the repo.

for example:
Our full usage docs give a good overview of what you can do with the CLI tool.

cannot change disk sizes

when trying to update disk sizes via values on an already running deploy get

helm upgrade t5  -f timescale.yml timescale/timescale-observability --devel
Error: UPGRADE FAILED: cannot patch "t5-timescaledb" with kind StatefulSet: StatefulSet.apps "t5-timescaledb" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

May be related to helm/charts#8594.

Thanks to @ismailyenigul for the report

Add an option to configure SQL queries to execute on TimescaleDB as pre/post upgrade hooks in tobs helm chart upgrade process without use of CLI

There should be a way to install the helm chart without having to create the secrets manually.

Helm supports pre-install and pre-upgrade hooks. Perhaps a nice way to implement this would be to run a job with tobs CLI container and give it a script to execute.

If this is implemented, it would also be possible to implement a post-install and post-upgrade hook in a similar manner, allowing the user to execute an arbitrary script - e.g. configure retention, create extra users, etc.

Both of these changes would greatly reduce the complexity of installing the stack, CD tools like ArgoCD can't use a custom CLI tool without extensive configuration and people usually don't want to install another CLI tool in their CI pipeline.

All in all, the CLI tool is useful but I think it should be the other way around - enable us to use the tool from the helm chart (as hooks), don't require us to download it and use it.

Adjust for moving of the upstream prom helm repo

The repo has moved to https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus

At least the Helm hub repo link in https://github.com/timescale/tobs/tree/master/chart#additional-configuration-for-prometheus needs to be adjusted. There may be other things as well

tobs install failure on gcp

With the kubeconfig

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: OMIT 
    server: OMIT
  name: OMIT
contexts:
- context:
    cluster: OMIT
    namespace: OMIT
    user: OMIT
  name: OMIT
current-context: OMIT
kind: Config
preferences: {}
users:
- name: OMIT
  user:
    auth-provider:
      config:
        access-token: OMIT
        cmd-args: config config-helper --format=json
        cmd-path: gcloud
        expiry: "2021-09-22T14:53:32Z"
        expiry-key: '{.credential.token_expiry}'
        token-key: '{.credential.access_token}'
      name: gcp

Running,

./tobs install --chart-reference ../chart

from the cli directory fails,

... no Auth Provider found for name "gcp"

This seems to be a regression to #68.

Cronjob drop-chunk failing

I have installed the timescale-observability stack (timescaledb + timescale-prometheus + prometheus), version 0.1.0-alpha.3 and managed to make it work.

However, I notice from times to times a pod of the drop-chunk cronjob in CrashLoopBackOff state:
metrics-timescale-prometheus-drop-chunk-1588905000-9t4rs 0/1 CrashLoopBackOff 5 5m1s

When I look at the logs, I have:

kubectl logs metrics-timescale-prometheus-drop-chunk-1588905000-9t4rs                                                                       SIGINT(2) 
ERROR:  schema "prom" does not exist
LINE 1: CALL prom.drop_chunks();
             ^

When I check in the database, the schema "prom" indeed does not exists.

postgres=# \dn
          List of schemas
          Name           |  Owner
-------------------------+----------
 _prom_catalog           | postgres
 _prom_ext               | postgres
 _timescaledb_cache      | postgres
 _timescaledb_catalog    | postgres
 _timescaledb_config     | postgres
 _timescaledb_internal   | postgres
 prom_api                | postgres
 prom_data               | postgres
 prom_data_series        | postgres
 prom_info               | postgres
 prom_metric             | postgres
 prom_series             | postgres
 public                  | postgres
 timescaledb_information | postgres
(14 rows)

However, I do see the "drop_chunks" procedure in the prom_api schema:

postgres=# select proname,nspname from pg_catalog.pg_proc JOIN pg_namespace ON pg_catalog.pg_proc.pronamespace = pg_namespace.oid where proname = 'drop_chunks';
   proname   | nspname
-------------+----------
 drop_chunks | prom_api
 drop_chunks | public
(2 rows)

tobs CLI default installation fails

$ curl --proto '=https' --tlsv1.2 -sSLf  https://tsdb.co/install-tobs-sh | sh
sh: 31: Syntax error: redirection unexpected

Because of the redirection opertion at if grep -Fxq "${checksum}" <<< "${checksumlist}"; then

Instead of using sh, using bash is making this work.

We should either update README.md to use curl <> | bash or find an alternative to redirection operation.

I am using ubuntu 20.04.

Grafana timescale datasource permissions

Hi,

I have installed the stack successfully into my cluster and have a problem with the TimescaleDB.
My expectation was that the TimescaleDB data source in grafana would enable querying the data ingested by Promscale, instead it looks like TimescaleDB data source only offers tables from the grafana database in the query builder.

Is this the intended behavior? I expected the data source to have read access to the Prometheus data and not the grafana database.

grafana pod error

after tobs install

Its error comes up this

root@master:~/tobs/chart# k logs -f tobs-grafana-74b75c676d-tmlv7
error: a container name must be specified for pod tobs-grafana-74b75c676d-tmlv7, choose one of: [grafana-sc-dashboard grafana] or one of the init containers: [grafana-sc-datasources]

how can i solve it?

Set up Grafana to show up traces from Promscale

cli command helm show-values fails

The following steps produce an error:

git clone https://github.com/timescale/tobs.git
cd tobs/cli 
go build .
./cli helm show-values 
>> Error: failed to get helm values: failed to download "timescale/tobs" (hint: running `helm repo update` may help)

I'm scratching my head. This seems like the most minimal first-example any sane person might run, and it does not work.

Of course, I can use helm directly, but since the cli tool is here, I would not mind using it if it worked...

failed to run chart on ubuntu 20.04 with microk8s

Hello,
I'm running into an issue trying out this chart on a stock install of ubuntu 20.04 in a vm. Here's what I've done:

snap install microk8s --classic
microk8s.enable helm3
microk8s.helm3 install --devel stats timescale/timescale-observability

After installing :

> microk8s.kubectl get pod
NAME                                          READY   STATUS             RESTARTS   AGE
stats-grafana-79b5c7d6cf-pnwt4                1/2     CrashLoopBackOff   13         24m
stats-kube-state-metrics-5cccd67c88-x4bjd     1/1     Running            0          24m
stats-prometheus-node-exporter-tdjrx          1/1     Running            0          24m
stats-prometheus-server-cfcbbd46b-bj2kp       0/2     Pending            0          24m
stats-timescale-prometheus-6748f8b67d-tp8c4   0/1     CrashLoopBackOff   13         24m
stats-timescaledb-0                           0/1     Pending            0          24m

I've tried with microk8s.helm and microk8s.helm3, with and without microk8s.dns enabled. Is this expected to work? Anything else I should try?

Thanks!

Make it easy to install with forge using the tobs cli

Make it easy to use the hosted platform Forge for running TimescaleDB

loki support?

Hi All, well done on a pretty solid stack, I like.

Do you foresee adding loki stack into it ? If you need help with that, can you write a short manual on how one would just splice something like that into it and set you a PR ? Greetings

Test with long release names

One error people get is:

helm install timescale-observability timescale/timescale-observability --devel
Error: CronJob.batch "timescale-observability-timescale-prometheus-drop-chunk" is invalid: metadata.name: Invalid value: "timescale-observability-timescale-prometheus-drop-chunk": must be no more than 52 characters.

	#extraScrapeConfigs: \|
	#example of adding hypotetical scrape job for https://github.com/AICoE/prometheus-anomaly-detector
	#- job_name: prometheus-anomaly
	# static_configs:
	# - targets:
	# - prometheus-anomaly-svc.default:8080