First of all: Awesome work, this an amazingly fast and awesome tool. Thanks for sharin

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Signal Handling from Orchestrator about consul-templaterb HOT 9 CLOSED

criteo commented on June 18, 2024

Signal Handling from Orchestrator

from consul-templaterb.

Comments (9)

pierresouchay commented on June 18, 2024 2

Hello @MillsyBot,

Thanks a lot, it is really great to receive a nice comment about our work and it is a real pleasure to share our work with people being grateful.

We are using daily this tool with signals to regenerate on the fly our configurations for dozens of different very large configurations in many teams with many instances of Prometheus (Prometheus receiving the signal) and this is working for a very long time, so I suspect a big bug would have been notified, but it is still possible, especially since we are not using very high --wait values (max is around 10s).

I also added the signal reload for your exact usage some time ago, so it is exactly intended to do so. But there are a few warnings (that are supposed to be explained in the README.md, but I take the time to re-explain (and feel free to modify the README in a PR if it is not clear).

The most common problem

First, in your example, you are using --wait 60. This means that templates are evaluated every 1 minute, it has several impacts: it means that the template will be generated at a maximum of 1/m, so the signals will be sent every second at the maximum, whatever the number of changes/s. I suspect this is what you want, but globally, even in a very noisy cluster, haproxy will reload its config every 60s max. I never tried to use it with such high values in production (while it should not have any impact, it is still possible).

Second, the order of --sig-reload and --exec are very important:

if you consider this example:

consul-templaterb --exec /my/monitoring --sig-reload NONE --exec "python -m http.server 8080" [somefiles]

/my/monitoring and python servers would be launched once all files are properly generated and will receive signals each time one of those files is changing. However:

/my/monitoring is supposed to receive SIGHUP (the default value), because -sig-reload is after --exec /my/monitoring
python -m http.server 8080 will receive no signal (because --sig-reload NONE is before --exec python command).

It means that for instance appending --sig-reload TERM at the end of your commands has no effect because no other --exec flag is present).
This behavior, while powerful, is not very common (and I have been bitten a few times with it myself), and several times, when looking at it, I did not understand while my python command did receive SIGHUP while I wanted NONE, appended at the end of my command line.

Another possible issue

From the doc:

If you generate several files at the same time, the signal will be sent only once the rendering of all templates is completed, so if your process is using several configuration files, all files will be modified and consistent before the signal is sent to process.

It means that the signal is supposed to be generated when all the files are considered not dirty, it means all the data for all the files have been retrieved and properly computed.

Consider this example:

<%
counts = {}
services.each do |service_name, tags|
  counts[service_name] = service(service_name).count
end
%><%= JSON.pretty_print(counts) %>

If you have a cluster where 1 service is renamed every 30s and you used --wait 60, for instance:
service-30, then service-60, service-90 [...] service-3600

At startup, consul-templaterb ignores --wait and try to fetch data as fast as possible. But once the template has been rendered, it would watch.

every 60s, the template is evaluated, then wait...

so, at t+60: consul-templaterb finds new services service-30 and service-60, if schedules the retrival of those 2 additional services, mark the template has dirty (incomplete), does not render it and wait for 60s.
at t+120, consul-templaterb did get values for service-30 and service-60, but in the meantime, service-90 and service-120 have been added, so values are here, but 2 new schedules have been added service-90 and service-120, so template is still dirty, rescheduled in 60s... still no signal sent, if the list of services is changing fast enought and the --wait 60s is there, you might end with your templates being all rendered not that often, thus signal would be never sent, or very rarely. Note that in order to send the signal, consul-templaterb sends the signal ONLY if all templates are non-dirty. The fact that you are using several templates, means that at least one of those templates is not rendered, nothing happens

This might be fixed in at least 3 ways:

Hack the file consul-templaterb and change https://github.com/criteo/consul-templaterb/blob/master/bin/consul-templaterb#L69 to use min-value = 180 (refresh at the maximum the list of services every 3 minutes) => would work for my example, since the value would be greater than 60s, new services would never popup every 30s and it would converge with a maximum 60s. If it is the case, a clean fix would be to be able to override from command line those values and it would work in any case. Another possible fix would be to have values here that are always at least 1.5 times the --wait value
Do not use signals and use a binary instead, in this binary ensure to reload the command ONLY if some time has been spent. So, you could use --wait 60 BUT reload haproxy only every 60s, example --template haproxy.conf.erb:haproxy.conf::./reload_haproxy_when_needed.sh

with ./reload_haproxy_when_needed.shbeing (not tested):

#!/bin/bash

# Test if .last_reload is older than 1 min
if test "`find .last_reload -mmin +1`"
then
  kill -USR2 $(pidof haproxy)
  touch .last_reload
fi

=> thus, each time haproxy.conf would be modified, the script would be launch, but only reload haproxy every 60s max, meaning that you could use --wait 1 to have faster changes.

If it something else

You might consider running it with debug flags (which displays if there are some dirty templates), and if you don't find, try posting the debug messages here.

About --wait

wait is a bit tricky as it both handles files changes and evaluation of templates. This is clearly not the best idea I had to link both, and initially, it was more about either reducing a bit CPU usage on very complex templates (but not changing data all the time) OR avoid reloading the process all the time (but more with 10s than 60s).

This might be fixed by having an option to delay --sig-reload to launch only every x seconds (might not be that hard) and decorrelate it from --wait.

I would be glad to accept such PR if you are interested

Best
Pierre

from consul-templaterb.

pierresouchay commented on June 18, 2024 1

Hello @MillsyBot,

I added support from what I described in my previous comment in 70808c6

This will let you use -w 1 (render every second), but send to the process signals every 60s, using the new flag -W 60,
will be available in the next coming release (in a few minutes). This might solve your issues if my theory is correct.

from consul-templaterb.

pierresouchay commented on June 18, 2024 1

@robloxrob glad it helps

from consul-templaterb.

MillsyBot commented on June 18, 2024 1

@chuckyz says thanks as well.

from consul-templaterb.

robloxrob commented on June 18, 2024

Thanks for all the love here! This really makes my day.

from consul-templaterb.

MillsyBot commented on June 18, 2024

@pierresouchay WOW, thank you so much for the detailed response, and for releasing a new version addressing my issues! So cool, I will buy you a beer at the next HAProxy Conf (we meet there last year).

Out of curiosity: what type of performance impact are you seeing with HAProxy reloading every 1sec? Are the impacts significant to connection reuse when the FD are constantly passing from one process to another?

from consul-templaterb.

pierresouchay commented on June 18, 2024

@MillsyBot It really depends on the versions. In my tests a few years ago (Version 1.7.x), it used to break many things (including opened connections, so while it is ok to loose a connection or 2 every minute, not acceptable every sec), in recent, HAProxy has been doing lots of work to avoid this (esp. with 2.x releases), so I suspect it is not such a big issue with recent kernels and recent HAProxy versions.

On our side, we know use systems interconnected with Consul to pilot HAProxy (most notably: https://github.com/haproxytech/haproxy-consul-connect that uses dataplane recent additions to change on the fly HAProxy config)

But anyway, we also had this exact same issue with our prometheus instances (we used -wait 10s as well), so I think it is a welcomed addition anyway, avoiding lots of crappy workarounds.

from consul-templaterb.

pierresouchay commented on June 18, 2024

@MillsyBot Don't forget to tell me if it is fixing your issue, in that case, please close the issue!
Thank you!

from consul-templaterb.

MillsyBot commented on June 18, 2024

@pierresouchay Super helpful.

from consul-templaterb.

Signal Handling from Orchestrator about consul-templaterb HOT 9 CLOSED

Comments (9)

The most common problem

Another possible issue

If it something else

About --wait

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent