Giter Site home page Giter Site logo

Comments (9)

pierresouchay avatar pierresouchay commented on June 18, 2024 2

Hello @MillsyBot,

Thanks a lot, it is really great to receive a nice comment about our work and it is a real pleasure to share our work with people being grateful.

We are using daily this tool with signals to regenerate on the fly our configurations for dozens of different very large configurations in many teams with many instances of Prometheus (Prometheus receiving the signal) and this is working for a very long time, so I suspect a big bug would have been notified, but it is still possible, especially since we are not using very high --wait values (max is around 10s).

I also added the signal reload for your exact usage some time ago, so it is exactly intended to do so. But there are a few warnings (that are supposed to be explained in the README.md, but I take the time to re-explain (and feel free to modify the README in a PR if it is not clear).

The most common problem

First, in your example, you are using --wait 60. This means that templates are evaluated every 1 minute, it has several impacts: it means that the template will be generated at a maximum of 1/m, so the signals will be sent every second at the maximum, whatever the number of changes/s. I suspect this is what you want, but globally, even in a very noisy cluster, haproxy will reload its config every 60s max. I never tried to use it with such high values in production (while it should not have any impact, it is still possible).

Second, the order of --sig-reload and --exec are very important:

if you consider this example:

consul-templaterb --exec /my/monitoring --sig-reload NONE --exec "python -m http.server 8080" [somefiles]

/my/monitoring and python servers would be launched once all files are properly generated and will receive signals each time one of those files is changing. However:

  • /my/monitoring is supposed to receive SIGHUP (the default value), because -sig-reload is after --exec /my/monitoring
  • python -m http.server 8080 will receive no signal (because --sig-reload NONE is before --exec python command).

It means that for instance appending --sig-reload TERM at the end of your commands has no effect because no other --exec flag is present).
This behavior, while powerful, is not very common (and I have been bitten a few times with it myself), and several times, when looking at it, I did not understand while my python command did receive SIGHUP while I wanted NONE, appended at the end of my command line.

Another possible issue

From the doc:

If you generate several files at the same time, the signal will be sent only once the rendering of all templates is completed, so if your process is using several configuration files, all files will be modified and consistent before the signal is sent to process.

It means that the signal is supposed to be generated when all the files are considered not dirty, it means all the data for all the files have been retrieved and properly computed.

Consider this example:

<%
counts = {}
services.each do |service_name, tags|
  counts[service_name] = service(service_name).count
end
%><%= JSON.pretty_print(counts) %>

If you have a cluster where 1 service is renamed every 30s and you used --wait 60, for instance:
service-30, then service-60, service-90 [...] service-3600

At startup, consul-templaterb ignores --wait and try to fetch data as fast as possible. But once the template has been rendered, it would watch.

every 60s, the template is evaluated, then wait...

so, at t+60: consul-templaterb finds new services service-30 and service-60, if schedules the retrival of those 2 additional services, mark the template has dirty (incomplete), does not render it and wait for 60s.
at t+120, consul-templaterb did get values for service-30 and service-60, but in the meantime, service-90 and service-120 have been added, so values are here, but 2 new schedules have been added service-90 and service-120, so template is still dirty, rescheduled in 60s... still no signal sent, if the list of services is changing fast enought and the --wait 60s is there, you might end with your templates being all rendered not that often, thus signal would be never sent, or very rarely. Note that in order to send the signal, consul-templaterb sends the signal ONLY if all templates are non-dirty. The fact that you are using several templates, means that at least one of those templates is not rendered, nothing happens

This might be fixed in at least 3 ways:

  • Hack the file consul-templaterb and change https://github.com/criteo/consul-templaterb/blob/master/bin/consul-templaterb#L69 to use min-value = 180 (refresh at the maximum the list of services every 3 minutes) => would work for my example, since the value would be greater than 60s, new services would never popup every 30s and it would converge with a maximum 60s. If it is the case, a clean fix would be to be able to override from command line those values and it would work in any case. Another possible fix would be to have values here that are always at least 1.5 times the --wait value
  • Do not use signals and use a binary instead, in this binary ensure to reload the command ONLY if some time has been spent. So, you could use --wait 60 BUT reload haproxy only every 60s, example --template haproxy.conf.erb:haproxy.conf::./reload_haproxy_when_needed.sh

with ./reload_haproxy_when_needed.shbeing (not tested):

#!/bin/bash

# Test if .last_reload is older than 1 min
if test "`find .last_reload -mmin +1`"
then
  kill -USR2 $(pidof haproxy)
  touch .last_reload
fi

=> thus, each time haproxy.conf would be modified, the script would be launch, but only reload haproxy every 60s max, meaning that you could use --wait 1 to have faster changes.

If it something else

You might consider running it with debug flags (which displays if there are some dirty templates), and if you don't find, try posting the debug messages here.

About --wait

wait is a bit tricky as it both handles files changes and evaluation of templates. This is clearly not the best idea I had to link both, and initially, it was more about either reducing a bit CPU usage on very complex templates (but not changing data all the time) OR avoid reloading the process all the time (but more with 10s than 60s).

This might be fixed by having an option to delay --sig-reload to launch only every x seconds (might not be that hard) and decorrelate it from --wait.

I would be glad to accept such PR if you are interested

Best
Pierre

from consul-templaterb.

pierresouchay avatar pierresouchay commented on June 18, 2024 1

Hello @MillsyBot,

I added support from what I described in my previous comment in 70808c6

This will let you use -w 1 (render every second), but send to the process signals every 60s, using the new flag -W 60,
will be available in the next coming release (in a few minutes). This might solve your issues if my theory is correct.

from consul-templaterb.

pierresouchay avatar pierresouchay commented on June 18, 2024 1

@robloxrob glad it helps

from consul-templaterb.

MillsyBot avatar MillsyBot commented on June 18, 2024 1

@chuckyz says thanks as well.

from consul-templaterb.

robloxrob avatar robloxrob commented on June 18, 2024

Thanks for all the love here! This really makes my day.

from consul-templaterb.

MillsyBot avatar MillsyBot commented on June 18, 2024

@pierresouchay WOW, thank you so much for the detailed response, and for releasing a new version addressing my issues! So cool, I will buy you a beer at the next HAProxy Conf (we meet there last year).

Out of curiosity: what type of performance impact are you seeing with HAProxy reloading every 1sec? Are the impacts significant to connection reuse when the FD are constantly passing from one process to another?

from consul-templaterb.

pierresouchay avatar pierresouchay commented on June 18, 2024

@MillsyBot It really depends on the versions. In my tests a few years ago (Version 1.7.x), it used to break many things (including opened connections, so while it is ok to loose a connection or 2 every minute, not acceptable every sec), in recent, HAProxy has been doing lots of work to avoid this (esp. with 2.x releases), so I suspect it is not such a big issue with recent kernels and recent HAProxy versions.

On our side, we know use systems interconnected with Consul to pilot HAProxy (most notably: https://github.com/haproxytech/haproxy-consul-connect that uses dataplane recent additions to change on the fly HAProxy config)

But anyway, we also had this exact same issue with our prometheus instances (we used -wait 10s as well), so I think it is a welcomed addition anyway, avoiding lots of crappy workarounds.

from consul-templaterb.

pierresouchay avatar pierresouchay commented on June 18, 2024

@MillsyBot Don't forget to tell me if it is fixing your issue, in that case, please close the issue!
Thank you!

from consul-templaterb.

MillsyBot avatar MillsyBot commented on June 18, 2024

@pierresouchay Super helpful.

from consul-templaterb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.