Giter Site home page Giter Site logo

docker-swarm-cluster's Introduction

docker-swarm-cluster

Combines some tooling for creating a good Docker Swarm Cluster.

Overview

HTTP(S) Ingress

  • Caddy

Cluster Management

  • Swarm Dashboard
  • Portainer
  • Docker Janitor

Metrics Monitoring

Installation

  • Install Ubuntu on all VMs you're mean't to use in your Swarm Cluster

    • See #Cloud provider tips section for details
  • Install the latest Docker package on all VMs (https://docs.docker.com/engine/install/ubuntu/)

  • The best practise is to have

    • 3 VMs as Swarm Managers (may be small VMs)
    • any number of VMs as Swarm workers (larger VMs)
    • Place only essential services to run on managers
      • By doing this, in case your services exhaust the cluster resources, you will still have access to portainer and grafana to react to a crisis
      • Avoid your services to run on those machines by using placement constraints:
      • Verify that firewall is either disabled for those internal hosts, or have the correct open ports for mesh service and internal docker overlay network requirements (https://docs.docker.com/network/overlay/#publish-ports-on-an-overlay-network). Those problems are hard to identify, mainly when only ONE VM is with this kind of problem

Ingress

  • Use Caddy to handle TLS (with Let's Encrypt) and load balancing

    • Indicated for most applications
    • Just point your DNS entries to the public IP of the VMs that are part of the cluster and they will handle requests and balance between container instances.
  • Use a cloud LB to handle front TLS certificates and load balancing

    • Indicated for heavy loaded or critical sites
    • Your cloud provider LB will handle TLS certificates and balance between Swarm Nodes. Each Node will have Caddy listening on port 80 through Swarm mesh, so that when a request arrives on HTTP, it will proxy the request to the correct container services based on Host header(according to configured labels)
    • Disable https support from Caddy in this case by using the following label so that it won't be trying to generate a certificate by itself
  caddy-server:
    deploy:
      labels:
        - caddy.auto_https=off
        - caddy_controlled_server=
    ...
yourservice:
  ...
  deploy:
    placement:
        constraints:
          - node.role != manager
  • On one of the VMs:

    • Execute docker swarm init on the first VM with role manager
      • If your machine is connected to more than one network, it may ask you to use --advertise-addr to indicate which network to use for swarm communications
    • Copy the provided command/token to run on worker machines (not managers)
    • Execute docker swarm token-info manager and keep to run on manager machines
  • On machines selected to be managers (min 3)

    • Run the command from previous step for managers and add --advertise-addr [localip] with a local IP that connects those machines if they are local so that you don't use a public IP for that (by using Internet link)
      • Ex.: docker swarm join --advertise-addr 10.120.0.5 --token ...
  • On machines selected to be workers

    • Run the command got on any manager by docker swarm token-info worker and add --advertise-addr [localip] with a local IP that connects those machines if they are local so that you don't use a public IP for that (by using Internet link)
      • Ex.: docker swarm join --advertise-addr 10.120.0.5 --token ...
  • Make Docker daemon configurations on all machines

    • This has to be made after joining Swarm so that network 172.18/24 already exists (!)
    • Use journald for logging on all VMs (defaults to max usage of 10% of disk)
    • Enable native Docker Prometheus Exporter
    • Unleash ulimit for mem lock (fix problems with Caddy) and stack size
    • Run the following on each machine (workers and managers)
echo '{"log-driver": "journald", "metrics-addr" : "172.18.0.1:9323", "experimental" : true, "default-ulimits": { "memlock": { "Name": "memlock", "Hard": -1, "Soft": -1 }, "stack": { "Name": "stack", "Hard": -1, "Soft": -1 }} }' > /etc/docker/daemon.json
service docker restart
  • Start basic cluster services

    • git clone https://github.com/flaviostutz/docker-swarm-cluster.git
    • Take a look at docker-compose-* files for understanding the cluster topology
    • Setup .env parameters
    • Run create.sh
  • On one of the VMs, run curl -kLv --user whoami:whoami123 localhost and verify if the request was successful

Security

Optimal elastic topology

If you need elasticity (need to grow or shrink server size depending on app traffic) a good topology would be to have some two cluster "sizes". One that we call "idle" that has the minimal sizing when few users are on, and a "hot" configuration when traffic is high.

For the "idle" state, we use:

  • 1 VM with 1vCPU 2GB RAM (Swarm Manager + Prometheus)
  • 2 VMs with 1vCPU 1GB RAM (Swarm Manager)
  • 1 VM as worker with 2vCPU 4GB RAM (App services)

For the "hot" state, we use:

  • 1 VM with 1vCPU 2GB RAM (Swarm Manager + Prometheus) - same as "idle"
  • 2 VMs with 1vCPU 1GB RAM (Swarm Manager) - same as "idle"
  • Any number of VMs for handling users load

HA practices

  • Use "spread" preference in your service so that replicas are placed on different Nodes
    • In this example, group spread groups by role manager/worker, but you can group by any other label values
...
      placement:
        preferences:
          - spread: node.role
...

Service URLs

Services will be accessible by URLs: http://portainer.mycluster.org http://dashboard.mycluster.org http://grafana.mycluster.org http://unsee.mycluster.org http://alertmanager.mycluster.org http://prometheus.mycluster.org

Services which don't have embedded user name protection will use Caddy's basic auth. Change password accordingly. Defaults to admin/admin123admin123

The following services will have published ports on hosts so that you can use swarm network mesh to access admin service directly when Caddy is not accessible

  • portainer:8181
  • grafana: 9191

So point your browser to any public IP of a member VM to this port and access the service

Common Operations

Force service rebalancing among nodes

# docker service ls -q > dkr_svcs && for i in `cat dkr_svcs`; do docker service update "$i" --detach=false --force ; done
for service in $(docker service ls -q); do docker service update --force $service; done

WARNING: User service disruption will happen while doing this as some containers will be stopped during this operation

Add a new VM to the cluster

  • Create the new VM on cloud provider on the same VPC (see Cloud provider tips for specific instructions)
  • SSH a Swarm manager node and execute docker swarm join-token worker to get a Swarm join token
  • Copy the command and execute it on new VM
    • Add --advertise-addr [local-network-interface-ip] to the command if your host has multiple NICs
    • Execute the command on worker VM. Ex.: docker swarm join --token aaaaaaaaaaaa 10.120.0.2:2377 --advertise-addr 10.120.0.1
  • All containers that are "global" will be placed on this Node immediatelly
  • Even if other hosts are full (containers using too much memory/CPU) they won't be rebalanced as soon this node is added to the cluster. New containers will be placed on this node only when they are restarted (this is by design to minimize user disruption)
  • Add the newly created VM to the HTTP Load Balancer (if you use one from cloud provider) so that incoming requests that Caddy will handle will be routed through Swarm mesh network
  • Check firewall configuration (either disabled, or configured properly with service mesh and internal overlay network requirements as in https://docs.docker.com/network/overlay/#publish-ports-on-an-overlay-network)

Production tips

Optimal Topology

  • Have a small VM in your Swarm Cluster to have only basic cluster services. Avoid any other services to run in this server so that if your cluster run out of resources you will still have access to monitoring and admin tools (grafana, portainer etc) so that you can diagnosis what is going on and decide on cluster expansion, for example.

PLACE IMAGE HERE

OOM

  • If a node suffers from severe resource exhaustion, docker daemon presents some strange behavior (services not scheduled well, some commands fail saying the node is not part of a swarm cluster etc). It's better to reboot this VMs after solving the causes.

Tricks

  • Caddy has a "development" mode where it uses a self signed certificate while not in production. Just add - caddy.tls=internal label to your service.

Customizations

  1. Change the desired compose file for specific cluster configurations
  2. Run create.sh for updating modified services

docker-compose files

  • Swarm stack doesn't support .env automatically (yet). You have to run export $(cat .env) && docker stack... so that those parameters work
  • docker-compose-ingress.yml
  • docker-compose-admin.yml
  • docker-compose-metrics.yml
  • docker-compose-devtools.yml
    • export $(cat .env) && docker stack deploy --compose-file docker-compose-devtools.yml devtools

TODO

Volume management

  • AWS/DigitalOcean BS

Logs aggregation

  • FluentBit
  • Kafka
  • Graylog

Metrics Monitoring

  • Telegrambot

Cloud provider tips

Digital Ocean

  • For HTTPS certificates, use Let's Encrypt in Load Balancers if you are using a first level domain (something like stutz.com.br). We couldn't manage to make it work with subdomains (like poc.stutz.com.br).

  • For subdomains, use certbot and create a wildcard certificate (ex.: *.poc.stutz.com.br) manually and then upload it to Digital Ocean's Load Balancer.

apt-get install letsencrypt
certbot certonly --manual --preferred-challenges=dns [email protected] --server https://acme-v02.api.letsencrypt.org/directory --agree-tos -d *.poc.me.com

VMs

  • Use image Marketplace -> Docker
  • Check "Monitoring" to have native basic VM monitoring from DO panel

docker-swarm-cluster's People

Contributors

flaviostutz avatar ralphg6 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-swarm-cluster's Issues

docker-swarm-feeds and caddy

Hello, I see you are still using this (flaviostutz/docker-swarm-feeds:1.1.0) version of feed with the caddy configuration.
For me feeds is dedicated to traefik
Is it still working with caddy ?? ( I' ve a nice white screen on my infra)
or did you have plan to change the sniffing of the labels ??

But the idea to use it with caddy is very nice . just missing a traffic dashboard

Best regards

Ph Koenig

Should a CI/CD pipeline stack be part of the scope here?

Do you think this spec here should be opiniated about a CI/CD pipeline?
I mean, that’s not something you must have in order to use this stack, but it is something nice to have.
Maybe a “Nice to have” session where we could put something like CI/CD, Bots integration, Backup routines, and so on...

What are your thoughts on this?

Use Traefik with Lets Encrypt

I have made an Traefik image that comes with Lets Encrypt ready to go using HTTP Challenge, which allows each application proxied by Traefik to have its own Certificate. Take a look: https://github.com/tiagostutz/docker-traefik-letsencrypt

What about using this image instead of the original Traefik? The benefits:

  • The command line parameters enabling TLS with Lets Encrypt (ACME), monitoring (metrics) and admin (--api) comes as default in this image
  • The image redirects http entrypoint to https, enforcing all the requests to be https
  • Neither the applications developers nor the Edge administrator needs to be concerned about the TLS setup

What need to be done in the image:

  • Support for DNS Challenge
  • TLS certificate renew scheduler

What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.