The Electronic Babylonian Library platform is hosted on two VMs on Docker Swarm in LRZ.
6 daily backup images of the servers are made, at 02:20, 06:20, 10:20, 14:20, 18:20 and 22:20 hrs. The images are kept for 14 days. The images can be restored by the LRZ on short notice by opening a ticket on their Service Desk. If it is necessary to restore a backup copy, the images of all nodes of the Replica Set should be restored, not just one of the nodes, as otherwise they will be out of sync.
Each server needs to part of the Docker Swarm.
- Install Docker Engine - Community from Docker's repositories (currently installed
Docker version 19.03.1, build 74b1e89
) - Perform post-install steps.
- Add
ebladmin
todocker
group. (The group should be already created by install process.) - Copy daemon.json to
/etc/docker/daemon.json
to set up log rotation and metrics (The metrics are required by swarmprom.). If the daemon is already running, it needs to be restarted:service docker restart
. The log configuration only affects new containers. - Configure Docker to start on boot. Check status
sudo service docker status
- Add
- Configure the firewall. (Published ports are opened automatically by Docker with iptables and not appear in ufw rules.)
- Allow connections from all other nodes:
sudo ufw allow proto tcp from <node IP> to any port 2377,7946 comment 'Docker Swarm' sudo ufw allow proto udp from <node IP> to any port 7946,4789 comment 'Docker Swarm'
- Allow metrics:
sudo ufw allow from 172.18.0.0/16 to any port 9323 comment 'Docker Metrics'
- Allow connections from all other nodes:
- On the first VM, create a new swarm
- On the other VMs, join the swarm. (See the output from creating the swarm or run
docker swarm join-token worker
on the manager.) - On all VMs add pruning old Docker images to crontab:
0 4 * * * docker image prune -f --filter "until=24h"
.
- On a manager node install Swarmpit with default options:
Swarmpit is now accessible at port 888.
docker run -it --rm \ --name swarmpit-installer \ --volume /var/run/docker.sock:/var/run/docker.sock \ swarmpit/install:1.8
- Login to create a admin user.
- Add placement to the
db
andinfluxdb
services so it will have the access to the original volumes (both are currently onlmkwitg-ebl02
). - The following steps can be performed via the swarm manager or command line as preferred.
We use a setup based on Docker Swarm Rocks.
- Setup the DNS to send
*.cluster.ebabylon.org
to the swarm. - Create a network
docker network create --driver=overlay traefik-public
. - Create a config
traefik-config
from traefik.toml. - Create a secret
basic_auth_users
containing the basic auth users. A hashed password can be created withopenssl passwd -apr1 <password>
- Create stack
traefik-consul
from traefik-consul.yml
Setup Swarpit to use Traefik. See: Swarmpit web user interface for your Docker Swarm cluster.
- Remove ports from stack config.
- Add Traefik network and labels as in swarmpit.yml.
See: Docker Swarm Rocks Swarmprom for real-time monitoring and alerts.
- Create a webhook in Slack.
- Create netwrok
docker network create --driver=overlay monitoring
. - Create the configs:
dockerd_config
from swarmprom Caddyfile.node_rules
from swarm_node.rules.yml.task_rules
from swarm_task.rules.yml.
- Create stack
swarmprom
from swarmprom.yml. Because swarmproms Dockerfile definesGF_SECURITY_ADMIN_PASSWORD
it is not possible to useGF_SECURITY_ADMIN_PASSWORD__FILE
. - Import Traefik dashboard to Grafana.
- Create secrets
mongo_admin_user
andmongo_admin_password
which will be used to create the admin user on the first deploy. - Create stack
ebl-mongodb
from mongodb.yml. Initdb functionality does not work well with SSL, so we enable it in the next step. See: docker-library/mongo#239 and docker-library/mongo#172.
See: Deploy a Replica Set, Configure mongod and mongos for TLS/SSL, and Use x.509 Certificate for Membership Authentication.
- Create certificates for the root CA and all of the servers. See: MongoDB: Deploy a Replica Set With Transport Encryption: Part 3.
- Create secrects
mogoCA.crt
,ebl01.pem
, andebl02.pem
from the respective certificates. - Redeploy stack with replica set and SSL enabled from mongodb-replica_set.yml.
- Initiate the replica set. Login to mongo and run (The hosts must have full address, otherwise it is not possible to connect to the replica set from outside the stack):
rs.initiate( { _id : "rs-ebl1", members: [ { _id: 0, host: "lmkwitg-ebl01.srv.mwn.de:27017" }, { _id: 1, host: "lmkwitg-ebl02.srv.mwn.de:27018" } ] })
See: eses/mongodb_exporter and .
- Create a user for the exporter.
db.getSiblingDB("admin").createUser({ user: "mongodb_exporter", pwd: "<password>", roles: [ { role: "clusterMonitor", db: "admin" }, { role: "read", db: "local" } ] })
- Update the
swarmprom
stack:- Add
mongodb-exporter
service:mongodb-exporter: image: bitnami/mongodb-exporter command: - --mongodb.direct-connect=false - --mongodb.uri=mongodb://<user>:<passwordd>@lmkwitg-ebl01.srv.mwn.de:27017,lmkwitg-ebl02.srv.mwn.de:27018/?tls=true&tlsCAFile=/run/secrets/mongoCA.crt secrets: - mongoCA.crt networks: - net deploy: resources: reservations: memory: 64M limits: memory: 128M
- Add
mongodb-exporter
job toprometheus
:JOBS: traefik:8080 mongodb-exporter:9104
- Add
mongoCA.crt
tosecrets
.
- Add
- Import MongoDB dashboard to Grafana.
- Edit dashboard JSON and change metric prefix from
mongodb_
tomongodb_mongod_
.
Create configs registry_config
and docker-registry-ui_config
from registry_config.yml docker-registry-ui_config.
Create secrets:
httpass
bcrypt encrypted httpasswd file with users for the registry.registry_htpasswd
password of the regisry user used by the registry UI.
Create stack from registry.yml.
Create stack from ebl.yml. The Docker images should be in the registry before deploying the stack. Ai-api service is optional and could be left out. The EBL_AI_API environment variable on the api has to be present. The ebl-ai-api repository is here.
- Update the
swarmprom
stack:- Add
redis-exporter
service:redis-exporter: image: bitnami/redis-exporter:1 environment: REDIS_ADDR: redis://redis:6379 networks: - net - monitoring
- Add
redis-exporter
job toprometheus
:JOBS: traefik:8080 mongodb-exporter:9216 redis-exporter:9121
- Add
- Scale replicas to 0.
- Scale replica back to desired value.
When the new replicas join Consul should clean up old nodes and elect a new leader. To avoid stale nodes in the config the replicas should be shut down before the leader.
Long commands (e.g. node-exporter
) get messed up and $
is unescaped in the "Current engine state". If you have to redeploy a service/stack edit the "Last deploeyd instead" or copy the correct values this repository.
The Grafana admin password is set up only on first run. It can be resetted later vie the CLI docker exec -ti <container id> grafana-cli admin reset-admin-password <new password>
.
There is a bug in the exporter (PMM-4375). We should update it as soon as the fix is released. The exporter has been updated.
Docker can be restarted from the commandline by running sudo service docker restart
in all the affected instances. If it is not possible to connect with SSH, ask ITG to reboot/investigate.
Diskspace can be freed by removing old Docker images etc. See: https://stackoverflow.com/questions/32723111/how-to-remove-old-and-unused-docker-images/32723127#32723127
The certificates are handled automatically by Let's Encrypt and certbot
(managed in the Traefik configuration, cf.
the Traefik docs).
In case if fails the website becomes unavailable via https once the certificate expires. It might be necessary to manually restart the
traefik-consul_traefik
service in swarmpit or via the server.