Giter Site home page Giter Site logo

infrastructure's Introduction

Electronic Babylonian Library Infrastructure

The Electronic Babylonian Library platform is hosted on two VMs on Docker Swarm in LRZ.

Backup copies

6 daily backup images of the servers are made, at 02:20, 06:20, 10:20, 14:20, 18:20 and 22:20 hrs. The images are kept for 14 days. The images can be restored by the LRZ on short notice by opening a ticket on their Service Desk. If it is necessary to restore a backup copy, the images of all nodes of the Replica Set should be restored, not just one of the nodes, as otherwise they will be out of sync.

LRZ Server Setup

Each server needs to part of the Docker Swarm.

  • Install Docker Engine - Community from Docker's repositories (currently installed Docker version 19.03.1, build 74b1e89)
  • Perform post-install steps.
    • Add ebladmin to docker group. (The group should be already created by install process.)
    • Copy daemon.json to /etc/docker/daemon.json to set up log rotation and metrics (The metrics are required by swarmprom.). If the daemon is already running, it needs to be restarted: service docker restart. The log configuration only affects new containers.
    • Configure Docker to start on boot. Check status sudo service docker status
  • Configure the firewall. (Published ports are opened automatically by Docker with iptables and not appear in ufw rules.)
    • Allow connections from all other nodes:
      sudo ufw allow proto tcp from <node IP> to any port 2377,7946 comment 'Docker Swarm'
      sudo ufw allow proto udp from <node IP> to any port 7946,4789 comment 'Docker Swarm'
      
    • Allow metrics:
      sudo ufw allow from 172.18.0.0/16 to any port 9323 comment 'Docker Metrics'
      
  • On the first VM, create a new swarm
  • On the other VMs, join the swarm. (See the output from creating the swarm or run docker swarm join-token worker on the manager.)
  • On all VMs add pruning old Docker images to crontab: 0 4 * * * docker image prune -f --filter "until=24h".

Docker Swarm Setup

Swarm Manager

  • On a manager node install Swarmpit with default options:
    docker run -it --rm \
      --name swarmpit-installer \
      --volume /var/run/docker.sock:/var/run/docker.sock \
    swarmpit/install:1.8
    
    Swarmpit is now accessible at port 888.
  • Login to create a admin user.
  • Add placement to the db and influxdb services so it will have the access to the original volumes (both are currently on lmkwitg-ebl02).
  • The following steps can be performed via the swarm manager or command line as preferred.

HTTPS and Monitoring

We use a setup based on Docker Swarm Rocks.

See: Traefik Proxy with HTTPS

  • Setup the DNS to send *.cluster.ebabylon.org to the swarm.
  • Create a network docker network create --driver=overlay traefik-public.
  • Create a config traefik-config from traefik.toml.
  • Create a secret basic_auth_users containing the basic auth users. A hashed password can be created with openssl passwd -apr1 <password>
  • Create stack traefik-consul from traefik-consul.yml

Swarmpit

Setup Swarpit to use Traefik. See: Swarmpit web user interface for your Docker Swarm cluster.

  • Remove ports from stack config.
  • Add Traefik network and labels as in swarmpit.yml.

See: Docker Swarm Rocks Swarmprom for real-time monitoring and alerts.

Frontend Environment

  • Define frontend environment variables directly in main.yml. Put sensitive values to secrets.

MongoDB

  • Create secrets mongo_admin_user and mongo_admin_password which will be used to create the admin user on the first deploy.
  • Create stack ebl-mongodb from mongodb.yml. Initdb functionality does not work well with SSL, so we enable it in the next step. See: docker-library/mongo#239 and docker-library/mongo#172.

Replica Set and SSL

See: Deploy a Replica Set, Configure mongod and mongos for TLS/SSL, and Use x.509 Certificate for Membership Authentication.

  • Create certificates for the root CA and all of the servers. See: MongoDB: Deploy a Replica Set With Transport Encryption: Part 3.
  • Create secrects mogoCA.crt, ebl01.pem, and ebl02.pem from the respective certificates.
  • Redeploy stack with replica set and SSL enabled from mongodb-replica_set.yml.
  • Initiate the replica set. Login to mongo and run (The hosts must have full address, otherwise it is not possible to connect to the replica set from outside the stack):
    rs.initiate( {
       _id : "rs-ebl1",
       members: [
          { _id: 0, host: "lmkwitg-ebl01.srv.mwn.de:27017" },
          { _id: 1, host: "lmkwitg-ebl02.srv.mwn.de:27018" }
       ]
    })
    

Monitoring

See: eses/mongodb_exporter and .

  • Create a user for the exporter.
    db.getSiblingDB("admin").createUser({
        user: "mongodb_exporter",
        pwd: "<password>",
        roles: [
            { role: "clusterMonitor", db: "admin" },
            { role: "read", db: "local" }
        ]
    })
    
  • Update the swarmprom stack:
    • Add mongodb-exporter service:
        mongodb-exporter:
          image: bitnami/mongodb-exporter
          command:
           - --mongodb.direct-connect=false
           - --mongodb.uri=mongodb://<user>:<passwordd>@lmkwitg-ebl01.srv.mwn.de:27017,lmkwitg-ebl02.srv.mwn.de:27018/?tls=true&tlsCAFile=/run/secrets/mongoCA.crt
          secrets:
           - mongoCA.crt
          networks:
           - net
          deploy:
            resources:
              reservations:
                memory: 64M
              limits:
                memory: 128M
      
    • Add mongodb-exporter job to prometheus:
            JOBS: traefik:8080 mongodb-exporter:9104
      
    • Add mongoCA.crt to secrets.
  • Import MongoDB dashboard to Grafana.
  • Edit dashboard JSON and change metric prefix from mongodb_ to mongodb_mongod_.

Docker registry

Create configs registry_config and docker-registry-ui_config from registry_config.yml docker-registry-ui_config.

Create secrets:

  • httpass bcrypt encrypted httpasswd file with users for the registry.
  • registry_htpasswd password of the regisry user used by the registry UI.

Create stack from registry.yml.

eBL application

Create stack from ebl.yml. The Docker images should be in the registry before deploying the stack. Ai-api service is optional and could be left out. The EBL_AI_API environment variable on the api has to be present. The ebl-ai-api repository is here.

  • Update the swarmprom stack:
    • Add redis-exporter service:
      redis-exporter:
        image: bitnami/redis-exporter:1
        environment:
          REDIS_ADDR: redis://redis:6379
        networks:
         - net
         - monitoring
      
    • Add redis-exporter job to prometheus:
            JOBS: traefik:8080 mongodb-exporter:9216 redis-exporter:9121
      

Troubleshooting

Consul fails to elect a leader

  • Scale replicas to 0.
  • Scale replica back to desired value.

When the new replicas join Consul should clean up old nodes and elect a new leader. To avoid stale nodes in the config the replicas should be shut down before the leader.

Redeployment fails

Long commands (e.g. node-exporter) get messed up and $ is unescaped in the "Current engine state". If you have to redeploy a service/stack edit the "Last deploeyd instead" or copy the correct values this repository.

Forgotten Grafana password

The Grafana admin password is set up only on first run. It can be resetted later vie the CLI docker exec -ti <container id> grafana-cli admin reset-admin-password <new password>.

"invalid memory address or nil pointer dereference" from mongodb_exporter

There is a bug in the exporter (PMM-4375). We should update it as soon as the fix is released. The exporter has been updated.

The cluster becomes unresponsive

Docker can be restarted from the commandline by running sudo service docker restart in all the affected instances. If it is not possible to connect with SSH, ask ITG to reboot/investigate.

Low diskspace

Diskspace can be freed by removing old Docker images etc. See: https://stackoverflow.com/questions/32723111/how-to-remove-old-and-unused-docker-images/32723127#32723127

Expired certificate

The certificates are handled automatically by Let's Encrypt and certbot (managed in the Traefik configuration, cf. the Traefik docs). In case if fails the website becomes unavailable via https once the certificate expires. It might be necessary to manually restart the traefik-consul_traefik service in swarmpit or via the server.

infrastructure's People

Contributors

jlaasonen avatar khoidt avatar ycobanoglu avatar ejimsan avatar fsimonjetz avatar

Watchers

James Cloos avatar

Forkers

heartshare

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.