Giter Site home page Giter Site logo

ministryofjustice / staff-infrastructure-monitoring Goto Github PK

View Code? Open in Web Editor NEW
7.0 14.0 2.0 636 KB

Terraform module that deploys infrastructure for our monitoring solution including Grafana and Prometheus, etc.

Home Page: https://ministryofjustice.github.io/cloud-operations/#cloud-operations

License: MIT License

HCL 91.34% Ruby 3.47% Makefile 0.45% Shell 4.74%

staff-infrastructure-monitoring's Introduction

This repository has been archived - it's functionality is now delivered by staff-infrastructure-monitoring-cluster.

repo standards badge

Infrastructure Monitoring and Alerting Platform

Table of contents

About this repository

The Infrastructure Monitoring and Alerting (IMA) Platform aims to allow service owners and support teams to monitor the health of the MoJ infrastructure and identify failures as early as possible ahead of the end users reporting them. For alerting see this repository.

Our repositories

Getting started

Prerequisites

Before you start you should ensure that you have installed the following:

  • AWS Command Line Interface (CLI) - to manage AWS services
  • AWS Vault (>= 6.0.0) - to easily manage and switch between AWS account profiles on the command line
  • tfenv - to easily manage and switch versions Terraform versions
  • Terraform (1.1.x installed via tfenv) - to manage infrastructure

You should also have AWS account access to at least the Dev and Shared Services AWS accounts.

Authenticate with AWS

Terraform is run locally in a similar way to how it is run on the build pipelines.

It assumes an IAM role defined in the Shared Services, and targets the AWS account to gain access to the Development environment. This is done in the Terraform AWS provider with the assume_role configuration.

Authentication is made with the Shared Services AWS account, which then assumes the role into the target environment.

Assuming you have been granted necessary access permissions to the Shared Service Account, please follow the CloudOps best practices provided step-by-step guide to configure your AWS Vault and AWS Cli with AWS SSO.

Prepare the variables

  1. Copy .env.example to .env
  2. Modify the .env file and provide values for variables as below:
Variables How?
AWS_PROFILE= your AWS-CLI profile name for the Shared Services AWS account. Check this guide if you need help.
AWS_DEFAULT_REGION= eu-west-2
ENV= your unique terraform workspace name. 🔔
🔔 HELP
See Create Terraform workspace section to find out how to create a terraform workspace!

Initialize your Terraform

make init

Switch to an isolated workspace

If you do not have a Terraform workspace created already, use the command below to create a new workspace.

Create Terraform workspace

AWS_PROFILE=mojo-shared-services-cli terraform workspace new "YOUR_UNIQUE_WORKSPACE_NAME"

This should create a new workspace and select that new workspace at the same time.

If you already have a workspace created use the command below to select the right workspace before continue.

View Terraform workspace list

AWS_PROFILE=mojo-shared-services-cli terraform workspace list

Select a Terraform workspace

AWS_PROFILE=mojo-shared-services-cli terraform workspace select "YOUR_WORKSPACE_NAME"

4. Verify your email address for receiving emails

  1. Go to AWS Simple Email Service's Email Addresses section under Identity Management
  2. Click on Verify a New Email Address
  3. Enter your email address and click Verify This Email Address

You should then receive an Email Address Verification Request email.

  1. Click on the link provided in the email

This should update your Verification Status to Verified AWS.

5. Set up your own development infrastructure

  1. Run make generate-tfvars. This will pull down the tfvars file from aws parameter store, there are some values you'll have to complete yourself, or replace placeholders with your workspace name.
$ cp terraform.tfvars.example terraform.tfvars
  1. Set values for all the variables with grafana_db_name and grafana_db_endpoint set to foo for now. These values will be set after creating your own infrastructure.

  2. Create your infrastructure by running:

$ make apply
  1. Move into the database directory and initialise Terraform using:
$ cd database/ && aws-vault exec moj-pttp-dev -- terraform init
  1. Duplicate terraform.tfvars.example and rename the file to terraform.tfvars
$ cp terraform.tfvars.example terraform.tfvars

You will find the values for these tfvars outputted in the console after running the command in step 3

  1. Set values for all the variables using the Terraform outputs from creating your infrastructure in Step 1
  2. Create your database by running:
$ aws-vault exec moj-pttp-dev -- terraform apply
  1. Move back into the root directory
$ cd ../
  1. Update your terraform.tfvars values for grafana_db_name and grafana_db_endpoint to what is outputted by Terraform at Step 5
  2. Apply your changes by running:
$ aws-vault exec moj-pttp-shared-services -- terraform apply

This will enable you to use Grafana but not Prometheus, blackbox exporter and SNMP exporter. You'll need to push a Docker image to the corresponding AWS ECR repository that this repository created in order to utilise those components. To do so, see the README for each:

  1. Before you move onto any other repo's run the following to export your terraform outputs to parameter store:
$ export ENV=<your-workspace-name>
$ aws-vault exec moj-pttp-shared-services -- ./scripts/publish_terraform_outputs.sh

Usage

Running the code for development

To create an execution plan:

$ make plan

To execute changes:

$ make apply

To execute changes that require a longer session e.g. creating a database:

$ aws-vault clear && aws-vault exec moj-pttp-shared-services --duration=2h -- terraform apply

To minimise costs and keep the environment clean, regularly run teardown in your workspace using:

$ make destroy

To view your changes within the AWS Management Console:

Note: Login is into the Dev AWS account even though infrastructure execution is completed using moj-pttp-shared-services as it assumes the role of Dev.

$ aws-vault login moj-pttp-dev

To run Selenium tests, use:

$ make test

Documentation

For documentation, see our docs.

License

MIT License

staff-infrastructure-monitoring's People

Contributors

andycohen avatar astrobinson avatar bagg3rs avatar chubberlisk avatar davesammut avatar devotox avatar elcorbs avatar elena-vi avatar gary-h9 avatar github-actions[bot] avatar kyphutruong avatar mitchdawson1982 avatar mtouhid avatar paulmchenry avatar renovate-bot avatar staff-infrastructure-moj avatar thip avatar wanieldilson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

staff-infrastructure-monitoring's Issues

♻️ Add routes to ARK datacentres

Currently, the routes to 172.16.181.0/24 and 172.16.181.0/24 have been manually configured and must be provisioned as code.

image

These routes are for production only and should be configured here

A branch protection setting is not enabled: administrators require review

Hi there
The default branch protection setting called administrators require review is not enabled for this repository
See repository settings/Branches/Branch protection rules
Either add a new Branch protection rule or edit the existing branch protection rule and select the Require a pull request before merging option
See the repository standards: https://github.com/ministryofjustice/github-repository-standards
See the report: https://operations-engineering-reports.cloud-platform.service.justice.gov.uk/github_repositories
Please contact Operations Engineering on Slack #ask-operations-engineering, if you need any assistance

Missing Metrics 'DHCP ECS Task count' (Pre-Prod / Development)

User Story
As a Cloud Ops Engineer
I want to see metrics from all environments
So that I can successfully and easily manage DHCP.

Value / Purpose
All data is available on Grafana in one place making it easy to monitor our enviroments.

Definition of Done (DoD)
metric 'DHCP ECS Task count' is available for dev / pre-prod in Grafana.

Prod:
image.png

Pre-Prod / Dev:
image.png

📌 SPIKE: Scrape Microsoft Graph into IMA

timebox 1hr

User Story

As an Infrastructure Engineer for Devices and Apps
I want a view of live data from Intune in Grafana
So that I can create a consolidated view of application deployments

Value / Purpose

Intune shows individual deployments with no way to get a overall view of deployments
image

Useful Contacts

Matt W, Rich B

Additional Information

Microsoft Graph API docs

Definition of Done (DoD)

Prometheus or exporter is available to scrape MS GraphAPI JSON

  • Test data has been scraped into Prometheus.
  • Any further issues created
  • Documentation has been written / updated

Reference

How to write good user stories

✨ Add `development` and `pre-production` namespaces in `production` EKS cluster

User Story

As a… developer in CloudOps team
I want to… test my code in development environment in Production EKS cluster
So that… I can my code in isolation without having to create a whole new EKS cluster

Value / Purpose

This is reduce the cost of running additional EKS clusters for non-production environments.

Useful Contacts

Touhidul Islam, Rich Baguley

Additional Information

No response

Definition of Done

  • Production EKS cluster has all environments in separate namespaces
  • Documentation has been written / updated
  • README has been updated
  • User docs have been updated
  • Another team member has reviewed
  • Tests are green

🐛 Bug Report - S3 bucket encryption

Describe the problem

As part of the Terraform upgrade, the AWS provider needed to be upgraded also. Upgrading the AWS provider meant making a lot of changes to the S3 bucket resources as per the new syntax.

The KMS key used to be part of the s3 bucket module but is now split out into an individual resource. This has caused buckets that were previously encrypted to enter a state of flux whereby the bucket shows as unencrypted but the objects within it remain encrypted with the old KMS key (now deleted).

S3 buckets will not successfully encrypt with the new KMS key.

Service Discovery - DNS

User Story

As a network engineer
I want to view my device or instance by DNS hostname
So that I don't have to cross-reference IP addresses to hostnames

Value

Consistency of instance naming across the platform

Questions / Assumptions

Internal Zone within our DNS platform for e.g. network.service
we can join metrics if the instance changes from IP to hostname

Definition of done

  • Add internal zone
  • IMAP to use MoJO DNS
  • Network devices to use MoJO DNS
  • Prometheus to use dns_sd_config

Reference

How to write good user stories

Security Groups - Unrestricted Access

Several security groups have unrestricted access:

  • mojo-production-ima-alb-grafana-sg tcp 443-3000 0.0.0.0/0

  • mojo-production-ima-db-in tcp 5432 0.0.0.0/0

  • mojo-production-ima-route53-resolver -1 0.0.0.0/0

  • test-bastion tcp 22 0.0.0.0/0

  • Update Trusted Advisor

Restrict these security groups.

📚 IMA S3 Bucket review

User Story

As a member of the CloudOps team
I want to understand and document s3 use throughout IMA, and ensure all resources are necessary
So that I am better prepared for any s3 related issues in future

Value / Purpose

A better understanding of S3's role within the IMA platform.

Additional Information

Please see #181

Definition of Done (DoD)

S3 section added to the documentation, any s3 buckets not being used by another resource should be deleted.

  • README has been updated
  • pttp-prometheus-thanos-storage buckets are removed from each environment

Custom Metric Historic Data < 24 Hours

Describe the bug
There appears to be no data held longer than 24 hours for custom metrics in grafana.
To Reproduce
Steps to reproduce the behavior:

  1. Go to service dashboard identify a custom metric graph.
  2. Click on 'Time period > 24 Hours'
  3. Scroll down to '....'
  4. See No Data is visble

Expected behavior
No data is seen in custom metric graphs when the time period is > 24 hours.

Screenshots
image.png

Desktop (please complete the following information):

  • OS: [e.g. iOS] Win10 Dev Build
  • Browser [e.g. chrome, safari] Chrome
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

⬆ Upgrade EKS Kubernetes version 1.21 -> 1.22

User Story

As an… Engineer
I need/want/expect to… be running the latest versions of software
So that… we don't fall out of date, stay up to date security-wise, get new features etc etc

Value / Purpose

No response

Useful Contacts

No response

Additional Information

Definition of Done

  • Read upgrade docs
  • Document the steps taken in this ticket - a runbook for future upgrades
  • Update the terraform and any dependencies
  • Test in Dev / Pre-Prod
  • Allow into Live

🔥 Remove SNMP_Exporter, Pipeline, Alerts from IMAP

User Story

Network Operations are now monitoring physical devices.

We can remove this functionality from IMAP

Value / Purpose

Reduce IMAP complexity
Remove unneeded AWS resources
🔥 Code!

Useful Contacts

Rich B, Touhid, Ronak

Additional Information

Archive
https://github.com/ministryofjustice/staff-infrastructure-monitoring-snmpexporter
Remove

https://github.com/ministryofjustice/staff-infrastructure-monitoring-deployments/blob/main/kubernetes/infrastructure-monitoring/templates/prometheus-jobs/_snmp-exporter-production.tpl

https://github.com/ministryofjustice/staff-infrastructure-monitoring-deployments/blob/454f8316247b34aeb90164a6928bf7bd517b9eb9/kubernetes/infrastructure-monitoring/templates/prometheus-configmap.yaml#L156

Definition of Done

  • AlertManager config removed and channels archived
  • ECS or EKS resources 🔥
  • README has been updated
  • Repo's archived
  • Tests are green

🧱Upgrade VPC CNI version from v1.7.x or lower to v1.8.0 or above

Affected Environments:

  • mojo-development-ima-cluster
  • mojo-pre-production-ima-cluster
  • mojo-production-ima-cluster

VPC CNI versions starting with v1.8.0 have the required fix to address this issue. Please upgrade to VPC CNI v1.8.0 or above if you are currently on the affected CNI versions on EKS v1.21 clusters. Our recommendation is to upgrade one minor version at a time.

For example, if your current minor version is 1.8 and you want to update to 1.10, you should update to the latest patch version of 1.9 first, then update to the latest patch version of 1.10. Please refer to EKS documentation for instructions on upgrading the VPC CNI plugin [2].

NOTE:

We are currently configured to use the self managed add-on.

Procedure for migrating from the self managed add on to the eks add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#adding-vpc-cni-eks-add-on

Procedure to update the self managed add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-add-on

Procedure to update the eks add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on

If you have any questions or concerns, please contact AWS Support [3].

[1] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[2] https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on
[3] https://aws.amazon.com/support

📖 Investigate Bastion ec2

User Story

As a member of the CloudOps team
I want to make sure unnecessary ec2 instances aren't live in AWS.
So that we don't waste money and increase our security posture

Value / Purpose

Ensure bastion ec2 isn't spun up when it shouldn't be

Useful Contacts

Additional Information

image

Definition of Done (DoD)

Any test bastion ec2 instances are feature flagged and not created unnecessarily

🔥 Kill unused resources running on Fargate!

User Story

As a member of the CloudOps team
I want Prometheus and exporters to be running in the EKS cluster only
So that we aren't wasting money on unused resources in AWS Fargate

Value / Purpose

Prometheus and exporters have been running successfully in EKS for some time. It's now time to locate the resources in ECS Fargate and make sure they are removed as they're no longer used and are wasting money.

Useful Contacts

No response

Additional Information

No response

Definition of Done

Example - [ ] Documentation has been written / updated

  • README has been updated
  • User docs have been updated
  • Another team member has reviewed
  • Tests are green

⚰ Remove EKS Clusters from non-production environments

User Story

As a… developer in the CloudOps team
I want to… update the terraform so that it does not create EKS clusters for dev and pre-prod envs
So that… Only the prod env has a EKS cluster running

Value / Purpose

This will reduce the running costs of EKS clusters

Useful Contacts

Touhidul Islam, Rich Baguley

Additional Information

No response

Definition of Done

  • Terraform no longer creates EKS clusters in development and in pre-production by default
  • Documentation has been written / updated
  • README has been updated
  • User docs have been updated
  • Another team member has reviewed
  • Tests are green

Move slack_webhook_urls to parameter store from code

User Story:As a Cloud Ops engineerI want to ensure secrets are stored securely in the parameter storeSo that they are not visible in our repositories

Value / Purpose ensure secrets are securely stored and not openly available in code to improve security.

Useful ContactsAaron Robinson and CloudOps team

Additional Informationhttps://github.com/ministryofjustice/staff-infrastructure-monitoring/security/secret-scanning/1

Definition of Done (DoD)slack_webhook_urls are referenced from the parameter store rather than being stored in code.

🔒 Cluster nodes obtaining IP addresses from both public and private subnets

The EKS cluster is configured to allow nodes to obtain IP addresses from both public and private subnets. See code snippet below:

module "monitoring_alerting_cluster" {
  source                          = "terraform-aws-modules/eks/aws"
  version                         = "14.0.0"
  create_eks                      = var.is_eks_enabled
  cluster_name                    = "${var.prefix}-cluster"
  cluster_version                 = "1.19"
  manage_aws_auth                 = false
  cluster_endpoint_private_access = true
  cluster_enabled_log_types       = ["api", "authenticator", "controllerManager"]
  tags                            = var.tags

  subnets = concat(aws_subnet.private.*.id, aws_subnet.public.*.id)
  vpc_id  = aws_vpc.main.id

  worker_groups = [
    {
      name                 = "prometheus-worker-group"
      instance_type        = "t3.medium"
      asg_desired_capacity = 3
      root_volume_type     = "gp2"
    },
  ]
}

This presents a security concern as nodes should not be exposed to the internet unnecessarily.

👷‍♀️ Monitor EKS Cluster version with Renovate

User Story

As a… Cloud Ops Engineer
I expect Renovate to create an Issue when we are not running the latest version of EKS in our clusters
So that… we don't fall behind, out of date etc.

Value / Purpose

  • Team won't fall behind / have a better view of the work we need to do
  • Software will be more up to date / secure etc etc

Useful Contacts

No response

Additional Information

If this is done prior to this 1.21 -> 1.22, this can easily be tested.

Definition of Done

  • Renovate creates an issue when we are not running the latest K8s version of EKS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.