ministryofjustice / staff-infrastructure-monitoring Goto Github PK

View Code? Open in Web Editor NEW

7.0 14.0 2.0 636 KB

Terraform module that deploys infrastructure for our monitoring solution including Grafana and Prometheus, etc.

Home Page: https://ministryofjustice.github.io/cloud-operations/#cloud-operations

License: MIT License

HCL 91.34% Ruby 3.47% Makefile 0.45% Shell 4.74%

staff-infrastructure-monitoring's Introduction

This repository has been archived - it's functionality is now delivered by staff-infrastructure-monitoring-cluster.

Infrastructure Monitoring and Alerting Platform

About the project
- Our repositories
- Architecture Decision Record (ADR)
Getting started
Usage
- Running the code for development
Documentation
License

About this repository

The Infrastructure Monitoring and Alerting (IMA) Platform aims to allow service owners and support teams to monitor the health of the MoJ infrastructure and identify failures as early as possible ahead of the end users reporting them. For alerting see this repository.

Our repositories

IMA Platform Infrastructure - to provision the infrastructure that the IMA Platfrom is deployed on
Configuration - to provision dashboards, alerts, and datasources that monitor MoJ infrastructure and physical devices on the IMA Platform
Deployments - to deploy the IMA Platform's core services onto our infrastructure
SNMP Exporter - to scrape data from physical devices (Docker image)
Blackbox Exporter - to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.s (Docker image)
Metric Aggregation Server - to pull data from the SNMP exporter (Docker image)
Shared Services Infrastructure - to manage our CI/CD pipelines

Getting started

Prerequisites

Before you start you should ensure that you have installed the following:

AWS Command Line Interface (CLI) - to manage AWS services
AWS Vault (>= 6.0.0) - to easily manage and switch between AWS account profiles on the command line
tfenv - to easily manage and switch versions Terraform versions
Terraform (1.1.x installed via tfenv) - to manage infrastructure

You should also have AWS account access to at least the Dev and Shared Services AWS accounts.

Authenticate with AWS

Terraform is run locally in a similar way to how it is run on the build pipelines.

It assumes an IAM role defined in the Shared Services, and targets the AWS account to gain access to the Development environment. This is done in the Terraform AWS provider with the assume_role configuration.

Authentication is made with the Shared Services AWS account, which then assumes the role into the target environment.

Assuming you have been granted necessary access permissions to the Shared Service Account, please follow the CloudOps best practices provided step-by-step guide to configure your AWS Vault and AWS Cli with AWS SSO.

Prepare the variables

Copy .env.example to .env
Modify the .env file and provide values for variables as below:

Variables	How?
`AWS_PROFILE=`	your AWS-CLI profile name for the Shared Services AWS account. Check this guide if you need help.
`AWS_DEFAULT_REGION=`	`eu-west-2`
`ENV=`	your unique terraform workspace name. 🔔

🔔 HELP
See Create Terraform workspace section to find out how to create a terraform workspace!

Initialize your Terraform

make init

Switch to an isolated workspace

If you do not have a Terraform workspace created already, use the command below to create a new workspace.

Create Terraform workspace

AWS_PROFILE=mojo-shared-services-cli terraform workspace new "YOUR_UNIQUE_WORKSPACE_NAME"

This should create a new workspace and select that new workspace at the same time.

If you already have a workspace created use the command below to select the right workspace before continue.

View Terraform workspace list
AWS_PROFILE=mojo-shared-services-cli terraform workspace list
Select a Terraform workspace
AWS_PROFILE=mojo-shared-services-cli terraform workspace select "YOUR_WORKSPACE_NAME"

4. Verify your email address for receiving emails

Go to AWS Simple Email Service's Email Addresses section under Identity Management
Click on Verify a New Email Address
Enter your email address and click Verify This Email Address

You should then receive an Email Address Verification Request email.

Click on the link provided in the email

This should update your Verification Status to Verified AWS.

5. Set up your own development infrastructure

Run make generate-tfvars. This will pull down the tfvars file from aws parameter store, there are some values you'll have to complete yourself, or replace placeholders with your workspace name.

$ cp terraform.tfvars.example terraform.tfvars

Set values for all the variables with grafana_db_name and grafana_db_endpoint set to foo for now. These values will be set after creating your own infrastructure.
Create your infrastructure by running:

$ make apply

Move into the database directory and initialise Terraform using:

$ cd database/ && aws-vault exec moj-pttp-dev -- terraform init

Duplicate terraform.tfvars.example and rename the file to terraform.tfvars

$ cp terraform.tfvars.example terraform.tfvars

You will find the values for these tfvars outputted in the console after running the command in step 3

Set values for all the variables using the Terraform outputs from creating your infrastructure in Step 1
Create your database by running:

$ aws-vault exec moj-pttp-dev -- terraform apply

Move back into the root directory

$ cd ../

Update your terraform.tfvars values for grafana_db_name and grafana_db_endpoint to what is outputted by Terraform at Step 5
Apply your changes by running:

$ aws-vault exec moj-pttp-shared-services -- terraform apply

This will enable you to use Grafana but not Prometheus, blackbox exporter and SNMP exporter. You'll need to push a Docker image to the corresponding AWS ECR repository that this repository created in order to utilise those components. To do so, see the README for each:

Before you move onto any other repo's run the following to export your terraform outputs to parameter store:

$ export ENV=<your-workspace-name>
$ aws-vault exec moj-pttp-shared-services -- ./scripts/publish_terraform_outputs.sh

Usage

Running the code for development

To create an execution plan:

$ make plan

To execute changes:

$ make apply

To execute changes that require a longer session e.g. creating a database:

$ aws-vault clear && aws-vault exec moj-pttp-shared-services --duration=2h -- terraform apply

To minimise costs and keep the environment clean, regularly run teardown in your workspace using:

$ make destroy

To view your changes within the AWS Management Console:

Note: Login is into the Dev AWS account even though infrastructure execution is completed using moj-pttp-shared-services as it assumes the role of Dev.

$ aws-vault login moj-pttp-dev

To run Selenium tests, use:

$ make test

Documentation

For documentation, see our docs.

License

MIT License

staff-infrastructure-monitoring's People

Contributors

Stargazers

Watchers

Forkers

uk-gov-mirror mitchdawson1982

staff-infrastructure-monitoring's Issues

♻️ Add routes to ARK datacentres

Currently, the routes to 172.16.181.0/24 and 172.16.181.0/24 have been manually configured and must be provisioned as code.

These routes are for production only and should be configured here

A branch protection setting is not enabled: administrators require review

Hi there
The default branch protection setting called administrators require review is not enabled for this repository
See repository settings/Branches/Branch protection rules
Either add a new Branch protection rule or edit the existing branch protection rule and select the Require a pull request before merging option
See the repository standards: https://github.com/ministryofjustice/github-repository-standards
See the report: https://operations-engineering-reports.cloud-platform.service.justice.gov.uk/github_repositories
Please contact Operations Engineering on Slack #ask-operations-engineering, if you need any assistance

Missing Metrics 'DHCP ECS Task count' (Pre-Prod / Development)

User Story
As a Cloud Ops Engineer
I want to see metrics from all environments
So that I can successfully and easily manage DHCP.

Value / Purpose
All data is available on Grafana in one place making it easy to monitor our enviroments.

Definition of Done (DoD)
metric 'DHCP ECS Task count' is available for dev / pre-prod in Grafana.

Prod:

Pre-Prod / Dev:

📌 SPIKE: Scrape Microsoft Graph into IMA

timebox 1hr

User Story

As an Infrastructure Engineer for Devices and Apps
I want a view of live data from Intune in Grafana
So that I can create a consolidated view of application deployments

Value / Purpose

Intune shows individual deployments with no way to get a overall view of deployments

Useful Contacts

Matt W, Rich B

Additional Information

Microsoft Graph API docs

Definition of Done (DoD)

Prometheus or exporter is available to scrape MS GraphAPI JSON

Test data has been scraped into Prometheus.
Any further issues created
Documentation has been written / updated

Reference

How to write good user stories

✨ Add `development` and `pre-production` namespaces in `production` EKS cluster

User Story

As a… developer in CloudOps team
I want to… test my code in development environment in Production EKS cluster
So that… I can my code in isolation without having to create a whole new EKS cluster

Value / Purpose

This is reduce the cost of running additional EKS clusters for non-production environments.

Useful Contacts

Touhidul Islam, Rich Baguley

Additional Information

No response

Definition of Done

Production EKS cluster has all environments in separate namespaces
Documentation has been written / updated
README has been updated
User docs have been updated
Another team member has reviewed
Tests are green

Please define external collaborators in code

Hi there

We now have a process to manage github collaborators in code:
https://github.com/ministryofjustice/github-collaborators/blob/main/README.md#github-external-collaborators

Please follow the procedure described there to grant @andycohen access to this repository.

If you have any questions, please post in #ask-operations-engineering on Slack.

🐛 Bug Report - S3 bucket encryption

Describe the problem

As part of the Terraform upgrade, the AWS provider needed to be upgraded also. Upgrading the AWS provider meant making a lot of changes to the S3 bucket resources as per the new syntax.

The KMS key used to be part of the s3 bucket module but is now split out into an individual resource. This has caused buckets that were previously encrypted to enter a state of flux whereby the bucket shows as unencrypted but the objects within it remain encrypted with the old KMS key (now deleted).

S3 buckets will not successfully encrypt with the new KMS key.

Service Discovery - DNS

User Story

As a network engineer
I want to view my device or instance by DNS hostname
So that I don't have to cross-reference IP addresses to hostnames

Value

Consistency of instance naming across the platform

Questions / Assumptions

Internal Zone within our DNS platform for e.g. network.service
we can join metrics if the instance changes from IP to hostname

Definition of done

Add internal zone
IMAP to use MoJO DNS
Network devices to use MoJO DNS
Prometheus to use dns_sd_config

Reference

How to write good user stories

Security Groups - Unrestricted Access

Several security groups have unrestricted access:

mojo-production-ima-alb-grafana-sg tcp 443-3000 0.0.0.0/0
mojo-production-ima-db-in tcp 5432 0.0.0.0/0
mojo-production-ima-route53-resolver -1 0.0.0.0/0
test-bastion tcp 22 0.0.0.0/0
Update Trusted Advisor

Restrict these security groups.

📚 IMA S3 Bucket review

User Story

As a member of the CloudOps team
I want to understand and document s3 use throughout IMA, and ensure all resources are necessary
So that I am better prepared for any s3 related issues in future

Value / Purpose

A better understanding of S3's role within the IMA platform.

Additional Information

Please see #181

Definition of Done (DoD)

S3 section added to the documentation, any s3 buckets not being used by another resource should be deleted.

README has been updated
pttp-prometheus-thanos-storage buckets are removed from each environment

Custom Metric Historic Data < 24 Hours

Describe the bug
There appears to be no data held longer than 24 hours for custom metrics in grafana.
To Reproduce
Steps to reproduce the behavior:

Go to service dashboard identify a custom metric graph.
Click on 'Time period > 24 Hours'
Scroll down to '....'
See No Data is visble

Expected behavior
No data is seen in custom metric graphs when the time period is > 24 hours.

Screenshots

Desktop (please complete the following information):

OS: [e.g. iOS] Win10 Dev Build
Browser [e.g. chrome, safari] Chrome
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

⬆ Upgrade EKS Kubernetes version 1.21 -> 1.22

User Story

As an… Engineer
I need/want/expect to… be running the latest versions of software
So that… we don't fall out of date, stay up to date security-wise, get new features etc etc

Value / Purpose

No response

Useful Contacts

No response

Additional Information

Definition of Done

Read upgrade docs
Document the steps taken in this ticket - a runbook for future upgrades
Update the terraform and any dependencies
Test in Dev / Pre-Prod
Allow into Live

👁‍🗨 Issue to add labels

🔥 Remove SNMP_Exporter, Pipeline, Alerts from IMAP

User Story

Network Operations are now monitoring physical devices.

We can remove this functionality from IMAP

Value / Purpose

Reduce IMAP complexity
Remove unneeded AWS resources
🔥 Code!

Useful Contacts

Rich B, Touhid, Ronak

Additional Information

Archive
https://github.com/ministryofjustice/staff-infrastructure-monitoring-snmpexporter
Remove

https://github.com/ministryofjustice/staff-infrastructure-monitoring-deployments/blob/main/kubernetes/infrastructure-monitoring/templates/prometheus-jobs/_snmp-exporter-production.tpl

https://github.com/ministryofjustice/staff-infrastructure-monitoring-deployments/blob/454f8316247b34aeb90164a6928bf7bd517b9eb9/kubernetes/infrastructure-monitoring/templates/prometheus-configmap.yaml#L156

Definition of Done

🧱Upgrade VPC CNI version from v1.7.x or lower to v1.8.0 or above

Affected Environments:

mojo-development-ima-cluster
mojo-pre-production-ima-cluster
mojo-production-ima-cluster

VPC CNI versions starting with v1.8.0 have the required fix to address this issue. Please upgrade to VPC CNI v1.8.0 or above if you are currently on the affected CNI versions on EKS v1.21 clusters. Our recommendation is to upgrade one minor version at a time.

For example, if your current minor version is 1.8 and you want to update to 1.10, you should update to the latest patch version of 1.9 first, then update to the latest patch version of 1.10. Please refer to EKS documentation for instructions on upgrading the VPC CNI plugin [2].

NOTE:

We are currently configured to use the self managed add-on.

Procedure for migrating from the self managed add on to the eks add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#adding-vpc-cni-eks-add-on

Procedure to update the self managed add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-add-on

Procedure to update the eks add-on - https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on

If you have any questions or concerns, please contact AWS Support [3].

[1] https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
[2] https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on
[3] https://aws.amazon.com/support

📖 Investigate Bastion ec2

User Story

As a member of the CloudOps team
I want to make sure unnecessary ec2 instances aren't live in AWS.
So that we don't waste money and increase our security posture

Value / Purpose

Ensure bastion ec2 isn't spun up when it shouldn't be

Useful Contacts

Additional Information

Definition of Done (DoD)

Any test bastion ec2 instances are feature flagged and not created unnecessarily

🔥 Kill unused resources running on Fargate!

User Story

As a member of the CloudOps team
I want Prometheus and exporters to be running in the EKS cluster only
So that we aren't wasting money on unused resources in AWS Fargate

Value / Purpose

Prometheus and exporters have been running successfully in EKS for some time. It's now time to locate the resources in ECS Fargate and make sure they are removed as they're no longer used and are wasting money.

Useful Contacts

No response

Additional Information

No response

Definition of Done

Example - [ ] Documentation has been written / updated

README has been updated
User docs have been updated
Another team member has reviewed
Tests are green

⚰ Remove EKS Clusters from non-production environments

User Story

As a… developer in the CloudOps team
I want to… update the terraform so that it does not create EKS clusters for dev and pre-prod envs
So that… Only the prod env has a EKS cluster running

Value / Purpose

This will reduce the running costs of EKS clusters

Useful Contacts

Touhidul Islam, Rich Baguley

Additional Information

No response

Definition of Done

Terraform no longer creates EKS clusters in development and in pre-production by default
Documentation has been written / updated
README has been updated
User docs have been updated
Another team member has reviewed
Tests are green

🧱 Upgrade Terraform Version

Aim for the latest version.
Information here: https://www.terraform.io/language/upgrade-guides/1-1

Move slack_webhook_urls to parameter store from code

User Story:As a Cloud Ops engineerI want to ensure secrets are stored securely in the parameter storeSo that they are not visible in our repositories

Value / Purpose ensure secrets are securely stored and not openly available in code to improve security.

Useful ContactsAaron Robinson and CloudOps team

Additional Informationhttps://github.com/ministryofjustice/staff-infrastructure-monitoring/security/secret-scanning/1

Definition of Done (DoD)slack_webhook_urls are referenced from the parameter store rather than being stored in code.

🔒 Cluster nodes obtaining IP addresses from both public and private subnets

The EKS cluster is configured to allow nodes to obtain IP addresses from both public and private subnets. See code snippet below:

module "monitoring_alerting_cluster" {
  source                          = "terraform-aws-modules/eks/aws"
  version                         = "14.0.0"
  create_eks                      = var.is_eks_enabled
  cluster_name                    = "${var.prefix}-cluster"
  cluster_version                 = "1.19"
  manage_aws_auth                 = false
  cluster_endpoint_private_access = true
  cluster_enabled_log_types       = ["api", "authenticator", "controllerManager"]
  tags                            = var.tags

  subnets = concat(aws_subnet.private.*.id, aws_subnet.public.*.id)
  vpc_id  = aws_vpc.main.id

  worker_groups = [
    {
      name                 = "prometheus-worker-group"
      instance_type        = "t3.medium"
      asg_desired_capacity = 3
      root_volume_type     = "gp2"
    },
  ]
}

This presents a security concern as nodes should not be exposed to the internet unnecessarily.

Review after date expiry is upcoming for user: richrace

Hi there

The user @richrace has its access for this repository maintained in code here:
https://github.com/ministryofjustice/github-collaborators/blob/main/README.md#github-external-collaborators

The review_after date is due to expire within one month, please update this via a PR if they still require access.

If you have any questions, please post in #ask-operations-engineering on Slack.

📝 Update README to include TF_VAR_thanos_image_repository_url variable

The README does not currently contain this variable and should be updated accordingly.

👷‍♀️ Monitor EKS Cluster version with Renovate

User Story

As a… Cloud Ops Engineer
I expect Renovate to create an Issue when we are not running the latest version of EKS in our clusters
So that… we don't fall behind, out of date etc.

Value / Purpose

Team won't fall behind / have a better view of the work we need to do
Software will be more up to date / secure etc etc

Useful Contacts

No response

Additional Information

If this is done prior to this 1.21 -> 1.22, this can easily be tested.

Definition of Done

Renovate creates an issue when we are not running the latest K8s version of EKS

ministryofjustice / staff-infrastructure-monitoring Goto Github PK

staff-infrastructure-monitoring's Introduction

This repository has been archived - it's functionality is now delivered by staff-infrastructure-monitoring-cluster.

Infrastructure Monitoring and Alerting Platform

Table of contents

About this repository

Our repositories

Getting started

Prerequisites

Authenticate with AWS

Prepare the variables

Initialize your Terraform

Switch to an isolated workspace

Create Terraform workspace

View Terraform workspace list

Select a Terraform workspace

4. Verify your email address for receiving emails

5. Set up your own development infrastructure

Usage

Running the code for development

Documentation

License

staff-infrastructure-monitoring's People

Contributors

Stargazers

Watchers

Forkers

staff-infrastructure-monitoring's Issues

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done (DoD)

Reference

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

Describe the problem

User Story

Value

Questions / Assumptions

Definition of done

Reference

User Story

Value / Purpose

Additional Information

Definition of Done (DoD)

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done (DoD)

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

User Story

Value / Purpose

Useful Contacts

Additional Information

Definition of Done

Recommend Projects