dosomething / infrastructure Goto Github PK

🐄 DoSomething.org's infrastructure, managed by Terraform.

License: MIT License

HCL 74.43% VCL 8.73% Makefile 0.32% HTML 9.13% Smarty 3.61% Shell 3.79%

infrastructure's Introduction

DoSomething.org Infrastructure

This is DoSomething.org's infrastructure as code, built using Terraform. We use it to manage and provision resources in Fastly, Heroku, and AWS (EC2, RDS, SQS, S3, IAM users, amongst others). It's a work in progress.

Installation

Install Terraform 0.12. On macOS, this is easy with Homebrew:

brew install terraform

Create a Terraform Cloud account with your work email & ask for an invite to our organization in #dev-infrastructure. Don't forget to enable two-factor auth! Then, create your API token and place it in your ~/.terraformrc file, like so:

credentials "app.terraform.io" {
  token = "xxxxxx.atlasv1.zzzzzzzzzzzzz"
}

Run make init from this directory to install a githook to check formatting before you commit changes. You can run make format at any time to format your code, or install the Terraform extension for your editor.

Alright, now you're ready to build some infrastructure!! 🏗

Usage

Terraform allows us to create & modify infrastructure declaratively. The files in this repository define what infrastructure (apps, databases, queues, domains, etc.) we should have, and Terraform figures out what changes it needs to make the get there based on what currently exists.

We separate our configuration into workspaces. We also build reusable modules in the applications/ and components/ directories that can be used to provision the same type of thing in multiple places.

See Terraform's Getting Started guide & documentation for more details.

Plan

We use workspaces to separate different contexts (e.g. the main application vs. our data stack) and environments (proudction, QA, and development). Each workspace exists as a top-level folder in this repository.

To make changes in a workspace, first cd into the workspace's directory and run terraform init to pull down dependencies. Then, make your changes to the Terraform configuration files with your text editor.

You can make a plan to find out how your changes will affect the current state of the system:

terraform plan

Once you're satisfied with Terraform's plan for your changes, commit your work & make a pull request. Your pull request will automatically run a plan for all workspaces (even if they're not affected by your change).

Apply

After your pull request is reviewed and merged, you can then apply your change to update the actual infrastructure. Terraform Cloud will make your changes, update the remote state, and ensure nobody else makes any changes until you're done.

To apply pending changes to a workspace, visit Terraform Cloud and open the latest run for the workspace you want to modify. Review the plan & then choose "Confirm & Apply" to make the change.

Security Vulnerabilities

We take security very seriously. Any vulnerabilities should be reported to [email protected], and will be promptly addressed. Thank you for taking the time to responsibly disclose any issues you find.

References

Terraform "Getting Started" Tutorial - a "step by step" tutorial on Terraform basics
Terraform Configuration Language Reference – look here for syntax for writing config!
Terraform AWS Provider - API documentation for aws_ resources
Terraform Fastly Provider - API documentation for fastly_ resources
Terraform Heroku Provider - API documentation for heroku_ resources
Serverless Guide - how to use our "serverless" modules

License

© DoSomething.org. This config is free software, and may be redistributed under the terms specified in the LICENSE file. The name and logo for DoSomething.org are trademarks of Do Something, Inc and may not be used without permission.

infrastructure's People

Contributors

Stargazers

Watchers

Forkers

atsoamazed classicvalues

infrastructure's Issues

QUESTION: Fastly pass-through property for vote.ds?

Coming out of the National Voter Reg Day pre-mortem, we've asked ourselves whether the easiest way to redirect vote.ds as a contingency is to position a Fastly layer now, set with a minimal pass-through configuration. If we do need to do any redirects, and with those any URL/param transformations, we could do them via Fastly.

Migrate Footlocker/Scholarship Apps to Heroku

We run into at least a few issues, either related to supervisor, code file updates, or DevOps issues a couple times a year related to Whitelabel Scholarship apps, including Supervisor/email send issues, DB exports (although these are much better than they used to be), and sometimes deployment issues.

These take up a fair amount of DevOps time, and are unpredictable in whether resolution is simple, complicated, or even doable. Almost every fix is somewhat DDF (Duct-tape Driven Fix).

Migrating to Heroku would simplify this process, especially since the servers we're using are infrequently updated, and setup in a fairly non-standard way.

Would love to hear @DFurnes and @katiecrane thoughts on LoE and complexity here, as well as thoughts on best timing, and tradeoffs on code migration effort vs addressing issues as they continually come up.

Update Rogue Prod and Dev`binlog_format` to be Row instead of Mixed

We need to update the default parameter group to have binlog_format variable set to ROW instead of MIXED, to allow for DMS streaming replication of Rogue Campaigns table into Quasar.

Currently Rogue Prod and Dev are on MariaDB 10.2. Talking to @DFurnes these aren't captured in Terraform yet, so going to manually do the following:

Notify #announce-tech of update/potential small downtime window.
Set upgrade to MariaDB 10.3 for Rogue Dev and Prod
Create parameter group with binlog_format set to ROW based on the MariaDB 10.3 param family.
Confirm on Sunday evening that the MariaDB upgrade has taken, and swap out MariaDB parameter group and reboot if necessary.
Notify #announce-tech if there's any downtime and when work is complete.

@DFurnes is going to capture these changes in Terraform likely slated for next sprint.

Retire celebsgonegood.com infrastructure

This is now being served by Instapage, so we can truly retire everything related to CGG, including the Fastly property.

Simplify front-end (Phoenix & Ashes) Fastly configs.

BUG REQUEST

Current Behavior

We simplified a lot of our Fastly config in #39, and moved the majority of properties into Terraform so that we can track changes in code. The two outstanding services that still need to be cleaned up are DoSomething.org (the O.G), and thor.dosomething.org.

These contain routing rules for Phoenix & Ashes, including relatively sizable dictionaries for redirects and backend assignments. As we've moved more stuff into Phoenix, this has created an increasing workload (for devops & now the product team) since every new URL needs to be manually assigned.

Desired Behavior

First of all, let's move these properties into Terraform!

Since we're creating the majority of (…or all??) new content on Phoenix, I'd like to flip the "default" backend to that application. This should remove all the work of pool assignments in one fell swoop.

I'd also like to see if we can simplify redirect logic, since currently PMs need to create redirects for every distinct URL that a user may visit (including query strings, like UTMs!). In nearly every case, we only really care about the path when creating a URL redirect.

Relevant Screenshots + Links

N/A

Create homepage takeover for election day.

We're preparing a takeover for www.dosomething.org for election day, with resources for our members on how & where to vote. Luke has prepared the markup for this on the voter-reg branch of our static-rapid-response repo. We'll use either a Fastly synthetic response or S3 bucket to host it.

Add Deletion Protection to our Production RDS Instances to Avoid Accidental Data Loss

As a general practice, we've set deletion protection on all production RDS instances so we don't inadvertently delete a database whether from human or machine error.

This is configurable via: https://www.terraform.io/docs/providers/aws/r/db_instance.html#deletion_protection.

Default is set to false in Terraform.

Make it clearer what environment an app is in.

It can sometimes be confusing to see what environment you're in. Even with our simplified naming scheme, you still have to look at the URL to see where you're at, e.g. activity-qa vs. activity. This makes it easier than it should be to make a change in the wrong place, or wonder why your local changes aren't applying because you're refreshing QA! Oof!

I introduced the idea of "environment badges" in GraphQL last year, but never had a chance to roll them out further. I'd like to build a small library to make it easy to add the same feature everywhere:

Allow access to webfonts from local machines.

We load our webfonts from www.dosomething.org so that we don't end up with copyrighted fonts in our GitHub repositories. When switching to the new Fastly property in DoSomething/devops#444, we lost the CORS rule that allowed these to be used from other domains, like https://phoenix.test.

We should fix that so developers can see fonts on their local machines! 🔠

Update Fastly TXT Records for Open-Source TLS Offering

Create Postgres Warehouse Module in Terraform

Per conversation in #72, let's make a module for Quasar data warehouse settings as postgres_warehouse.

Require GPG signed commits for application repositories.

BUG

Current Behavior

Right now, we don't have any policy requiring signed commits. This means it's pretty easy for someone to spoof commits.

Desired Behavior

It'd be fun to require signed commits to close this up.

Steps to Replicate

N/A

Why This Matters

It's a nice little thing to do!

Countries Affected (optional)

N/A

OS/Browser (optional)

N/A

Relevant Screenshots + Links

References #89.

Enable `force_ssl` setting on Quasar RDS instances

This will ensure that only SSL connections from clients will be accepted.

Figure Out Where to Put Infrastructure Documentation

I'd love everyone tagged on this ticket thoughts on where to house cross-eng infrastructure documentation. Dave and I have discussed this a little bit, but I think it warrants a standardized approach in 2019. Examples of documentation would be:

How do you connect to MongoDB databases from your dev machine?
How do you connect to MariaDB databases from your dev machine?
What's the normal caching rules for our Fastly properties?

While something like point 3 can probably exist in the infrastructure repo, I think it'd be great to have at-a-glance infrastructure info that anyone can look at for our most pertinent information. Do we want to use GitBooks in this repo? Readme in a dedicated infra-doc repo? Something else entirely?

Allow some crawlers on servers outside US access to our environments.

When trying to debug and test meta tags in our HTML pages, ran into an issue where a www-dev.dosomething.org URL was returning the "Sorry" page for all users outside of the US due to our current approach to deal with GDPR.

This poses a problem when testing if crawlers that exist outside the US can't read our pages properly and are getting redirected.

Terraform 0.12: Simplify Fastly configuration w/ dynamic blocks.

Right now, we can't use count for nested resources within a fastly_service_v1 resource. This means that we end up duplicating more code than we'd like. With dynamic blocks & for-each in the upcoming Terraform 0.12 release, we should be able to simplify this resource a lot!

Simplify how we configure New Relic agent in Heroku.

We've historically configured New Relic on our Heroku apps by attaching a "placeholder" free-tier New Relic add-on and then replacing the auto-generated NEW_RELIC_LICENSE_KEY environment variable with our own. While this works, it results in some inconsistent behavior –

New Relic must be installed on QA environments, since that's where builds happen. If the application that a build is promoted from doesn't have the add-on, New Relic won't be installed. (And moving forward, we likely won't want New Relic on QA environments to reduce usage!)
Each app gets a standalone unused New Relic account, which has led to confusion.

When setting this up for Longshot, I was able to simplify setup to make this more reliable:

For future reference, the steps followed:

Add the ext-newrelic PHP extension to the application, instead of using a Heroku addon. This has the advantage of not clobbering environment variables & provisioning a parallel "mini-account" in Heroku that goes unused. (DoSomething/longshot#923)

Add support for per-environment newrelic.enabled, so we can install the New Relic agent in the initial Heroku build but only enable it when the same compiled slug is promoted to a production app. (DoSomething/longshot#924)

Set the proper environment variables per-application, via with_newrelic arg. (#59)

If we continue using New Relic, we should take this approach in other applications as well.

Migrate Infrastructure Jenkins Jobs to Code/Heroku

BUG

Current Behavior

A miscellaneous number of tasks remain in the Production Jenkins environment that handles tasks like weekly DB refreshes from Prod to QA.

Desired Behavior

Most of these tasks are just wrapper scripts around bash jobs or other similar utilities. They can and should be moved to code in a repo and run via the Heroku scheduler in one place.

Why This Matters

We're not capturing this critical part of infrastructure in code. It provides simple but vital services, and should be run with production grade attention to detail, documentation, and tracking.

Add New Relic policies & alerts to Terraform.

We should use New Relic's Terraform provider to configure alert policies consistently between apps, so all production apps have similar baseline alert conditions & notification rules. (There are also Terraform providers for other observability tools, like DataDog or PagerDuty).

Set up a DNS record for `www.caps.hrblock.com` and redirect

When we were testing the Heroku switch we noticed that http://www.caps.hrblock.com/ doesn't redirect to the correct page, https://caps.hrblock.com/. We'd like to set up a DNS record for that URL and redirect it so that users are getting to the site!

From this Slack conversation.

Do not serve cached edge content to authenticated requests.

We added caching to Northstar's user profile endpoint in an attempt to improve broadcast performance. Rafa flagged the other day that he was seeing unexpected caching on authenticated requests too, though, which is not expected behavior.

From some futher investigation this morning, I've confirmed that once a profile is cached in Fastly, we will continue to return that cached "public" profile, even for requests that come in with a privileged authentication token. That's not helpful!

To fix this, we should add a "pass" cache setting for requests with an Authorization header.

Welcome Home campaign not loading

BUG

Current Behavior

https://www.dosomething.org/us/campaigns/welcome-home?source=node/1141 doesn't load

(working theory from @DFurnes is that this was closed and he only grabbed active campaigns!)

Desired Behavior

https://www.dosomething.org/us/campaigns/welcome-home?source=node/1141 should load

Steps to Replicate

Go to https://www.dosomething.org/us/campaigns/welcome-home?source=node/1141 and see that it doesn't load

Why This Matters

Bc it's on the homepage

Revisit how we run extra Rogue RDS backups.

We created a Jenkins job to perform an additional daily backup of the Rogue RDS instance in DoSomething/internal#369, in response to the issues we ran into with DoSomething/internal#357. This "Rogue DB Backup" job ran successfully until November 12th, when it silently failed because we'd hit our cap of 100 manual snapshots.

I noticed this when trying to take final snapshots before terminating databases in #49, and got this error: cannot create more than 100 manual snapshots (Service: AmazonRDS; Status Code: 400; Error Code: SnapshotQuotaExceeded). Bummer!

The short-term fix was to delete the 49 snapshots that had been piling up over the past few months to get us back safely under the limit. Longer-term, we'd like to investigate whether there are better ways to do this, like RDS's point-in-time restores (or at least adding some alerting to that Jenkins job).

Need to allow some crawlers to access www-dev, www-qa, www-preview

The robots.txt on our different dev/test environments is blocking testing out different crawlers from reading our pages to see how cards are previewing on social platforms like Twitter and Facebook.

Capture Quasar Config in Terraform

Quasar QA DB, Security Groups, and Options Group
Quasar Prod DB, Security Groups, and Options Group

Upgrade Heroku apps to Heroku-18 stack.

Heroku released their Heroku-18 stack a while back, running on Ubuntu 18.04. This is the default for newly created apps, but existing apps aren't automatically upgraded.

When we have some time, it'd be nice to test that this is a safe upgrade for each application that's still running on Heroku-16 (Ubuntu 16.04) and standardize this across the board using the stack option on our heroku_app resources in Terraform.

Drupal admin panel gives AJAX error on `/index.php`.

This issue was escalated by @lkpttn in #help-web-template and is preventing campaigns team from editing campaigns or the homepage on Ashes:

Update Cert for HRBlock

The SSL cert for caps.hrblock.com is set to expire on 12/9, and it needs to be updated. The ideal solution would be to use Let's Encrypt with auto-renew so we don't have to worry about it anymore, but there are some issues with that. There are a few entries that are run before all others that are preventing using HAProxy as a standalone webserver to verify the caps URL:

redirect prefix https://longshot-qa.dosomething.org code 301 if { hdr(host) -i longshot-qa.dosomething.org }
redirect prefix https://footlockerscholarathletes.com code 301 if { hdr(host) -i footlockerscholarathletes.com }
redirect prefix https://footlockerscholarathletes.com code 301 if { hdr(host) -i www.footlockerscholarathletes.com }
redirect prefix https://caps.hrblock.com code 301 if { hdr(host) -i caps.hrblock.com }

I need to add these entries to allow Let's Encrypt to validate the URL and enable auto renew:

# Test URI to see if its a letsencrypt request
acl letsencrypt-acl path_beg /.well-known/acme-challenge/
use_backend letsencrypt-backend if letsencrypt-acl  

# LE Backend
backend letsencrypt-backend
    server letsencrypt 127.0.0.1:8888

The issue is that because HAProxy always runs redirect rules before use_backend and sends all requests straight to the Heroku app, Let's Encrypt isn't able to use our standalone server to respond properly to the verification. Are we able to move or remove those entries or accomplish what the purpose of those entries in another way? @sheyd @DFurnes If we can, we'll be able to use the same approach for Footlocker as well.

Here's the command to set up the certs initially:

sudo certbot certonly --standalone -d caps.hrblock.com \
    --non-interactive --agree-tos --email [email protected] \
    --http-01-port=8888

Here's what the auto-renew script will look like for reference:

#!/usr/bin/env bash

# Renew the certificate
certbot renew --force-renewal --tls-sni-01-port=8888

# Concatenate new cert files, with less output (avoiding the use tee and its output to stdout)
bash -c "cat /etc/letsencrypt/live/caps.hrblock.com/fullchain.pem /etc/letsencrypt/live/caps.hrblock.com/privkey.pem > /etc/ssl/demo.scalinglaravel.com/caps.hrblock.com.pem"

# Reload  HAProxy
service haproxy reload

There are some additional setup steps not listed here, but I'm following this tutorial as a guide. https://serversforhackers.com/c/letsencrypt-with-haproxy

Heroku provider doesn't support auto-scaling.

We're currently using Heroku Autoscaling to automatically spin up extra Northstar dynos when we increase server load (usually during an SMS broadcast). Unfortunately, this is not supported by Terraform's Heroku provider so any planning during a broadcast marks this resource as "dirty", and attempts to scale back to the provided quantity.

We've also seen somewhat iffy results with autoscaling (detailed in DoSomething/devops#435) since it's based on response time (rather than request queueing or throughput), and so will tend to thrash between over- and under-provisioning as response time stabilizes and deteriorates.

Purchase Reserved RDS Instance for Quasar Prod

We use a 🐄 Multi-AZ instance for Quasar Prod, and @mshmsh5000 and I have talked about going with no-upfront 1 year reserved instance to save 💰.

Deprecate XYZ EC2 Instance - Make Static Site

Simplify Fastly config & track changes in code.

BUG REQUEST

Background

From Matt's write-up in the Q3-Q4 technology memo:

We’ve built up a pretty large catalog of Fastly services, in many cases using a separate service per-environment and per-app. This increases the number of places a change must be made, and the chance that things might get out of sync between environments.

We’ve also had trouble with discoverability and the review process around updating VCLs for applications, as it’s not always clear to application developers when a change has been made or how it will impact their application. We should investigate simplifying our systems to rely on fewer separate configs (re: Fastly ALTITUDE talks from USA Today & Conde Nast), and investigate better processes or tooling around making changes more visible.

Current Behavior

We have 19(!!) Fastly properties, of which DoSomething.org (Phoenix & Ashes), API/Northstar, Rogue, GraphQL, and CatchAll (some redirects) receive the most use.

We also have 4 services for different voting app instances, a search property for Solr, Ashes Staging & Thor, and a property with vanity redirects for two campaigns. The remaining 7 don't seem to be receiving any traffic and can probably be deleted.

Desired Behavior

It'd be great to consolidate these into fewer properties so we can roll out changes across the board more easily (e.g. gzipping or geolocation headers). We should also have separate QA & production configs for our other services (and ideally, an easy way to promote changes from QA to prod).

Finally, it's not always clear what changes are made to our configs and why (e.g. if we forget to drop a note in #deploys), or whether a draft config is safe to push to production or should be discarded (which should be aided by Morgan's new discussions for the DoSomething.org and API/Northstar properties).

Relevant Screenshots + Links

N/A

Rename Blink instances to match new naming scheme.

FEATURE

Current Behavior

We currently have some inconsistencies with domains & environment naming. Let's update those to be more consistent for Blink, following the discussion in #366.

Desired Behavior

We should have the following Heroku apps in a dosomething-blink pipeline:

dosomething-blink-qa - https://queue-qa.dosomething.org (staging)
dosomething-blink - https://queue.dosomething.org (production)

(Does it make sense to add "dev" environments for Blink?)

Why This Matters

This is a step towards reducing confusion about apps & environments!

Countries Affected (optional)

N/A

OS/Browser (optional)

N/A

Relevant Screenshots + Links

N/A

Rename Gambit instances to match new naming scheme.

FEATURE

Current Behavior

We currently have some inconsistencies with domains & environment naming. Let's update those to be more consistent for Gambit, following the discussion in #366.

Desired Behavior

We should have the following Heroku apps in a gambit-conversations pipeline:

gambit-conversations-qa - https://sms-conversations-qa.dosomething.org (staging)
gambit-conversations - https://sms-conversations.dosomething.org (production)

And the following Heroku apps in a gambit-campaigns pipeline:

gambit-campaigns-qa - https://sms-campaigns-qa.dosomething.org (staging)
gambit-campaigns - https://sms-campaigns.dosomething.org (production)

(Does it make sense to add "dev" environments for Gambit?)

Why This Matters

This is a step towards reducing confusion about apps & environments!

Countries Affected (optional)

N/A

OS/Browser (optional)

N/A

Relevant Screenshots + Links

N/A

Incident: Footlocker Scholarship timeouts.

INCIDENT

What's gone wrong?

On Friday morning, the Foot Locker Scholarship site went down with a Heroku "application error" page. New Relic showed lots of request queueing and we were seeing a slew of H12s in Papertrail.

Restarting the Heroku dyno fixed the issue, although the underlying cause is still unclear.

Timeline

As the incident develops, the "incident lead" should continue to fill in this timeline:

9:25am: Daniel flagged the issue in #dss-footlocker, tagging Jen & Caroline.
9:32am: Jen moved the discussion into #dev-scholarships-app and tagged Dave & Matt.
9:42am: Dave acknowledged the issue & noticed request queueing & timeouts in logs.
9:42am: Dave restarted the Heroku application & things returned to normal.

Follow-up:

Jen created Ghost Inspector monitoring tests for Footlocker & New Relic.
Dave added a New Relic uptime monitor to our alert policy for this application.
Dave reached out to New Relic support to figure out why our alert policy didn't trigger.

Relevant Screenshots + Links

Add S3 buckets to Terraform.

We use Amazon S3 for storage in many of our applications. We should move the rest of these buckets, permissions & IAM roles, and environment variable config into Terraform. As a stretch goal, we also have tons of empty & unused buckets that'd be nice to clean up so the S3 admin panel is less intimidating.

Incident: Users unable to create posts on Phoenix.

INCIDENT

What's gone wrong?

We've been receiving support tickets that users are unable to report back on Phoenix, receiving an "Unauthenticated" message in the uploader when they try to submit:

Timeline

Deployed Rogue on Thursday at 10:47am EST (diff).
Deployed Northstar on Friday at 2:49pm EST (diff).
Deployed Phoenix on Friday at 1:22pm EST (diff), and at 2:49pm EST (diff).
Help tickets began coming in Friday at ~~12:31am EST~~ (unrelated scholarship question) 9:33pm EST and continued through weekend. Hannah found them and compiled them, and raised in #team-product Monday at 1:50am EST.
Matt CC'd the issue in #dev-phoenix at 7:21am.
Mendel jumped in and started digging into Rogue errors at 9:46am.
Dave saw Hannah's message in #team-product at 10:10am & created this issue. 🤓
Dave rolled back Phoenix to v206 at 10:35am, resolving the issue in production.
Mendel figured out the underlying issue at 10:58am, and pushed up a fix in DoSomething/legacy-website#1182 at 11:17am. This fixed things up on QA.
We re-deployed master with that fix at 2:20pm, and ran through manual testing of signup, photo/text/share post, quiz, and article flows on production to make sure no new issues appeared.
We reached out to members who were affected by the bug via email on Monday at 6:08pm.

Relevant Screenshots + Links

d12g.co Wildcard certificate not Updating Properly with LetsEncrypt Cron

Got an email that our *.d12g.co wildcard certificate will expire on Dec 18. Checking into why the renewal cron is working properly.

Improve Northstar performance & concurrency.

This is a follow-up ticket from DoSomething/devops#383.

Background

We successfully moved Northstar from AWS (an m4.xlarge EC2 instance) to Heroku (1-5 autoscaled Performance-M dynos). This simplified our operations & also allows us to scale up-and-down based on demand more easily. I've monitored performance and made some adjustments since then:

From July 17th 2018's broadcast:

We ran a couple of broadcasts yesterday - a smaller 11:00am broadcast, and a 2:00-8:30pm full-list broadcast (which is continuing today). Both were set to run at 75rps, and seem to have happily ran at between 70-85rps(!!). Here's what New Relic had to say:

And here's that same response time graph overlaid over auto-scaling events:

It's interesting to see how much scaling fluctuates between 2-4 dynos when under load, which perhaps suggests we should drop our desired p95 response time a little more (currently at 750ms, curious to try 500ms again).

And the following day when we finished that broadcast:

I think one problem we're running into with auto-scaling is that it's entirely based on response time, so if we get good performance at 3 dynos, we scale down to 2 until things bog down, and then back up... for example, here's the hour from 12:50-1:50pm with a target 500ms p95 response time:

Current Behavior

I think we've found the sweet spot for p95 response time (750ms). This seems to let us spin up quickly enough when throughput spikes, but still happily provisions down to 1 dyno when throughput drops.

I think the remaining scaling issues we're seeing come from alternatingly over- and under-provisioning, since Heroku's autoscaling is based on the past hour's performance (so we scale up when things get too slow, but then as soon as they're under control we scale back down again... gah!!)

While things seem to be working mostly okay, we do still see some slow requests & 503s when in one of those underprovisioned "dips" (until we scale back up again).

Desired Behavior

We may want to consider handling scaling ourselves (based on throughput), or further tuning performance to get more out of each individual dyno.

Why This Matters

We want to make sure we're getting the best performance bang for our buck! Specifically, we want to make sure that we can get the most throughput when we need to make a ton of API requests to Northstar during a broadcast.

Checklist: Migrate campaign metadata from Ashes & remove campaign run IDs.

Overview

In order to retire Ashes, we need to move legacy campaigns to the Rogue database. This document has more information about what data will be migrated to Rogue and how we will deprecate runs. Below is a checklist of the order of operations for the migration to successfully run.

Order of Operations

First, we need to create the new Campaign IDs table in Rogue (based on old IDs/Run IDs):

Team Bleed to merge PR to migrate all legacy campaigns from Ashes to Rogue.
Morgan to complete taking the Campaigns table from Rogue and piping it into Quasar
Team Bleed to run/test migration script to get all legacy campaigns into Rogue on QA.
ALL TEAMS: Check that data looks good & everything still works as expected on QA!
Run migration script to get all legacy campaigns into Rogue on production.
ALL TEAMS: Check that data looks good & everything still works as expected on production!

Once we have that table, we'll update signups & posts from old runs to their new canonical IDs:

Once this script has run, we can update front-ends to exclude Run IDs anytime:

Phoenix will stop sending campaign_run_id on the next production deploy.
Gambit to exclude campaign_run_id from all Rogue requests (alter signups filtering by campaign_id instead of campaign_run_id when checking for signup.why_participated)

Other things to think about in 2019:

Galleries previously showed all runs for a multi-run campaign, now they'll just show the "latest" campaign. We may need to rethink for pre-seeding campaigns

Migrate custom VCLs into Fastly snippets.

The upcoming 0.4.0 release of Terraform's Fastly provider includes support for VCL snippets. This is a welcome change from maintaining a completely custom VCL, and we should consider migrating these once that release is shipped.

Add Runscope buckets & shared environments to Terraform.

We use Runscope for API monitoring on our production services. It'd be nice to use Terraform's Runscope provider to automatically set up buckets & shared environments (with URLs, API keys, etc.) for each application, both for consistency & so these are easier to spin up!

Convert HAProxy instance to a Fastly property w/ Fastly Anycast.

ISSUE

Current Behavior

The Footlocker cert expired, and I've renewed it using Let's Encrypt. We'd like to continue using Let's Encrypt going forward for all our properties because it's free and because it will auto-renew. The certs have to be renewed every 90 days which would be a non-issue if we were pointing directly to the webserver by using a cron job, but because we're using HAProxy we have to go through some somewhat extraordinary measures to utilize the Let's Encrypt auto-renew features.

Desired Behavior

Per @sheyd's suggestion, we should migrate the old HAProxy box to Nginx so we can take advantage of the auto-renew features so we eliminate these issues for good in the future. This will also have the added benefit of cleaning up that config and removing cruft that's no longer necessary.

Why This Matters

We don't want to have to think about certs anymore!

Properties Affected (optional)

Including but not limited to Footlocker, HRBlock, and other properties that go through the current standalone HAProxy box

NOTE: Footlocker will expire 1/29/19 and HRBlock will expire 12/9/18

Standardize more common patterns with modules.

We tend to duplicate a lot of the same boilerplate for each service (e.g. Rogue and Northstar look pretty much identical), and same for the same app between environments (see Northstar Dev, Northstar QA, and Northstar Production).

We should consider standardizing some of these common patterns with reusable DoSomething-specific modules (like a heroku_php_app module that sets up the app, buildpack, standard environment variables, domain, log drain, and just has arguments for things we may customize per-app).

Move all Longshot environments for Foot Locker & HR Block to RDS

FEATURE OVERVIEW

User Story

As DS, we want every Longshot environment for Foot Locker internal, external and HR Block to be on RDS so that it handles automated backups, upgrades, and availability.

Additional Information

This was done for Foot Locker production here

Related ticket

This was also resurfaced in a Longshot post-mortem as a next step here

Tentative soft launch date for HR Block is Nov 6, hard launch is Nov 7

Why This Matters

We want more stability with the Longshot app across environments

Definition of Done

Given that I'm a DS dev
When I look at any Longshot environment
Then it's on RDS and things are working as expected

Additional Things to Consider for Done

(1) Test needed?
(2) Documentation needed?

Add S3 Bucket Permissions for Blockbuster App to Terraform

We use the dosomething-blockbuster S3 bucket for the Gala and Summit every year. We want to be able to toggle off/on public access, as well as define IAM role and permissions for write access to the bucket in Terraform so we don't have to manually muck about with the web console.

Automate dependency updates across applications.

We have some recurring cards to run npm update && composer update in Northstar and Rogue, but it's easy to forget to do this until GitHub sends out a security alert (and that's only for JavaScript dependencies for now). We should look into something like Greenkeeper (just npm), DependenCI (just Composer, killer name) or Snyk (everything) to automate this.

Add RDS databases to Terraform.

We use Amazon RDS for most of our application databases because it offers good pricing, performance, and automated backups. We currently configure and hook these up to applications by hand, but this is prone to errors. We should audit these & manage them with Terraform.

Add SQS queues to Terraform.

We use Amazon SQS for queuing in many of our applications (Northstar, Rogue, and Chompy) because it's affordable, simple, and reliable. Like other third-party services, though, we configure queues & hook them up to applications by hand!

We should move these queues, IAM roles, and environment variable config into Terraform.

Add DNS records to Terraform.

We should add our DNS records to Terraform so we can manage subdomains & hook them up to the right backend in code. I could've sworn DNSMadeEasy was a third-party provider last time I looked, but it turns out it's included out-of-the-box.

dosomething / infrastructure Goto Github PK

infrastructure's Introduction

DoSomething.org Infrastructure

Installation

Usage

Plan

Apply

Security Vulnerabilities

References

License

infrastructure's People

Contributors

Stargazers

Watchers

Forkers

infrastructure's Issues

BUG REQUEST

Current Behavior

Desired Behavior

Relevant Screenshots + Links

BUG

Current Behavior

Desired Behavior

Steps to Replicate

Why This Matters

Countries Affected (optional)

OS/Browser (optional)

Relevant Screenshots + Links

BUG

Current Behavior

Desired Behavior

Why This Matters

BUG

Current Behavior

Desired Behavior

Steps to Replicate

Why This Matters

BUG REQUEST

Background

Current Behavior

Desired Behavior

Suggested Solution

Relevant Screenshots + Links

FEATURE

Current Behavior

Desired Behavior

Why This Matters

Countries Affected (optional)

OS/Browser (optional)

Relevant Screenshots + Links

FEATURE

Current Behavior

Desired Behavior

Why This Matters

Countries Affected (optional)

OS/Browser (optional)

Relevant Screenshots + Links

INCIDENT

What's gone wrong?

Timeline

Follow-up:

Relevant Screenshots + Links

INCIDENT

What's gone wrong?

Timeline

Relevant Screenshots + Links

Background

Current Behavior

Desired Behavior

Why This Matters

Overview

Order of Operations

ISSUE

Current Behavior

Desired Behavior

Why This Matters

Properties Affected (optional)

FEATURE OVERVIEW

User Story

Additional Information