Resolve autopush load testing issues

QA is reporting issues with load testing scripts for Autopush & Autoendpoint.

(Noting issue here because it potentially spans multiple repos)

Eat, drink & be merry, aka 🍳 🥤 😁

Investigate Spanner failure rates in telemetry

Telemetry shows Spanner nodes having a slightly higher ratio of Number of Failed Syncs / Number of Sync. The difference has slightly increased over the past month.

Create PRD for giving users control over their data

Details

This started as a possible project to allow users to export their data. We need to first understand/explore user problems here. This work will cover doing potential UR/etc in order to inform these requirements.

Description

We want to have a reasonable understanding of the

Technical challenges / Risk
The value we are providing to our end-users
Agreement from other teams we need to collaborate with to deliver value
Very rough estimate of the timeline to deliver to production.

Acceptance Criteria

Peer-reviewed migration plan describing technical considerations / challenges. Migrating Sync for Existing Users - Project Plan
Reasonable solution for issue with clients on FF for iOS < v20.0.
- Related issue here

Note: Good examples of engineering discovery docs are Sync Manager RFC and trusted networking doc

Related issues:

Tracked here, in this Meta Issue

Documents

[meta] sync quota tracking bug

Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.

Affected users tracked in 1653022

This will track:

Determining preferred approach. Plan is here
Getting it implemented/released:
- ~~[ ] client side changes - 1655531~~
- test to confirm clients work as expected already - 1655531
- server side changes - mozilla-services/syncstorage-rs#746
- testing to confirm server side changes work as expected. - https://jira.mozilla.com/browse/PI-773
- doc changes - mozilla-services/docs#85
- Isolate existing affected legacy sync users pre-migration 1658399
- notify SUMO (started slack thread here)
- Deprecate POST "failed" bsos mozilla-services/syncstorage-rs#418 (See below for why this is needed)

Coordinate messaging for legacy firefox-ios sync clients

In tandem with this change, we'll need to work with product to coordinate messaging to legacy sync clients that they'll need to upgrade in order to fix sync.

Feedback/retro for Rust from DS

Come up with feedback around DS, get to Matt Miller.

Ops requirements needed for production rollout of Durable Sync.

We need some details around what's needed from an Ops perspective to feel comfortable going from Phase 1 to Phase 2 here in the plan to roll out Durable Sync for new users only.

So we can get some more accurate timelines attached to the above plan, can you provide us with details around what you need in order to feel comfortable routing 10% of new users to Spanner?

As far as details around our timeline...we'd like to get to Phase 2 before the end of the year.

Verify iOS behaviour for migration

Breaking out from this list of issues - app services will take the rest, we'll take iOS. This issue will track that effort.

Find a way to connect Durable Sync to FF locally.

Ie, it'd be really nice to be able to do the same thing you can do with existing sync - set a flag in FF to point to local token server, and watch sync now point to your local DB.

Let's find a way to make sure we don't lose this ability with durable sync. Phil had a good idea to perhaps just point FF directly into the local syncstorage-rs instance and mock a tokenserver endpoint. Or, we could tweak the token server to allow this as well (or at least document how to do it if not making/committing changes).

Let's scope this for just getting this working against MySQL based durable sync. Then, a follow up can be to document how to connect to spanner-dev.

Move Push docs into ecosystem platform docs

Get Write access to https://mozilla.github.io/ecosystem-platform/ (POC: @jgruen) - DONE
Merge https://mozilla-push-service.readthedocs.io/en/latest/ and https://mozilla.github.io/application-services/docs/push/welcome.html and create new doc in https://mozilla.github.io/ecosystem-platform/

Create AWS query for unbridled UAIDs

Due to https://bugzilla.mozilla.org/show_bug.cgi?id=1617136, we wanted to find out how much costs might increase. This requires finding how many UAIDs do not have channels associated with them.

Need to compose a AWS DynamoDB query to find these unbridled UAID records. compared to existing data.

Allow for complete Durable Sync engine testing against staging

This is a Q1 2020 Goal for the SE Team.

Linking all related issues to the efforts to more fully test Sync against staging for all Sync clients.

Related issues:

move existing sync setup docs to dochub.

Missing functionality:

Fenix support. Tracking epic for this work is here
Lockwise Android Support - won't make it into Q1 for lockwise team.
Lockwise iOS Support - won't make it into Q1 for lockwise team.

Create dashboard tracking old v new sync

The goal is to be able to confidently enable durable sync. Right now we have no way of comparing it against existing sync.

Currently scoping this out here

Move the forked Rust SDK into the mozilla services org.

Right now, it lives here: https://github.com/pjenvey/mozilla-rust-sdk

To be safe, we probably don't want be including dependencies that live in our personal github accounts.

Cross device sync PRD - v2

Respond to feedback, create next iteration of designs for this PRD

Provide better support for self-host builds

The google spanner library continues to not be published. We need to provide better support for self-hosters to start helping with this.

Two possible options:

put spanner functions behind a feature flag, self host only gets mysql for now (preferable)
vendor the spanner library with LOTS OF WARNINGs about beta and unsupported nature.

Create generic project runbook

When ready with generic version, customize it for Durable Sync and make sure it's easily accessible for all.

Things to include:

a checklist for issues like say, the rust logging problem we ran into for autopush in prod, and durable sync in stg. Possibly even setup integration tests like the one suggested here to make sure we don't accidentally break this again.
a release calendar here that we can share with QA.
a guide for dependency upgrades (ie, get them on a regular cadence so we don't get too far behind)

Useful references:

Create an internal "Rust Best Practices" doc

We've learned a few lessons recently (ie, the verbose logging) around do's/don't for working with Rust. It would be nice to have some sort of best practices doc that we could add learnings like this to, and then share internally with other teams.

Sync master doc - link to all the rest

We have a TON of internal docs related to sync. Let's get a single master doc together that links to them all so we can start navigating the sea of docs a little easier, and also hopefully pinpoint areas where we're missing clear docs or details.

Return "endpoint" and "appServerKey" from PushManager.dispatchInfoForChid(chid) in Push Component

see mozilla/application-services#2047

Fill in rest of DS runbook

Once this is finalized. Runbook lives here for now.

Plan for related sync services.

Ie, what to do with tokenserver, syncserver, etc? Let's start drafting a plan to discuss.

Create a technical project handoff plan for fxa-email-server

This will be a deep dive into the existing fxa email service. We need to document what it would take specifically in order for our team to take this over.

Specifically, let's try to answer the following:

What specifically needs to change for us to take on ownership of this project?
Are there things we'd like to change/improve upon over the longterm that may not make sense for the initial handoff? If so, what?
How much time do we anticipate spending doing the work to take over the project?
How much time do we anticipate spending to maintain the project?
Who else do we need to coordinate with here in order to make sure this handoff is succesful?
What kinds of monitoring/metrics exist around the project?

Create release calendar for syncstorage-rs

One of the follow-ups from this work is to create a regular release calendar. This should include:

Assigning release managers (to rotate amongst team members)
Clarifying roles for dev/ops at each stage
Sharing with dev/ops/qa/product
Linking to calendar within ds runbook

Future of Sync blog post

We've been running fairly silent about what our plans for sync are (We've discussed it internally a lot, but not with the wider public, and that's bad). This leads to things like this, but more often leads to FUD.

We need to put together a lengthier post highlighting what our plans are, how it will impact folk, and some of the reasons we picked what we did.

Document: Create "new project acceptance criteria"

Create a checklist of required items before Services Engineering group can sign off on accepting a new project.
e.g. (access to sources, issues, bugzilla; documentation is up-to-date; location of metrics, logs, dashboards; system topography chart; etc. Basically, all the things we would need to seamlessly operate and possibly improve a system. Yes, this will require the original team to do some work, not just throw a project over a wall and run like hell. If they do that, we sunset the project since we can't maintain it.)

Test migration behaviour against client with fresh FF install

After running through these tests, we want to check against the following scenario:

When a user with this collection of bookmarks gets migrated, what happens when they login for the very first time on a brand new sync client?

This can go one of two ways:

a) A sync is performed, data is fetched from spanner and populated on device.
b) A sync is performed, empty dataset that exists in fresh install replaces existing dataset, data is lost.

Create rollout plan for Durable Sync for new users.

Let's ensure we're all on the same page for % of users we're rolling this out to, timeline, and our thresholds for enabling for all users.

Once complete, should be shared for feedback among Eng/Ops/Sec and PM teams for review.

Explore github actions to automate project board.

The main pain point here is that newly filed issues don't get added to the board. Looks like there's a few github actions out there like this one that could automate this.

Will try first with syncstorage-rs repo and then apply to others if we like it.

Update sync testing docs.

We started adding to the docs here and it looks like a few things have happened since then that are worth documenting to save folks time in the future:

Testing Fenix against both staging and self-hosted sync.
Building iOS locally pointed to local Durable sync.

^ In at least the case of #2, it's worth making sure we write this up somewhere as it's time consuming to figure out otherwise.

Explore sync logs in bigquery.

Specifically, I'd like to try and get at how many users we may be preventing from losing data by the move to durable sync.

In addition to what's going to re:dash, we've got a table called log_storage in moz-fx-sync-prod-3f0c over in GCP.

Create PRD for user sync dashboard.

It's a spinoff of an idea that came out of the device level sync prefs PRD, and would be worth exploring a bit further as a separate project.

Prepare datasets for profiles for QA

In prep for QA to test the migration script using profiles with varying amounts of data, we need to first:

Determine what kinds of datasets we want to test against.
Prep as much data as we can to send along to SV to assist with QA-ing.

Cost estimates for durable sync

Once we have enough users on it, let's start estimating costs to compare against AWS.

Document a basic operational level agreement for Durable Sync

We don't have one, and it would be useful to have something moving forward.

Based on the guidelines provided by @Micheletto here, let's put together a draft of what this could look like.

I'll draft this, then we can review until Ops & Eng teams feel comfortable moving forward.

Prototype migration strategy

Related to our end goal of migrating users to Durable Sync, we need to start testing out our implementation plan.

Blocked by:

#18 - Manual Migration Script

Testing will start in 2 phases:

- Manual testing, approach documented here for a several internal users.
- Assuming this goes well, we'll get a PI request out for Softvision to test this more extensively.

Document: Project Operational checklist

Every system needs regular maintenance. Create a Checklist that specifies the actions and scheduling to ensure a system stays operational:

e.g. (clear out logs monthly, Update libraries quarterly, rotate keys yearly, etc.)

mozilla-services / services-engineering Goto Github PK

services-engineering's Introduction

services-engineering

How we work:

Where else to find us:

services-engineering's People

Contributors

Stargazers

Watchers

services-engineering's Issues

Details

Related links:

Description

Acceptance Criteria

Related issues:

Documents

Related issues:

Missing functionality:

Blocked by:

Testing will start in 2 phases:

Recommend Projects

Recommend Topics

Recommend Org