Giter Site home page Giter Site logo

services-engineering's Introduction

services-engineering

This is a meta repo for all things related to the Services Engineering team that don't belong specifically to another repo.

How we work:

  • We follow these Rust security related best practices within our Rust based projects. Other general Rust tips...
    • If you're having trouble getting Rust to compile, try rm Cargo.lock && cargo clean - this forces an upgrade to your dependencies.
  • We use this estimation guide when adding estimates to tasks. You can see these estimates used as labels throughout our GitHub repos.
  • You can usually get a sense of what we're up to on our our GitHub project board.

Where else to find us:

services-engineering's People

Contributors

jrconlin avatar tublitzed avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

services-engineering's Issues

Resolve autopush load testing issues

QA is reporting issues with load testing scripts for Autopush & Autoendpoint.

(Noting issue here because it potentially spans multiple repos)

Create PRD for giving users control over their data

Details

This started as a possible project to allow users to export their data. We need to first understand/explore user problems here. This work will cover doing potential UR/etc in order to inform these requirements.

Related links:

1437589

Sync errors are constantly received while using a user with large data volume

Build:
Stage

Affected platforms:
Windows 10

Prerequisites:
Set identity.fxaccounts.autoconfig.uri to https://accounts.stage.mozaws.net
and services.sync.log.appender.file.logOnSuccess to true

Steps to reproduce:

  1. Login on 2 Firefox profiles using the following testing account [email protected] - sync ID 5752141
  2. On first profile spend some time to create some new browsing data.
  3. Perform a Sync Now on the first profile.
  4. Trigger a Sync Now on the second profile.

Expected results:
All the new data created while browsing on the first profile is successfully synced on profile B without encountering any errors.

Actual Results:
All the new data from the first profile is synced on the second profile but only error files are displayed in about:sync-log.

I am attaching a sync error file:
syncerror.txt

syncerrors

Port tokenserver to pypy

Check to see if tokenserver can run under pypy and what changes might be needed.

Requirements:

  • Needs to cover included tokenserver scripts in this repo.
  • Need to verify that we're not seeing significant differences via load tests.

Refine migration script

As a follow up to the initial migration script, there's a few issues that were mentioned in the pr that may need some investigation (thanks @pjenvey for pointing them out):

  • (probably) handle users exceeding batch limit
  • how to handle errors during a user migration. can the entire user's move be in one transaction (so if an error happens, we cleanly abort)? and if we abort a user's migration should we retry? or just report the user id that failed for fixing up later?

If we think of more, let's just add them to the list here; if this list gets too big we can break down into smaller tasks.

Move sync testing doc into to ecosystem platform docs

Ie, move the contents of this doc into the new Ecosystem platform docs (aka dochub) here.

This is the first step in a team KR for Q1 2020 to Allow for complete Sync engine testing against staging.

The idea is that in the process of moving (and completely testing) these docs, I'll be able to pinpoint what's missing in our ability to test, and get the appropriate issues filed with relevant teams.

Since this likely will require getting those issues filed early in order to get them in for Q1, I'm prioritizing this one now.

Specifically, this needs to cover testing against staging for:

  • Lockwise
  • Fenix (not doing Fennec yet; by the team this is done, we'll ideally be moved over - if we need to come back to it, we will)
  • iOS
  • Desktop (test against OSX and Windows to make sure no difference that need to be called out)

Verify that user data is not completely re-synced post migration.

Plan here is as follows:

  • 1. Find a single test user account still pointed to legacy sync nodes. (Sync id 128881199 - [email protected]). Current pointed to this legacy sync node.
  • 2. Enable local trace logging using about:sync. Enable success logging by setting services.sync.log.appender.file.logOnSuccess to true in about:config.~
  • 3. Replace bookmarks collection with this collection via the bookmarks UI. This will force an entirely fresh upload of the collection.
  • 4. Collect the logs from this completely fresh upload, and save them. Logs live here
  • 5. On April 6th, this sync user will be migrated to spanner.
  • 6. Once step 5 is complete, test sync service by adding a new bookmark. Collect and analyze logs to confirm that a) user is pointed to durable sync and b) user did NOT re-upload entire sync collection after being migrated.

[spanner] Reevaluate the BsoModified index

The secondary index on the modified column doesn't get nearly as much use as we intended. Let's reevaluate its usefulness, because it may not be worth keeping.

Removing it can saves a small number of mutations and storage space, and though likely a long shot, might provide a small write speed improvement to the migration.

Testing and regular release plan for DS

To help with stability of the service, we're going to create a regular testing and release plan that we can share out with other teams.

I'll draft this and then circulate for feedback. Once ready, we can determine where it's final home should be.

Write test migration script for us for initial DS migration test.

This can be rough for now, as we're testing a single sync user first. This will be for use with our initial test for migrating one sync user.

  • Script should be able to point to existing Sync DB (dev is fine), and pull data by sync id (or some identifier that we can get with sync id).
  • Should then be able to insert same data into Spanner dev DB.

As long as the DB hosts are easily configurable, we should be able to eventually modify this to run against stg/prod as we refine.

Durable Sync Storage Migration Plan

Description

We want to have a reasonable understanding of the

  1. Technical challenges / Risk
  2. The value we are providing to our end-users
  3. Agreement from other teams we need to collaborate with to deliver value
  4. Very rough estimate of the timeline to deliver to production.

Acceptance Criteria

Note: Good examples of engineering discovery docs are Sync Manager RFC and trusted networking doc

Related issues:

Tracked here, in this Meta Issue

Documents

[meta] sync quota tracking bug

Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.

Affected users tracked in 1653022

This will track:

Ops requirements needed for production rollout of Durable Sync.

We need some details around what's needed from an Ops perspective to feel comfortable going from Phase 1 to Phase 2 here in the plan to roll out Durable Sync for new users only.

So we can get some more accurate timelines attached to the above plan, can you provide us with details around what you need in order to feel comfortable routing 10% of new users to Spanner?

As far as details around our timeline...we'd like to get to Phase 2 before the end of the year.

Find a way to connect Durable Sync to FF locally.

Ie, it'd be really nice to be able to do the same thing you can do with existing sync - set a flag in FF to point to local token server, and watch sync now point to your local DB.

Let's find a way to make sure we don't lose this ability with durable sync. Phil had a good idea to perhaps just point FF directly into the local syncstorage-rs instance and mock a tokenserver endpoint. Or, we could tweak the token server to allow this as well (or at least document how to do it if not making/committing changes).

Let's scope this for just getting this working against MySQL based durable sync. Then, a follow up can be to document how to connect to spanner-dev.

Provide better support for self-host builds

The google spanner library continues to not be published. We need to provide better support for self-hosters to start helping with this.

Two possible options:

  1. put spanner functions behind a feature flag, self host only gets mysql for now (preferable)
  2. vendor the spanner library with LOTS OF WARNINGs about beta and unsupported nature.

Create generic project runbook

When ready with generic version, customize it for Durable Sync and make sure it's easily accessible for all.

Things to include:

  • a checklist for issues like say, the rust logging problem we ran into for autopush in prod, and durable sync in stg. Possibly even setup integration tests like the one suggested here to make sure we don't accidentally break this again.

  • a release calendar here that we can share with QA.

  • a guide for dependency upgrades (ie, get them on a regular cadence so we don't get too far behind)

Useful references:

Create an internal "Rust Best Practices" doc

We've learned a few lessons recently (ie, the verbose logging) around do's/don't for working with Rust. It would be nice to have some sort of best practices doc that we could add learnings like this to, and then share internally with other teams.

Sync master doc - link to all the rest

We have a TON of internal docs related to sync. Let's get a single master doc together that links to them all so we can start navigating the sea of docs a little easier, and also hopefully pinpoint areas where we're missing clear docs or details.

Create a technical project handoff plan for fxa-email-server

This will be a deep dive into the existing fxa email service. We need to document what it would take specifically in order for our team to take this over.

Specifically, let's try to answer the following:

  • What specifically needs to change for us to take on ownership of this project?
  • Are there things we'd like to change/improve upon over the longterm that may not make sense for the initial handoff? If so, what?
  • How much time do we anticipate spending doing the work to take over the project?
  • How much time do we anticipate spending to maintain the project?
  • Who else do we need to coordinate with here in order to make sure this handoff is succesful?
  • What kinds of monitoring/metrics exist around the project?

Create release calendar for syncstorage-rs

One of the follow-ups from this work is to create a regular release calendar. This should include:

  1. Assigning release managers (to rotate amongst team members)
  2. Clarifying roles for dev/ops at each stage
  3. Sharing with dev/ops/qa/product
  4. Linking to calendar within ds runbook

Future of Sync blog post

We've been running fairly silent about what our plans for sync are (We've discussed it internally a lot, but not with the wider public, and that's bad). This leads to things like this, but more often leads to FUD.

We need to put together a lengthier post highlighting what our plans are, how it will impact folk, and some of the reasons we picked what we did.

Document: Create "new project acceptance criteria"

Create a checklist of required items before Services Engineering group can sign off on accepting a new project.
e.g. (access to sources, issues, bugzilla; documentation is up-to-date; location of metrics, logs, dashboards; system topography chart; etc. Basically, all the things we would need to seamlessly operate and possibly improve a system. Yes, this will require the original team to do some work, not just throw a project over a wall and run like hell. If they do that, we sunset the project since we can't maintain it.)

Test migration behaviour against client with fresh FF install

After running through these tests, we want to check against the following scenario:

When a user with this collection of bookmarks gets migrated, what happens when they login for the very first time on a brand new sync client?

This can go one of two ways:

a) A sync is performed, data is fetched from spanner and populated on device.
b) A sync is performed, empty dataset that exists in fresh install replaces existing dataset, data is lost.

Create rollout plan for Durable Sync for new users.

Let's ensure we're all on the same page for % of users we're rolling this out to, timeline, and our thresholds for enabling for all users.

Once complete, should be shared for feedback among Eng/Ops/Sec and PM teams for review.

Update sync testing docs.

We started adding to the docs here and it looks like a few things have happened since then that are worth documenting to save folks time in the future:

  1. Testing Fenix against both staging and self-hosted sync.
  2. Building iOS locally pointed to local Durable sync.

^ In at least the case of #2, it's worth making sure we write this up somewhere as it's time consuming to figure out otherwise.

Explore sync logs in bigquery.

Specifically, I'd like to try and get at how many users we may be preventing from losing data by the move to durable sync.

In addition to what's going to re:dash, we've got a table called log_storage in moz-fx-sync-prod-3f0c over in GCP.

Prepare datasets for profiles for QA

In prep for QA to test the migration script using profiles with varying amounts of data, we need to first:

  1. Determine what kinds of datasets we want to test against.
  2. Prep as much data as we can to send along to SV to assist with QA-ing.

Prototype migration strategy

Related to our end goal of migrating users to Durable Sync, we need to start testing out our implementation plan.

Blocked by:

  • #18 - Manual Migration Script

Testing will start in 2 phases:

  • - Manual testing, approach documented here for a several internal users.

  • - Assuming this goes well, we'll get a PI request out for Softvision to test this more extensively.

Document: Project Operational checklist

Every system needs regular maintenance. Create a Checklist that specifies the actions and scheduling to ensure a system stays operational:

e.g. (clear out logs monthly, Update libraries quarterly, rotate keys yearly, etc.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.