Giter Site home page Giter Site logo

mozilla-releng / balrog Goto Github PK

View Code? Open in Web Editor NEW
94.0 94.0 148.0 71.37 MB

Mozilla's Update Server

Home Page: http://mozilla-balrog.readthedocs.io/en/latest/index.html

License: Mozilla Public License 2.0

Python 81.17% Shell 0.60% JavaScript 17.74% HTML 0.07% Dockerfile 0.42%

balrog's People

Contributors

ahal avatar aksareen avatar allan-silva avatar alweezy avatar amuzy avatar bhearsum avatar callek avatar catlee avatar collin5 avatar davemo avatar dependabot-preview[bot] avatar dependabot[bot] avatar foxandxss avatar gabrielbusta avatar gbrownmozilla avatar glasserc avatar harikishen avatar jcristau avatar johanlorenzo avatar michellemounde avatar ninadbhat avatar njirap avatar nthomas-mozilla avatar nurav avatar peterbe avatar pyup-bot avatar searls avatar tarekziade avatar tomprince avatar tyagi-iiitv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

balrog's Issues

use blob schemas where possible in API validation

We've got jsonschemas that can validate all of the different types of Release blobs. It would be great if we could use those as part of the swagger specs, because it would allow us to generate clients that could do client-side blob validation, which can make clients much friendlier.

Swagger only supports a subset of jsonschema, so it may not be possible for all of the current blob types.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1381514)

Rebuild Release blob submission

Submitting data to Balrog Release blobs has become increasingly problematic over the years. Much of this boils down to the fact that we hit tons of data races trying to update a Release, because the API requires that clients submit the data version that they're basing their update on, and the database layer rejects that update if the current data version does not match the given one. This has become aggravated in recent times by adding more balrogworker instances.

Whatever solution we end up choosing should get us to a place where a submission with valid data works 99.99% of the time. We do not want a solution that still leaves us subject to data races or other scaling issues.

This work is likely to involve the API (I'd like to take this opportunity to get rid of or fix up https://github.com/mozilla-releng/balrog/blob/master/src/auslib/web/admin/views/releases.py, which is really, really ugly and hacky), balrogscript, and possibly the database layer.

verify history tables entries in tests

bug 1246993 describes a long standing bug where bad entries were made to history tables while making updates. If the tests for this code had verified the history table entries (in addition to the primary tables), this bug would never have been introduced.

We should make a point of always verifying the history table entries whenever we write tests that modify the database. This bug is to track updating all of the existing tests to do so.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1405337)

footgun protection against undefined behaviour when multiple rules have the same priority

Right now, if an update query matches multiple rules with the same priority, it's not possible to guarantee that a certain one is chosen. This is undefined behaviour and can lead to lots of confusion.

We should do something to make this less of a footgun. In the past we've talked about possibly choosing the "most matching" rule (that would be, the one with the most specificity, ie: one that requires build_target+channel is more matching than one that requires just channel).

Another idea could be to just disallow rules with the same priority. Or maybe disallow rules with the same priority when product is the same.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1283568)

regularly validate release URLs

We recently discovered an issue where we had clean up some stuff on s3 that was thought to be unused, and it broke updates for a significant number of users.

One thing that would help here is to regularly check all of the mapped to Releases, and make sure that all of the URLs they point to are 200s. I wrote a hacky hacky script to do this as a one off: https://github.com/mozilla/balrog/compare/master...mozbhearsum:find-bad-mars?expand=1

find-active-mar-urls2.py finds everything that is pointed at and outputs that to a json file. check-urls.py goes through a json file and does a HEAD request on each of them.

This needs to be polished and probably enhanced before we can run it in automation or anything.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337148)

figure out how to make scheduled change "when" behaviour more consistent

The current behaviour of Scheduled Changes that use "when" is:

  • You cannot schedule them in the past when you create or update them
  • Scheduled Changes are not enacted until the "when" time has arrived and the required signoffs have been met.

This means that an already scheduled change may end up being scheduled "in the past" if signoffs do not arrive in time. It's confusing and inconsistent that you can't schedule a change in the past, but it can drift there.

A couple of ideas about how to fix this:

  1. Allow changes to be scheduled in the past. The original reason for this safeguard was that we didn't want them enacted immediately. Now that we use Signoffs for most channels, this generally won't happen.
  2. Support a value like "immediately" or "ASAP" as a "when". Changes could be initially scheduled with this, and they would be automatically changed to this if they are not enacted prior to the original "when". This would effectively have the same behaviour as a "when" in the past, but it would be more obvious.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1392720)

don't fire onInsert/Update/Delete until after a transaction successfully commits

We have callbacks that are meant to fire when changes are made to the database. Right now, they fire before the transaction is completed (eg: https://github.com/mozilla/balrog/blob/9f8de88056be59332faa9b79ba2517ad2b0caffa/auslib/db.py#L345), which means the callbacks may send e-mail or other notify changes that may end up failing to commit.

I suspect the reason they ended up here is because we pass the query to them, so that interface may need to change.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1332412)

add tests to ensure that API specs are OpenAPI 3.0 compliant

We've recently started using OpenAPI-style specs to define both our admin and public APIs. We use Connexion to load and create routes for them in Flask. It turns out that Connexion has some extensions that make it possible for us to have Connexion-compatible specs that aren't OpenAPI 2.0 compliant. The consequence of this is that we cannot make use of other OpenAPI 2.0 tools, such as https://github.com/swagger-api/swagger-codegen.

We're still working on becoming OpenAPI 2.0 compliant, but we should add tests to make sure it doesn't regress once we get there.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1387063)

Need to wait for #1084 first.

rule change sandbox

@catlee suggested a way to test out rule changes before applying them. You'd have something that lets you modify one or more rules before they affect actual requests, and check them against a set of criterion.

I interpreted the criterion as specifying a release, with the option to restrict to a particular build target/locale/OS Version/etc to simulate a query. That could save on the nitty gritty of build target strings and buildIDs in the url, but we could leave the general case there too. Then specify which mapping should be used to serve the update, and verify that's what the rules deliver.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1141801, original comment from @nthomas-mozilla)

balrog agent fails to process subsequent changes if an exception is hit for an earlier one

We discovered this while testing https://bugzilla.mozilla.org/show_bug.cgi?id=1310226. Some of the scheduled changes used in testing would generate errors, because signoffs hadn't been given. When looking at sc_id 2, an error was generated (in that case, because signoffs weren't done), and that caused sc_id 3 to never be processed. We probably need to enhance the error handling in https://github.com/mozilla/balrog/blob/dc79a6b06ae1f38fd2d3fb20d8df20e5a7481d35/agent/balrogagent/cmd.py to continue in the two most inner loops if any exceptions are hit.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1342191)

web apis for history of signoffs to scheduled changes

We keep track of when signoffs to scheduled changes happens in Balrog's database, but we don't expose them anywhere in the API.

The current state of signoffs get returned as part of list of scheduled changes, eg: in a GET to /scheduled_changes/rules.

We could do this in two ways:

  1. Add a separate endpoint to get the history of signoffs to a scheduled change, eg: /scheduled_changes/rules/:sc_id/signoffs/revisions
  2. Integrate with regular scheduled change history (eg: /scheduled_changes/rules/:sc_id/revisions).

The former would be simpler to implement, but the latter is more consistent with the existing scheduled changes api, where we've been treating each scheduled change as one object, despite the fact that they are stored across 3 tables (scheduled_changes, conditions, and signoffs). Going this route may mean we need to increase data_version in each of these tables whenever something from one of them changes.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340170)

improve or kill change notification

Currently we have an extremely simplistic change notification system. For certain tables, we send e-mail to a mailing list whenever a change to them is made. This has turned out to be extremely spammy, and likely goes unread most of the time.

I think a better change notification system has at least these two requirements:

  • Ability to subscribe to some types of changes but not others.
    -- Product and Channel seem like the most obvious filter we'd want. Possibly being able to filter on object (eg: Rule, Release, etc.) would be useful too.
  • Self serve subscriptions

There's probably other considerations that I haven't thought of.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337892)

We should also consider just killing the system, as nobody looks at the notifications.

enforce access to permissions and roles at the database layer

In #218, we added a new endpoint that allows someone to query for the permissions and roles of a named user. Nick correctly pointed out that we should restrict this to admins, and those users who are able to manipulate permissions. I implemented this for the new endpoint as part of that PR, but we should move this enforcement down to the database level to make sure that it is obeyed by all endpoints.

We'll need to modiify the interface of AUSTable.select() to do this, because it requires knowing the current user. We already pass this as "changed_by" for insert/update/delete, so we should probably add an arg like that to select().

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340167)

show diff against base table when sending e-mail alerts for scheduled changes

Currently we send mail such as:
Row to be inserted:

{'base_alias': 'firefox-release',
 'base_backgroundRate': 100,
 'base_buildID': None,
 'base_buildTarget': None,
 'base_channel': 'release',
 'base_comment': 'default release rule updated by buildbot, DO NOT DELETE',
 'base_data_version': 147,
 'base_distVersion': None,
 'base_distribution': None,
 'base_fallbackMapping': None,
 'base_headerArchitecture': None,
 'base_locale': None,
 'base_mapping': 'Firefox-54.0-build3-whatsnew',
 'base_osVersion': None,
 'base_priority': 90,
 'base_product': 'Firefox',
 'base_rule_id': 145,
 'base_systemCapabilities': None,
 'base_update_type': 'minor',
 'base_version': None,
 'change_type': 'update',
 'csrf_token': '1498007702##fe6927e9a30927fb3b0baebef3ea7f86bf04e314',
 'data_version': 1,
 'scheduled_by': '[email protected]'}

Unfortunately, scheduled changes are not terribly useful without context. The most useful thing to know is "what is this scheduled change going to change". In the above case, backgroundRate and fallbackMapping were different vs. the base, so that's what should've been highlighted.

Can be closed if we end up killing them (see #1071).

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1375010)

log data about requests

We've long wanted to log information about what each request to Balrog is served (see https://bugzilla.mozilla.org/show_bug.cgi?id=758373). At the very least, what version we served the request, and whether or not the request got served the primary mapping, or the fallback mapping.

At this point in time, the most sensible place to put it would be BigQuery.

It's worth noting that even if/when we start doing this in the Balrog app, we still have both an nginx and a Cloudfront cache in front of it, so the data won't actually contain information about all updates served, just those that make it to the app.

publish blobs as a separate package?

Balrog blobs have two main jobs:

  1. They store and verify the schema of Releases.
  2. They create responses based on the Release data.

#2 is something that generally only the server cares about, but #1 is something that would be extremely useful to clients as well as the server.

I think it would be a good idea to investigate if it would be possible to publish auslib.blobs as a separate package that the server and clients could depend on. This might be trickier now that we have multifile updates...we may need to look at refactoring the code such that #1 and #2 are isolated. It might not even be viable, but I think it's worth looking into further.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1312868)

consider not saving history for nightly releases

We've been having some issues dealing with nightly release history cleanup (details of which can be found in bug 1283492). We were talking about them today and it got me thinking: rather than keep nightly history for such a short period of time (7-14 days), maybe we can just not keep history for nightly releases in the first place. We'd still need to do cleanup of the releases table to remove old nightlies, but that is quick and easy in comparison.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1294493)

improve fileUrls to eliminate duplication

fileUrls already supports a special "*" channel to eliminate the need to list our main release + cdn test channel separately, but in cases where we have multiple sets of channels that are the same we have to duplicate one set. Eg, for RCs we have:

  "fileUrls": {
    "beta": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "*": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "beta-cdntest": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "release-localtest": {
      "partials": {
        "Firefox-33.1-build3": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-33.1-34.0.partial.mar"
      },
      "completes": {
        "*": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-34.0.complete.mar"
      }
    }
  },

"*" handles release-cdntest and release. But we also have beta-cdntest and beta. Those two channels serve exactly the same content, but are different from their release counterparts. There should be a way to combine these together into a single entry.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1122557)

kill -latest blobs?

The -latest blobs that we currently use in Balrog are very long lived, and get continually updated with the latest nightly build information. Every other type of release in Balrog (dated nightly blobs, beta/release blobs, CDM blobs, etc.) just contain information about one "set" of things, and we create new release blobs whenever we generate a new set of things, so -latest blobs are a bit strange in comparison.

We're starting to bump into areas where this makes things harder. For example, when Varun implemented merge logic in https://bugzilla.mozilla.org/show_bug.cgi?id=1223872, we considered making conflicts between partial+complete lists mergable, but couldn't because -latest blobs need to fully overwrite them at times. Other blob types are append-only in these sections.

I think it would be worthwhile trying to find an alternative to -latest blobs that would still allow us to get nightly updates out in a timely fashion. Some random ideas:

  • Update mapping after all nightly updates are done
    ** This might not let us get things out in a timely fashion, or maybe at all if we even have one repack fail.
  • Use the fallback mappings from https://bugzilla.mozilla.org/show_bug.cgi?id=1282891, and set mapping to the latest dated blob, and fallback mapping to the n-1 dated blob.
    ** This might cause issues if we have a locale failing multiple nights in a row (they wouldn't have entries in either mapping).

It's possible that -latest is already the best all around solution, so this might end up being WONTFIX.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1286842)

Finish Guardian -> FirefoxVPN rename

In #1042 we did a small change to allow FirefoxVPN to be used in all the ways that Guardian was. The latter name is now deprecated, and we should remove support for it, and use FirefoxVPN in its place.

/users/<username> sometimes hits auth0 rate limit errors

Presumably this is happening because of the call we make to auth0's /userinfo endpoint, which is rate limited to 5 requests per minute with bursts of up to 10 requests per user id (from https://auth0.com/docs/policies/rate-limits#authentication-api).

We do cache the results of these calls, but we make one request per username at roughly the same time when /users is loaded, and we have multiple admin webheads, so it could take for all the webheads to have cached results for all of the users.

Off the top of my head, the only way I can think to fix this is to cache the results of the /userinfo queries somewhere persistent that can be shared between webheads. Right now, the only thing we have that persists is the mysql database, but we've talked about adding memcache at some point.

There may also be a more clever fix that I haven't considered.

don't make any change to database if row hasn't actually changed

After we enabled the change notifier we found at least one case where automation repeatedly makes the same change to the database many times. This is silly, and unnecessarily invalidates caches. We should try to make clients smarter about this where we can, but we can also do better on the backend and simply check if things will change prior to making a write.

We should watch out for potential performance penalties when it comes to Releases. Single locale updates already need to retrieve the current version of the blob before changing it, but updates to Release blobs that intend to replace the full contents may not already retrieve the entire blob already.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313631)

test suite to validate the state of Balrog rules

Rail and Catlee have both mentioned recently that it would be useful to have a test suite that can validate that the current state of a set of Balrog rules returns all the right things for all the right inputs. This would be helpful both to give us reassurance when making changes to the Rules, and also if we ever need to rebuild them from scratch (eg: if we somehow lose the database).

A few other random thoughts:

  • Ideally, this would test a wide range of old versions of Firefox and other products. which means the test suite would need to be aware of watersheds.
  • Rather than making requests and comparing XML output, it may be better to simply compare the name of the Release that the XML would be generated from. This avoids a lot of issues related to ordering of lines in the ouput and other differences that don't change behaviour. We'd still need to go through AUS.evaluateRules() to have all the rule matching logic run. Not sure how this would fit with multifile updates yet, as that logic is very closely tied to XML generation. Maybe return the name of the superblob and all response blobs?

--

Aki brought up the idea of "test driven development" for Balrog Rules recently. Boiled down, fixing this bug is the primary piece of work to make that possible.

(In reply to Ben Hearsum (:bhearsum) from comment #0)

  • Rather than making requests and comparing XML output, it may be better to
    simply compare the name of the Release that the XML would be generated from.
    This avoids a lot of issues related to ordering of lines in the ouput and
    other differences that don't change behaviour. We'd still need to go through
    AUS.evaluateRules() to have all the rule matching logic run. Not sure how
    this would fit with multifile updates yet, as that logic is very closely
    tied to XML generation. Maybe return the name of the superblob and all
    response blobs?

I'm going to backtrack on this now that I'm seeing it with fresh eyes - if we really care about validating Balrog's state, we really need to run through the entire rule matching + XML generation logic. There's too many things we could miss if we only look at Mappings.

Given that, I think the test suite simply ends being "given a list of update URLs and expected results, do the URLs return what is expected"? Where things get tricky, is how we create that list. We obviously can't have humans writing thousands of update URLs, so we need a config that we generate them from. Our requirements for that are:

  • Reasonably easy for humans to manipulate
  • Be able to define expected foreground and background check responses
  • Be able to handle at least Firefox, GMP, and System Addon updates
  • The ability to handle exception cases (for example: XP and Vista users receiving an ESR release, while other versions of Windows received a vanilla release)
  • Be as concise as possible

Other requirements:

  • Must be reasonably fast to run

Open questions:

  • Do we need to test all locales? We could significantly cut down the number of test cases by testing a handful instead of all.
  • Do we need to test all old versions? Trimming out versions between watersheds would also reduce the number of test cases.
  • Will this run continually, or just when we're getting ready to make changes?

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320373)

require fallbackMapping when backgroundRate is <100?

Now that it exists, fallbackMapping is something we almost always use when running a throttled rollout. I can't think of any case where we'd want users to go no updates instead of the previously released version during these times. We should consider enforcing this in the backend, and rejecting any requests to change backgroundRate to <100 unless fallbackMapping is set, or is getting set as part of the change.

We'll need to ensure that these would actually work for all different types of updates (GMP, SystemAddons, Firefox) before deciding to move forward.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1391013)

Implement a "service" layer to manipulate table objects

Discussed with :bhearsum while implementing [1]. In a few words, the service layer is represent the whole business logic of a process, by dividing the business process into smaller ones. The smallest ones are then the table objects.
The idea of the back-end architecture would be the following: The web layer takes data out of the HTTP request and passes them to the Service. The service is then in charge of delegating to all the tables, and making sure data is consistent across tables

Here are some call diagram examples:

  • Add a new scheduled change [2]
  • Update a condition [3]
  • Update a scheduled target value [4]

[1] #151
[2] https://www.websequencediagrams.com/?lz=dGl0bGUgQWRkIGEgbmV3IHNjaGVkdWxlZCBjaGFuZ2UKClMACghSdWxlc0FQSVZpZXctPgAJDlNlcnZpY2U6IGFkZF9uZXcocnVsZV9pZCwgcnVsZXNfdmFsdWVzLCBjb25kaXRpb25zLCBhdXRob3IpAFIPAEMHAFEQVGFibGU6IGluc2VydABZBQBLCgBBCG5vdGUgcmlnaHQgb2YgACsVYWxzbyBzYXZlcyB0aW1lc3RhbXAAgVMPAGUFAIFHGQCCFwlfAIFbBwCCERAAgSoWQwCBbwkAgTsOAD4RAIIXDCk&s=rose
[3] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIGNvbmRpdGlvbnMKClNjaGVkdWxlZFJ1bGVzQVBJVmlldy0-AAkOU2VydmljZTogdQA8BV8ANwoocwA4CF9ydWxlX2lkLABUCywgYXV0aG9yKQBYDwBJBwBYD0MAgQ4JVGFibABlCQBFHgA1JgBEDl9sYXN0XwCBFQYAgS0UAIEtCA&s=rose
[4] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIHNjaGVkdWxlZCB0YXJnZXQgdmFsdWVzCgpTABEIUnVsZXNBUElWaWV3LT4ACQ5TZXJ2aWNlOiB1AEkFKABFCV9ydWxlX2lkLCBydWxlc18AUAYsIGF1dGhvcikATw8AQAcAThBUYWJsACkz&s=rose

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313742, original comment by @JohanLorenzo)

don't require signoff from everyone for product-less permissions

One of the rough edges to the new Multiple Signoffs system is that product-less permissions (eg: full admins) end up requiring signoff from all groups that are listed in any permissions required signoff.

For example, if we have the following Permissions Required Signoffs:

  • Firefox Permissions, 1 releng, 1 relman
  • SystemAddons Permissions, 1 releng, 1 relman, 1 gofaster
  • Thunderbird Permissions, 1 releng, 1 tbird

...then adding a full fledged admin requires signoff from 1 releng, 1 relman, 1 gofaster, and 1 tbird.

I can think of a ways to improve this, but each has drawbacks:

  • Ignore signoffs for permissions that don't specify a product (lets us add full fledged admins with no oversight).
  • If product isn't specified, look explictly at only Firefox permission signoffs, because those are likely to be a good set of signoffs to require (probably ends up needing relman signoff for things they don't care about, kindof hacky)
  • If product isn't specified, use a sentinel value in its place. Eg: look for permissions required signoffs that apply to a "NOPRODUCT" product. This would give us the best control over this case, but it's still kindof hacky.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1343904)

put blob-specific submission code in balrog repo

Currently in the backend, we have Blob classes with jsonschemas that have two main functions:

  • ensure that blobs are valid before they go into the database
  • turn blobs into update responses (eg: the ... that Firefox requires)

Separately, we have a bunch of functions in https://github.com/mozilla/build-tools/blob/cc9e80196f7c67abaa9acf3a3e434f6554fd0977/lib/python/balrog/submitter/cli.py whose job it is to turn raw data into valid Blobs.

Because these pieces of code are disconnected we sometimes run into issues where cli.py generates invalid Blobs, which often ends up breaking nightly, or even release, updates.

Over in https://bugzilla.mozilla.org/show_bug.cgi?id=1303106#c9, Rail suggested that if these pieces of code were in the same place, we could run integration tests on them. We could also consider making the turn-raw-data-into-valid-blobs piece an additional function of the blob classes (but it doesn't have to be).

Doing this would probably necessitate doing bug 1312868 as well, so that api clients would have an easy way to get/run the new code.

Related to #1063

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320949)

stop storing future dates in UTC

Nick pointed out this series of tweets (https://twitter.com/tef_ebooks/status/949350236392181760?s=03) which explains that it's not safe to store future dates in UTC, because local timezone offsets shift around with DST, while UTC does not. This means that if someone in North America schedule's a change 2 days before a DST change, but we schedule it to take place 2 days AFTER the DST change (aka 4 days in the future) - it will end up being off by an hour.

The thread recommends storing future dates as time + named timezone (that is, "US/Los_Angeles" - or something like that), because timezones ("PST") can change in the future too.

I think this has a couple of implications for Balrog:

  • We should absolutely stop storing scheduled change times as UTC. We should probably store as Mountain View time ("US/San_Francisco" probably), since that is considered to be Mozilla Standard Time.
  • Scheduled Changes should be required to be submitted in this timezone. This will prevent issues if someone in a different timezone, which has different DST rules, submits a Scheduled Change near a California DST-change day. We should also only display this timezone when presenting Scheduled Changes for the same reason.

There may be other necessary changes, too.

I think history tables are unaffected by this (I don't think that timezones or DST can retroactively be changed), but we may want to consider storing history in the same timezone for consistency. History UI can almost certainly be presented as user-local time regardless of what we do on the backend.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1431793)

For posterity, the aforementioned tweets:

psa: you can’t store future dates as UTC because local time offsets can and will change
and you can't store time zones as utc offsets because dst rules change too
if a user can pick the time, and it could be in the future, I'm sorry, you can't normalise to utc
what you want to store, isn’t timezone offset, or name, but location

or why tzinfo uses ‘US/Los_Angeles’ as the key
normalising future dates to utc or offset, or even storing ‘PST’ means ignoring dst or local timezone changes you need ‘US/Los_Angeles’

Means to "explode" priority values to create gaps.

This is something I've been musing on, and bhearsum told me to file a bug. (It may not belong in this exact component).

Background:

  • our rules are evaluated on a priority integer at present.
  • We have rules that can match multiple channels (or even multiple products)
  • When gaps close (rules are prior 91, and 90) and you need to put a new rule in between those two, its often difficult, there is a need to edit potentially many rules, and try to reason about which need to be changed, without breaking rule order for other products/channels.

This tool (UI, or manual script, etc) should:

  • [Optionally?] take a specified channel or product
  • Read in the list of all rules that match.
  • Explode the rules priorities, leaving order in tact.

E.g. take the following arbitrary set:

#0 (Template): ::

#1 firefox:beta:96
#2 firefox:beta:95
#3 firefox:beta-cdntest:94
#4 firefox:beta*:93
#5 firefox:beta:92
#6 fennec:beta:92
#7 fennec:beta*:91
#8 firefox:beta:90
#9 <no_product>:beta:89
#10 fennec:release:88
#11 firefox:beta:70

running this script, with the channel/product set to firefox/beta would explode to be like so (unchanged omitted):

#1 firefox:beta:148
#2 firefox:beta:138 # XXX: Should we do 139 to preserve the non-identical prior even though no conflict
#3 firefox:beta-cdntest:138
#4 firefox:beta*:128
#5 firefox:beta:118
#6 fennec:beta:111
#7 fennec:beta*:110
#8 firefox:beta:108
#9 <no_product>:beta:98

Yielding a proper, rule-order-preserving mapping.

Note, since this was "firefox/beta" we still bumped priority for fennec in #6 and #7, and beta-cdntest for firefox in #3 because other matching rules got bumped, and we preserved the existing gap...

The logic of this can be tweaked from my proposal of course.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1301045)

requests can map to deleted releases for short periods of time

We had a new Traceback show up in Sentry recently that showed a request try to retrieve a Release that didn't exist:

IndexError: list index out of range
  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 57, in get
    release, update_type = AUS.evaluateRules(query)
  File "auslib/AUS.py", line 99, in evaluateRules
    release = dbo.releases.getReleases(name=rule['mapping'], limit=1)[0]

In this case, the rule in question was the main Firefox release rule, and the Release it mapped to was Firefox-50.1.0-build2-prod. At the time of the request, that release rule pointed at Firefox-50.1.0-build2, and Firefox-50.1.0-build2-prod didn't exist. After some digging with jlund I discovered that he changed the mapping of that Rule and deleted Firefox-50.1.0-build2-prod in short succession. Because Rules are cached, we ended up with a short period of time where requests were using the cached Rule (that pointed at Firefox-50.1.0-build2-prod), but didn't have that Release cached.

This is a pretty rare occurence, but definitely possible to hit again. We only cache Rules for 30s, so that's the maximum amount of time we could stay in this state for.

There's no obvious easy fix for this. We can't prevent people from deleting Releases that are still pointed at by a cached Rule, because the admin app doesn't know anything about the caches on the public side.

One thing we might be able to try is to ensure that the mappings (aka Releases) of all cached Rules are always cached in the public app. This could be tricky though, and possibly cause a big performance penalty.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1325605)

add XML comment to balrog responses when 500 error is caught

Catlee suggested that this would be a good way to make such errors debuggable after the fact, without the need to be able to reproduce them. It's also useful because it lets us distinguish between an actual empty update and an error.

I don't think we can or should put the full traceback in, but something that hints at the error (perhaps the top of the traceback stack) would be useful. We can probably use it to find more information via logs or newrelic.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1191320)

proposal: rules should be ordered by specifying which rule should come before it

Rather than setting a priority of a rule to determine where it sits in order, it would be neat if you could say which rule should come before it and have Balrog internally decide the priority based on that.

This would be similar to a linked list solution where you can insert based on appending at an indexed value and all nodes in the list would be re-ordered based on that.

Put another way, the priority wouldn't be an exposed for mutability.

Motivation:

Often as releaseduty, we have to re-order the priority of many rules to fit others in. This opens us up to a silly human mistake if you are scheduling many rule changes and also more churn than needed as both releng and relman must sign off on a no-op change.

This would require front end work too

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1426218 by @lundjordan)

Balrog should be able to generate accurate update URLs

One of the things we often struggle with is generating useful update URLs to test with. Balrog has enough data that it should be able to generate these URLs for us. Because the necessary information is stored in Release blobs, I think it would be best to integrate around them in some way. For example, there could be a button beside the Mapping field in the Rules UI called "Test URLs". When clicked, the user would be prompted for a small amount of information (see below), and then an update URL would be returned, or possibly opened in a new tab. Since we know the Release, we can pull most of the data we need for the update URL from it. We'll still need the user to choose an OS, locale, and possibly channel.

As an example, let's see how we could generate this URL: https://aus5.mozilla.org/update/6/Firefox/53.0.2/20170504105526/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml

We've got the following parts to deal with:

  • Domain - available in app config
  • URL version - Depends on Product Version. We'll probably need to hardcode a list of product version <-> update URL version mappings in the app.
  • Product - Available in the Release
  • Product Version - Available in the Release
  • Buildid - Available in the Release, depends on Locale.
  • Build Target - Available in the Release, depends on OS Version chosen.
  • Locale - User input
  • Channel - User input, but we can probably suggest the most likely one.
  • OS Version - User input
  • System Requirements - Probably just hardcode this to SSE3. Only present in URL version 6.
  • Distribution - Hardcode to "default". Only present in URL version 3 & 6.
  • Distribution Version - Hardcode to "default". Only present in URL version 3 & 6.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1398202)

cope with builds that have "release-google" as hardcoded channel

Apparently we shipped something with channel set to release-google-cck-realnetworks at some point (full url is https://aus5.mozilla.org/update/2/Firefox/2.0.0.11/2007112718/WINNT_x86-msvc/ja/release-google-cck-realnetworks/Windows_NT%205.1/update.xml). These builds currently get exceptions when trying to update, eg:

KeyError: 'release-google'
  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "connexion/decorators/decorator.py", line 66, in wrapper
    response = function(request)
  File "connexion/decorators/validation.py", line 293, in wrapper
    return function(request)
  File "connexion/decorators/produces.py", line 38, in wrapper
    response = function(request)
  File "connexion/decorators/response.py", line 85, in wrapper
    response = function(request)
  File "connexion/decorators/decorator.py", line 42, in wrapper
    response = function(request)
  File "connexion/decorators/parameter.py", line 195, in wrapper
    return function(**kwargs)
  File "auslib/web/public/client.py", line 126, in get_update_blob
    app.config["SPECIAL_FORCE_HOSTS"]))
  File "auslib/blobs/apprelease.py", line 166, in getInnerXML
    patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 279, in _getPatchesXML
    xml = self._getSpecificPatchXML(patchKey, patchKey, patch, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 97, in _getSpecificPatchXML
    url = self._getUrl(updateQuery, patchKey, patch, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 248, in _getUrl
    url = self['fileUrls'][getFallbackChannel(updateQuery['channel'])]

Probably the simplest thing to do is to add "release-google" to whichever blob is serving updates for them.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1379281)

restructure Balrog

Balrog's design has evolved a bit over time, but some rough edges have crept in, particularly around having two applications share the same library. For example, it's very difficult to have global objects (such as a database), because the two applications live in different places (and for awhile, had their own database objects). We can work around that with hacks like in https://github.com/mozilla/balrog/blob/master/auslib/__init__.py, but it's not ideal.

I'm also finding that making caching only happen on the non-admin application as part of bug 671488 to be more complicated because of this.

In any case, it seems like we should be looking towards some sort of structure that allows the common parts of Balrog (db.py, blobs/, maybe AUS.py) to be in an importable library, and the app-specific parts to live in their own place. We'll have to consider what this means for deployment (particularly in cases where we need a synced deployment for admin+non-admin), and there might be better options than this too.

This is low on the priority list given the feature work on the horizon, but it would be nice to do for future maintainability.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1109295)

don't allow rules to map to have a mapping to a release with a non-matching product

While thinking about bug 1309656, I realized that it's currently possible to do something very silly and point a Firefox update rule at a non-Firefox Release. Even in a multifile update world I can't see a scenario where this would be desired, and there may be some potential for privilege escalation or creating confusion (eg: create a release with product=Thunderbird, name=Firefox-$version-we're-about-to-ship) that might lead to the wrong thing being served.

This probably needs a bit more thought about whether or not its a good idea.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1309877)

active data extraction script should ignore releases that are only mapped to by completed scheduled changes

I noticed today that we have a whole bunch of recent releases in the latest production db dump - most of which are not currently mapped to. I think this is because we grab any releases that are mapped to by any scheduled rule change (https://github.com/mozilla/balrog/blob/752d1d548840f1753a8592af405846c6612f7f3c/scripts/manage-db.py#L137), instead of only incompleted scheduled rule changes.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1377460)

keep track of last time an account was used in Balrog

While talking about ways to make account rotation easier, Catlee suggested that we should keep track of the last time an account is used in Balrog, which would avoid the need to grovel through logs like we're doing now.

I originally thought we might be able to query this from the existing history tables, but I've since realized that we'd want to include GETs here as well, so we'll probably something different for this. Maybe just a table that is updated whenever a request is made to the admin interface, and updates the timestamp for the user?

We'd also need an API + some UI for it.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1372250)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.