mozilla-releng / balrog Goto Github PK

Mozilla's Update Server

Home Page: http://mozilla-balrog.readthedocs.io/en/latest/index.html

License: Mozilla Public License 2.0

Python 80.85% Shell 0.60% JavaScript 18.06% HTML 0.07% Dockerfile 0.42%

balrog's Introduction

Balrog is the software that runs the server side component of the update system used by Firefox and other Mozilla products.

Installation

To run a development environment you must have Docker and docker-compose installed (if you're on Windows or Mac you need "Docker for Windows" or "Docker for Mac" at least v1.12.0)

If you have access to it, set up the machine token for the Agent. If you don't have access to it, just skip this step. The Agent will not function, but everything else will work.

$ export AUTH0_M2M_CLIENT_SECRET=abcdef123456

Run the following command to create and run the necessary containers:

$ docker-compose up

Note

On ARM (M1) chips

Make sure you are running a recent version of docker compose:

$ docker-compose version
Docker Compose version v2.2.3

Then, run the following command to create and run the necessary containers:

$ docker-compose -f docker-compose.yml -f docker-compose.arm.yml up

Once it completes, you should be able to access

http://localhost:9010 - The public API
https://localhost:8010 - The admin API
https://localhost:9000 - The admin interface

You'll need to accept the self signed SSL certificates in your browser for each of the links above for everything (especially the UI) to function correctly.

You'll need to use the "Sign in..." button to do anything useful with the admin interface, which will ask you to sign in with a third party provider (eg: gmail, github). Once you've done that, run the following to create a local admin user to gain write access:

$ export LOCAL_ADMIN=<email address you signed in with>
$ docker-compose run balrogadmin create-local-admin

Tests

To execute all tests, simply run:

$ tox

This will run all unit tests within a Docker container.

Documentation

Balrog's documentation is hosted at http://mozilla-balrog.readthedocs.io/en/latest/index.html

License

Balrog is released under Mozilla Public License 2.0.

balrog's People

Contributors

Stargazers

Watchers

Forkers

bhearsum lsblakk ekkid nthomas-mozilla liz11364 cturra petemoore mdtsai callek mozmark andreja-cliqz tarekziade cliqz rwood-moz nurav mikeling ps259 catlee aybuke garbas kleopatra999 johanlorenzo lundjordan waseem18 meetmangukiya davidblitz dejunliu f3real ckprice nerdvibe njirap amritp55 happy-ferret srfraser mihaitabara 2978695409620 rugby110 diegoguimaraes tjurin sirmackk abukky52 dpaks aksareen ninadbhat chartjes tieu tdz allan-silva shruti9520 tgrecojs fuzzmz ewongbb tpingili tyagi-iiitv nagar-akshay darshanime jaehoonhwang stephendonner mu779 shruthigopalswamy arn197 umar-ahmed andela-oadeniran nitinprakash96 koyagabriel i-chat damidurodola phabbs alexxnica dilekuzulmez cheersayam koyexes ta1hia harikishen kushalsingh007 fossbalaji shivam2302 brandontang rogal12321 bhavishyagopesh vikasmahato mk779 tossj verma-varsha abit2 hngerebara veryobinna alweezy 3shv andela seunkoko notriddle shemogumbe hmntptwl prakashmishra1598 adamsdenniskariuki akve17 trankmichael enterstudio enterstudios

balrog's Issues

don't allow rules to map to have a mapping to a release with a non-matching product

While thinking about bug 1309656, I realized that it's currently possible to do something very silly and point a Firefox update rule at a non-Firefox Release. Even in a multifile update world I can't see a scenario where this would be desired, and there may be some potential for privilege escalation or creating confusion (eg: create a release with product=Thunderbird, name=Firefox-$version-we're-about-to-ship) that might lead to the wrong thing being served.

This probably needs a bit more thought about whether or not its a good idea.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1309877)

Chips overflowing text field

From mozilla-releng/balrog-ui#93 (comment).

non-admin users cannot sign off on permission changes

Today we discovered that non-admin users cannot sign off on permission changes. When they try to load /users, they get: Permission denied to perform the request. You are not authorized to view permissions of other users.

I'm not entirely sure if the fix is in backend or frontend, but I'll file it here to start.

In the old UI, signoffs were done through a view that only loaded scheduled permission changes -- not current permissions. Arguably, these should be treated the same way by the backend.

run bandit against balrog as part of CI

Not hugely important, but it can catch some bad practices and potential issues.

https://github.com/openstack/bandit

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1335428)

show an indication when a rule points at a blob with wnp data (updateLine object)

Since the what's new pages are now part of the blobs rather than rules, their presence is hidden, but can be relevant, so it'd be nice to retain some indication that a blob has that extra information.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1445236)

don't prompt for signoff with roles that aren't relevant

Examples:

If releng and relman are required for a change, and a user holds releng and tb-releng, we currently show a prompt that allow the user to sign off with either. We should just automatically sign off with releng instead.
If releng and relman are required for a change, and a user holds releng, relman, and tb-releng, we currently show a prompt that allow the user to sign off with any of them. We should still show a prompt in this case, but only allow releng or relman to be used.

Finish Guardian -> FirefoxVPN rename

In #1042 we did a small change to allow FirefoxVPN to be used in all the ways that Guardian was. The latter name is now deprecated, and we should remove support for it, and use FirefoxVPN in its place.

consider showing what signoffs are required when updating rules

when selecting update on a rule, it wasn't obvious if the rule required signoffs or who needed to sign off until the actual update was made.

e.g. here:

it be neat if to show the signoffs required below or above the timestamp requirements in the update page:

consider not saving history for nightly releases

We've been having some issues dealing with nightly release history cleanup (details of which can be found in bug 1283492). We were talking about them today and it got me thinking: rather than keep nightly history for such a short period of time (7-14 days), maybe we can just not keep history for nightly releases in the first place. We'd still need to do cleanup of the releases table to remove old nightlies, but that is quick and easy in comparison.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1294493)

make required signoffs page accept "product" url param

It would be nice to be able to go directly to the required signoffs page for a given product, with something like /required_signoffs?product=Firefox

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1410419)

balrog agent fails to process subsequent changes if an exception is hit for an earlier one

We discovered this while testing https://bugzilla.mozilla.org/show_bug.cgi?id=1310226. Some of the scheduled changes used in testing would generate errors, because signoffs hadn't been given. When looking at sc_id 2, an error was generated (in that case, because signoffs weren't done), and that caused sc_id 3 to never be processed. We probably need to enhance the error handling in https://github.com/mozilla/balrog/blob/dc79a6b06ae1f38fd2d3fb20d8df20e5a7481d35/agent/balrogagent/cmd.py to continue in the two most inner loops if any exceptions are hit.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1342191)

on the releases page, use aliases for rule backlinks if the rule has an alias

this would be handy to quickly know if a rule that is mapped to a release is something permanent or official:

ie release or beta-cdntest rather than some arbitrary #.

sanity check hash length+type

We should make sure that hashes which are submitted to Release blobs match the hashFunction used by the release blob.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1138418)

put blob-specific submission code in balrog repo

Currently in the backend, we have Blob classes with jsonschemas that have two main functions:

ensure that blobs are valid before they go into the database
turn blobs into update responses (eg: the ... that Firefox requires)

Separately, we have a bunch of functions in https://github.com/mozilla/build-tools/blob/cc9e80196f7c67abaa9acf3a3e434f6554fd0977/lib/python/balrog/submitter/cli.py whose job it is to turn raw data into valid Blobs.

Because these pieces of code are disconnected we sometimes run into issues where cli.py generates invalid Blobs, which often ends up breaking nightly, or even release, updates.

Over in https://bugzilla.mozilla.org/show_bug.cgi?id=1303106#c9, Rail suggested that if these pieces of code were in the same place, we could run integration tests on them. We could also consider making the turn-raw-data-into-valid-blobs piece an additional function of the blob classes (but it doesn't have to be).

Doing this would probably necessitate doing bug 1312868 as well, so that api clients would have an easy way to get/run the new code.

Related to #1063

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320949)

log data about requests

We've long wanted to log information about what each request to Balrog is served (see https://bugzilla.mozilla.org/show_bug.cgi?id=758373). At the very least, what version we served the request, and whether or not the request got served the primary mapping, or the fallback mapping.

At this point in time, the most sensible place to put it would be BigQuery.

It's worth noting that even if/when we start doing this in the Balrog app, we still have both an nginx and a Cloudfront cache in front of it, so the data won't actually contain information about all updates served, just those that make it to the app.

don't require signoff from everyone for product-less permissions

One of the rough edges to the new Multiple Signoffs system is that product-less permissions (eg: full admins) end up requiring signoff from all groups that are listed in any permissions required signoff.

For example, if we have the following Permissions Required Signoffs:

Firefox Permissions, 1 releng, 1 relman
SystemAddons Permissions, 1 releng, 1 relman, 1 gofaster
Thunderbird Permissions, 1 releng, 1 tbird

...then adding a full fledged admin requires signoff from 1 releng, 1 relman, 1 gofaster, and 1 tbird.

I can think of a ways to improve this, but each has drawbacks:

Ignore signoffs for permissions that don't specify a product (lets us add full fledged admins with no oversight).
If product isn't specified, look explictly at only Firefox permission signoffs, because those are likely to be a good set of signoffs to require (probably ends up needing relman signoff for things they don't care about, kindof hacky)
If product isn't specified, use a sentinel value in its place. Eg: look for permissions required signoffs that apply to a "NOPRODUCT" product. This would give us the best control over this case, but it's still kindof hacky.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1343904)

Add loading indicator to release diff view

I noticed that when doing larger diffs, the Data Version will update immediately, but the lines changed and hunks don't render until afterwards, which can be confusing. We don't necessarily need to hide the entire thing, but maybe a spinner or other indicator over top of it would work?

Consider have a "partial match filter" on Releases page

Maybe can be useful to users have a partial filter of releases. E.g: if I type "Fire 69", it will match entries like "Firefox-69.0b10-build1".

sign XML response

We started signing responses for FirefoxVPN, which has given us a model for how to do it for other product types. I won't paste it at all here, but https://bugzilla.mozilla.org/show_bug.cgi?id=1304782 has a lot of additional discussion on this.

Show signoff history

I'd like to be able to see what rules I last signed off on. For example, on Monday March 19 I remember signing off on two system add-on rules for the release channel, but I can't remember the details of the second one and I'd like to review them.

It might also be useful to see a list of all the signoffs that happened recently or in a particular date range. So, what did we sign off on and who did the signoffs, from a week ago, or 2 weeks ago?

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1447764)

rule change sandbox

@catlee suggested a way to test out rule changes before applying them. You'd have something that lets you modify one or more rules before they affect actual requests, and check them against a set of criterion.

I interpreted the criterion as specifying a release, with the option to restrict to a particular build target/locale/OS Version/etc to simulate a query. That could save on the nitty gritty of build target strings and buildIDs in the url, but we could leave the general case there too. Then specify which mapping should be used to serve the update, and verify that's what the rules deliver.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1141801, original comment from @nthomas-mozilla)

Implement a "service" layer to manipulate table objects

Discussed with :bhearsum while implementing [1]. In a few words, the service layer is represent the whole business logic of a process, by dividing the business process into smaller ones. The smallest ones are then the table objects.
The idea of the back-end architecture would be the following: The web layer takes data out of the HTTP request and passes them to the Service. The service is then in charge of delegating to all the tables, and making sure data is consistent across tables

Here are some call diagram examples:

Add a new scheduled change [2]
Update a condition [3]
Update a scheduled target value [4]

[1] #151
[2] https://www.websequencediagrams.com/?lz=dGl0bGUgQWRkIGEgbmV3IHNjaGVkdWxlZCBjaGFuZ2UKClMACghSdWxlc0FQSVZpZXctPgAJDlNlcnZpY2U6IGFkZF9uZXcocnVsZV9pZCwgcnVsZXNfdmFsdWVzLCBjb25kaXRpb25zLCBhdXRob3IpAFIPAEMHAFEQVGFibGU6IGluc2VydABZBQBLCgBBCG5vdGUgcmlnaHQgb2YgACsVYWxzbyBzYXZlcyB0aW1lc3RhbXAAgVMPAGUFAIFHGQCCFwlfAIFbBwCCERAAgSoWQwCBbwkAgTsOAD4RAIIXDCk&s=rose
[3] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIGNvbmRpdGlvbnMKClNjaGVkdWxlZFJ1bGVzQVBJVmlldy0-AAkOU2VydmljZTogdQA8BV8ANwoocwA4CF9ydWxlX2lkLABUCywgYXV0aG9yKQBYDwBJBwBYD0MAgQ4JVGFibABlCQBFHgA1JgBEDl9sYXN0XwCBFQYAgS0UAIEtCA&s=rose
[4] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIHNjaGVkdWxlZCB0YXJnZXQgdmFsdWVzCgpTABEIUnVsZXNBUElWaWV3LT4ACQ5TZXJ2aWNlOiB1AEkFKABFCV9ydWxlX2lkLCBydWxlc18AUAYsIGF1dGhvcikATw8AQAcAThBUYWJsACkz&s=rose

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313742, original comment by @JohanLorenzo)

provide a way to see diff between two blobs

Currently when looking at a pending rule update (I assume it'd be the same for a release instead) I have to shell out. See e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1462099#c16. It might be nice to have that available directly.

This is probably less useful for Firefox rule updates than SystemAddons since those have a lot more changes from one release to the next, although it might still be useful for things like WNP.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1465710)

don't build docker images unless all tests pass

This should be a pretty simple change now that we're on .taskcluster.yml v1.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1363495)

enforce access to permissions and roles at the database layer

In #218, we added a new endpoint that allows someone to query for the permissions and roles of a named user. Nick correctly pointed out that we should restrict this to admins, and those users who are able to manipulate permissions. I implemented this for the new endpoint as part of that PR, but we should move this enforcement down to the database level to make sure that it is obeyed by all endpoints.

We'll need to modiify the interface of AUSTable.select() to do this, because it requires knowing the current user. We already pass this as "changed_by" for insert/update/delete, so we should probably add an arg like that to select().

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340167)

"Find in page" doesn't work well on Releases view

For example, if a Release contains "updateLine" (like, eg: Firefox-70.0.1-build1), Firefox's find-in-page function doesn't see it.

I presume this is due to virtualization, and is probably applicable to other views as well (eg: ListRules).

Snackbar doesn't always autohide

I haven't been able to find a consistent way to reproduce this, but on the Rules page at least, it sometimes closes after 5 seconds, and sometimes doesn't. I added some logging, and sure enough, handleSnackbarClose was not called at all.

footgun protection against undefined behaviour when multiple rules have the same priority

Right now, if an update query matches multiple rules with the same priority, it's not possible to guarantee that a certain one is chosen. This is undefined behaviour and can lead to lots of confusion.

We should do something to make this less of a footgun. In the past we've talked about possibly choosing the "most matching" rule (that would be, the one with the most specificity, ie: one that requires build_target+channel is more matching than one that requires just channel).

Another idea could be to just disallow rules with the same priority. Or maybe disallow rules with the same priority when product is the same.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1283568)

kill -latest blobs?

The -latest blobs that we currently use in Balrog are very long lived, and get continually updated with the latest nightly build information. Every other type of release in Balrog (dated nightly blobs, beta/release blobs, CDM blobs, etc.) just contain information about one "set" of things, and we create new release blobs whenever we generate a new set of things, so -latest blobs are a bit strange in comparison.

We're starting to bump into areas where this makes things harder. For example, when Varun implemented merge logic in https://bugzilla.mozilla.org/show_bug.cgi?id=1223872, we considered making conflicts between partial+complete lists mergable, but couldn't because -latest blobs need to fully overwrite them at times. Other blob types are append-only in these sections.

I think it would be worthwhile trying to find an alternative to -latest blobs that would still allow us to get nightly updates out in a timely fashion. Some random ideas:

Update mapping after all nightly updates are done
** This might not let us get things out in a timely fashion, or maybe at all if we even have one repack fail.
Use the fallback mappings from https://bugzilla.mozilla.org/show_bug.cgi?id=1282891, and set mapping to the latest dated blob, and fallback mapping to the n-1 dated blob.
** This might cause issues if we have a locale failing multiple nights in a row (they wouldn't have entries in either mapping).

It's possible that -latest is already the best all around solution, so this might end up being WONTFIX.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1286842)

better UX when removing permissions that are scheduled for deletion

If a permission is scheduled to be deleted, and a user removes it from the user view with the trash can icon, the scheduled deletion will be cancalled, and the permission will be kept. We should do better here.

See https://github.com/mozilla-frontend-infra/balrog-ui/blob/d56268f8b2999951dc8d92a4add5df94862cf7e2/src/views/Users/ViewUser/index.jsx#L414

add tests for all blob schemas

While reviewing #145 I found an issue with one of the yml files it was changing. To my surprise, the tests still passed. It looks like this is because we don't have any tests for .validate() except for some apprelease blobs, and the whitelist blob. We should add tests for all the remaining blobs that use .validate().

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1310000)

update docs about GMP, System Addons, and SuperBlobs

It looks like these haven't been updated to reflect a few things:

The new System Addon blobs, which are no longer exactly the same as GMP blobs (https://bugzilla.mozilla.org/show_bug.cgi?id=1275370)
The point-to-a-specific-Release style of SuperBlobs (https://bugzilla.mozilla.org/show_bug.cgi?id=1282898)

regularly validate release URLs

We recently discovered an issue where we had clean up some stuff on s3 that was thought to be unused, and it broke updates for a significant number of users.

One thing that would help here is to regularly check all of the mapped to Releases, and make sure that all of the URLs they point to are 200s. I wrote a hacky hacky script to do this as a one off: https://github.com/mozilla/balrog/compare/master...mozbhearsum:find-bad-mars?expand=1

find-active-mar-urls2.py finds everything that is pointed at and outputs that to a json file. check-urls.py goes through a json file and does a HEAD request on each of them.

This needs to be polished and probably enhanced before we can run it in automation or anything.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337148)

Notify user when data is no longer up-to-date

Similar to GitHub's functionality, a user should be notified somehow if the data is no longer considered up-to-date. This could be as simple as adding a "Refresh" label to the view.

don't make any change to database if row hasn't actually changed

After we enabled the change notifier we found at least one case where automation repeatedly makes the same change to the database many times. This is silly, and unnecessarily invalidates caches. We should try to make clients smarter about this where we can, but we can also do better on the backend and simply check if things will change prior to making a write.

We should watch out for potential performance penalties when it comes to Releases. Single locale updates already need to retrieve the current version of the blob before changing it, but updates to Release blobs that intend to replace the full contents may not already retrieve the entire blob already.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313631)

publish blobs as a separate package?

Balrog blobs have two main jobs:

They store and verify the schema of Releases.
They create responses based on the Release data.

#2 is something that generally only the server cares about, but #1 is something that would be extremely useful to clients as well as the server.

I think it would be a good idea to investigate if it would be possible to publish auslib.blobs as a separate package that the server and clients could depend on. This might be trickier now that we have multifile updates...we may need to look at refactoring the code such that #1 and #2 are isolated. It might not even be viable, but I think it's worth looking into further.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1312868)

Add confirmation dialog for each delete button

Clicking on a "Delete" button or the garbage icon should prompt a confirmation dialog. The content of the dialog could be as simple as:

Delete?

Are you sure you want to delete {...}


                                CANCEL  DELETE

web apis for history of signoffs to scheduled changes

We keep track of when signoffs to scheduled changes happens in Balrog's database, but we don't expose them anywhere in the API.

The current state of signoffs get returned as part of list of scheduled changes, eg: in a GET to /scheduled_changes/rules.

We could do this in two ways:

Add a separate endpoint to get the history of signoffs to a scheduled change, eg: /scheduled_changes/rules/:sc_id/signoffs/revisions
Integrate with regular scheduled change history (eg: /scheduled_changes/rules/:sc_id/revisions).

The former would be simpler to implement, but the latter is more consistent with the existing scheduled changes api, where we've been treating each scheduled change as one object, despite the fact that they are stored across 3 tables (scheduled_changes, conditions, and signoffs). Going this route may mean we need to increase data_version in each of these tables whenever something from one of them changes.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340170)

keep track of last time an account was used in Balrog

While talking about ways to make account rotation easier, Catlee suggested that we should keep track of the last time an account is used in Balrog, which would avoid the need to grovel through logs like we're doing now.

I originally thought we might be able to query this from the existing history tables, but I've since realized that we'd want to include GETs here as well, so we'll probably something different for this. Maybe just a table that is updated whenever a request is made to the admin interface, and updates the timestamp for the user?

We'd also need an API + some UI for it.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1372250)

requests can map to deleted releases for short periods of time

We had a new Traceback show up in Sentry recently that showed a request try to retrieve a Release that didn't exist:

IndexError: list index out of range
  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 57, in get
    release, update_type = AUS.evaluateRules(query)
  File "auslib/AUS.py", line 99, in evaluateRules
    release = dbo.releases.getReleases(name=rule['mapping'], limit=1)[0]

In this case, the rule in question was the main Firefox release rule, and the Release it mapped to was Firefox-50.1.0-build2-prod. At the time of the request, that release rule pointed at Firefox-50.1.0-build2, and Firefox-50.1.0-build2-prod didn't exist. After some digging with jlund I discovered that he changed the mapping of that Rule and deleted Firefox-50.1.0-build2-prod in short succession. Because Rules are cached, we ended up with a short period of time where requests were using the cached Rule (that pointed at Firefox-50.1.0-build2-prod), but didn't have that Release cached.

This is a pretty rare occurence, but definitely possible to hit again. We only cache Rules for 30s, so that's the maximum amount of time we could stay in this state for.

There's no obvious easy fix for this. We can't prevent people from deleting Releases that are still pointed at by a cached Rule, because the admin app doesn't know anything about the caches on the public side.

One thing we might be able to try is to ensure that the mappings (aka Releases) of all cached Rules are always cached in the public app. This could be tricky though, and possibly cause a big performance penalty.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1325605)

Scheduled change: Change wording when `enact datetime` occurs in the past

Sometimes a sign-off takes time to happen. For instance, someone may be in a meeting, and the sign-off occurs an hour later. When it happens, this someone may see that the change she/he's about to sign off was enacted X minutes ago. I think this wording is confusing and may be better changed to something like: will be enacted once sign-offs done.

What do you all think?

add XML comment to balrog responses when 500 error is caught

Catlee suggested that this would be a good way to make such errors debuggable after the fact, without the need to be able to reproduce them. It's also useful because it lets us distinguish between an actual empty update and an error.

I don't think we can or should put the full traceback in, but something that hints at the error (perhaps the top of the traceback stack) would be useful. We can probably use it to find more information via logs or newrelic.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1191320)

improve or kill change notification

Currently we have an extremely simplistic change notification system. For certain tables, we send e-mail to a mailing list whenever a change to them is made. This has turned out to be extremely spammy, and likely goes unread most of the time.

I think a better change notification system has at least these two requirements:

Ability to subscribe to some types of changes but not others.
-- Product and Channel seem like the most obvious filter we'd want. Possibly being able to filter on object (eg: Rule, Release, etc.) would be useful too.
Self serve subscriptions

There's probably other considerations that I haven't thought of.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337892)

We should also consider just killing the system, as nobody looks at the notifications.

test suite to validate the state of Balrog rules

Rail and Catlee have both mentioned recently that it would be useful to have a test suite that can validate that the current state of a set of Balrog rules returns all the right things for all the right inputs. This would be helpful both to give us reassurance when making changes to the Rules, and also if we ever need to rebuild them from scratch (eg: if we somehow lose the database).

A few other random thoughts:

Ideally, this would test a wide range of old versions of Firefox and other products. which means the test suite would need to be aware of watersheds.
Rather than making requests and comparing XML output, it may be better to simply compare the name of the Release that the XML would be generated from. This avoids a lot of issues related to ordering of lines in the ouput and other differences that don't change behaviour. We'd still need to go through AUS.evaluateRules() to have all the rule matching logic run. Not sure how this would fit with multifile updates yet, as that logic is very closely tied to XML generation. Maybe return the name of the superblob and all response blobs?

Aki brought up the idea of "test driven development" for Balrog Rules recently. Boiled down, fixing this bug is the primary piece of work to make that possible.

(In reply to Ben Hearsum (:bhearsum) from comment #0)

Rather than making requests and comparing XML output, it may be better to
simply compare the name of the Release that the XML would be generated from.
This avoids a lot of issues related to ordering of lines in the ouput and
other differences that don't change behaviour. We'd still need to go through
AUS.evaluateRules() to have all the rule matching logic run. Not sure how
this would fit with multifile updates yet, as that logic is very closely
tied to XML generation. Maybe return the name of the superblob and all
response blobs?

I'm going to backtrack on this now that I'm seeing it with fresh eyes - if we really care about validating Balrog's state, we really need to run through the entire rule matching + XML generation logic. There's too many things we could miss if we only look at Mappings.

Given that, I think the test suite simply ends being "given a list of update URLs and expected results, do the URLs return what is expected"? Where things get tricky, is how we create that list. We obviously can't have humans writing thousands of update URLs, so we need a config that we generate them from. Our requirements for that are:

Reasonably easy for humans to manipulate
Be able to define expected foreground and background check responses
Be able to handle at least Firefox, GMP, and System Addon updates
The ability to handle exception cases (for example: XP and Vista users receiving an ESR release, while other versions of Windows received a vanilla release)
Be as concise as possible

Other requirements:

Must be reasonably fast to run

Open questions:

Do we need to test all locales? We could significantly cut down the number of test cases by testing a handful instead of all.
Do we need to test all old versions? Trimming out versions between watersheds would also reduce the number of test cases.
Will this run continually, or just when we're getting ready to make changes?

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320373)

restructure Balrog

Balrog's design has evolved a bit over time, but some rough edges have crept in, particularly around having two applications share the same library. For example, it's very difficult to have global objects (such as a database), because the two applications live in different places (and for awhile, had their own database objects). We can work around that with hacks like in https://github.com/mozilla/balrog/blob/master/auslib/__init__.py, but it's not ideal.

I'm also finding that making caching only happen on the non-admin application as part of bug 671488 to be more complicated because of this.

In any case, it seems like we should be looking towards some sort of structure that allows the common parts of Balrog (db.py, blobs/, maybe AUS.py) to be in an importable library, and the app-specific parts to live in their own place. We'll have to consider what this means for deployment (particularly in cases where we need a synced deployment for admin+non-admin), and there might be better options than this too.

This is low on the priority list given the feature work on the horizon, but it would be nice to do for future maintainability.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1109295)

don't fire onInsert/Update/Delete until after a transaction successfully commits

We have callbacks that are meant to fire when changes are made to the database. Right now, they fire before the transaction is completed (eg: https://github.com/mozilla/balrog/blob/9f8de88056be59332faa9b79ba2517ad2b0caffa/auslib/db.py#L345), which means the callbacks may send e-mail or other notify changes that may end up failing to commit.

I suspect the reason they ended up here is because we pass the query to them, so that interface may need to change.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1332412)

/users/<username> sometimes hits auth0 rate limit errors

Presumably this is happening because of the call we make to auth0's /userinfo endpoint, which is rate limited to 5 requests per minute with bursts of up to 10 requests per user id (from https://auth0.com/docs/policies/rate-limits#authentication-api).

We do cache the results of these calls, but we make one request per username at roughly the same time when /users is loaded, and we have multiple admin webheads, so it could take for all the webheads to have cached results for all of the users.

Off the top of my head, the only way I can think to fix this is to cache the results of the /userinfo queries somewhere persistent that can be shared between webheads. Right now, the only thing we have that persists is the mysql database, but we've talked about adding memcache at some point.

There may also be a more clever fix that I haven't considered.

Rebuild Release blob submission

Submitting data to Balrog Release blobs has become increasingly problematic over the years. Much of this boils down to the fact that we hit tons of data races trying to update a Release, because the API requires that clients submit the data version that they're basing their update on, and the database layer rejects that update if the current data version does not match the given one. This has become aggravated in recent times by adding more balrogworker instances.

Whatever solution we end up choosing should get us to a place where a submission with valid data works 99.99% of the time. We do not want a solution that still leaves us subject to data races or other scaling issues.

This work is likely to involve the API (I'd like to take this opportunity to get rid of or fix up https://github.com/mozilla-releng/balrog/blob/master/src/auslib/web/admin/views/releases.py, which is really, really ugly and hacky), balrogscript, and possibly the database layer.

improve fileUrls to eliminate duplication

fileUrls already supports a special "*" channel to eliminate the need to list our main release + cdn test channel separately, but in cases where we have multiple sets of channels that are the same we have to duplicate one set. Eg, for RCs we have:

  "fileUrls": {
    "beta": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "*": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "beta-cdntest": {
      "partials": {
        "Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
      },
      "completes": {
        "*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
      }
    },
    "release-localtest": {
      "partials": {
        "Firefox-33.1-build3": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-33.1-34.0.partial.mar"
      },
      "completes": {
        "*": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-34.0.complete.mar"
      }
    }
  },

"*" handles release-cdntest and release. But we also have beta-cdntest and beta. Those two channels serve exactly the same content, but are different from their release counterparts. There should be a way to combine these together into a single entry.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1122557)

Means to "explode" priority values to create gaps.

This is something I've been musing on, and bhearsum told me to file a bug. (It may not belong in this exact component).

Background:

our rules are evaluated on a priority integer at present.
We have rules that can match multiple channels (or even multiple products)
When gaps close (rules are prior 91, and 90) and you need to put a new rule in between those two, its often difficult, there is a need to edit potentially many rules, and try to reason about which need to be changed, without breaking rule order for other products/channels.

This tool (UI, or manual script, etc) should:

[Optionally?] take a specified channel or product
Read in the list of all rules that match.
Explode the rules priorities, leaving order in tact.

E.g. take the following arbitrary set:

#0 (Template): ::

#1 firefox:beta:96
#2 firefox:beta:95
#3 firefox:beta-cdntest:94
#4 firefox:beta*:93
#5 firefox:beta:92
#6 fennec:beta:92
#7 fennec:beta*:91
#8 firefox:beta:90
#9 <no_product>:beta:89
#10 fennec:release:88
#11 firefox:beta:70

running this script, with the channel/product set to firefox/beta would explode to be like so (unchanged omitted):

#1 firefox:beta:148
#2 firefox:beta:138 # XXX: Should we do 139 to preserve the non-identical prior even though no conflict
#3 firefox:beta-cdntest:138
#4 firefox:beta*:128
#5 firefox:beta:118
#6 fennec:beta:111
#7 fennec:beta*:110
#8 firefox:beta:108
#9 <no_product>:beta:98

Yielding a proper, rule-order-preserving mapping.

Note, since this was "firefox/beta" we still bumped priority for fennec in #6 and #7, and beta-cdntest for firefox in #3 because other matching rules got bumped, and we preserved the existing gap...

The logic of this can be tweaked from my proposal of course.

(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1301045)