mozilla-releng / balrog Goto Github PK
View Code? Open in Web Editor NEWMozilla's Update Server
Home Page: http://mozilla-balrog.readthedocs.io/en/latest/index.html
License: Mozilla Public License 2.0
Mozilla's Update Server
Home Page: http://mozilla-balrog.readthedocs.io/en/latest/index.html
License: Mozilla Public License 2.0
We've got jsonschemas that can validate all of the different types of Release blobs. It would be great if we could use those as part of the swagger specs, because it would allow us to generate clients that could do client-side blob validation, which can make clients much friendlier.
Swagger only supports a subset of jsonschema, so it may not be possible for all of the current blob types.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1381514)
Submitting data to Balrog Release blobs has become increasingly problematic over the years. Much of this boils down to the fact that we hit tons of data races trying to update a Release, because the API requires that clients submit the data version that they're basing their update on, and the database layer rejects that update if the current data version does not match the given one. This has become aggravated in recent times by adding more balrogworker
instances.
Whatever solution we end up choosing should get us to a place where a submission with valid data works 99.99% of the time. We do not want a solution that still leaves us subject to data races or other scaling issues.
This work is likely to involve the API (I'd like to take this opportunity to get rid of or fix up https://github.com/mozilla-releng/balrog/blob/master/src/auslib/web/admin/views/releases.py, which is really, really ugly and hacky), balrogscript
, and possibly the database layer.
bug 1246993 describes a long standing bug where bad entries were made to history tables while making updates. If the tests for this code had verified the history table entries (in addition to the primary tables), this bug would never have been introduced.
We should make a point of always verifying the history table entries whenever we write tests that modify the database. This bug is to track updating all of the existing tests to do so.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1405337)
Right now, if an update query matches multiple rules with the same priority, it's not possible to guarantee that a certain one is chosen. This is undefined behaviour and can lead to lots of confusion.
We should do something to make this less of a footgun. In the past we've talked about possibly choosing the "most matching" rule (that would be, the one with the most specificity, ie: one that requires build_target+channel is more matching than one that requires just channel).
Another idea could be to just disallow rules with the same priority. Or maybe disallow rules with the same priority when product is the same.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1283568)
We recently discovered an issue where we had clean up some stuff on s3 that was thought to be unused, and it broke updates for a significant number of users.
One thing that would help here is to regularly check all of the mapped to Releases, and make sure that all of the URLs they point to are 200s. I wrote a hacky hacky script to do this as a one off: https://github.com/mozilla/balrog/compare/master...mozbhearsum:find-bad-mars?expand=1
find-active-mar-urls2.py finds everything that is pointed at and outputs that to a json file. check-urls.py goes through a json file and does a HEAD request on each of them.
This needs to be polished and probably enhanced before we can run it in automation or anything.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337148)
The current behaviour of Scheduled Changes that use "when" is:
This means that an already scheduled change may end up being scheduled "in the past" if signoffs do not arrive in time. It's confusing and inconsistent that you can't schedule a change in the past, but it can drift there.
A couple of ideas about how to fix this:
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1392720)
We have callbacks that are meant to fire when changes are made to the database. Right now, they fire before the transaction is completed (eg: https://github.com/mozilla/balrog/blob/9f8de88056be59332faa9b79ba2517ad2b0caffa/auslib/db.py#L345), which means the callbacks may send e-mail or other notify changes that may end up failing to commit.
I suspect the reason they ended up here is because we pass the query to them, so that interface may need to change.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1332412)
It looks like these haven't been updated to reflect a few things:
We've recently started using OpenAPI-style specs to define both our admin and public APIs. We use Connexion to load and create routes for them in Flask. It turns out that Connexion has some extensions that make it possible for us to have Connexion-compatible specs that aren't OpenAPI 2.0 compliant. The consequence of this is that we cannot make use of other OpenAPI 2.0 tools, such as https://github.com/swagger-api/swagger-codegen.
We're still working on becoming OpenAPI 2.0 compliant, but we should add tests to make sure it doesn't regress once we get there.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1387063)
Need to wait for #1084 first.
@catlee suggested a way to test out rule changes before applying them. You'd have something that lets you modify one or more rules before they affect actual requests, and check them against a set of criterion.
I interpreted the criterion as specifying a release, with the option to restrict to a particular build target/locale/OS Version/etc to simulate a query. That could save on the nitty gritty of build target strings and buildIDs in the url, but we could leave the general case there too. Then specify which mapping should be used to serve the update, and verify that's what the rules deliver.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1141801, original comment from @nthomas-mozilla)
This idea came up today and shortly afterwards I was helping debug why nothing was being returned for a particular update URL. It turned out that a different rule was being matched than we thought. Had we had the release name somewhere in the response (like a header!), this would've been much more obvious.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1425594)
We started signing responses for FirefoxVPN, which has given us a model for how to do it for other product types. I won't paste it at all here, but https://bugzilla.mozilla.org/show_bug.cgi?id=1304782 has a lot of additional discussion on this.
We discovered this while testing https://bugzilla.mozilla.org/show_bug.cgi?id=1310226. Some of the scheduled changes used in testing would generate errors, because signoffs hadn't been given. When looking at sc_id 2, an error was generated (in that case, because signoffs weren't done), and that caused sc_id 3 to never be processed. We probably need to enhance the error handling in https://github.com/mozilla/balrog/blob/dc79a6b06ae1f38fd2d3fb20d8df20e5a7481d35/agent/balrogagent/cmd.py to continue in the two most inner loops if any exceptions are hit.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1342191)
This is to help situations where a scheduled rule is forgotten and requires a sign off.
I'm thinking it could start via a channel poke, then escalate via email to the people within the associated group that can sign off.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1426221, filed by @lundjordan)
We keep track of when signoffs to scheduled changes happens in Balrog's database, but we don't expose them anywhere in the API.
The current state of signoffs get returned as part of list of scheduled changes, eg: in a GET to /scheduled_changes/rules.
We could do this in two ways:
The former would be simpler to implement, but the latter is more consistent with the existing scheduled changes api, where we've been treating each scheduled change as one object, despite the fact that they are stored across 3 tables (scheduled_changes, conditions, and signoffs). Going this route may mean we need to increase data_version in each of these tables whenever something from one of them changes.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340170)
Currently we have an extremely simplistic change notification system. For certain tables, we send e-mail to a mailing list whenever a change to them is made. This has turned out to be extremely spammy, and likely goes unread most of the time.
I think a better change notification system has at least these two requirements:
There's probably other considerations that I haven't thought of.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1337892)
We should also consider just killing the system, as nobody looks at the notifications.
While reviewing #145 I found an issue with one of the yml files it was changing. To my surprise, the tests still passed. It looks like this is because we don't have any tests for .validate() except for some apprelease blobs, and the whitelist blob. We should add tests for all the remaining blobs that use .validate().
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1310000)
In #218, we added a new endpoint that allows someone to query for the permissions and roles of a named user. Nick correctly pointed out that we should restrict this to admins, and those users who are able to manipulate permissions. I implemented this for the new endpoint as part of that PR, but we should move this enforcement down to the database level to make sure that it is obeyed by all endpoints.
We'll need to modiify the interface of AUSTable.select() to do this, because it requires knowing the current user. We already pass this as "changed_by" for insert/update/delete, so we should probably add an arg like that to select().
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1340167)
Steps to reproduce:
While working on bug-1355477, I realized that It is possible to download the current version of a release and upload it as an update without making any changes to the file.
Actual results:
The update was accepted as a scheduled change for a release.
Expected results:
The system should have rejected scheduling of updates with no changes.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1417355)
Currently we send mail such as:
Row to be inserted:
{'base_alias': 'firefox-release',
'base_backgroundRate': 100,
'base_buildID': None,
'base_buildTarget': None,
'base_channel': 'release',
'base_comment': 'default release rule updated by buildbot, DO NOT DELETE',
'base_data_version': 147,
'base_distVersion': None,
'base_distribution': None,
'base_fallbackMapping': None,
'base_headerArchitecture': None,
'base_locale': None,
'base_mapping': 'Firefox-54.0-build3-whatsnew',
'base_osVersion': None,
'base_priority': 90,
'base_product': 'Firefox',
'base_rule_id': 145,
'base_systemCapabilities': None,
'base_update_type': 'minor',
'base_version': None,
'change_type': 'update',
'csrf_token': '1498007702##fe6927e9a30927fb3b0baebef3ea7f86bf04e314',
'data_version': 1,
'scheduled_by': '[email protected]'}
Unfortunately, scheduled changes are not terribly useful without context. The most useful thing to know is "what is this scheduled change going to change". In the above case, backgroundRate and fallbackMapping were different vs. the base, so that's what should've been highlighted.
Can be closed if we end up killing them (see #1071).
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1375010)
Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1374638
We've long wanted to log information about what each request to Balrog is served (see https://bugzilla.mozilla.org/show_bug.cgi?id=758373). At the very least, what version we served the request, and whether or not the request got served the primary mapping, or the fallback mapping.
At this point in time, the most sensible place to put it would be BigQuery.
It's worth noting that even if/when we start doing this in the Balrog app, we still have both an nginx and a Cloudfront cache in front of it, so the data won't actually contain information about all updates served, just those that make it to the app.
Balrog blobs have two main jobs:
#2 is something that generally only the server cares about, but #1 is something that would be extremely useful to clients as well as the server.
I think it would be a good idea to investigate if it would be possible to publish auslib.blobs as a separate package that the server and clients could depend on. This might be trickier now that we have multifile updates...we may need to look at refactoring the code such that #1 and #2 are isolated. It might not even be viable, but I think it's worth looking into further.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1312868)
We've been having some issues dealing with nightly release history cleanup (details of which can be found in bug 1283492). We were talking about them today and it got me thinking: rather than keep nightly history for such a short period of time (7-14 days), maybe we can just not keep history for nightly releases in the first place. We'd still need to do cleanup of the releases table to remove old nightlies, but that is quick and easy in comparison.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1294493)
Not hugely important, but it can catch some bad practices and potential issues.
https://github.com/openstack/bandit
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1335428)
fileUrls already supports a special "*" channel to eliminate the need to list our main release + cdn test channel separately, but in cases where we have multiple sets of channels that are the same we have to duplicate one set. Eg, for RCs we have:
"fileUrls": {
"beta": {
"partials": {
"Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
},
"completes": {
"*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
}
},
"*": {
"partials": {
"Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
},
"completes": {
"*": "http://download.mozilla.org/?product=firefox-34.0-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
}
},
"beta-cdntest": {
"partials": {
"Firefox-33.1-build3": "http://download.mozilla.org/?product=firefox-34.0build2-partial-33.1&os=%OS_BOUNCER%&lang=%LOCALE%"
},
"completes": {
"*": "http://download.mozilla.org/?product=firefox-34.0build2-complete&os=%OS_BOUNCER%&lang=%LOCALE%"
}
},
"release-localtest": {
"partials": {
"Firefox-33.1-build3": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-33.1-34.0.partial.mar"
},
"completes": {
"*": "http://dev-stage01.srv.releng.scl3.mozilla.com/pub/mozilla.org/firefox/candidates/34.0-candidates/build2/update/%OS_FTP%/%LOCALE%/firefox-34.0.complete.mar"
}
}
},
"*" handles release-cdntest and release. But we also have beta-cdntest and beta. Those two channels serve exactly the same content, but are different from their release counterparts. There should be a way to combine these together into a single entry.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1122557)
The -latest blobs that we currently use in Balrog are very long lived, and get continually updated with the latest nightly build information. Every other type of release in Balrog (dated nightly blobs, beta/release blobs, CDM blobs, etc.) just contain information about one "set" of things, and we create new release blobs whenever we generate a new set of things, so -latest blobs are a bit strange in comparison.
We're starting to bump into areas where this makes things harder. For example, when Varun implemented merge logic in https://bugzilla.mozilla.org/show_bug.cgi?id=1223872, we considered making conflicts between partial+complete lists mergable, but couldn't because -latest blobs need to fully overwrite them at times. Other blob types are append-only in these sections.
I think it would be worthwhile trying to find an alternative to -latest blobs that would still allow us to get nightly updates out in a timely fashion. Some random ideas:
It's possible that -latest is already the best all around solution, so this might end up being WONTFIX.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1286842)
In #1042 we did a small change to allow FirefoxVPN
to be used in all the ways that Guardian
was. The latter name is now deprecated, and we should remove support for it, and use FirefoxVPN
in its place.
Presumably this is happening because of the call we make to auth0's /userinfo
endpoint, which is rate limited to 5 requests per minute with bursts of up to 10 requests
per user id
(from https://auth0.com/docs/policies/rate-limits#authentication-api).
We do cache the results of these calls, but we make one request per username at roughly the same time when /users
is loaded, and we have multiple admin webheads, so it could take for all the webheads to have cached results for all of the users.
Off the top of my head, the only way I can think to fix this is to cache the results of the /userinfo
queries somewhere persistent that can be shared between webheads. Right now, the only thing we have that persists is the mysql database, but we've talked about adding memcache at some point.
There may also be a more clever fix that I haven't considered.
This should be a pretty simple change now that we're on .taskcluster.yml
v1.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1363495)
After we enabled the change notifier we found at least one case where automation repeatedly makes the same change to the database many times. This is silly, and unnecessarily invalidates caches. We should try to make clients smarter about this where we can, but we can also do better on the backend and simply check if things will change prior to making a write.
We should watch out for potential performance penalties when it comes to Releases. Single locale updates already need to retrieve the current version of the blob before changing it, but updates to Release blobs that intend to replace the full contents may not already retrieve the entire blob already.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313631)
Rail and Catlee have both mentioned recently that it would be useful to have a test suite that can validate that the current state of a set of Balrog rules returns all the right things for all the right inputs. This would be helpful both to give us reassurance when making changes to the Rules, and also if we ever need to rebuild them from scratch (eg: if we somehow lose the database).
A few other random thoughts:
--
Aki brought up the idea of "test driven development" for Balrog Rules recently. Boiled down, fixing this bug is the primary piece of work to make that possible.
(In reply to Ben Hearsum (:bhearsum) from comment #0)
- Rather than making requests and comparing XML output, it may be better to
simply compare the name of the Release that the XML would be generated from.
This avoids a lot of issues related to ordering of lines in the ouput and
other differences that don't change behaviour. We'd still need to go through
AUS.evaluateRules() to have all the rule matching logic run. Not sure how
this would fit with multifile updates yet, as that logic is very closely
tied to XML generation. Maybe return the name of the superblob and all
response blobs?
I'm going to backtrack on this now that I'm seeing it with fresh eyes - if we really care about validating Balrog's state, we really need to run through the entire rule matching + XML generation logic. There's too many things we could miss if we only look at Mappings.
Given that, I think the test suite simply ends being "given a list of update URLs and expected results, do the URLs return what is expected"? Where things get tricky, is how we create that list. We obviously can't have humans writing thousands of update URLs, so we need a config that we generate them from. Our requirements for that are:
Other requirements:
Open questions:
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320373)
Now that it exists, fallbackMapping is something we almost always use when running a throttled rollout. I can't think of any case where we'd want users to go no updates instead of the previously released version during these times. We should consider enforcing this in the backend, and rejecting any requests to change backgroundRate to <100 unless fallbackMapping is set, or is getting set as part of the change.
We'll need to ensure that these would actually work for all different types of updates (GMP, SystemAddons, Firefox) before deciding to move forward.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1391013)
Discussed with :bhearsum while implementing [1]. In a few words, the service layer is represent the whole business logic of a process, by dividing the business process into smaller ones. The smallest ones are then the table objects.
The idea of the back-end architecture would be the following: The web layer takes data out of the HTTP request and passes them to the Service. The service is then in charge of delegating to all the tables, and making sure data is consistent across tables
Here are some call diagram examples:
[1] #151
[2] https://www.websequencediagrams.com/?lz=dGl0bGUgQWRkIGEgbmV3IHNjaGVkdWxlZCBjaGFuZ2UKClMACghSdWxlc0FQSVZpZXctPgAJDlNlcnZpY2U6IGFkZF9uZXcocnVsZV9pZCwgcnVsZXNfdmFsdWVzLCBjb25kaXRpb25zLCBhdXRob3IpAFIPAEMHAFEQVGFibGU6IGluc2VydABZBQBLCgBBCG5vdGUgcmlnaHQgb2YgACsVYWxzbyBzYXZlcyB0aW1lc3RhbXAAgVMPAGUFAIFHGQCCFwlfAIFbBwCCERAAgSoWQwCBbwkAgTsOAD4RAIIXDCk&s=rose
[3] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIGNvbmRpdGlvbnMKClNjaGVkdWxlZFJ1bGVzQVBJVmlldy0-AAkOU2VydmljZTogdQA8BV8ANwoocwA4CF9ydWxlX2lkLABUCywgYXV0aG9yKQBYDwBJBwBYD0MAgQ4JVGFibABlCQBFHgA1JgBEDl9sYXN0XwCBFQYAgS0UAIEtCA&s=rose
[4] https://www.websequencediagrams.com/?lz=dGl0bGUgVXBkYXRlIHNjaGVkdWxlZCB0YXJnZXQgdmFsdWVzCgpTABEIUnVsZXNBUElWaWV3LT4ACQ5TZXJ2aWNlOiB1AEkFKABFCV9ydWxlX2lkLCBydWxlc18AUAYsIGF1dGhvcikATw8AQAcAThBUYWJsACkz&s=rose
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1313742, original comment by @JohanLorenzo)
One of the rough edges to the new Multiple Signoffs system is that product-less permissions (eg: full admins) end up requiring signoff from all groups that are listed in any permissions required signoff.
For example, if we have the following Permissions Required Signoffs:
...then adding a full fledged admin requires signoff from 1 releng, 1 relman, 1 gofaster, and 1 tbird.
I can think of a ways to improve this, but each has drawbacks:
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1343904)
Currently in the backend, we have Blob classes with jsonschemas that have two main functions:
Separately, we have a bunch of functions in https://github.com/mozilla/build-tools/blob/cc9e80196f7c67abaa9acf3a3e434f6554fd0977/lib/python/balrog/submitter/cli.py whose job it is to turn raw data into valid Blobs.
Because these pieces of code are disconnected we sometimes run into issues where cli.py generates invalid Blobs, which often ends up breaking nightly, or even release, updates.
Over in https://bugzilla.mozilla.org/show_bug.cgi?id=1303106#c9, Rail suggested that if these pieces of code were in the same place, we could run integration tests on them. We could also consider making the turn-raw-data-into-valid-blobs piece an additional function of the blob classes (but it doesn't have to be).
Doing this would probably necessitate doing bug 1312868 as well, so that api clients would have an easy way to get/run the new code.
Related to #1063
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1320949)
We're still on 2.0.
Nick pointed out this series of tweets (https://twitter.com/tef_ebooks/status/949350236392181760?s=03) which explains that it's not safe to store future dates in UTC, because local timezone offsets shift around with DST, while UTC does not. This means that if someone in North America schedule's a change 2 days before a DST change, but we schedule it to take place 2 days AFTER the DST change (aka 4 days in the future) - it will end up being off by an hour.
The thread recommends storing future dates as time + named timezone (that is, "US/Los_Angeles" - or something like that), because timezones ("PST") can change in the future too.
I think this has a couple of implications for Balrog:
There may be other necessary changes, too.
I think history tables are unaffected by this (I don't think that timezones or DST can retroactively be changed), but we may want to consider storing history in the same timezone for consistency. History UI can almost certainly be presented as user-local time regardless of what we do on the backend.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1431793)
For posterity, the aforementioned tweets:
psa: you can’t store future dates as UTC because local time offsets can and will change
and you can't store time zones as utc offsets because dst rules change too
if a user can pick the time, and it could be in the future, I'm sorry, you can't normalise to utc
what you want to store, isn’t timezone offset, or name, but locationor why tzinfo uses ‘US/Los_Angeles’ as the key
normalising future dates to utc or offset, or even storing ‘PST’ means ignoring dst or local timezone changes you need ‘US/Los_Angeles’
This is something I've been musing on, and bhearsum told me to file a bug. (It may not belong in this exact component).
Background:
This tool (UI, or manual script, etc) should:
E.g. take the following arbitrary set:
#0 (Template): ::
#1 firefox:beta:96
#2 firefox:beta:95
#3 firefox:beta-cdntest:94
#4 firefox:beta*:93
#5 firefox:beta:92
#6 fennec:beta:92
#7 fennec:beta*:91
#8 firefox:beta:90
#9 <no_product>:beta:89
#10 fennec:release:88
#11 firefox:beta:70
running this script, with the channel/product set to firefox/beta would explode to be like so (unchanged omitted):
#1 firefox:beta:148
#2 firefox:beta:138 # XXX: Should we do 139 to preserve the non-identical prior even though no conflict
#3 firefox:beta-cdntest:138
#4 firefox:beta*:128
#5 firefox:beta:118
#6 fennec:beta:111
#7 fennec:beta*:110
#8 firefox:beta:108
#9 <no_product>:beta:98
Yielding a proper, rule-order-preserving mapping.
Note, since this was "firefox/beta" we still bumped priority for fennec in #6 and #7, and beta-cdntest for firefox in #3 because other matching rules got bumped, and we preserved the existing gap...
The logic of this can be tweaked from my proposal of course.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1301045)
We had a new Traceback show up in Sentry recently that showed a request try to retrieve a Release that didn't exist:
IndexError: list index out of range
File "flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "flask/views.py", line 84, in view
return self.dispatch_request(*args, **kwargs)
File "flask/views.py", line 149, in dispatch_request
return meth(*args, **kwargs)
File "auslib/web/views/client.py", line 57, in get
release, update_type = AUS.evaluateRules(query)
File "auslib/AUS.py", line 99, in evaluateRules
release = dbo.releases.getReleases(name=rule['mapping'], limit=1)[0]
In this case, the rule in question was the main Firefox release rule, and the Release it mapped to was Firefox-50.1.0-build2-prod. At the time of the request, that release rule pointed at Firefox-50.1.0-build2, and Firefox-50.1.0-build2-prod didn't exist. After some digging with jlund I discovered that he changed the mapping of that Rule and deleted Firefox-50.1.0-build2-prod in short succession. Because Rules are cached, we ended up with a short period of time where requests were using the cached Rule (that pointed at Firefox-50.1.0-build2-prod), but didn't have that Release cached.
This is a pretty rare occurence, but definitely possible to hit again. We only cache Rules for 30s, so that's the maximum amount of time we could stay in this state for.
There's no obvious easy fix for this. We can't prevent people from deleting Releases that are still pointed at by a cached Rule, because the admin app doesn't know anything about the caches on the public side.
One thing we might be able to try is to ensure that the mappings (aka Releases) of all cached Rules are always cached in the public app. This could be tricky though, and possibly cause a big performance penalty.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1325605)
Catlee suggested that this would be a good way to make such errors debuggable after the fact, without the need to be able to reproduce them. It's also useful because it lets us distinguish between an actual empty update and an error.
I don't think we can or should put the full traceback in, but something that hints at the error (perhaps the top of the traceback stack) would be useful. We can probably use it to find more information via logs or newrelic.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1191320)
Rather than setting a priority of a rule to determine where it sits in order, it would be neat if you could say which rule should come before it and have Balrog internally decide the priority based on that.
This would be similar to a linked list solution where you can insert based on appending at an indexed value and all nodes in the list would be re-ordered based on that.
Put another way, the priority wouldn't be an exposed for mutability.
Motivation:
Often as releaseduty, we have to re-order the priority of many rules to fit others in. This opens us up to a silly human mistake if you are scheduling many rule changes and also more churn than needed as both releng and relman must sign off on a no-op change.
This would require front end work too
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1426218 by @lundjordan)
One of the things we often struggle with is generating useful update URLs to test with. Balrog has enough data that it should be able to generate these URLs for us. Because the necessary information is stored in Release blobs, I think it would be best to integrate around them in some way. For example, there could be a button beside the Mapping field in the Rules UI called "Test URLs". When clicked, the user would be prompted for a small amount of information (see below), and then an update URL would be returned, or possibly opened in a new tab. Since we know the Release, we can pull most of the data we need for the update URL from it. We'll still need the user to choose an OS, locale, and possibly channel.
As an example, let's see how we could generate this URL: https://aus5.mozilla.org/update/6/Firefox/53.0.2/20170504105526/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml
We've got the following parts to deal with:
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1398202)
Apparently we shipped something with channel set to release-google-cck-realnetworks at some point (full url is https://aus5.mozilla.org/update/2/Firefox/2.0.0.11/2007112718/WINNT_x86-msvc/ja/release-google-cck-realnetworks/Windows_NT%205.1/update.xml). These builds currently get exceptions when trying to update, eg:
KeyError: 'release-google'
File "flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "connexion/decorators/decorator.py", line 66, in wrapper
response = function(request)
File "connexion/decorators/validation.py", line 293, in wrapper
return function(request)
File "connexion/decorators/produces.py", line 38, in wrapper
response = function(request)
File "connexion/decorators/response.py", line 85, in wrapper
response = function(request)
File "connexion/decorators/decorator.py", line 42, in wrapper
response = function(request)
File "connexion/decorators/parameter.py", line 195, in wrapper
return function(**kwargs)
File "auslib/web/public/client.py", line 126, in get_update_blob
app.config["SPECIAL_FORCE_HOSTS"]))
File "auslib/blobs/apprelease.py", line 166, in getInnerXML
patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts)
File "auslib/blobs/apprelease.py", line 279, in _getPatchesXML
xml = self._getSpecificPatchXML(patchKey, patchKey, patch, updateQuery, whitelistedDomains, specialForceHosts)
File "auslib/blobs/apprelease.py", line 97, in _getSpecificPatchXML
url = self._getUrl(updateQuery, patchKey, patch, specialForceHosts)
File "auslib/blobs/apprelease.py", line 248, in _getUrl
url = self['fileUrls'][getFallbackChannel(updateQuery['channel'])]
Probably the simplest thing to do is to add "release-google" to whichever blob is serving updates for them.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1379281)
We should make sure that hashes which are submitted to Release blobs match the hashFunction
used by the release blob.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1138418)
Balrog's design has evolved a bit over time, but some rough edges have crept in, particularly around having two applications share the same library. For example, it's very difficult to have global objects (such as a database), because the two applications live in different places (and for awhile, had their own database objects). We can work around that with hacks like in https://github.com/mozilla/balrog/blob/master/auslib/__init__.py, but it's not ideal.
I'm also finding that making caching only happen on the non-admin application as part of bug 671488 to be more complicated because of this.
In any case, it seems like we should be looking towards some sort of structure that allows the common parts of Balrog (db.py, blobs/, maybe AUS.py) to be in an importable library, and the app-specific parts to live in their own place. We'll have to consider what this means for deployment (particularly in cases where we need a synced deployment for admin+non-admin), and there might be better options than this too.
This is low on the priority list given the feature work on the horizon, but it would be nice to do for future maintainability.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1109295)
While thinking about bug 1309656, I realized that it's currently possible to do something very silly and point a Firefox update rule at a non-Firefox Release. Even in a multifile update world I can't see a scenario where this would be desired, and there may be some potential for privilege escalation or creating confusion (eg: create a release with product=Thunderbird, name=Firefox-$version-we're-about-to-ship) that might lead to the wrong thing being served.
This probably needs a bit more thought about whether or not its a good idea.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1309877)
I noticed today that we have a whole bunch of recent releases in the latest production db dump - most of which are not currently mapped to. I think this is because we grab any releases that are mapped to by any scheduled rule change (https://github.com/mozilla/balrog/blob/752d1d548840f1753a8592af405846c6612f7f3c/scripts/manage-db.py#L137), instead of only incompleted scheduled rule changes.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1377460)
While talking about ways to make account rotation easier, Catlee suggested that we should keep track of the last time an account is used in Balrog, which would avoid the need to grovel through logs like we're doing now.
I originally thought we might be able to query this from the existing history tables, but I've since realized that we'd want to include GETs here as well, so we'll probably something different for this. Maybe just a table that is updated whenever a request is made to the admin interface, and updates the timestamp for the user?
We'd also need an API + some UI for it.
(Imported from https://bugzilla.mozilla.org/show_bug.cgi?id=1372250)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.