jupyterhub / mybinder.org-deploy Goto Github PK
View Code? Open in Web Editor NEWDeployment config files for mybinder.org
Home Page: https://mybinder-sre.readthedocs.io/en/latest/index.html
License: BSD 3-Clause "New" or "Revised" License
Deployment config files for mybinder.org
Home Page: https://mybinder-sre.readthedocs.io/en/latest/index.html
License: BSD 3-Clause "New" or "Revised" License
If the ingress configuration has not fully loaded before launching the test, there is a chance that the ingress configuration will reload during the build test and drop the connection, resulting in a build failure. If we reliably waited until the ingress configuration was loaded, this wouldn't happen (#181).
Currently, our kubernetes master is not highly available, causing outages whenever we do an upgrade or change autoscaler settings.
GKE announced recently support for highly available masters, with https://cloud.google.com/kubernetes-engine/docs/concepts/multi-zone-and-regional-clusters. We should switch to using that!
There may be a problem with the ingress configuration. I'm seeing "HTTP 413: Payload too large" when trying to save a large-ish notebook with figures. We fixed this with an annotation before, but ingress configuration has changed recently. The annotation probably just got lost somewhere.
These should exercise common pathways that real users do, and should be fast enough to be used as part of each and every deployment automatically.
Things it should test:
I think it could be useful when we want to announce new features, or highlight repos, etc. Maybe we could even write a bot that posts links to binders on twitter. :-)
If folks are cool w/ a twitter handle, what do we grab? mybinder
and binder
are taken...we could do binderdevs
? Ideas?
We should have either a grafana or prometheus alertmanager alert (sent to gitter perhaps) when the cluster is approaching 80-90% capacity.
Pending jupyterhub/binderhub#346 we will have the ability to apply our firewall and throttling logic to builds, which is the remaining location where user code runs and has the opportunity to abuse the network. However, applying the same rules as user pods would result in throttling the pushes to our image registry, which wouldn't be nice. I'm not 100% sure how to whitelist the image registry, since gcr.io has loads of IP addresses on various subnets. I suppose we could run the push through a transparent proxy that isn't throttled, then the cluster ip whitelist CIDR would prevent throttling, but that's getting a bit complicated. Any ideas?
Currently we use that for a bunch of stuff, and we no longer need to.
Heya! As our team gets bigger, we should scale by making it easier and more painless to do deploys in quick and reliable ways, through a combination of automation & expectations. Here's a straw proposal!
Whenever a change is merged into binderhub, a PR is automatically made bumping chart version in this repo. It is the responsibility of the person merging the PR in binderhub to also merge the PR in this repo. This has a few advantages:
This doesn't mean we force people who want to merge PRs in those two repos to know how to deploy. Rather, this means we build enough reliable automation that they would be willing to and find it pleasurable enough to do a deploy when they merge the PR. Lots of automation work, but that's the only way we can scale something like this to many many people. This issue is just a place to gather consensus that this is something we want to move towards, so we can focus on the automation bits more.
Note that this is also just a straw proposal based on my personal opinions the other large software community I've been part of that also had to do deploys consistently (Wikimedia). Feel free to bring up objections or alternative proposals!
We should log the full path of which repo was started when. Right now you can kind of reconstruct the repo names with https://mybinder.org/v2/gh/choldgraf/binder-stats/master?filepath=prometheus_demo.ipynb but it requires a human to look at it and make an educated guess.
Should we have a "build" counter and use the repo URL as a label?
I'd like to be able to generate a time series of builds to then calculate popular repos over the last N minutes but also just show a stream of builds as they are happening (like a twitter timeline). Is a counter the way to go? Do we need a new service that isn't prometheus?
This might be an issue for the binderhub repo proper but unsure.
Deployments should be as automated as possible. This makes it easier for lots of people to do them quickly and without making arbitrary mistakes.
I've done a bunch of 'push to deploy' setups over the last few months, and here's what I think it should be like (assuming we've switched beta to mybinder.org at this point):
staging
branch. Travis runs some tests to make sure things look ok.This is the model we've successfully followed in https://github.com/berkeley-dsep-infra/datahub/ for a while now. It has given us the following advantages:
common.yaml
, with minimal staging / prod specific ones (such as hostnames and secrets) in their specific files. This prevents the biggest enemy of having useful staging environments - drift.It also does have some negatives:
IMO the positives outweigh the negatives, and we should JDI. It will also free up good chunks of my time to focus on other issues :)
Should we add a space on the mybinder.org landing page for a logo (or two or three) of the current sponsor(s)?
Acknowledging them would be nice I think. Doesn't have to be flashy/distracting.
Binder is not accessible from Microsoft Edge but works perfectly fine in Mozilla Firefox:
https://mybinder.org/v2/gh/jvns/pandas-cookbook/master
In Edge the page stays at the following, and 'launch' button just refreshes it:
Create a notebook with the cell:
from IPython.display import publish_display_data
kB = 1024
publish_display_data({
'application/x-fake-size-test': 'x' * 1024 * kB
})
run that cell, and save will fail with HTTP 413 (too big!).
We have this annotation, which I assume is meant to address exactly this, but it doesn't appear to have the desired effect.
The biggest problem with old binder was that it was quite unreliable. To be taken seriously in the long term, we need to be as rock solid reliable as possible. One important part of that is that we should never be alerted to 'binder is down' by a human - we should always first get alerted by a machine. This alert should be visible to us even if we aren't in front of a computer / on Gitter / email at that moment - usually this is called 'paging' and uses SMS / an app / phone calls, etc.
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit is a good read on a monitoring & alerting philosophy that I'm a fan of. Highly recommend reading - it's a short read!
For us, we have two aspects we want to cover:
I think we should write rock solid checks for the following:
These are all things that users will immediately notice, and could be from a variety of symptoms. We should be paged on all three of these.
We probably will have other checks & diagnostic tools to help, but only these three should page.
In general, I think everyone who does deployments should get paged - it is part of the responsibilities you get when you get the power to deploy. However, different people have different constraints, some deploy far more often than others, etc. We can easily have different levels of paging for various people - sms, email, slack / gitter, phone calls as the person's preferences are. We should work out some form of co-ordinating who responds to what too.
The AGU is having their annual conference the week of Dec 11: https://fallmeeting.agu.org/2017/
We should scale up the cluster a bit in order to accommodate extra traffic. @tylere would appreciate it for their talk.
@yuvipanda wanna give instructions on how to do the scaling so you don't have to be the one that does this? Maybe commands + general guidelines for how much to scale it?
We have tests to verify that a deploy succeeded (yay!). But when the tests suggest that the deploy did not succeed, we leave it in place. This has been useful, because test failures seem to most often be problems in the test at this point, rather than problems in the deployment. But maybe we should be doing a rollback to the previous state when the test fails?
Is it
It's not clear from the readme. I had assumed 1., which would suggest that it shouldn't be part of people putting together their own deployments. But @choldgraf mentioned this repo when going through a binder deployment, so now I'm not sure.
cc @willingc
It should be possible to automatize sending PRs to Staging with a bot / Travis-Ci;
It would be also nice to standadise the commit message; I think that a git log previous...current --oneline
may be better;
Or we're limited to a 60 req/hour rate limit, which we will exhaust quickly. Each launch and build requires a request.
Temporarily, restarting the pod gets it a new IP and hence new limits. However, real way is to put a token in there.
A brief SRE document would be helpful in the future. Let's brainstorm what it would include. The Google SRE book would be a good reference.
Topics:
For #59 I needed an encryption key to set up git-crypt so I could modify the secret files.
We need to document the process of getting set up with git-crypt and how we should exchange the key to people who should have it.
For this case, we tested ssh-vault, which lets you encrypt things with people's github ssh keys, and it worked pretty smoothly.
In the move to put the beta on binder-prod, a separate binder-staging project was created. I can't find any reasoning behind this, and it makes management way more complicated because admins has to be added/removed twice, logs can't be combined, billing is separate, etc.
Shouldn't there be just one GCE project for Binder, with separate clusters for prod/staging?
We should have an automated process that:
And should churn over every few days, to avoid node related docker problems.
I'm not sure if this is something wrong in our ingress config here, or in an upstream chart, but we are again (still?) logging only internal kubernetes ips associated with requests, rather than the true origin, which is important for diagnosing issues.
Hey all - @yuvipanda and I have been mulling this over for a while now, and we think that it's time to point mybinder.org
to the beta deployment. I recently changed the page so that "beta" is up there in big letters, so it'll be clear that this isn't a finished-release. However, at this point the beta deployment is far more stable than the old version of binder, and ready for a soft launch.
Just opening this issue here in case @willingc or @minrk have suggestions otherwise...I feel like this should be a group decision.
Unless somebody objects, the plan is to make this switch as soon as I can figure out billing details with the new Moore grant (which are currently awaiting a response from Berkeley campus shared services). That'll probably be settled by early next week.
Not sure why they're returning 404 when the site is working... Investigate, find out and fix!
@yuvipanda Adding this as a placeholder reminder. I wasn't sure when you wanted to remove this file.
Write up a small document that lays out the responsibilities of the various people involved in a deployment.
See https://phabricator.wikimedia.org/L3 for something a little similar (although not the same)
Currently they run on the same kubernetes cluster. This means we can't 'test' a lot of cluster related operations (such as kubernetes upgrades or nginx/prometheus upgrades) in a staging cluster before pushing it to beta.
We should instead have two separate clusters with their own copy of support charts for staging and beta/prod.
This is sort of blocked on #43, since that'll allow us to do the switchover in one smooth motion instead of having to do it twice.
Links are now broken for previous users.
This is an issue to keep track of progress on the billing front. The background is that we need to switch over the billing for mybinder.org
to our accounts instead of Jeremy's. I've been going back and forth with BIDS admins about figuring this out and we're making progress. I've got an email out to Google Cloud sales so hopefully they can answer some questions for us. Right now we need to figure out:
Will update this issue as new information arises.
We should add a readiness probe that verifies that the deployment is responding to requests on the public URLs (i.e. that the servers are up and kube-lego has done its business). This ought to prevent failed deploys on Travis that occur when the tests start faster than the deployment becomes ready.
I put a band-aid on #180 in the mean-time, but that should go away when we get a proper check.
We should add a note somewhere on the front page linking to an explanation of what data is collected when you use the mybinder instance and which part of it is accessible as a public event stream.
My guess would be that not too many people expect the general public to be able to see how often each repo was built etc.
I don't think there is anything (too) questionable that is being logged and exposed which is why it would be good to be upfront about it.
One of the coolest things about Wikimedia is the large amount of usage data it makes available publicly: Page Views API & Dumps, content dumps, client usage, live recent changes stream etc. This makes it very useful for a number of purposes - fundraising, quantifying impact, etc. By just making the data available, it enables a wide variety of people to derive whatever meaning they want from the raw data, enabling creativitiy & removing itself as a bottleneck.
We should take a similar approach, both because we strive to be open & we're a small team who can not do all the cool things that would be possible with this approach.
The simple proposal here is:
This can be our primary information source. On top of this, multiple other things can be built:
And far more. This also prevents us from being a bottleneck, and provides space for a developer community that uses binder (rather than just one that develops binder) to open up. We also determine what kinda info is emitted, making sure we preserve our users' privacy.
This issue is primarily to talk about this approach, rather than technical details. Thoughts?
We're running a public service allowing arbitrary code execution, and so have responsibilities on how secure we have to be. Here's a baseline on getting started:
Eventually, I'd love for us to spend some money getting JupyterHub & BinderHub security audited by an actual security firm!
Since we use staging instead of master, let's add a note to the README and CONTRIBUTING.md to make contributors aware of that.
@betatim pointed me to https://about.hindawi.com/opinion/a-radically-open-approach-to-developing-infrastructure-for-open-science/ which has several good points about radically open open-science infrastructure. It's a great read, and I think we are already fairly radically open. I've been thinking of how we can be more radical, and one thing would be to start making public how much we are paying our cloud providers in a fairly automated (but still private way).
The way I'm thinking of this is:
Automating this is important, since otherwise we won't have time to do it. This has several advantages:
Note that this issue is only about financial transparency of operational cloud compute costs, and no other financial transactions.
As someone providing public computational services for random unauthenticated users, we have some legal & moral responsibilities to our users & the world. We won't get them right from the start, but we have a responsibility to think about and implement all of these things.
@betatim noticed that deleting the hub pod actually kills all the user pods. I am pretty sure this wasn't old behavior, but indeed, we never set c.JupyterHub.cleanup_servers = False
in z2jh! I made a PR for it, but:
A mystery, and a ticking time bomb!
Sometimes there's good reason to ban a repository from launching on Binder, but we should make sure that our reasons are clear for doing so in order to avoid making people unnecessarily unhappy. Let's include a section about our banning policies for the mybinder.org
deployment. Not sure where this should go, maybe for now just a text file like guidelines.md
that we can point people to in case we need to? For now this file could just have text like:
# Jupyter Usage Guidelines
This page details some guidelines and policies that we follow at mybinder.org.
## Temporary Banning
Temporary banning means that `mybinder.org` will stop building / serving Binder sessions for
a given repository. This usually happens because of some undesired behavior with the repository, such as a large, unexpected spike in traffic.
If you are temporarily banned, contact us on the [Gitter Channel](link-to-gitter) or [Open an Issue](issues-link) to discuss how to un-ban the repository.
https://www.eff.org/dnt-policy
The remaining steps on this one are:
mybinder.org-deploy
repodocs.mybinder.org
and link to the EFF text file in the deploy repo.https://mybinder.org/v2/gh/tylere/agu2017/master?urlpath=lab via @tylere there will be a mention of this at the AGU 10-15 December. Estimate 100-150 people maybe.
We should keep an eye out on the utilisation of the cluster. Maybe also post the command(s) here to grow/shrink it.
We originally moved the DNS to my account because our first attempt at moving to the Jupyter DNS account from Jeremy's account wasn't working. We should try again so that I am not a bottleneck on DNS!
I think @Carreau and I could do this some day at BIDS soon.
Currently prod runs on one cluster. If this cluster runs into problems (as it did last week), we have to rush to bring it back up to speed - not conducive to operational healthiness nor our uptime! Also GKE still makes the master unavailable when upgrading or modifying parameters (such as autoscaling limits).
We should instead operate with a pair of cluster - a 'cold spare'. In building it, we should meet the following criteria:
The easiest way to do this is:
With this in place, we can do upgrades with minimal hassle & deal with full cluster outages better.
We should be limiting outbound (and possibly inbound to a lesser extent) traffic in the user pods to limit what mischief they can get up to.
Can we redirect a small percentage of users to staging (without them noticing) in some kind of A/B testing like setup? The purpose of this would be to be able to look at failure rates to detect when a change has been deployed to staging that breaks something.
Related to #66
We should have a strong privacy policy that doesn't burden us too much, while still offering users a reasonable amount of privacy.
I consider https://wikimediafoundation.org/wiki/Privacy_policy/FAQ and https://wikimediafoundation.org/wiki/Privacy_policy to be somewhat of gold standards in this case. However, they're all probably too burdensome for someone like us, without the resources of the Wikimedia Foundation. As one example, we'll probably have to use Google Analytics rather than roll our own. We can still try to protect user's privacy tho - by opting out of GA's data sharing, respecting DNT headers, not storing IP info past a few days, etc.
since they'll never be used again.
They cause problems leading to the 'no space left on disk' error
Current README is a mix of multiple bits of docs:
We should split this up into multiple pages that can then grow independently.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.