jupyterhub / mybinder.org-deploy Goto Github PK

View Code? Open in Web Editor NEW

76.0 12.0 82.0 14.9 MB

Deployment config files for mybinder.org

Home Page: https://mybinder-sre.readthedocs.io/en/latest/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 39.14% Jupyter Notebook 42.09% HTML 1.63% Dockerfile 1.03% HCL 15.48% Shell 0.64%

jupyterhub binder binderhub

mybinder.org-deploy's Issues

tests after deploy can fail due to ingress reloading

If the ingress configuration has not fully loaded before launching the test, there is a chance that the ingress configuration will reload during the build test and drop the connection, resulting in a build failure. If we reliably waited until the ingress configuration was loaded, this wouldn't happen (#181).

Switch to a regional HA kubernetes cluster

Currently, our kubernetes master is not highly available, causing outages whenever we do an upgrade or change autoscaler settings.

GKE announced recently support for highly available masters, with https://cloud.google.com/kubernetes-engine/docs/concepts/multi-zone-and-regional-clusters. We should switch to using that!

HTTP 413: payload too large

There may be a problem with the ingress configuration. I'm seeing "HTTP 413: Payload too large" when trying to save a large-ish notebook with figures. We fixed this with an annotation before, but ingress configuration has changed recently. The annotation probably just got lost somewhere.

Write end-to-end tests for mybinder.org

These should exercise common pathways that real users do, and should be fast enough to be used as part of each and every deployment automatically.

Things it should test:

A repo requiring new image building is ok
A repo that launches an image that is already built is also ok
Users can connect to the notebook after launching

Binder twitter handle

I think it could be useful when we want to announce new features, or highlight repos, etc. Maybe we could even write a bot that posts links to binders on twitter. :-)

If folks are cool w/ a twitter handle, what do we grab? mybinder and binder are taken...we could do binderdevs? Ideas?

@willingc @minrk @Carreau @betatim @yuvipanda

Have alerts for when cluster is getting full

We should have either a grafana or prometheus alertmanager alert (sent to gitter perhaps) when the cluster is approaching 80-90% capacity.

network throttling on build pods

Pending jupyterhub/binderhub#346 we will have the ability to apply our firewall and throttling logic to builds, which is the remaining location where user code runs and has the opportunity to abuse the network. However, applying the same rules as user pods would result in throttling the pushes to our image registry, which wouldn't be nice. I'm not 100% sure how to whitelist the image registry, since gcr.io has loads of IP addresses on various subnets. I suppose we could run the push through a transparent proxy that isn't throttled, then the cluster ip whitelist CIDR would prevent throttling, but that's getting a bit complicated. Any ideas?

Stop using omgwtf.in domains

Currently we use that for a bunch of stuff, and we no longer need to.

beta hub should be hub.beta.mybinder.org
staging binder should be staging.mybinder.org
staging hub should be hub.staging.mybinder.org

More structured deployment process

Heya! As our team gets bigger, we should scale by making it easier and more painless to do deploys in quick and reliable ways, through a combination of automation & expectations. Here's a straw proposal!

Whenever a change is merged into binderhub, a PR is automatically made bumping chart version in this repo. It is the responsibility of the person merging the PR in binderhub to also merge the PR in this repo. This has a few advantages:

Increases bus factor. Since folks can't self-merge in binderhub (or repo2docker), but can here, this ensures that the person doing the deploy isn't the same person as the person who wrote the code. This spreads knowledge around and forces better docs / explanations.
Forces our deploy process to be super smooth & reliable. Currently our deploys almost always fail the first time, which is very bad form. We can 'get away with it' because the folks doing deploy have this unwritten knowledge to 'just restart it', which is very bad. If there's a larger number of folks doing deploys, it forces our process to be at the least easily documented, and our automations to be better.
It helps prevent the slide into hero culture, where only a few people are able to deploy big changes because deploying big changes is under tested, under documented & under automated. This is bad for the culture of the community & for the 'hero's too, and we need to explicitly ward this off before it begins.
Gives contributors to binderhub / repo2docker a 'instant gratification' boost - as soon as their PR is merged, it's live! No waiting for an indeterminate amount of time for people with 'deploy rights' to do a sacred act!

This doesn't mean we force people who want to merge PRs in those two repos to know how to deploy. Rather, this means we build enough reliable automation that they would be willing to and find it pleasurable enough to do a deploy when they merge the PR. Lots of automation work, but that's the only way we can scale something like this to many many people. This issue is just a place to gather consensus that this is something we want to move towards, so we can focus on the automation bits more.

Note that this is also just a straw proposal based on my personal opinions the other large software community I've been part of that also had to do deploys consistently (Wikimedia). Feel free to bring up objections or alternative proposals!

Prometheus logging of builds

We should log the full path of which repo was started when. Right now you can kind of reconstruct the repo names with https://mybinder.org/v2/gh/choldgraf/binder-stats/master?filepath=prometheus_demo.ipynb but it requires a human to look at it and make an educated guess.

Should we have a "build" counter and use the repo URL as a label?

I'd like to be able to generate a time series of builds to then calculate popular repos over the last N minutes but also just show a stream of builds as they are happening (like a twitter timeline). Is a counter the way to go? Do we need a new service that isn't prometheus?

This might be an issue for the binderhub repo proper but unsure.

Automated push to deployment via GitHub PRs

Deployments should be as automated as possible. This makes it easier for lots of people to do them quickly and without making arbitrary mistakes.

I've done a bunch of 'push to deploy' setups over the last few months, and here's what I think it should be like (assuming we've switched beta to mybinder.org at this point):

We have a 'staging' and 'prod' branch, and they each correspond to staging.mybinder.org and mybinder.org
When someone wants to make a change, they make a PR to the staging branch. Travis runs some tests to make sure things look ok.
Someone merges the PR. This triggers an actual deployment via travis, which passes or fails. It is the responsibility of the human who merged the PR to make sure the deployment succeeds, and if it doesn't, debug it + roll back if necessary.
If (3) went well, another PR is made merging staging into prod branch. Travis runs tests again.
The PR from (4) is merged.

This is the model we've successfully followed in https://github.com/berkeley-dsep-infra/datahub/ for a while now. It has given us the following advantages:

Much much less error prone than manual deploys, which might fail for any number of environmental reasons.
Most adversarial effects are caught in the staging deployment, so less downtime for the production one
Most config can live in common.yaml, with minimal staging / prod specific ones (such as hostnames and secrets) in their specific files. This prevents the biggest enemy of having useful staging environments - drift.
Leaves a very clear audit trail of both who merged what (in PRs) as well as output of each deployment (in travis).
No need to figure out-of-band secret key transfer mechanisms. Whoever is able to merge PRs in the repo has rights to do deployments.

It also does have some negatives:

Flip side of advantage 5 - anyone who compromises travis has compromised our infrastructure, including all of travis employees. Is this acceptable?
??? (can't think of any!)

IMO the positives outweigh the negatives, and we should JDI. It will also free up good chunks of my time to focus on other issues :)

Sponsor logo space

Should we add a space on the mybinder.org landing page for a logo (or two or three) of the current sponsor(s)?

Acknowledging them would be nice I think. Doesn't have to be flashy/distracting.

Binder is not accessible from Microsoft Edge (works perfectly fine in Mozilla Firefox)

Binder is not accessible from Microsoft Edge but works perfectly fine in Mozilla Firefox:
https://mybinder.org/v2/gh/jvns/pandas-cookbook/master

In Edge the page stays at the following, and 'launch' button just refreshes it:

max file size is 1MB

Create a notebook with the cell:

from IPython.display import publish_display_data
kB = 1024
publish_display_data({
    'application/x-fake-size-test': 'x' * 1024 * kB 
})

run that cell, and save will fail with HTTP 413 (too big!).

We have this annotation, which I assume is meant to address exactly this, but it doesn't appear to have the desired effect.

Set up monitoring + paging alerting

The biggest problem with old binder was that it was quite unreliable. To be taken seriously in the long term, we need to be as rock solid reliable as possible. One important part of that is that we should never be alerted to 'binder is down' by a human - we should always first get alerted by a machine. This alert should be visible to us even if we aren't in front of a computer / on Gitter / email at that moment - usually this is called 'paging' and uses SMS / an app / phone calls, etc.

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit is a good read on a monitoring & alerting philosophy that I'm a fan of. Highly recommend reading - it's a short read!

For us, we have two aspects we want to cover:

Make sure we have a core set of 'must always work' features that we check, and will page on if the check fails.
Make sure we have enough people with the know-how on how to debug the system, and then page all these people!

Paging checks

I think we should write rock solid checks for the following:

beta.mybinder.org is up and responding
A git repository could be built and launched within some timeframe when it hadn't been built before
A git repository could be launched when it had already been built before

These are all things that users will immediately notice, and could be from a variety of symptoms. We should be paged on all three of these.

We probably will have other checks & diagnostic tools to help, but only these three should page.

Who gets paged?

In general, I think everyone who does deployments should get paged - it is part of the responsibilities you get when you get the power to deploy. However, different people have different constraints, some deploy far more often than others, etc. We can easily have different levels of paging for various people - sms, email, slack / gitter, phone calls as the person's preferences are. We should work out some form of co-ordinating who responds to what too.

Scale up the cluster for AGU Dec 11-15

The AGU is having their annual conference the week of Dec 11: https://fallmeeting.agu.org/2017/

We should scale up the cluster a bit in order to accommodate extra traffic. @tylere would appreciate it for their talk.

@yuvipanda wanna give instructions on how to do the scaling so you don't have to be the one that does this? Maybe commands + general guidelines for how much to scale it?

rollback failed deploys?

We have tests to verify that a deploy succeeded (yay!). But when the tests suggest that the deploy did not succeed, we leave it in place. This has been useful, because test failures seem to most often be problems in the test at this point, rather than problems in the deployment. But maybe we should be doing a rollback to the previous state when the test fails?

What is this repo for?

Is it

the specific deployment files for beta.mybinder.org?
example/reference/documentation on which people should base their own deployments?
something else?

It's not clear from the readme. I had assumed 1., which would suggest that it shouldn't be part of people putting together their own deployments. But @choldgraf mentioned this repo when going through a binder deployment, so now I'm not sure.

cc @willingc

Title of PR, and autopush PRs.

It should be possible to automatize sending PRs to Staging with a bot / Travis-Ci;

It would be also nice to standadise the commit message; I think that a git log previous...current --oneline may be better;

Set GitHub Auth token for binderhub

Or we're limited to a 60 req/hour rate limit, which we will exhaust quickly. Each launch and build requires a request.

Temporarily, restarting the pod gets it a new IP and hence new limits. However, real way is to put a token in there.

Add a brief SRE document

A brief SRE document would be helpful in the future. Let's brainstorm what it would include. The Google SRE book would be a good reference.

Topics:

incident reporting

document secret handling

For #59 I needed an encryption key to set up git-crypt so I could modify the secret files.

We need to document the process of getting set up with git-crypt and how we should exchange the key to people who should have it.

For this case, we tested ssh-vault, which lets you encrypt things with people's github ssh keys, and it worked pretty smoothly.

separate GCE project for staging

In the move to put the beta on binder-prod, a separate binder-staging project was created. I can't find any reasoning behind this, and it makes management way more complicated because admins has to be added/removed twice, logs can't be combined, billing is separate, etc.

Shouldn't there be just one GCE project for Binder, with separate clusters for prod/staging?

Churn nodes on a periodic basis

We should have an automated process that:

Takes a node out of rotation
Adds a new node
When old node is empty kills it

And should churn over every few days, to avoid node related docker problems.

Begin logging true origin as well as internal IPs for requests

I'm not sure if this is something wrong in our ingress config here, or in an upstream chart, but we are again (still?) logging only internal kubernetes ips associated with requests, rather than the true origin, which is important for diagnosing issues.

Pointing `mybinder.org` to the beta deployment

Hey all - @yuvipanda and I have been mulling this over for a while now, and we think that it's time to point mybinder.org to the beta deployment. I recently changed the page so that "beta" is up there in big letters, so it'll be clear that this isn't a finished-release. However, at this point the beta deployment is far more stable than the old version of binder, and ready for a soft launch.

Just opening this issue here in case @willingc or @minrk have suggestions otherwise...I feel like this should be a group decision.

Unless somebody objects, the plan is to make this switch as soon as I can figure out billing details with the new Moore grant (which are currently awaiting a response from Berkeley campus shared services). That'll probably be settled by early next week.

Make the tests in pytest less flaky

Not sure why they're returning 404 when the site is working... Investigate, find out and fix!

Remove common.yaml in root of repo

@yuvipanda Adding this as a placeholder reminder. I wasn't sure when you wanted to remove this file.

Write up deployment guidelines & responsibilities

Write up a small document that lays out the responsibilities of the various people involved in a deployment.

See https://phabricator.wikimedia.org/L3 for something a little similar (although not the same)

Separate staging & production/beta clusters

Currently they run on the same kubernetes cluster. This means we can't 'test' a lot of cluster related operations (such as kubernetes upgrades or nginx/prometheus upgrades) in a staging cluster before pushing it to beta.

We should instead have two separate clusters with their own copy of support charts for staging and beta/prod.

This is sort of blocked on #43, since that'll allow us to do the switchover in one smooth motion instead of having to do it twice.

docs.mybinder.org used to use hsts.

Links are now broken for previous users.

Switch billing over to our accounts

This is an issue to keep track of progress on the billing front. The background is that we need to switch over the billing for mybinder.org to our accounts instead of Jeremy's. I've been going back and forth with BIDS admins about figuring this out and we're making progress. I've got an email out to Google Cloud sales so hopefully they can answer some questions for us. Right now we need to figure out:

Can we pay for google cloud with a check instead of a credit card? (makes things much easier for billing purposes)
What's the expected monthly cost for Binder?
Are there any opportunities to get an education discount for our use?

Will update this issue as new information arises.

add readiness probe to ensure deployment is ready before testing

We should add a readiness probe that verifies that the deployment is responding to requests on the public URLs (i.e. that the servers are up and kube-lego has done its business). This ought to prevent failed deploys on Travis that occur when the tests start faster than the deployment becomes ready.

I put a band-aid on #180 in the mean-time, but that should go away when we get a proper check.

User privacy information and data collection statement

We should add a note somewhere on the front page linking to an explanation of what data is collected when you use the mybinder instance and which part of it is accessible as a public event stream.

My guess would be that not too many people expect the general public to be able to see how often each repo was built etc.

I don't think there is anything (too) questionable that is being logged and exposed which is why it would be good to be upfront about it.

Public analytics & metrics pipeline

One of the coolest things about Wikimedia is the large amount of usage data it makes available publicly: Page Views API & Dumps, content dumps, client usage, live recent changes stream etc. This makes it very useful for a number of purposes - fundraising, quantifying impact, etc. By just making the data available, it enables a wide variety of people to derive whatever meaning they want from the raw data, enabling creativitiy & removing itself as a bottleneck.

We should take a similar approach, both because we strive to be open & we're a small team who can not do all the cool things that would be possible with this approach.

The simple proposal here is:

Instrument binderhub to emit events every time someone builds a repo, with some metadata. We could just do this with regular logging
Collect this info in a structured fashion as an event stream (such as 'all launches')
Provide this as a primary source, in the form of an event stream that the internet can subscribe to.

This can be our primary information source. On top of this, multiple other things can be built:

Someone just listens on this and produces daily, weekly, monthly aggregate statistics
This can also be used to provide nicer badges that have a 'launch count' on them
People can make cool visualizations like https://listen.hatnote.com/
Bots and humans can use this to spot builds that are failing
Can easily be used to justify your own repo's funding / credit, since you know how many times it has been launched.
Can produce leaderboards!

And far more. This also prevents us from being a bottleneck, and provides space for a developer community that uses binder (rather than just one that develops binder) to open up. We also determine what kinda info is emitted, making sure we preserve our users' privacy.

This issue is primarily to talk about this approach, rather than technical details. Thoughts?

Secure our Kubernetes Clusters

We're running a public service allowing arbitrary code execution, and so have responsibilities on how secure we have to be. Here's a baseline on getting started:

1. Make sure we have kubernetes RBAC enabled, with hub & binder allowed only the permissions they need & nothing more.
2. Enable Kubernetes Network Policy, restricting user pods' network access.
1. Only the hub pod can talk to the proxy's API port
2. The user pods can only talk to the hub & the proxies inside the cluster (so not other services in the kubernetes cluster nor other user pods)
3. (When GKE supports Egress network policy) we should lock down the network ports pods can connect to. At a minimum we should block outgoing traffic on port 25 (to prevent spamming). We might also consider making a port whitelist instead.
3. Run go-auditd on all our nodes, for reasons outlined in this excellent blog post from Slack's security team.
4. Make sure our users can not escalate to root by setting appropriate parameters in the pod spec.
5. Make sure users can never access the kubernetes API itself, since this allows for a lot of escalation points.
6. Make sure useres can not access the GCE Metadata Service from inside the containers. This might already be the case, but we should verify.) (#174)

Eventually, I'd love for us to spend some money getting JupyterHub & BinderHub security audited by an actual security firm!

Add note to CONTRIBUTING.md about staging

Since we use staging instead of master, let's add a note to the README and CONTRIBUTING.md to make contributors aware of that.

Measure and communicate Cloud compute Financial Transparency

@betatim pointed me to https://about.hindawi.com/opinion/a-radically-open-approach-to-developing-infrastructure-for-open-science/ which has several good points about radically open open-science infrastructure. It's a great read, and I think we are already fairly radically open. I've been thinking of how we can be more radical, and one thing would be to start making public how much we are paying our cloud providers in a fairly automated (but still private way).

The way I'm thinking of this is:

Every month, we get an invoice
We parse the invoice, getting only the data that matters and is not private - so just 'how much we paid for what specific cloud computing resource'
We post it somewhere public, automatically

Automating this is important, since otherwise we won't have time to do it. This has several advantages:

Sets up a high watermark for others providing open infrastructure to aspire to. (I couldn't find any other org that does this consistently)
Provides ways for community input into cutting cost (for example, tweaking instance sizes or properly cleaning up cached images in the registry)
Holds us, the operators of mybinder.org, accountable (which is really what radical transparency is for)

Note that this issue is only about financial transparency of operational cloud compute costs, and no other financial transactions.

Anti-abuse responsibilites of mybinder.org

As someone providing public computational services for random unauthenticated users, we have some legal & moral responsibilities to our users & the world. We won't get them right from the start, but we have a responsibility to think about and implement all of these things.

Legal

Figure out whom DMCA complaints can be forwarded to. We don't actually host content so we should be fine, but that doesn't mean we won't get DMCA notices! An easy way to 'host' content on binder is to use a Dockerfile that's downloading copyright protected material from wherever - this means the code itself (on GitHub) isn't a copyright violation, but the built container image on mybinder.org is. There'll also be frivolous DMCA notices that we'll have to respond to in some form. (#449)
Have an abuse@ email contact that is appropriately monitored. This will be used in DNS to tell people where to send abuse complaints to, and as good internet citizens we should respond. Complaints could be about Spam being sent out of mybinder (since we don't limit network in any way), mybinder used as part of a botnet / other cyber attack, doxxing / revenge-porn type situations, even child porn. (jupyterhub/mybinder.org-user-guide#69)
Set up a privacy policy and link to it prominently. Make sure we actually live up to it. (https://github.com/jupyterhub/binderhub/issues/70)

Technical

Build means to enforce network policy. For example, we might want to restrict outgoing connections on port 25 (to prevent smtp spamming), and if we are notified of being used for a botnet we would need to put additional protections in place.
Make sure we respect the Do Not Track user preference header for all tracking we do.

Fix user pods dying when hub is shut down

@betatim noticed that deleting the hub pod actually kills all the user pods. I am pretty sure this wasn't old behavior, but indeed, we never set c.JupyterHub.cleanup_servers = False in z2jh! I made a PR for it, but:

Why now? Am pretty sure this wasn't old behavior...
How do we safely deploy this? Deployment would restart the hub, triggering pod deaths. I suspect we'd have to ssh into the node and do a 'kill -9' to be absolutely sure the hub doesn't try to shut everything down before going away.

A mystery, and a ticking time bomb!

Document banning rules

Sometimes there's good reason to ban a repository from launching on Binder, but we should make sure that our reasons are clear for doing so in order to avoid making people unnecessarily unhappy. Let's include a section about our banning policies for the mybinder.org deployment. Not sure where this should go, maybe for now just a text file like guidelines.md that we can point people to in case we need to? For now this file could just have text like:

# Jupyter Usage Guidelines

This page details some guidelines and policies that we follow at mybinder.org.

## Temporary Banning

Temporary banning means that `mybinder.org` will stop building / serving Binder sessions for
a given repository. This usually happens because of some undesired behavior with the repository, such as a large, unexpected spike in traffic.

If you are temporarily banned, contact us on the [Gitter Channel](link-to-gitter) or [Open an Issue](issues-link) to discuss how to un-ban the repository.

Adhere to the EFF's Do Not Track policy for allowing people to opt out of tracking

https://www.eff.org/dnt-policy

The remaining steps on this one are:

Copy the text from the EFF DNT policy (https://github.com/EFForg/dnt-policy/blob/master/dnt-policy-1.0.txt) into the mybinder.org-deploy repo
Confirm that we can copy the policy verbatim and still adhere to it
Create a privacy section in docs.mybinder.org and link to the EFF text file in the deploy repo.

AGU cluster usage and shout out

https://mybinder.org/v2/gh/tylere/agu2017/master?urlpath=lab via @tylere there will be a mention of this at the AGU 10-15 December. Estimate 100-150 people maybe.

We should keep an eye out on the utilisation of the cluster. Maybe also post the command(s) here to grow/shrink it.

Move the DNS to the Jupyter DNS account

We originally moved the DNS to my account because our first attempt at moving to the Jupyter DNS account from Jeremy's account wasn't working. We should try again so that I am not a bottleneck on DNS!

I think @Carreau and I could do this some day at BIDS soon.

Build a cold spare cluster with manual failover

Currently prod runs on one cluster. If this cluster runs into problems (as it did last week), we have to rush to bring it back up to speed - not conducive to operational healthiness nor our uptime! Also GKE still makes the master unavailable when upgrading or modifying parameters (such as autoscaling limits).

We should instead operate with a pair of cluster - a 'cold spare'. In building it, we should meet the following criteria:

We should be able to make mybinder.org point to either of them with one automated command
Users who are on the hub during the failover should ideally not be affected as much as possible during a failover. People in the middle of a build will probably be affected.

The easiest way to do this is:

Create a new cluster (prod-b) with similar specs as current cluster (prod-a)
Modify our deploy.py script to do all deployments to both the clusters each time. Only difference should be the IPs assigned to the load balancers.
Write the failover script to change the cluster the static IP is assigned to.
Run the cold spare cluster at minimal capacity (1 node? 0 nodes? 1 really small node?), and have failover script also perform scale up

With this in place, we can do upgrades with minimal hassle & deal with full cluster outages better.

firewall/limit outbound traffic

We should be limiting outbound (and possibly inbound to a lesser extent) traffic in the user pods to limit what mischief they can get up to.

Redirect small number of users to staging

Can we redirect a small percentage of users to staging (without them noticing) in some kind of A/B testing like setup? The purpose of this would be to be able to look at failure rates to detect when a change has been deployed to staging that breaks something.

Build a privacy policy for mybinder.org

Related to #66

We should have a strong privacy policy that doesn't burden us too much, while still offering users a reasonable amount of privacy.

I consider https://wikimediafoundation.org/wiki/Privacy_policy/FAQ and https://wikimediafoundation.org/wiki/Privacy_policy to be somewhat of gold standards in this case. However, they're all probably too burdensome for someone like us, without the resources of the Wikimedia Foundation. As one example, we'll probably have to use Google Analytics rather than roll our own. We can still try to protect user's privacy tho - by opting out of GA's data sharing, respecting DNT headers, not storing IP info past a few days, etc.

Clean up built images with prefixes that aren't the current prefix

since they'll never be used again.

They cause problems leading to the 'no space left on disk' error

Split up / re-format README

Current README is a mix of multiple bits of docs:

An actual README, that just tells us what this repo is and what it is used for
Deployment instructions, but maybe not 100% complete (@betatim had suggestions for example)
Some form of #11 but not enough.

We should split this up into multiple pages that can then grow independently.

jupyterhub / mybinder.org-deploy Goto Github PK

mybinder.org-deploy's Issues

Paging checks

Who gets paged?

Legal

Technical

Recommend Projects

Recommend Topics

Recommend Org