jonhoo / tsunami Goto Github PK
View Code? Open in Web Editor NEWRust crate for running one-off cloud jobs
License: Apache License 2.0
Rust crate for running one-off cloud jobs
License: Apache License 2.0
Amazon EC2 has a notion of a Virtual Private Cloud (see also the VPC intro). tsunami
currently just spawns all instances in the default VPC, and uses security groups to ensure that the resulting instances can only talk to each other. Instead, we should instead set up a new VPC for our job, spawn the instances inside of it, and set up the firewall such that the only traffic allow in is SSH.
Implementing this will resemble what we did for security groups; specifically, first create a VPC with some semi-random name, and then use the resulting VPC ID when launching the spot requests. Note that the VPC IP range will also need to be included in the security group setup, and that the VPC should be torn down after the job has finished.
I have something similar to the ping example where every region is continuously pinging every other region for 30 minutes (with ping -w 1800
). This experiment failed with the error the connection was terminated
and I can see this string in ./target/debug/deps/libopenssh-9b9063468e4fbdfc.rlib
.
Internet suggests that this was a timeout in ssh which could be prevented by setting the ServerAliveInterval
ssh option to 0 (it would never timeout in this case). I was going to open a PR in openssh
to allow configuring this option (and maybe alsoServerAliveCountMax
) in openssh::SessionBuilder
but I believe this is not enough since the session is created internally in tsunami
. Given that, I'm not sure how to proceed. Do you have ideas? Thanks.
Every once in a while, it takes AWS a little bit of time to set up an instance.
In this case (using #23 but looks like it'd apply at HEAD too), Tsunami hammers the AWS API, which eventually gives errors back once I exhaust my quota.
Tsunami should retry using some kind of backoff to prevent overusing the AWS API.
We currently keep trying to establish an SSH connection to each instance for up to 2 minutes:
Lines 17 to 28 in 379acb7
This is a pretty arbitrary limit, and EC2 has been known to occasionally have spawn times longer than this. The right thing to do instead is to keep retrying until the instance is no longer marked as running
. Should that happen, there is no point in continuing to retry. Theoretically there could be an issue with the firewall rules that prevents connection, so we may still want a high eventual timeout (~5 minutes).
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances
Since the feature isn't available to new users (and will be turned off in dec 2022 for all), aws::LaunchMode::DefinedDuration
will eventually have to be removed.
Might be worth keeping it around through the grace period.
Because they tend to build up over time when not deleted. Especially since there's one per spot request, rather than just one per region...
The default type of spot instance request is one-time, which means that you shouldnt need to cancel it once it was created -- it will only be fulfilled once.
The ssh crate we're currently using, ssh2
, does not have support for asynchronous sessions. This causes us to jump through some arguably unnecessary hoops such as using rayon
to connect to instances in parallel
Line 528 in 379acb7
We should instead switch to thrussh
, a pure-Rust SSH client implementation, which does support asynchronous operation. That would let us connect to all the machines in parallel, and execute commands, without needing to spin up lots of threads (we could still use a threadpool for managing the connections, but it seems less important). This is now possible since we authenticate with a PEM private key file rather than through the SSH agent (which thrussh
doesn't support).
Currently, the crate only exposes the stdout of a process through Session::cmd
. However, it'd be nice to also have some other mechanism for easily getting both STDOUT
and STDERR
. The way to do this would probably be to have a builder of some sort for a command where the user can explicitly say what they want to happen to standard out and to standard error.
Azure now supports spot instances: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/spot-cli
Should be straightforward to add support for it.
This issue is for tracking adding a GCP Launcher
to this crate.
As @znewman01 suggested in #23 it's probably possible to do using this: https://github.com/Byron/google-apis-rs
Code has been written for cleaning up the security group and keypair generated by tsunami
for every run, but it is currently commented out:
Lines 445 to 458 in 379acb7
This is because it turns out that EC2 won't let you delete resources that are associated with instances that have not yet been terminated. We ran into this during the live-coding session @
https://youtu.be/66INYb73yXo?t=5716. The problem is specifically that although we send a termination request for all the instances, they may not yet have terminated by the time we try to do cleanup (and the deletion of the security group thus fails). The way around this would be to keep checking on all the instances until they have all been terminated, and only then do cleanup.
The upcoming rusoto 0.32.0
release bring full async support to rusoto. The API changes are outlined in the rusoto migration guide. It's probably a good idea to start to investigate what changes we need to make to tsunami
in light of those changes.
It also brings some much-desired fixes to hyper
, so that we can (in theory) get rid of this hack:
Lines 435 to 439 in 379acb7
I'm just watching part 2 of the video where you add logging, and noticed you log a path in debug format because Path
doesn't implement Display
. The reason is that most OSes make no guarantee that a path contains valid UTF8 characters. But you can use path.display()
to convert it to a displayable value. This is super-minor, but thought I'd mention it :)
The videos are great by the way and super informative ๐
How hard do you think it would be to add support for Windows instances? Either yourself or someone else.
Currently, run_as
is quite long, and has many relatively independent pieces to it (see also this comment). Especially on the async
branch. Splitting that function into many smaller ones would make the code much easier to read and manage! It shouldn't be too bad to take most of the closures and split them out into separate private &self
functions on TsunamiBuilder
, so we should probably do that!
I have a horrible hacky implementation on #23: znewman01@8c91e80
Which I can clean up (at some point)
At the moment we hard-code the credentials provider and instance region
Lines 249 to 250 in 379acb7
While that makes it easy to get up and running, some users may need to authenticate through more elaborate means (e.g., through sts::AssumeRole
) as discussed here. Similarly, not all users will want to use the US-EAST-1 region for obvious reasons. We should provide a way for the user to specify these options in the builder, perhaps also including the connector to use.
The closure passed to TsunamiBuilder::run
is currently handed a bunch of Machine
items. Each one has a ssh
field which provides an SSH connection to the host in question. However, the type of that connection is currently Option<Session>
, even though it is always Some
when run
is invoked. This is because we use the same type to keep track of a host before we've connected to it. Ideally a different type should be exposed to run
so that it doesn't need to unwrap
unnecessarily.
We currently only have cmd
(and cmd_raw
) which take a single string and executes it on whatever shell is running on the remote host. While this works, it is pretty error-prone as users must manually do escaping, deal with whitespace argument splitting, etc. It'd be great if we could provide a higher level "command constructor" or builder that would allow the user to build a command from separate arguments, and that would take care of all the escaping business.
https://github.com/mit-pdos/distributary/blob/6bb5a8c5833237aa1f589aa4f2d83d5958431dfb/benchmarks/vote/orchestrator.rs#L571-L600 may be a good place to start for inspiration.
Tsunami currently panics if requesting a spot instance fails. That isn't great because user code has limited ability to recover from this situation, and we end up leaving instances from spot requests that did go through running.
This should be fixed by breaking from the loop if creating a spot request fails, waiting for any spot requests that were created to start an instance, and then exit (which would also terminate the instances) somewhere around here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.