Giter Site home page Giter Site logo

tsunami's People

Contributors

akshayknarayan avatar brennie avatar gkbrk avatar hexjelly avatar jonhoo avatar ms705 avatar simonrw avatar vitorenesduarte avatar znewman01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tsunami's Issues

VPC Support

Amazon EC2 has a notion of a Virtual Private Cloud (see also the VPC intro). tsunami currently just spawns all instances in the default VPC, and uses security groups to ensure that the resulting instances can only talk to each other. Instead, we should instead set up a new VPC for our job, spawn the instances inside of it, and set up the firewall such that the only traffic allow in is SSH.

Implementing this will resemble what we did for security groups; specifically, first create a VPC with some semi-random name, and then use the resulting VPC ID when launching the spot requests. Note that the VPC IP range will also need to be included in the security group setup, and that the VPC should be torn down after the job has finished.

Increase ssh session timeout

I have something similar to the ping example where every region is continuously pinging every other region for 30 minutes (with ping -w 1800). This experiment failed with the error the connection was terminated and I can see this string in ./target/debug/deps/libopenssh-9b9063468e4fbdfc.rlib.

Internet suggests that this was a timeout in ssh which could be prevented by setting the ServerAliveInterval ssh option to 0 (it would never timeout in this case). I was going to open a PR in openssh to allow configuring this option (and maybe alsoServerAliveCountMax) in openssh::SessionBuilder but I believe this is not enough since the session is created internally in tsunami. Given that, I'm not sure how to proceed. Do you have ideas? Thanks.

Backoff when polling

Every once in a while, it takes AWS a little bit of time to set up an instance.

In this case (using #23 but looks like it'd apply at HEAD too), Tsunami hammers the AWS API, which eventually gives errors back once I exhaust my quota.

Tsunami should retry using some kind of backoff to prevent overusing the AWS API.

Avoid arbitrary SSH timeout

We currently keep trying to establish an SSH connection to each instance for up to 2 minutes:

tsunami/src/ssh.rs

Lines 17 to 28 in 379acb7

// TODO: instead of max time, keep trying as long as instance is still active
let start = Instant::now();
let tcp = loop {
match TcpStream::connect_timeout(&addr, Duration::from_secs(3)) {
Ok(s) => break s,
Err(_) if start.elapsed() <= Duration::from_secs(120) => {
thread::sleep(Duration::from_secs(1));
}
Err(e) => Err(Error::from(e).context("failed to connect to ssh port"))?,
}
};

This is a pretty arbitrary limit, and EC2 has been known to occasionally have spawn times longer than this. The right thing to do instead is to keep retrying until the instance is no longer marked as running. Should that happen, there is no point in continuing to retry. Theoretically there could be an issue with the firewall rules that prevents connection, so we may still want a high eventual timeout (~5 minutes).

Clean up AWS placement groups

Because they tend to build up over time when not deleted. Especially since there's one per spot request, rather than just one per region...

Asynchronous SSH sessions

The ssh crate we're currently using, ssh2, does not have support for asynchronous sessions. This causes us to jump through some arguably unnecessary hoops such as using rayon to connect to instances in parallel

.par_iter_mut()

We should instead switch to thrussh, a pure-Rust SSH client implementation, which does support asynchronous operation. That would let us connect to all the machines in parallel, and execute commands, without needing to spin up lots of threads (we could still use a threadpool for managing the connections, but it seems less important). This is now possible since we authenticate with a PEM private key file rather than through the SSH agent (which thrussh doesn't support).

Provide access to standard error output

Currently, the crate only exposes the stdout of a process through Session::cmd. However, it'd be nice to also have some other mechanism for easily getting both STDOUT and STDERR. The way to do this would probably be to have a builder of some sort for a command where the user can explicitly say what they want to happen to standard out and to standard error.

Bring back cleanup code

Code has been written for cleaning up the security group and keypair generated by tsunami for every run, but it is currently commented out:

tsunami/src/lib.rs

Lines 445 to 458 in 379acb7

/*
debug!(log, "cleaning up temporary resources");
trace!(log, "cleaning up temporary security group");
// clean up security groups and keys
let mut req = rusoto_ec2::DeleteSecurityGroupRequest::default();
req.group_id = Some(group_id);
ec2.delete_security_group(&req)
.context("failed to clean up security group")?;
trace!(log, "cleaning up temporary keypair");
let mut req = rusoto_ec2::DeleteKeyPairRequest::default();
req.key_name = key_name;
ec2.delete_key_pair(&req)
.context("failed to clean up key pair")?;
*/

This is because it turns out that EC2 won't let you delete resources that are associated with instances that have not yet been terminated. We ran into this during the live-coding session @
https://youtu.be/66INYb73yXo?t=5716. The problem is specifically that although we send a termination request for all the instances, they may not yet have terminated by the time we try to do cleanup (and the deletion of the security group thus fails). The way around this would be to keep checking on all the instances until they have all been terminated, and only then do cleanup.

Sync up with new, async rusoto

The upcoming rusoto 0.32.0 release bring full async support to rusoto. The API changes are outlined in the rusoto migration guide. It's probably a good idea to start to investigate what changes we need to make to tsunami in light of those changes.

It also brings some much-desired fixes to hyper, so that we can (in theory) get rid of this hack:

tsunami/src/lib.rs

Lines 435 to 439 in 379acb7

while let Err(e) = ec2.terminate_instances(&termination_req) {
let msg = format!("{}", e);
if msg.contains("Pooled stream disconnected") || msg.contains("broken pipe") {
trace!(log, "retrying instance termination");
continue;

Use `Path::display()` to display a path

I'm just watching part 2 of the video where you add logging, and noticed you log a path in debug format because Path doesn't implement Display. The reason is that most OSes make no guarantee that a path contains valid UTF8 characters. But you can use path.display() to convert it to a displayable value. This is super-minor, but thought I'd mention it :)

The videos are great by the way and super informative ๐Ÿ‘

Separate run_as into many smaller components

Currently, run_as is quite long, and has many relatively independent pieces to it (see also this comment). Especially on the async branch. Splitting that function into many smaller ones would make the code much easier to read and manage! It shouldn't be too bad to take most of the closures and split them out into separate private &self functions on TsunamiBuilder, so we should probably do that!

Enable user to supply region and credentials provider

At the moment we hard-code the credentials provider and instance region

tsunami/src/lib.rs

Lines 249 to 250 in 379acb7

EnvironmentProvider,
Region::UsEast1,

While that makes it easy to get up and running, some users may need to authenticate through more elaborate means (e.g., through sts::AssumeRole) as discussed here. Similarly, not all users will want to use the US-EAST-1 region for obvious reasons. We should provide a way for the user to specify these options in the builder, perhaps also including the connector to use.

Provide a type to `run` closure that does not require unwrapping `ssh`

The closure passed to TsunamiBuilder::run is currently handed a bunch of Machine items. Each one has a ssh field which provides an SSH connection to the host in question. However, the type of that connection is currently Option<Session>, even though it is always Some when run is invoked. This is because we use the same type to keep track of a host before we've connected to it. Ideally a different type should be exposed to run so that it doesn't need to unwrap unnecessarily.

Provide a nicer interface for executing remote commands

We currently only have cmd (and cmd_raw) which take a single string and executes it on whatever shell is running on the remote host. While this works, it is pretty error-prone as users must manually do escaping, deal with whitespace argument splitting, etc. It'd be great if we could provide a higher level "command constructor" or builder that would allow the user to build a command from separate arguments, and that would take care of all the escaping business.

https://github.com/mit-pdos/distributary/blob/6bb5a8c5833237aa1f589aa4f2d83d5958431dfb/benchmarks/vote/orchestrator.rs#L571-L600 may be a good place to start for inspiration.

Handle issues that occur while requesting spot instances

Tsunami currently panics if requesting a spot instance fails. That isn't great because user code has limited ability to recover from this situation, and we end up leaving instances from spot requests that did go through running.

This should be fixed by breaking from the loop if creating a spot request fails, waiting for any spot requests that were created to start an instance, and then exit (which would also terminate the instances) somewhere around here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.