The tsunami from jonhoo

VPC Support

Amazon EC2 has a notion of a Virtual Private Cloud (see also the VPC intro). tsunami currently just spawns all instances in the default VPC, and uses security groups to ensure that the resulting instances can only talk to each other. Instead, we should instead set up a new VPC for our job, spawn the instances inside of it, and set up the firewall such that the only traffic allow in is SSH.

Implementing this will resemble what we did for security groups; specifically, first create a VPC with some semi-random name, and then use the resulting VPC ID when launching the spot requests. Note that the VPC IP range will also need to be included in the security group setup, and that the VPC should be torn down after the job has finished.

Increase ssh session timeout

I have something similar to the ping example where every region is continuously pinging every other region for 30 minutes (with ping -w 1800). This experiment failed with the error the connection was terminated and I can see this string in ./target/debug/deps/libopenssh-9b9063468e4fbdfc.rlib.

Internet suggests that this was a timeout in ssh which could be prevented by setting the ServerAliveInterval ssh option to 0 (it would never timeout in this case). I was going to open a PR in openssh to allow configuring this option (and maybe alsoServerAliveCountMax) in openssh::SessionBuilder but I believe this is not enough since the session is created internally in tsunami. Given that, I'm not sure how to proceed. Do you have ideas? Thanks.

Backoff when polling

Every once in a while, it takes AWS a little bit of time to set up an instance.

In this case (using #23 but looks like it'd apply at HEAD too), Tsunami hammers the AWS API, which eventually gives errors back once I exhaust my quota.

Tsunami should retry using some kind of backoff to prevent overusing the AWS API.

Avoid arbitrary SSH timeout

We currently keep trying to establish an SSH connection to each instance for up to 2 minutes:

tsunami/src/ssh.rs

Lines 17 to 28 in 379acb7

    
           // TODO: instead of max time, keep trying as long as instance is still active 
        
           let start = Instant::now(); 
        
           let tcp = loop { 
        
               match TcpStream::connect_timeout(&addr, Duration::from_secs(3)) { 
        
                   Ok(s) => break s, 
        
                   Err(_) if start.elapsed() <= Duration::from_secs(120) => { 
        
                       thread::sleep(Duration::from_secs(1)); 
        
                   } 
        
                   Err(e) => Err(Error::from(e).context("failed to connect to ssh port"))?, 
        
               } 
        
           };

This is a pretty arbitrary limit, and EC2 has been known to occasionally have spawn times longer than this. The right thing to do instead is to keep retrying until the instance is no longer marked as running. Should that happen, there is no point in continuing to retry. Theoretically there could be an issue with the firewall rules that prevents connection, so we may still want a high eventual timeout (~5 minutes).

Defined duration instances are deprecated

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances

Since the feature isn't available to new users (and will be turned off in dec 2022 for all), aws::LaunchMode::DefinedDuration will eventually have to be removed.

Might be worth keeping it around through the grace period.

Clean up AWS placement groups

Because they tend to build up over time when not deleted. Especially since there's one per spot request, rather than just one per region...

Don't unnecessarily cancel spot instance requests

The default type of spot instance request is one-time, which means that you shouldnt need to cancel it once it was created -- it will only be fulfilled once.

Asynchronous SSH sessions

The ssh crate we're currently using, ssh2, does not have support for asynchronous sessions. This causes us to jump through some arguably unnecessary hoops such as using rayon to connect to instances in parallel

tsunami/src/lib.rs

Line 528 in 379acb7

.par_iter_mut()

We should instead switch to thrussh, a pure-Rust SSH client implementation, which does support asynchronous operation. That would let us connect to all the machines in parallel, and execute commands, without needing to spin up lots of threads (we could still use a threadpool for managing the connections, but it seems less important). This is now possible since we authenticate with a PEM private key file rather than through the SSH agent (which thrussh doesn't support).

TsunamiBuilder readme link is broken

Provide access to standard error output

Currently, the crate only exposes the stdout of a process through Session::cmd. However, it'd be nice to also have some other mechanism for easily getting both STDOUT and STDERR. The way to do this would probably be to have a builder of some sort for a command where the user can explicitly say what they want to happen to standard out and to standard error.

Azure Spot Support

Azure now supports spot instances: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/spot-cli
Should be straightforward to add support for it.

GCP support

This issue is for tracking adding a GCP Launcher to this crate.
As @znewman01 suggested in #23 it's probably possible to do using this: https://github.com/Byron/google-apis-rs

Bring back cleanup code

Code has been written for cleaning up the security group and keypair generated by tsunami for every run, but it is currently commented out:

tsunami/src/lib.rs

Lines 445 to 458 in 379acb7

    
                       /* 
        
                       debug!(log, "cleaning up temporary resources"); 
        
                       trace!(log, "cleaning up temporary security group"); 
        
                       // clean up security groups and keys 
        
                       let mut req = rusoto_ec2::DeleteSecurityGroupRequest::default(); 
        
                       req.group_id = Some(group_id); 
        
                       ec2.delete_security_group(&req) 
        
                           .context("failed to clean up security group")?; 
        
                       trace!(log, "cleaning up temporary keypair"); 
        
                       let mut req = rusoto_ec2::DeleteKeyPairRequest::default(); 
        
                       req.key_name = key_name; 
        
                       ec2.delete_key_pair(&req) 
        
                       .context("failed to clean up key pair")?; 
        
                       */

This is because it turns out that EC2 won't let you delete resources that are associated with instances that have not yet been terminated. We ran into this during the live-coding session @
https://youtu.be/66INYb73yXo?t=5716. The problem is specifically that although we send a termination request for all the instances, they may not yet have terminated by the time we try to do cleanup (and the deletion of the security group thus fails). The way around this would be to keep checking on all the instances until they have all been terminated, and only then do cleanup.

Sync up with new, async rusoto

The upcoming rusoto 0.32.0 release bring full async support to rusoto. The API changes are outlined in the rusoto migration guide. It's probably a good idea to start to investigate what changes we need to make to tsunami in light of those changes.

It also brings some much-desired fixes to hyper, so that we can (in theory) get rid of this hack:

tsunami/src/lib.rs

Lines 435 to 439 in 379acb7

    
           while let Err(e) = ec2.terminate_instances(&termination_req) { 
        
               let msg = format!("{}", e); 
        
               if msg.contains("Pooled stream disconnected") || msg.contains("broken pipe") { 
        
                   trace!(log, "retrying instance termination"); 
        
                   continue;

Use `Path::display()` to display a path

I'm just watching part 2 of the video where you add logging, and noticed you log a path in debug format because Path doesn't implement Display. The reason is that most OSes make no guarantee that a path contains valid UTF8 characters. But you can use path.display() to convert it to a displayable value. This is super-minor, but thought I'd mention it :)

The videos are great by the way and super informative 👍

Support for Windows instances?

How hard do you think it would be to add support for Windows instances? Either yourself or someone else.

Separate run_as into many smaller components

Currently, run_as is quite long, and has many relatively independent pieces to it (see also this comment). Especially on the async branch. Splitting that function into many smaller ones would make the code much easier to read and manage! It shouldn't be too bad to take most of the closures and split them out into separate private &self functions on TsunamiBuilder, so we should probably do that!

Add support for on-demand AWS instance

I have a horrible hacky implementation on #23: znewman01@8c91e80

Which I can clean up (at some point)

Enable user to supply region and credentials provider

At the moment we hard-code the credentials provider and instance region

tsunami/src/lib.rs

Lines 249 to 250 in 379acb7

    
           EnvironmentProvider, 
        
           Region::UsEast1,

While that makes it easy to get up and running, some users may need to authenticate through more elaborate means (e.g., through sts::AssumeRole) as discussed here. Similarly, not all users will want to use the US-EAST-1 region for obvious reasons. We should provide a way for the user to specify these options in the builder, perhaps also including the connector to use.

Provide a type to `run` closure that does not require unwrapping `ssh`

The closure passed to TsunamiBuilder::run is currently handed a bunch of Machine items. Each one has a ssh field which provides an SSH connection to the host in question. However, the type of that connection is currently Option<Session>, even though it is always Some when run is invoked. This is because we use the same type to keep track of a host before we've connected to it. Ideally a different type should be exposed to run so that it doesn't need to unwrap unnecessarily.

Provide a nicer interface for executing remote commands

We currently only have cmd (and cmd_raw) which take a single string and executes it on whatever shell is running on the remote host. While this works, it is pretty error-prone as users must manually do escaping, deal with whitespace argument splitting, etc. It'd be great if we could provide a higher level "command constructor" or builder that would allow the user to build a command from separate arguments, and that would take care of all the escaping business.

https://github.com/mit-pdos/distributary/blob/6bb5a8c5833237aa1f589aa4f2d83d5958431dfb/benchmarks/vote/orchestrator.rs#L571-L600 may be a good place to start for inspiration.

Handle issues that occur while requesting spot instances

Tsunami currently panics if requesting a spot instance fails. That isn't great because user code has limited ability to recover from this situation, and we end up leaving instances from spot requests that did go through running.

This should be fixed by breaking from the loop if creating a spot request fails, waiting for any spot requests that were created to start an instance, and then exit (which would also terminate the instances) somewhere around here.

	// TODO: instead of max time, keep trying as long as instance is still active
	let start = Instant::now();
	let tcp = loop {
	match TcpStream::connect_timeout(&addr, Duration::from_secs(3)) {
	Ok(s) => break s,
	Err(_) if start.elapsed() <= Duration::from_secs(120) => {
	thread::sleep(Duration::from_secs(1));
	}
	Err(e) => Err(Error::from(e).context("failed to connect to ssh port"))?,
	}
	};

	/*
	debug!(log, "cleaning up temporary resources");
	trace!(log, "cleaning up temporary security group");
	// clean up security groups and keys
	let mut req = rusoto_ec2::DeleteSecurityGroupRequest::default();
	req.group_id = Some(group_id);
	ec2.delete_security_group(&req)
	.context("failed to clean up security group")?;
	trace!(log, "cleaning up temporary keypair");
	let mut req = rusoto_ec2::DeleteKeyPairRequest::default();
	req.key_name = key_name;
	ec2.delete_key_pair(&req)
	.context("failed to clean up key pair")?;
	*/

	while let Err(e) = ec2.terminate_instances(&termination_req) {
	let msg = format!("{}", e);
	if msg.contains("Pooled stream disconnected") \|\| msg.contains("broken pipe") {
	trace!(log, "retrying instance termination");
	continue;

jonhoo / tsunami Goto Github PK

tsunami's People

Contributors

Stargazers

Watchers

Forkers

tsunami's Issues

Recommend Projects

Recommend Topics

Recommend Org