keikoproj / minion-manager Goto Github PK

View Code? Open in Web Editor NEW

130.0 9.0 22.0 13.73 MB

Intelligent use of Spot Instances in Kubernetes

License: Apache License 2.0

Makefile 1.24% Python 97.97% Dockerfile 0.56% Shell 0.24%

spot-instances kubernetes autoscaling-groups cost-effectiveness

minion-manager's Introduction

Spot Instances use and management for Kubernetes

The minion-manager enables the intelligent use of Spot Instances in Kubernetes.

What does it do?

The minion-manager operates on autoscaling groups (ASGs).
It queries AWS for all autoscaling groups that have the Kubernetes cluster tag and a special tag called "k8s-minion-manager". ASGs which have these tags are operated upon by the minion-manager.
It queries AWS to get the pricing information for spot-instances every 10 minutes.
It checks whether the given ASGs are using spot-instances or on-demand instances. If the spot-instance price < on-demand instance price, it switches the ASG to use spot-instances and terminates the on-demand instance.
If, at any point in time, the spot-instance price spikes and goes above on-demand instance price, it switches the ASG to use on-demand instances.

Prerequisites

It's best to run the minion-manger on an on-demand instance.

The IAM role of the node that runs the minion-manager should have the following policies.

{
    "Sid": "kopsK8sMinionManager",
    "Effect": "Allow",
    "Action": [
        "ec2:DescribeInstances",
        "ec2:TerminateInstances",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSpotInstanceRequests",
        "autoscaling:CreateLaunchConfiguration",
        "autoscaling:DeleteLaunchConfiguration",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "autoscaling:UpdateAutoScalingGroup",
        "autoscaling:DescribeScalingActivities",
        "iam:PassRole"
    ],
    "Resource": [
	"*"
    ]
}

Installing

Modify the deploy/mm.yaml by

Add the names of your cluster instead of
Change the namespace where the minion-manager will be run.

Then, kubectl apply -f deploy/mm.yaml.

Design:

Only ASGs which have the "k8s-minion-manager" tag are considered by the minion-manager. Other ASGs are left alone.
Minion-manager queries AWS for ASGs with these tags every "--refresh-interval". Default is 5 minutes.
The "k8s-minion-manager" tag can have two possible values:
- "use-spot": This will make the minion-manager intelligently use spot instances in the ASG
- "no-spot": This will make the minion-manager always use on-demand instances in the ASG. This is useful when someone wants to temporarily switch to on-demand instances and at a later point switch to "use-spot"
- Note that after changing the tag value, it may take upto 5 minutes for the minion-manager pod to see the changes and make them take effect.
The "k8s-minion-manager/not-terminate" tag can control ASG instance terminate by the minion-manager. If you want to control when to terminate ASG instances. You can set this tag to true. If not set or other value will disable this feature.

What happens when:

1. User runs k8s-minion-manager without any ASG having the "k8s-minion-manager" tag?

k8s-minion-manager ignores all ASGs. It simply continues to keep polling AWS for the tags every "refresh-interval" seconds.

2. User runs k8s-minion-manager, adds the "k8s-minion-manager" tag and the "use-spot" value to start with. But later wants to not use spot instances.

User should then change the key from "use-spot" to "no-spot". This will indicate to the k8s-minion-manager that the ASG should have all on-demand instances and it will make sure of that.

3. User runs k8s-minion-manager, adds the "k8s-minion-manager" key and the "use-spot" value to start with. But later simply removes the tag and the value.

Once the tag is removed, k8s-minion-manager simply considers the ASG to be off-limits and does not act upon it. The ASG will remain in whatever condition it is in.

4. User runs k8s-minion-manager, adds the "k8s-minion-manager" key and the "no-spot" value to start with. But later simply removes the tag and the value.

Same as above. The ASG will remain in whatever condition it is in.

5. User is running k8s-minion-manager and using spot instances. But now wants to stop using instances forever.

This will be a multi-step process:

Change the value of the "k8-minion-manager" tag to "no-spot".
Wait for the minion-manager to react to this and switch the instances to on-demand. Look at the AWS console for verifying that all instances are on-demand.
After the above, remove the "k8s-minion-manager" tag.
Delete the "k8s-minion-manager" deployment.

How do I:

1. Run unit tests: Ensure that your AWS cli is set up correctly. Then simply run make docker-test

minion-manager's People

Stargazers

Watchers

minion-manager's Issues

on-demand price is 0.0000

I installed kubernetes 1.10 and started minion-manager using yaml file from deploy folder. I tagged ASG with "KubernetesCluster"="my-cluster-name" and "minion-manager"="on-spot". After some time log shows

INFO aws.minion-manager.bid-advisor MainThread: Using spot_instance price 0.013900, on-demand price 0.000000 for instance type: t2.medium, zones: ['us-east-1a', 'us-east-1b']. Why is that? There was no errors in the log.

minion manager should support events only mode

Minion Manager should support Kubernetes events only mode

Published structured JSON with IG-name, Bid price, and aws-region, etc
Minion Manager will only publish recommendation and will not take any action

Remove python2 compatibility code

Python2 was EOL'ed earlier this year and thus it may be worth switching to Python3 to avoid potential security issues

https://www.python.org/doc/sunset-python-2/

How to change log to debug?

Try to do a POC with this tool but cant make it to work (i set the tags)

Move container image under argoproj

The current minion-manager docker image is under a personal docker-hub account. It should be moved under argoproj with the rest of the images.

Remove dependency on aws credentials for running unit tests

Currently, running make runs the unit tests which require valid aws credentials. This is because the unit tests invoke the boto apis that actually make AWS api calls. Instead, the appropriate calls should be mocked with mocker or moto. This will also reduce the time required for the tests to run.

Will `schedule_instance_termination()` terminate instance just created?

When I read the schedule_instance_termination() in aws_minion_manager.py. I found it will terminate instances if not match ASG's k8s-minion-manager.
But I didn't found update_scaling_group() will update the ASG's k8s-minion-manager value.
When spot price over on-demand price. mm will update launch config to use on-demand. And update lc_info and bid_info.
So I worry will it keep terminate instances just launched after price raised over on-demand price.

MM fails to discover/populate ASG, using weird endpoint for AWS API

Minion-Manager seems to be using a peculiar endpoint when talking to AWS Autoscaling API.
I've got the following setup:

K8s cluster: dev.rnd.pw
K8s nodes ASG: nodes.dev.rnd.pw
Route 53 Zone Record: *.dev.rnd.pw
- api.dev.rnd.pw -> CNAME -> K8s API ELB
- *.dev.rnd.pw -> CNAME -> Ingress ELB
- other records, not relevant to this issue
Ingress ELB with rnd.pw SSL Cert with additional alternative names: *.rnd.pw, *.dev.rnd.pw

When launching Minion-Manger, it seems to attempt to talk to something in *.dev.rnd.pw, according to the error message mentioning the certificate. I've no idea how it would resolve autoscaling.us-east-2.amazonaws.com via *.dev.rnd.pw wildcard CNAME record.

Using shrinand/k8s-minion-manager:v0.2-dev

$ kubectl logs -f minion-manager-695dd4596f-nl5wd
2018-07-25T10:20:04 INFO minion_manager MainThread: Starting minion-manager for cluster: dev.rnd.pw, in region us-east-2 for cloud provider aws
2018-07-25T10:20:05 INFO aws_minion_manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
  File "./minion_manager.py", line 61, in <module>
    run()
  File "./minion_manager.py", line 57, in run
    minion_manager.run()
  File "/cloud_provider/aws/aws_minion_manager.py", line 495, in run
    str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'

Or using argoproj/minion-manager

$ kubectl logs -f minion-manager-dep-575cb9d695-7vrhl
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting ...
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: us-east-2
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: ['nodes.dev.rnd.pw']
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting minion-manager for scaling groups: ['nodes.dev.rnd.pw'], in region us-east-2 for cloud provider aws
2018-07-25T10:30:05 INFO aws.minion-manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
  File "/ax/bin/minion_manager", line 89, in <module>
    run()
  File "/ax/bin/minion_manager", line 79, in run
    minion_manager.run()
  File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 550, in run
    self.start()
  File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 152, in start
    str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'

As soon as I delete *.dev.rnd.pw DNS record, the problem disappears, and Minion-Manager discovers ASG just fine.

Setup PR builds for minion-manager

Bid threshold should be a parameter

Hi, I've done some testing on minion-manager and was really satisfied with the result, thinking about implementing it to production 😄 , but I think a configurable threshold should be added, so we can determine how aggressive we want to be about the prices over OnDemand instances.

Something like:

parser.add_argument("--threshold", default=80, help="Max percentage to pay over OnDemand price")

Looking at the code, IMO it's a very simple change and I wonder if you think it's valid or not?
I could submit a PR if you're positive.

Thanks.

Handle AZ-isolated capacity issues

Occasionally, an AZ may run out of spot capacity. When this happens, an ASG will temporarily spin up instances in other AZs if possible - and later attempt to rebalance instances across all AZs. If an AZ is still out of capacity or close to being out, AWS will still attempt to spin up instances in this AZ. We've noticed a lot of node churn when this happens - nodes are spun up, before being yanked by AWS for instance-terminated-no-capacity - it would be nice if minion manager was able suspend AzRebalance in these cases to avoid further churn.

k8s-minion-manager should show money spent/saved

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
k8s-minion-manager can intelligently switch between spot and on-demand instances. However, it doesn't provide information about how much money has been saved because of it. It will be good if the addon can provide that information.

What you expected to happen:
There should be an easy way of seeing the money spent and saved on a per IG basis.

Switch to on-demand when capacity unavailable

Good Afternoon,

I was just wondering if it was possible to add a feature to switch to on-demand instances when capacity is unavailable and the request could not be be fulfilled?

AWS can return multiple values per instance-type per region

We recently found that the minion-manager was not switching from on-demand to spot-instances because it saw that the spot-instance price was > than the on-demand price. The on-demand instance price was seen to be 0.000 :-).

Turns out that this was happening because the AWS on-demand pricing endpoint was returning multiple values for that instance-type for the same region. The current mechanism for gathering the ondemand instance price will simply take the price of the last entry for that instance type. But it seems that that price could be 0.00.

This is what we got in one case:

{'SKU': '2N2QH6UEJZ5GUPT8'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0464000000'	'PriceDescription': '$0.0464 per On Demand Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': ''	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'Used'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-BoxUsage:t2.medium'	'RateCode': '2N2QH6UEJZ5GUPT8.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}
{'SKU': 'QT7848TA4YHDW5JE'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0464000000'	'PriceDescription': '$0.0464 per Unused Reservation Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': '2N2QH6UEJZ5GUPT8'	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'UnusedCapacityReservation'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-UnusedBox:t2.medium'	'RateCode': 'QT7848TA4YHDW5JE.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}
{'SKU': 'PRCADQFUQ6HZKBHK'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0000000000'	'PriceDescription': '$0.00 per Reservation Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': '2N2QH6UEJZ5GUPT8'	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'AllocatedCapacityReservation'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-Reservation:t2.medium'	'RateCode': 'PRCADQFUQ6HZKBHK.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}

The difference in the three prices is the price description.

'PriceDescription': '$0.0464 per On Demand Linux t2.medium Instance Hour'
'PriceDescription': '$0.0464 per Unused Reservation Linux t2.medium Instance Hour'
'PriceDescription': '$0.00 per Reservation Linux t2.medium Instance Hour'

Basically, minion-manager currently only support on-demand instances (does not support Reserved instances). Therefore, only the "On Demand" price description from the above is relevant. But current implementation of the price querying API does not factor this in.

To start with:

it'll be good to specifically look for "On Demand " in the price description and only consider that price.
Add warnings if there are duplicates and if some price is being overwritten
Ensure that the price is not set to 0. If so... log LOUDLY!!

Dockerfile Vulnerability

I don't like the unknown nature of the docker file. why is this not based on https://hub.docker.com/_/python/?tab=tags
?

Use AWS tags to discover ASGs instead of command line arguments

The minion-manager currently uses the --scaling-groups command line argument to find the list of ASGs on which to operate. Everytime the list has to be updated, the minion-manager deployment has to be updated and restarted. This is cumbersome and error-prone.

Instead, the minion-manager should take an AWS tag name and tag value pair as argument and "discover" the ASGs to operate upon. If the user wants to disable use of spot-instances, the user can simply modify the tags in AWS and the minion-manager pod should factor that in.

This is similar to the way the cluster-autoscaler pod runs.

Upgrade the `pyca / cryptography` library to newer version

Switch to on-demand pricing per region

Example: ( us-west-2 ) URL
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-2/index.csv

Bug: LaunchTemplate causes controller to crash

Looks like if we tag an ASG using a LaunchTemplate with the minion manager tag, it causes the controller to crash.
Regardless of whether we would like to support launch templates in the future, we should probably avoid crashing if the LaunchConfigurationName field is nil.

Traceback (most recent call last):
  File "./minion_manager.py", line 62, in <module>
    run()
  File "./minion_manager.py", line 58, in run
    minion_manager.run()
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 727, in run
    str(ex))
Exception: Failed to discover/populate current ASG info: LaunchConfigurationName

Config-file with config options

The minion-manager has a few configuration options and could use a few more. E.g.

Name of the cluster
Region
Number of instances to terminate in parallel
Time to sleep between terminating instances

It will be better to have these options in a config file and make the minion-manager use that file instead of some command line args.

Cordon and drain nodes before termination

When the minion-manager switches between on-demand to spot instances, it currently simply terminates the nodes. It will be good if the termination is proceeded by cordoning and draining the node so that the pods on that node can move to a different node. Also, this might reduce the downtime (if any) that the apps might face because of this.

Master branch broken because of incorrect use of variable

Switching to on-demand instances from spot-instances is currently broken because of the following:

2019-03-26T05:38:29 ERROR aws_minion_manager MainThread: Failed while checking instances in ASG: global name 'spot_price' is not defined
Traceback (most recent call last):
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 593, in minion_manager_work
    self.update_scaling_group(asg_meta, new_bid_info)
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 330, in update_scaling_group
    self.create_lc_on_demand(new_lc_name, launch_config)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 270, in create_lc_on_demand
    SpotPrice=spot_price,
NameError: global name 'spot_price' is not defined

Handle Cloudwatch Event for Spot instance termination

Minion manager should react to the spot instance termination event and switch to On Demand to catch very rapid bursts in Spot Price.

Does minion-manager support mixed instance?

I have cluster that use ASG with mixed instance, on-demand + spot instance reference , I want to use minion-manager to just use to autoscale the spot instance? without touching the on-demand one, does this possible with current minion-manager?

On-demand instances get terminated all together

Currently, all on-demand instances in an ASG get terminated together when the minion-manager decides to use spot instances. This has it's pros and cons. The benefit of this is that new instances all come-up in parallel shortening the time for which on-demand instances run (and therefore keeps costs low). However, this can lead to service disruption.

Ideally, it should be possible to chose what termination strategy is to be used.

Maybe, add another tag?
k8s-minion-manager/num-simultaneous-terminations: 1 will terminate one instance at a time.
k8s-minion-manager/num-simultaneous-terminations: all will terminate all instances together.

Spot instances can be terminated without price change (and new ones gotten after long time)

We noticed that spot instances were Terminated without any bid or spot price changes. It seems AWS can terminate them and not give new ones immediately. We may need to switch to on-demand when this is happening. Will be a little hard to decide but need some mechanism

lot of DescribeSpotPriceHistory calls when there is a Exception

When an Exception occurred during DescribeSpotPriceHistory there is no back off, minion-manager is making lot of aws calls

Certain Instance types have on-demand price is 0.0000

I saw that these two issues below are similar to mine and were closed.
#15
#10

The bug still seems to be in the code and also affects m5a.2xlarge instances

I have a fix. What is the best way to share it? I am getting a 403 when I try to push up my branch.

Thanks!

BUG: Spot price is not updated based on LaunchConfig

When a LaunchConfig is changed, spot price will stay the same and based on the size of instance-type being switched, it can prevent instances from launching.

Spot price should be updated based on instance-type specified in LaunchConfig, you might also need to maintain the previous spot pricing until all new instances have joined Asg.

SpotRecommendation events should include IG name

Current:

28m         Normal   SpotRecommendationGiven   SpotPriceInfo   {"apiVersion":"v1alpha1","spotPrice":"", "useSpot": false}

Ecxpeetd:

28m         Normal   SpotRecommendationGiven   SpotPriceInfo   {"apiVersion":"v1alpha1","spotPrice":"0.90", "useSpot": false, "instanceGroup": "node123"}