Giter Site home page Giter Site logo

aws-samples / aws-health-aware Goto Github PK

View Code? Open in Web Editor NEW
318.0 18.0 131.0 1.47 MB

AHA is an incident management & communication framework to provide real-time alert customers when there are active AWS event(s). For customers with AWS Organizations, customers can get aggregated active account level events of all the accounts in the Organization. Customers not using AWS Organizations still benefit alerting at the account level.

License: MIT No Attribution

Python 69.09% HCL 30.91%
heath-check health serverless incident-response-tooling incident-management alerts

aws-health-aware's Issues

Multi region deployment issue

I deployed AHA with Multi region option then I found my lambda function on Secondary region failed to run. The reason is misconfiguration of lambda Environment variables. ORG_STATUS value should be “No” but was “false” even though primary region lambda ORG_STATUS value is “No”. Are there anyone who faced same as me?

Terraform deploy to single region fails on creation of AWS Secret Manager secret for AssumeRoleArn

The current state of the main branch of this repo is not deployable using terraform provided in this repo.

When deploying to an Organisation member account, in a single region, using Terraform there are 2 errors:

Error 1:

Error: error creating Secrets Manager Secret: InvalidParameterException: Invalid replica region.

   with aws_secretsmanager_secret.AssumeRoleArn[0],
   on Terraform_DEPLOY_AHA.tf line 416, in resource "aws_secretsmanager_secret" "AssumeRoleArn":
  416: resource "aws_secretsmanager_secret" "AssumeRoleArn" {

Error 2:

 Error: Invalid index
 
   on Terraform_DEPLOY_AHA.tf line 207, in resource "aws_s3_bucket_acl" "AHA-S3Bucket-PrimaryRegion":
  207:     bucket = aws_s3_bucket.AHA-S3Bucket-PrimaryRegion[0].id
     ├────────────────
     │ aws_s3_bucket.AHA-S3Bucket-PrimaryRegion is empty tuple
 
 The given key does not identify an element in this collection value: the
 collection has no elements.

Please Note: Error 1 was already attempted to be fixed in this PR-32 of this project

SHD updates beyond initial posting are not shown by AHA

When updates are made to the Service Health Dashboard after an initial event has been opened (i.e. CloudFront goes from Green to Blue), they are not propagated or picked up by AHA and sent to the Slack webhook. Similarly, when a service/event goes back to green, that is not reflected in AHA. It appears that only the initial event notification sends out an alert via AHA but it requires the user to continue to monitor the SHD/PHD for further updates.

Conversely, an RSS feed for Slack picks up all updates to the SHD (including after a service is not green).

Lambda issue - 413 Request Entity Too Large

Due to recent AWS Health Event occurring during the past two days. The Lambda started to fail on second day posting messages to Slack and Chime due to 413 Request Entity Too Large error. SNS notification worked just fine.

Typo in readme instructions for multi-account deployment

README.md

Line 255 : 9. In the Outputs tab, there will be a value for AWSHealthAwareRoleForPHDEventsArn (e.g. arn:aws:iam::000123456789:role/aha-org-role-AWSHealthAwareRoleForPHDEvents-ABCSDE12201), copy that down as you will need it for step 16.

Should reference step 14, not 16.

Question: What is the reason for building the an S3 bucket for each region

Dear Folks,

I have a question regarding the following Terraform resources:

resource "aws_s3_bucket" "AHA-S3Bucket-PrimaryRegion" {
    count      = "${var.ExcludeAccountIDs != "" ? 1 : 0}"
    bucket     = "aha-bucket-${var.aha_primary_region}-${random_string.resource_code.result}"
    tags = {
      Name        = "aha-bucket"
    }
}

resource "aws_s3_bucket" "AHA-S3Bucket-SecondaryRegion" {
    count      = "${var.aha_secondary_region != "" && var.ExcludeAccountIDs != "" ? 1 : 0}"
    provider   = aws.secondary_region
    bucket     = "aha-bucket-${var.aha_secondary_region}-${random_string.resource_code.result}"
    tags = {
      Name        = "aha-bucket"
    }
}

I was not able to figure out what they are used for.

I think they are used for a CSV file holding data about excluded accounts if so I do not see a reason to create these buckets if I were just to pass the excluded accounts as a list in Terraform that is interpreted in python as a string and parsed.

If someone could tell me what these buckets are used for that would be great.

Many thanks.

Lambda cold/warm start issue when DNS changes region

There is an issue in the code when the DNS is switching from one region to another.

We have noticed that when there is a failover from one region to another from the loadbalancer, although the health_active_region is changing, due to the fact that codeblock is not wrapped into a function or in the function handler, the code is not in warm state and wrong results are returned. Currently the initialisation of the config is done on the cold state. As a result when one runs the lambda every x minutes when a failover occurs it will return an error.

In the function logs we notice an error message like:

Client is configured with the deprecated endpoint: us-east-2

health_dns = socket.gethostbyname_ex('global.health.amazonaws.com')
(current_endpoint, global_endpoint, ip_endpoint) = health_dns
health_active_list = current_endpoint.split('.')
health_active_region = health_active_list[1]
print("current health region: ", health_active_region)
# create a boto3 health client w/ backoff/retry
config = Config(
region_name=health_active_region,
retries=dict(
max_attempts=10 # org view apis have a lower tps than the single
# account apis so we need to use larger
# backoff/retry values than than the boto defaults
)
)

A solution for this issue could be to wrap the initialisations inside a function.

Trigger lambda on an event rather than on a schedule

It strikes me as quite inefficient that this lambda is triggered on a schedule of every minute. Is there a reason this is the case, rather than executing on every Health event received that matches an event pattern?

Add SourceArn parameter to ses send_email

Hi,

SES send_email call should have the SendArn parameter.
In the organizational context, SES identities are often defined on a dedicated aws account

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ses/client/send_email.html

aws-health-aware/handler.py

Lines 267 to 283 in eae99cc

response = client.send_email(
Source=SENDER,
Destination={
'ToAddresses': RECIPIENT
},
Message={
'Body': {
'Html': {
'Data': BODY_HTML
},
},
'Subject': {
'Charset': 'UTF-8',
'Data': SUBJECT,
},
},
)

Duplicate notifications for events

Hello

I have AHA deployed in my organisation master account and several member accounts. They have all been deployed with teams notifications enabled.

Early this morning there was an event for a direct connect issue in a member account. In the AWS console, there is one event listed (same if i query the API directly).

However I received 4 teams notifications, all seemingly identical. When the issue was resolved, I again received 4 notifications.

Logs for the function only show 'Sending the alert to Teams' occurring once. What is causing this notification to be spammed to Teams? I cannot see anything in the function that might explain it, so perhaps this is a teams issue?

duplicates

Thanks

Eventbridge message structure

At the moment the data sent to Eventbridge contains the account information as one field, in this format:

account-name (012345678) - That is account_name (account_id)

References:

https://github.com/aws-samples/aws-health-aware/blob/main/handler.py#L480-L481
https://github.com/aws-samples/aws-health-aware/blob/main/messagegenerator.py#L136

In order to action these events downstream using eventbridge we're having to pull the account id out of that field. It'd be nicer to keep these separate for example two fields: "Account name" and "Account ID"

This comment suggests some upcoming changes to eventbus structure: #29 (comment)

Are there any updates?

Organization Account Name Lookup Too Frequent

entity['awsAccountName'] = get_account_name(entity['awsAccountId'])

The call to get_account_name in get_affected_entities is currently in inside the loop around the returned entities for the Health event. As this is only used when Org Mode is turned on, and the call to describe_affected_entities_for_organization uses an awsAccountId filter, this means that the account name is looked up for the same account for each entity that is returned.

It would be more efficient to move the get_account_name call to the outside account loop so that it is only done once for each account in the affected_accounts list when running in Org Mode.

Multi-Region deployment fails to create all lambda environment variables

When deploying AHA in multi-region mode, I am getting the error:

[ERROR] KeyError: 'ACCOUNT_IDS'
Traceback (most recent call last):
  File "/var/task/handler.py", line 872, in main
    describe_org_events(health_client)
  File "/var/task/handler.py", line 734, in describe_org_events
    if os.environ['ACCOUNT_IDS'] == "None" or os.environ['ACCOUNT_IDS'] == "":
  File "/var/lang/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None

Looking at cloudformation, it appears the line starting below doesn't create ACCOUNT_IDS as an environment variable for the secondary region -

Environment:

Terraform Lambda doesn't include env var for channels (Slack, Teams or Chime)

On Lambda handler.py lines 702 to 709 there are calls to get secrets in case environment variables are set for channels to be communicated. ( get_secret(secret_teams_name, client) if "Teams" in os.environ else "None")

The Terraform code, however, doesn't populate that variable dynamically, therefore there will never be those environment variables and the notifications won't be sent.

I corrected it on my code by making the lambda variables as a local variable and merging them to the notifications channel in case they're populated.

Hope this helps.

Multi-Region Deployment failing with "Stack set operation [xxx...] was unexpectedly stopped or failed"

I am unable to deploy in multiple regions. While I have successfully deployed to us-east-1 and tested, I have since tried to add us-west-1 or us-west-2 (on separate attempts) as alternate regions and they fail with the same error:

Resource handler returned message: "Stack set operation [55c...] was unexpectedly stopped or failed" (RequestToken: a820..., HandlerErrorCode: InternalFailure)

The following resource(s) failed to create: [AHASecondaryRegionStackSet]. The following resource(s) failed to update: [LambdaFunction].

Unfortunately, that isn't very much to go on. Any ideas or suggestions? Has anyone else experienced this issue?

/AHA-LambdaFunction calls GetSecretValue even though no MS Channel is defined

Our cloudtrail alarming is reporting this error when deploying the lambda with only a slack url.
arn:aws:sts::xxx:assumed-role/AHA-LambdaExecutionRole-ejo5owz1/AHA-LambdaFunction-ejo5owz1 called GetSecretValue but failed due to AccessDenied

Cause: IAM Policy is only created when string is not empty, but the code can not know if the channel id was empty. Therfore it try to fetch it and fails.
grafik

SubscriptionRequiredException Error

Getting this error when I go to deploy the solution in a member account that belongs to AWS Organization. Member account does have business support plan.

[ERROR] ClientError: An error occurred (SubscriptionRequiredException) when calling the DescribeEventsForOrganization operation:
Traceback (most recent call last):
  File "/var/task/handler.py", line 872, in main
    describe_org_events(health_client)
  File "/var/task/handler.py", line 720, in describe_org_events
    for response in org_event_page_iterator:
  File "/var/runtime/botocore/paginate.py", line 255, in iter
    response = self._make_request(current_kwargs)
  File "/var/runtime/botocore/paginate.py", line 334, in _make_request
    return self._method(**current_kwargs)
  File "/var/runtime/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/runtime/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)

This can be closed. Issue was because we don't have a business plan in our AWS root account where AWS Organization lives.

Thanks

Clarify docs a bit: Enabling HEALTH API on Org Level

Could you please update the docs a bit just to clarify the following, I think it might be useful for other people:

  1. When enabling the Health API, when referring to the "management" account, does this refer to the payeer account?
  2. When making use of the Health API (when it was enabled on an organizational level), should one access the API from the payeer account itself or can you make use of account that is below the payeer account in terms of account hierarchy, or perhaps even access it from one of the member accounts?

Name caching

Is there an issue with the way the lookup of the name is set as a global variable in the Lambda? I might be lacking in knowledge of how lambda functions work but my understanding is once the lambda has been invoked a global variable would be set and not initialised again. As the lambda function is generally kept warm for more than the period of invocation (1 minute looking at the terraform example) then I don't see how a change in the DNS record would be picked up.

I will do a PR with a proposed fix.

LambdaExecutionRole creation error

Policy statement must contain resources. (Service: AmazonIdentityManagement; Status Code: 400; Error Code: MalformedPolicyDocument; Request ID:redacted ; Proxy: null)

Question: Is it possible to simulate Health events in AWS?

This is an AWS question, but I figured you might know due to the nature of this project.

I have recently installed AHA for our organization. I was able to test some events by setting the hours back to 4000 for one lambda run. The problem is that only issues showed up for the timeframe before I had enabled Organizational View for Health. I would like to test other event types proactively instead of waiting for them to happen in the wild.

Is there a mechanism in AWS that allows test health events to be created that come in through the API's like a normal event? If not, how do you test the code for this project?

aws-cdk support

It would be awesome if there were out-of-the-box aws-cdk support for this. I saw that there's beta support for terraform, but CDK support would be awesome!

Route messages to (Slack) channels based on account name (or number) & CDK?

Just to ensure we aren't about to reinvent a wheel, is anyone aware of an existing method to direct different messages to different webhooks based on (ideally) account name?

We have a slack channel per account and we'd like to send messages accordingly. Slack have made mention that having one app across multiple channels is on their radar but without ETA.

In the meantime we found the slack Echo Bot which matches a keyword and will echo a message to another channel which is very handy, no code approach for our Ops team but it quotes the message rather than sending it a new which kind of messes up the formatting.

I had a look but no one has so far forked or submitted a PR for this have they?
Also no one has plans to reproduce this in CDK have they? It's our go to so might see about the effort involved.

Error when sending email

[ERROR] TypeError: send_email() takes 2 positional arguments but 4 were givenTraceback (most recent call last):  File "/var/task/handler.py", line 836, in main    describe_events(health_client)  File "/var/task/handler.py", line 686, in describe_events    update_ddb(event_arn, str_update, status_code, event_details, affected_accounts, affected_entities)  File "/var/task/handler.py", line 484, in update_ddb    send_alert(event_details, affected_accounts_details, affected_entities, event_type="create")  File "/var/task/handler.py", line 95, in send_alert    send_email(event_details, event_type, affected_accounts, affected_entities) [ERROR] TypeError: send_email() takes 2 positional arguments but 4 were given Traceback (most recent call last):   File "/var/task/handler.py", line 836, in main     describe_events(health_client)   File "/var/task/handler.py", line 686, in describe_events     update_ddb(event_arn, str_update, status_code, event_details, affected_accounts, affected_entities)   File "/var/task/handler.py", line 484, in update_ddb     send_alert(event_details, affected_accounts_details, affected_entities, event_type="create")   File "/var/task/handler.py", line 95, in send_alert     send_email(event_details, event_type, affected_accounts, affected_entities)

To fix:
change send_email function
def send_email(event_details, eventType):

to match similar send_org_email
def send_email(event_details, eventType, affected_accounts, affected_entities):

also need to update get_message_for_email(event_details, eventType, affected_accounts, affected_entities):

@gmridula I would create a fork but hope you can just fix from this

add "Action Required" indication to action required events

When aha sending emails to stakeholders, the important ones like: "AWS_RDS_PLANNED_LIFECYCLE_EVENT"
getting the "[Action Required]" prefix in email subject. (see pic)

Is it possible to add this indication for each event json (e.g. action required: true/false)? we would like to promote this kind of event as they are critical

image

Slack Workflow

Has anyone ran into issues in getting this to work with Slack Workflows?

I've created as per the docs but the workflow just reports Received a webhook request that was missing a required field. No indication of what it received. If I call the workflow via its webhook via curl and the following input it works fine.

curl -X POST https://hooks.slack.com/workflows/<redacted> -H 'Content-Type: application/json' -d '{"text":"test","account":"1234","resources":"my_resource","service":"my_service","region":"my_region","status":"my_status","start_time":"start_time","event_arn":"my_arn","updates":"none"}'

Wondering if anyone had ran into this at all before?

Ignore Accounts File - Syntax Unclear

Calling the "ignored accounts" file a CSV is a bit confusing. I put all the accounts on one line, separated by commas, and it didnt work. I looked at the code and saw that they're actually separated by a new line.

Might want to make some clarification in the docs.

How to get notification from event bus ARN

HI,
We want this notification in the email, where the only option to get it via SES or event bus ARN then SNS. SES required email verification and exchange policy to be allowed for delivery which is difficult

We want to explore event bus option. after the event is received on the event bus it should trigger to SNS, SQS or any other service which required event pattern to be created. What will be event pattern sample or parameter to get that all events to SNS or SQS from event bus? and it should be in json format. So is lambda sending events into json format?

Please assist

Adnan

Some major problems with setting up project in member account with AWS Organizations enabled

I want to setup this project's resources in a member aws account and then have aws organizations enabled. This would mean that I create a role to access the personal health dashboard info on the payer account and that all the other resources like Lambda, DDB table are on the member account.

The project's README describes that you can setup resources(Lambda, Dynamodb table etc) in a member acccount and then access the Personal Health Dashboard info via a role on the payer/top level account.

That is a false assumption, because in order for that to be possible, you will need to assume a role on the payer account and while doing so also access the Dynamodb table in the member account. That is impossible because its not possible to assume a role in another account and access resources in your current account, atleast not at the same time/moment.

Another issue I found is that the docs mention that for this type of setup you will need to make use of the variable (and give it a value with the ARN of the role you would assume from the member account in the payer account), the variable is MANAGEMENT_ROLE_ARN in either your cloudformation or terraform code.

Just do a quick search in the code for https://github.com/aws-samples/aws-health-aware/search?q=MANAGEMENT_ROLE_ARN
and you will see it is only referenced in the cloudformation and terraform code in the environment variable section. It is not referenced anywhere in the code section. So I'm not sure how the code is supposed to make use of this variable. Maybe I'm missing something here? Please correct me if I'm wrong.

Throttling exception when calling "describe_affected_entities_for_organization"

@jordanaroth @gmridula Lambda function is throwing a throttling exception, I think the issue is at L829,
health_client = get_sts_token('health')

we could use
config = Config(
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
but this is not supported with get_sts_token

{
  "errorMessage": "An error occurred (ThrottlingException) when calling the DescribeAffectedAccountsForOrganization operation: Rate exceeded",
  "errorType": "ClientError",
  "stackTrace": [
    "  File \"/var/task/handler.py\", line 849, in main\n    describe_org_events(health_client)\n",
    "  File \"/var/task/handler.py\", line 742, in describe_org_events\n    affected_org_accounts = get_health_org_accounts(health_client, event, event_arn)\n",
    "  File \"/var/task/handler.py\", line 323, in get_health_org_accounts\n    for event_accounts_page in event_accounts_page_iterator:\n",
    "  File \"/var/runtime/botocore/paginate.py\", line 255, in __iter__\n    response = self._make_request(current_kwargs)\n",
    "  File \"/var/runtime/botocore/paginate.py\", line 332, in _make_request\n    return self._method(**current_kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 386, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 705, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

Notifications not associated to a region

Hello Jordan. The solutions is working great in ms-teams. Thanks!

I've configured a couple of regions of my interest, I see some notifications doesn't appear when they are not associated to a region, do I need to select global for this?. What will happen if I select "us-east-1,us-east-2, global". I'll receive all the events world wide or only us-east-1,us-east-2 + those not associated to a region?

thanks

Further Customizing Delivery Subscriptions

Would it be possible to have a way to easily customize the delivery based on custom metadata or perhaps AWS Organizations Account Tags to send emails to custom addresses on a per account basis?

Use case is to deliver appropriate notifications to: Custom Account Owners, Tenant Application Owners or specific resource owners based on metadata / tags not really supported by AWS today.

I suspect it would be possible to custom code something off EventBridge and Dynamo by accountId, but I get concerned about the API limits for querying Account Tags in organizations. It would be better to have a solution to register/subscribe custom destinations in the DynamoDB.

Cannot consume terraform as a module

because *.py scripts are in the root folder i cannot use terraform code as a module:
source github.com/aws-samples/aws-health-aware//terraform/Terraform_DEPLOY_AHA?ref=v2.01
so i cant store configuration in my source code - need to fork/dowoad and update code manually

if you can change this that would be great

Reminders of AWS_EC2_INSTANCE_STOP_SCHEDULED

Can this monitor be extended to repeat notifications of upcoming AWS_EC2_INSTANCE_STOP_SCHEDULED alerts.
Possibly in a reverse fibonacci sequence - 13, 8, 5, 3, 2, 1.

It would also be nice if the instance NAME was pulled from the Name Tag.

Health Events With No End Date

Some health events never provide an end date or closure event. We are having a difficult time determining when an event has ended or closed. What logic does the AWS Health Dashboard use for events without a provided end date, closure event, or status? It looks like it may use Last update time.

Example Events
RDS operational notification
OpenSearchService service software update available
VPN redundancy loss

*I see a new EventBridge schema for [aha-2.1-beta] and it looks like this may be solvable in that release.

CloudTrail AccessDenied entry, for any unconfigured secret

Another user reported to me that they receive many "AccessDenied" entries in Cloudtrail.
Upon testing this in my account environment, I was able to replicate this.

Essentially, for any endpoint not configured, I receive a Cloudtrail entry with access denied stating that the role for AHA does not have access for "null" secret:
" is not authorized to perform: secretsmanager:GetSecretValue on resource: ChimeChannelID because no identity-based policy allows the secretsmanager:GetSecretValue action","requestParameters":null,"responseElements":null"

For them it's an issue because their SIEM is monitoring for those kind of events.

I want to know if this is by design, since I see that handler.py, when it's checking if there is a configured endpoint - is set to look for AccessDenied error coming from the client.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.