Giter Site home page Giter Site logo

azure-monitor-baseline-alerts's Introduction

Azure Monitor Baseline Alerts (AMBA)

Average time to resolve an issue Percentage of issues still open

NOTE: Please check out our the AMBA GitHub Pages site for more interactive access to the content in this repo.

Welcome to the Azure Monitor Baseline Alerts (AMBA) repo! The purpose of this site is to provide best practice guidance around key alerts metrics and their thresholds.

This sites is broken down into two main sections:

  1. Services: This section provides guidance for individual Azure services. For each service, there is a list of key alert metrics and the recommended thresholds.

  2. Patterns / Scenarios: This section provides guidance for common patterns / scenarios (like Azure Landing Zones), as well as policy definition and initiatives for deploying the alerts in your environment.

Why is configuring alerts important?

When deploying Azure resources, it is crucial to configure alerts to ensure the health, performance, and security of your resources. By setting up alerts, you can proactively monitor your resources and take timely actions to address any issues that may arise.

Here are the key reasons why configuring alerts is important:

  1. Early detection of issues: Alerts enable you to identify potential problems or anomalies in your Azure resources at an early stage. By monitoring key metrics and logs, you can detect issues such as high CPU usage, low memory, network connectivity problems, or security breaches. This allows you to take immediate action and prevent any negative impact on your applications or services.

  2. Reduced downtime: By configuring alerts, you can minimize downtime by being notified of critical events or failures in real-time. This allows you to quickly investigate and resolve issues before they escalate, ensuring the availability and reliability of your applications.

  3. Optimized resource utilization: Alerts help you optimize resource utilization by providing insights into resource usage patterns and trends. By monitoring metrics such as CPU utilization, memory consumption, or storage capacity, you can identify opportunities for optimization and cost savings.

  4. Compliance and security: Configuring alerts is essential for maintaining compliance with regulatory requirements and ensuring the security of your Azure resources. By monitoring security logs and detecting suspicious activities or unauthorized access attempts, you can take immediate action to mitigate potential risks and protect your data.

  5. Proactive capacity planning: Alerts provide valuable information for capacity planning and scaling your resources. By monitoring resource utilization trends over time, you can identify patterns and forecast future resource requirements. This helps you avoid performance bottlenecks and ensure a smooth user experience.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

azure-monitor-baseline-alerts's People

Contributors

alboroni avatar anwather avatar arjenhuitema avatar brunoga-ms avatar bzabber avatar cassiekays avatar colin-domansky avatar dependabot[bot] avatar dost2010 avatar emanuel-metzenthin avatar github-actions[bot] avatar heoelri avatar humblejay avatar janegilring avatar jcorems avatar jhajduk-microsoft avatar joeybarnes avatar josefehse avatar judyer28 avatar lukemurraynz avatar mahesh-msft avatar mbrat2005 avatar microsoftopensource avatar pla5ma avatar ppascan avatar sihbher avatar steph409 avatar tagolovina avatar tgolovina avatar ymehdimsft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-monitor-baseline-alerts's Issues

[Question/Feedback]: Stop deploying email action if the parameter is empty

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi!

If I omit ALZMonitorActionGroupEmail within policyAssignmentParametersNotificationAssets in alzArm.param.json (because I don't want notifications by email), then an email notification to [email protected] is still configured in the action group.

Is it possible to stop this from happening?

Thanks!

[Question/Feedback]: Strategy for avoiding configuration drift?

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi all,

I have a challenge with configuration drift and wonder if you guys have any suggestions or thoughts regarding this topic.

Background

Basically, Azure Policy deploys the alerts with ARM deployments right. This ARM deployment is sort of a "deploy and forget" deployment to my understanding.

Policy Compliance, more or less, only cares if the alert resource itself is deployed, not the content of the alert. If the alert is not yet deployed, a remediation task can be executed to fix the alert.

This means that if I edit a deployed alert in any way, for example a threshold, Azure Policy will not pick this up to revert it back to its original threshold - Which usually is what you want when working with IaC (and to some degree, Azure Policy)

This can cause some issues

  1. After AMBA is deployed, changes made in the codebase to the alerts are never deployed to the alert resources. The Policy Definition is naturally updated to reflect the changes, but as the policy assignment is compliant it never deploys the update to the alert resource.

    • Result: Configuration Drift and loss of control for IT Administrators
  2. If going with the decentralized approach where alerts are deployed to the landing zones themselves, end users can modify the alerts with undesired values causing alarm-storms or even non-working alerts.

    • Result: Configuration drift, and changing the alerts goes unnoticed and can potentially create a breach of SLA with the customer because an alert didn't fire when it should have. (Likewise this would be a problem in the centralized approach, if an IT Admin changes the alert instead)

Potential solutions:

  • Is it possible to modify policy definitions to also look for content such as thresholds? Not sure if the existenceCondition field can be this complex
  • Somehow force re-deployments on every pipeline run. Not sure if this is possible with the way Azure Policy works.
  • Make a full swap to Terraform/Bicep where the alerts are fully managed instead of with Policy. At least with Terraform, this would require a continous pipeline that scans for subscriptions (data source) and deploys alerts for each subscription.
    • This obviously requires a complete re-write, it's more of a solution to my specific case

[Question/ Feedback]: After deployment of the assignments the managed identity does not have the assigned role it needs to remediate

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

After we have deployed the assignments of the initiatives we noticed we cannot start remediation right away. The assignment does not have the rights yet. Only after editing and saving the assignment via the portal, the rights are granted.

Is this a known issue?

Syntax error in the 6 policies for Service Health Alerts

Hi,

I picked up the six policies for resource health alerts at Services>Resources>Subscriptions. They fail to deploy with the below.

Location:
https://github.com/Azure/azure-monitor-baseline-alerts/tree/main/services/Resources/subscriptions

Error:
When you try to deploy any of the above you get the following error:

The deployment failed with error(s). Showing 1 out of 1 error(s). Status Message: The policy | 'Deploy_activitylog_ResourceHealth_Unhealthy_Alert' has defined parameters 'enabled,alertResourceGroupName,alertResourceGroupTags,alertResourceGroupLocation,effect' which are not used in | the policy rule. Please either remove these parameters from the definition or ensure that they are used in the policy rule. (Code:UnusedPolicyParameters) CorrelationId: | 1b935cef-1cd2-4162-a022-f8c44191d7a2

Cause:
It looks like all the files need a find/replace for '[[' to '[' it looks like the parameter declarations have that double opening square brace and that's what is causing error.

Incorrect Example:
"location": "[[parameters('alertResourceGroupLocation')]",

Corrected Example:
"location": "[parameters('alertResourceGroupLocation')]",

All files in the directory need a full find/replace as it occurs all over the place. Once I corrected this all templates deployed successfully.

Thanks!

MissingSubscription

Hi,

I'm getting an error message after running command
I'm user access administrator and owner on subscription

  1. Create a new role assignment by subscription and object. By default, the role assignment will be tied to your default subscription. Replace $subscriptionId with your subscription ID, $resourceGroupName with your resource group name, and $assigneeObjectId with generated assignee-object-id (the newly created service principal object id).

$ az role assignment create --role contributor --subscription xxxxxxxxxxxxxxxxx --assignee-object-id xxxxxxxxxxxxxxxxxxxxxxxxx --assignee-principal-type ServicePrincipal --scope /subscriptions/xxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/rg-tst-mgt-mon-amba-01

(MissingSubscription) The request did not have a subscription or a valid tenant level resource provider.
Code: MissingSubscription
Message: The request did not have a subscription or a valid tenant level resource provider.

[Question/Feedback]: Modifying policy definitions clarification

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hello,

As higlighted in the wiki for customizing-the-amba-policies it doesn't go into detail on how this is accomplished.

The reason I'm asking is that it seems to be more complicated than just editing the individual policy definitions (which is simple enough) inside the services directory. All policy definitions are, to my understanding, converted into a single big json file policies.json (which is referenced by alzArm.json used by pipeline). This is done automatically I assume, by the policies.bicep file?

So modifying the invididual policies won't apply any actual changes when running the pipeline because policies.json is not updated to reflect these changes without first running policies.bicep locally.

Am I understanding this correctly? If so there should to be documentation on how to use the policies.bicep file

Surely we're not supposed to edit the policies.json file directly? Specially when a big portion of it looks like this:

image

image

To me it seems like the policy definitions inside the services directory is the source of truth, the rest is automated

[Question/Feedback]: Using AMBA without a Github Repo

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi All,

I am trying to deploy AMBA using powershell from a local cloned repo set up with DevOps. When I try to deploy using the command below

New-AzManagementGroupDeployment -ManagementGroupId $pseudoRootManagementGroup -Location $location -TemplateFile ./patterns/alz/alzArm.json -TemplateParameterFile ./patterns/alz/alzArm.param.json

I am getting the error below. Are there limitations with deploying AMBA without using a Github Repo and deploying from either Azure Devops or locally form a cloned repo?

New-AzManagementGroupDeployment: 16:49:22 - Error: Code=InvalidTemplate; Message=Deployment template validation failed: 'The template variable 'deploymentUris' is not valid: The language expression property 'templateLink' doesn't exist, available properties are 'template, templateHash, parameters, mode, provisioningState'.. Please see https://aka.ms/arm-functions for usage details.'.
New-AzManagementGroupDeployment: The deployment validation failed

See attachment for further info.

AMBA

[Question/Feedback]: No action group deployed when remediating policies.

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I have deployed AMBA from your GitHub Repo with the PowerShell pattern. When I remediate policies, it creates the alert rules BUT it doesn't create an action group for these. I'm a bit confused as I have followed the instructions from your GitHub pages & updated the parameter file & then deploying AMBA with:

"New-AzManagementGroupDeployment -ManagementGroupId $pseudoRootManagementGroup -Location $location -TemplateUri "https://raw.githubusercontent.com/Azure/azure-monitor-baseline-alerts/main/patterns/alz/alzArm.json" -TemplateParameterFile ".\patterns\alz\alzArm.param.json"

The Policies/Initiatives are deployed & assigned. When remediating the RG/tags & alert rules are created BUT no action group deployed & linked to alert rules. I have added my email address to the ALZMonitorActionGroupEmail parameter.

Is there another step that I'm missing?

[Question/Feedback]: Exclude some resources from being monitored

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I have a customer requirement where several VMS are being deployed but not all VMs require Alerting. This normally is the case with Devops Runners that switch off regularly everyday at a certain period of time. Once this event happens a heartbeat alert will be fired which is not required on those VMs.
For more clarity, Consider the following example:

there are 3 Vms, 2 Vms require the alerting but the third Vm needs to be excluded.

I also looked into the Monitor Disable parameter in the Json however, this might have some limitation:

  1. the parameter is targeting scopes meaning all VMs will be affected and in my case i only want some VMs to have alerting disabled.
  2. In the Json itself, the Monitor Disabled isn't included within the VMs alerting

[Question/Feedback]: Customizing baseline for multiple landing zones

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hello team.

I would like to understand whether AMBA is designed to be a one-time solution that is primarily used to jump start baseline monitoring for a project or is it more like ALZ pattern where users are encouraged to keep their deployment in-sync with the main GitHub repo through the use of provided scripts.

If it is the later, then what is the thinking on multi-team environments, with multiple landing zones, where there may be a need to customize baseline alerts? For example I have two landing zones for two dev teams, and team #2 needs to update some alert thresholds because for their use case the baseline threshold is too low and they get a lot of false-positives. Do they go ahead and modify their alerts directly in the portal? That would mean that they very quickly deviate from main and risk loosing their changes if the baseline is reapplied. Or is there another pattern they can use?

Thank you :)

[Question/Feedback]: The template language expression literal limit exceeded. Limit: '131072' and actual: '131072'.

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,

We have been adding additional alerts to the solution, specifically ~13 new json policies for compute/virtualMachines.
Seems like we have reached the limit for the final processPolicySetDefinitionsAll template/JSON content.

To reproduce we added a new alert (the 14th), as well as new parameters for the new alert in Deploy-LandingZone-Alerts.json

Full error:

ERROR: InvalidTemplate - Deployment template validation failed: 'The template variable 'processPolicySetDefinitionsAll' is not valid: The template language expression literal limit exceeded. Limit: '131072' and actual: '131072'. Please see https://aka.ms/arm-functions for usage details.. Please see https://aka.ms/arm-functions for usage details.'.

This is a limitation in ARM/Bicep and is discussed here as well Azure/bicep#4293 ("Increase MaxLiteralCharacterLimit for loadTextContent (specifically for non-JSON content)")

If this is an actual issue and not an error on my end I suspect this repo will reach this limit soon as well as new alerts are added.

[Question/Feedback]: Unable to deploy via pipeline using ubuntu-latest

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,

I'm having an issue deploying AMBA as part of our ADO pipeline. When using image version 'Ubuntu-Latest' , I get a script failed message saying the parameter file was unable to be parsed. However when using 'Windows-Latest' the deployment works just fine. There are no errors with the param file and the pipeline stage is basically identical to the example provided in the repo. Any advice on this would be greatly appreciated, cheers!

Screenshot 2023-11-02 142413

[Question/Feedback]: Syntax defaultValue vs defaultvalue

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,

I am currently writing the terraform implementation for the amba. When I deploy the policies per script and them manage via terraform, I saw that there is a configuration drift for 12 policies. Examplary, I put one example here, where you can see in pink the changes defaultvalue vs defaultValue and allowedValues vs allowedvalues.

When you look at the definition file and search for e.g. defaultvalue, you can see that it is inconsistent in the json-files. Is there are reason for it?

image

[Question/Feedback]: Action groups with other than email for distribution

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,
first of all I like the whole concept here. I am currently working on implementing it in a bigger scale for my company.
You stated in the documentation:

We may investigate and implement alternative or additional actions in the future (e.g. configure alternate email distribution groups depending on the subscription/service or workload owners/etc.).

Are you still considering that?
I just wanted to highlight it as a very sought after feature on my site ;)

[Question/Feedback]: Ability to customise/filter Service Health Alerts without config drift?

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hey team,

We have a lot of customers and want to adopt AMBA on a wider scale as a baseline rollout for our ALZ alerts. However, we are seeing a potential issue with a lot of noise on the service health alerts as we'll get notifications for all types even if they are not relevant for that customer.

I've checked the param file and it doesn't seem like there is currently a way to filter the service health alerts to specific regions, and services which would be beneficial for this.

I can only find the custom deployments information which may have some use for us but we wanted to try and avoid config drift from the project if possible (and amending the policies seems the only way?). https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/deploy/Deploy-only-Service-Health-Alerts/#custom-deployment

If there is no easy way to limit the service health alerts to a specific region(s) and/or Azure services for the individual deployments via params, then it would be nice to have some easy mechanism to do so without having major config drift.

If anyone has insight for how they're doing this with AMBA, that would be welcomed to learn about also! Thank you!

[Question/Feedback]: Ability to update the Action Group displayname/shortname

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

In the scenario where we receive alerts from several different customers & tenants we would like to be able to assign another displayname of the Action Group that is deployed with AMBA. This lets us know which tenant and customer the alert belongs to which is helpful since we have several different environments we are monitoring.

Is there a way to modify this today?

The way I have solved it so far is to run AZ CLI on a scheduled basis which updates the Displayname incase a new deployment has happened and a new action group has been created with the default ActGrp displayname we have today.

ID=$(az monitor action-group list --query "[?name.contains(@,'AMBA')].id" --output tsv)
az monitor action-group update --ids $ID --group-short-name "<customer shortname>"

[Question/Feedback]: Use a single centralized action group?

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Seeing as it's possible to use a single action group across multiple subscriptions' alerts, would it be possible to build this deployment to take advantage of that? I work with a lot of SMBs that tend to have smaller LZ deployments and smaller internal teams, so there's not always appetite for the same action group repeated over a dozen subscriptions. I can manually modify the policies to use a single centralized action group but the build process handling this could be nice.

[Question/Feedback]: Multiple AMBA deployments in the same tenant.

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Is it possible to deploy multiple AMBA instances in the same tenant.
We have one PROD management group, containing application specific management groups and a NON-PROD management group containing application specific non-prod management groups.

Is it possible to deploy 1 AMBA instance into the PROD management group and one in the NON-PROD management group?
So two separate "alzArm.param.json" files.

Thanks

[Feedback]: Deploy PIP VIP Availability Alert - Only for Standard SKU Public IP Addresses

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

https://github.com/Azure/azure-monitor-baseline-alerts/blob/main/services/Network/publicIPAddresses/Deploy-PIP-VIPAvailability-Alert.json

"VipAvailability" metric is only available on the Standard SKU. The policyRule should be:

   "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "equals": "Microsoft.Network/publicIPAddresses"
          },
          {
            "field": "Microsoft.Network/publicIPAddresses/sku.name",
            "equals": "Standard"
          },
          {
            "field": "[[concat('tags[', parameters('MonitorDisable'), ']')]",
            "notEquals": "true"
          }
        ]
     }

[Question/Feedback]: Alert rules for VMs from Azure Monitor Metrics

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

The AMA agent for Windows VMs now have (Preview) support for sending VM guest performance data to Azure Monitor Metrics instead of the InsightsMetrics table in Log Analytics.

However, the VM alert rules (and Grafana templates) in AMBA is currently only looking in the InsightsMetrics table. We'll we see support for alert rules based on Metrics in AMBA in the near future or would you recommend sticking with Log Analytics based performance logs/metrics for new deployments for some time?

[Question/Feedback]: Log based "catch all" alerts vs Metric based alerts scoped to resource

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,

Really great work in this repo, it looks interesting.
I just looked through the code for Compute alerts today, and from what I can see you are taking an approach where you are using log based alerts with a broad scope, basically having the same configuration (thresholds etc.) for all resources in the scope. Is there a reason for doing this, instead of using Metric alert rules and doing an alert rule per resource, which would allow for custom settings per resource?

I'm not saying one or the other is the right way, just curious as this is one of the challenges I'm facing in similar projects. For compute for example I'm still using log based alerts for disk monitoring, but CPU/memory is metric based and scoped to a single resource, and can then be adjusted to that resource.

Happy to test and provide feedback!

Best regards,
Jesper Bing

Can AMBA respect existing single user defined Managed Identity

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Deployment of AMBA Policies includes what seem to be system assigned managed identities for each of the ALZ areas as shown here (my deployment is AMBA Unaligned so you see them all in one place under the MG with that same name):
image

So thats 6 MIs that are Contributors in existing customer management group.
The customer has security policy to instead use their own, single User Defined Managed Identity for all Policy contributor roles.

Customer modified to use their own after deployment and was working fine, but now every subsequent deployment clobbers over their MI and requires clean up again (not acceptable).

image

Is there not a way we can parameterize this value in the parameters template and have it use the customers user defined MI instead?

[Feedback]: Failed to build site locally using Hugo Server

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

When trying to run hugo server with latest changes, the following error is occurring:
$ hugo server -D
Start building sites …
hugo v0.111.3-5d4eb5154e1fed125ca8e9b5a0315c4180dab192+extended windows/amd64 BuildDate=2023-03-12T11:40:50Z VendorInfo=gohugoio
Error: Error building site: failed to render shortcode: "C:\Users\judyer\source\repos\amba\azure-monitor-baseline-alerts\docs\content\patterns\alz\deploy\Deploy-with-Azure-Pipelines.md:6:1": failed to render shortcode "include": failed to process shortcode: "C:\Users\judyer\source\repos\amba\azure-monitor-baseline-alerts\docs\layouts\shortcodes\include.html:2:5": execute of template failed: template: shortcodes/include.html:2:5: executing "shortcodes/include.html" at <$p.RenderShortcodes>: can't evaluate field RenderShortcodes in type page.Page
Built in 6416 ms

Can we add a Parameter to adjust "Number of Violations"

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

We noticed that we cannot adjust options which appear as "Advanced" in the portal, using the AMBA json.
While we are extremely appreciative of the effort the AMBA team invested in picking the best practice alerts, we found some cases where we would like to still use the AMBA repo and deploy from it, but we dont have the flexibility in the template parameters file to make the adjustments we need.

This is an example. There are probably more, but this one is the one we need to adjust now. We get too many false positives with the value as provided.

VMHeartbeat
So in the portal we see that there is an adjustment allowed to set the # of violations from 1 - 6, with the default being 2 and AMBA hardcodes this as 2 (since there is no parameter to change it).
image

Here is the json template parameter file, Notice there is no option (parameter) to adjust that.
"VMHeartBeatRGAlertSeverity": {
"value": "0"
},
"VMHeartBeatRGWindowSize": {
"value": "PT15M"
},
"VMHeartBeatRGEvaluationFrequency": {
"value": "PT5M"
},
"VMHeartBeatRGAutoMitigate": {
"value": "true"
},
"VMHeartBeatRGAutoResolve": {
"value": "true"
},
"VMHeartBeatRGAutoResolveTime": {
"value": "00:10:00"
},
"VMHeartBeatRGPolicyEffect": {
"value": "deployIfNotExists"
},
"VMHeartBeatRGAlertState": {
"value": "true"
},
"VMHeartBeatRGThreshold": {
"value": "10"
},
"VMHeartBeatRGOperator": {
"value": "GreaterThan"
},
"VMHeartBeatRGTimeAggregation": {
"value": "Average"
},

Might it be possible to expose that as a parameter and include it in the next build?

[Question/Feedback]: Is it possible to exclude subscriptions (remove amba from them) after remediation ?

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/deploy/Remediate-Policies/

I have remediated the policy initiatives but noticed that remediation also occurs on subscriptions outside my landingzones management group (but inside the pseudroot management group). I want to turn this of for these 'out of scope' subscriptions because this is not intended.
Can I just make a exclusing scope on the inititiative Assignment policies and remove the resources which are deployed to the out of scope subscriptions or is there another way to achieve this?

[Question/Feedback]: AMBA & EPAC

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

To start with, awesome job, I only came across this from watching the most recent ALZ community call recording and have been very impressed with what has been setup!

Recently within our ogranisation we began to implement the Enterprise Policy as Code (EPAC) project, and first thing I did was import the ALZ polices. Are there plans to also add this to EPAC at all so I could quite easily use that as my source of ALZ policy truth with the ease factor or pulling in any updates now and then directly to EPAC?

Again, happy to find this project, immediately bookmarked / starred and will be keeping a close eye on it!

[Question/Feedback]: No VMs resources are listed as compliant.

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I am working on deploying AMBA, I tried deploying to different tenants, using aligned and unaligned deployment, either through pipeline or CLI, everything is working great except for all the VM policies which marks all as non-compliant. VMs where created an hour after AMBA deployment, remediation task gives nothing except the errors below.

image image

[Question/Feedback]: Automatically delete alert rules

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hello,
I have tried this project out and it seems to be working well. However when I delete a resource, the alert rules stays. This should be working as intended, because the policy only deploys alert rules and nothing else. I want to be able to delete corresponding alert rules automatically when deleting a resource. What's the best way to approach this?

Unaligned environments will get duplicate alerts

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I found many of the policy definitions exists in more than 1 Initiative. I think I understand that in a CAF aligned subscription deployment, there will be services running in different scopes so it would work fine there.

But for "Unaligned" environments where the scope is the same, we get duplicate alerts provisioned.

image

image

image

image

image

Migration to GA AMBA blocked by remediation due to conflict

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I had been running pre-GA "ALZ-monitor" in an unaligned environment (one MG and one subscription) and yesterday decided to migrate to the azure-monitor-baseline-alerts GA version.
I saw the documentation indicating that a cleanup was required and that there was no 'migration', so I executed the cleanup scripts as guided in the docs:

  1. PS /home/steve/AMBAGA/azure-monitor-baseline-alerts/patterns/alz/scripts> ./Start-ALZMonitorCleanup.ps1 -WhatIf
  2. PS /home/steve/AMBAGA/azure-monitor-baseline-alerts/patterns/alz/scripts> ./Start-ALZMonitorCleanup.ps1 -Force
    I got good output:

Found '0' resource groups with tag '_deployed_by_alz_monitor=True' to be deleted.
Found '5' policy assignments with metadata '_deployed_by_alz_monitor=True' to be deleted.
Found '7' policy set definitions with metadata '_deployed_by_alz_monitor=True' to be deleted.
Found '134' policy definitions with metadata '_deployed_by_alz_monitor=True' to be deleted.
Found '5' role assignment with description '_deployed_by_alz_monitor' to be deleted.
WARNING: This script will delete the resources discovered above.
Deleting alert resources...
Several pages of "Ture" followed by
True
Deleting role assignments...
Cleanup complete.

Then I forked the repository, updated the json parameters file, and ran the deployment against the same AMBA-Unaligned MG which contains the one Subscription with all my resources. I waited until all the reminants of ALerts and Policy were gone before starting.

Then I ran the deployment which completed without errors:
PS /home/steve/AMBAGA/azure-monitor-baseline-alerts/patterns/alz> New-AzManagementGroupDeployment -Name "amba-GeneralDeployment" -ManagementGroupId $pseudoRootManagementGroup -Location $location -TemplateUri https://raw.githubusercontent.com/stevedistef/azure-monitor-baseline-alerts/main/patterns/alz/alzArm.json -TemplateParameterFile "./alzArm.param.json"

snip: Outputs :
Name Type Value
=============== ========================= ==========
deployment String "amba-GeneralDeployment has successfully deployed."

OK, so now it is on the REMEDIATION. It was 4PM EST for me so I ran remediation fully expecting it would require a couple hours....

I ran these one line at a time and waited for each to complete before proceeding.

$pseudoRootManagementGroup = "AMBA-Unaligned"
$identityManagementGroup = "AMBA-Unaligned"
$managementManagementGroup = "AMBA-Unaligned"
$connectivityManagementGroup = "AMBA-Unaligned"
$LZManagementGroup= "AMBA-Unaligned"
#Run the following commands to initiate remediation (one at a time).
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $managementManagementGroup -policyName Alerting-Management
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $connectivityManagementGroup -policyName Alerting-Connectivity
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $identityManagementGroup -policyName Alerting-Identity
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $LZManagementGroup -policyName Alerting-LandingZone
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $pseudoRootManagementGroup -policyName Alerting-ServiceHealth
.\patterns\alz\scripts\Start-AMBARemediation.ps1 -managementGroupName $pseudoRootManagementGroup -policyName Notification-Assets

I saw the ne Policy initiative go green and went home.
image

This morning, I arrived and so only a small improvement:
image

Looking closer at Connectivity:
image

I looked inside the connectivity initiative and saw that the resources are still showing they require remediation:
image

And look at the error message from 17 hours ago:
image

Says there is a conflict with data that was there fromn 3/14 (yesterday). Is this the older AMBA which I cleaned up???

So I am looking across all remediation teaks from yesterday afternoon:
image

Notice the one from this morning at ~9AM, which was me attempting a manual remediation against just the DNS Zones....
It now (this morning manual remediation) says completed:
image

Yet, its still showing non complient, but look at the reason here!!!!????
image

This might require a shared desktop session withsomeone before I mess up the environment with multiple remediation retries....

[Question/Feedback]: Linked template validation failed

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi All,

I get the following error on the alzArm.json file having taken a forked copy of the code into VS code. The linked template in question is policies.json.

Thanks.

Template validation failed: The template variable 'policySetDefinitionsAll' is not valid: The template function 'managementGroup' is not valid. Please see https://aka.ms/arm-template-expressions for usage details.. Please see https://aka.ms/arm-template-expressions for usage details.

VM Alerts deploy is failing

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Deploying VM Alerts is failing.
For example: Deploy VM OS Disk Write Latency Alert
Deploy Error: Status: Conflict[(Error details)]
Details:
[10:47] Graziano Sommariva
"code": "DeploymentFailed",
"message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
"details": [
{
"code": "BadRequest",
"message": "{\r\n "error": {\r\n "code": "DraftClientException",\r\n "message": "The request had some invalid properties Activity ID: 3332f9c0-b4d4-464b-8ec4-44a670ba745b."\r\n }\r\n}"
}
]
}

I can't see action 'Deploy AMBA' in Github actions

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi,

In my (forked) github repo I cannot see action "Deploy AMBA" in Github actions

Instruction/Article:
Page: https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/deploy/Deploy-with-GitHub-Actions/

image

In the above picture it says to 'Go to GitHub actions and run the action Deploy AMBA'
When I go to the 'Actions' in Github Actions I don't see the action Deploy AMBA
Instead I'm seeing this:

image

Am I doing something wrong or looking at the wrong place?
Does action 'Deploy AMBA' refers to a workflow which has to be run?

[Question/Feedback]: Several Policies became Non Compliant since 10/3 with reason - NoRegisteredProviderFound

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I initially deployed AMBA from the now Archived repository on 10/3/2023. I deployed unaligned and deployed all Initiatives, into an existing environment and all Policy definitions were successfully remediated and all Initiatives showed 100% compliant and all ALert Rules were created.

My customer and I have both since experienced the following. Several Initiatives now have definitions which appear as non-compliant, but the Alert rules are still in place, and they still work (ex, deleting an NSG raises the alert). However, the dashboard shows non-compliant for these shown. And the two environments are identical! And Remediation actually succeeds yet definitions remain reflecting noncompliance.

I am thinking that since we moved AMBA to a new GitHub repo that maybe some of the Bicep templates may have changed, but that doesn't explain why an existing deployed environment that has had no Azure Services added or deleted in the last month suddenly change.

I don't want to just clean up and start over because this happened to a customer in hos PROD environment, and I want to be certain of the reason why we might suggest that.

Is anyone else seeing this?

image

image

image

image

image

image

This Compliance Reason is the same for all of them across all initiatives....
image

This shows I remediated the NSG policy definitions SUCCESSFULLY, yet an hour later they are still showing non-compliant.
image

image

image

[Bug]: Start-AMBACleanup is not removing role assignment

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I think that the Start-AMBACleanup script isn't removing the role assignments - it is leaving these orphaned roles.
image
At the management group levels underneath this there is the same thing as well - one for each of identity/management/connectivity management groups e.g.
image

I noticed this due to attempting to deploy the solution multiple times and it not being able to deploy the role assignments again (said it was unable to update them)

error message Start-AMBACleanup.ps1

Hi,

I'm getting the below error message after running the AMBA cleanup script Start-AMBACleanup.ps1

Any idea what could be going wrong or where to look?

Line |
210 | … ch-Object { Remove-AzResource -ResourceId $_ -Force:$force -Confirm:( …
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| NoRegisteredProviderFound : No registered resource provider found for location 'global' and API
| version '2015-01-01' for type 'actionRules'. The supported api-versions are
| '2018-11-02-privatepreview, 2019-05-05-preview, 2021-08-08-preview, 2021-08-08, 2023-05-01-preview'.
| The supported locations are 'global'. CorrelationId: b25a3211-e260-486

208: # delete alert processing rules
209: Write-Host "Deleting alert processing rule(s)..."
210: $alertProcessingRuleIds | Foreach-Object { Remove-AzResource -ResourceId $_ -Force:$force -Confirm:(!$force) }

[Question/Feedback]: Deploy AMBA Notification Assets policy default parameter for ALZMonitorActionGroupEmail cannot be changed.

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

We are testing an assignment for AMBA and noticed that all the parameters we changed work fine and show user defined parameter except ALZMonitorActionGroupEmail which stays at default, even though we changed that as well. Code we have for it is below;

"policyAssignmentParametersServiceHealth": {
"value": {
"ALZMonitorActionGroupEmail": {
"value": "[email protected]"
},

[Question/Feedback]: Multiple action group for incident category

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hello everyone,
Currently when we deploy the Service Health initiative, it creates one action groups to the different Service health incident category.
This seems to be limited to some use cases where a customer has dedicated teams that should only receive security type service incidents and other teams will receive other type service health incident. Is there a way we can modify the script so we can create one action group each per service health alert type?

RV Backup Policy Definition not automatically remediating

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

I have tried this in my own test environment and could not reproduce. However customer reported this happened on 5 RSVs in his sandbox environment. The issue is after we deploy this policy:
image
https://github.com/Azure/alz-monitor/blob/main/src/resources/Microsoft.Authorization/policyDefinitions/deploy-rv_backuphealth_monitor.bicep

It shows customers Recovery Vaults as non compliant as expected.
image
image

and this seems accurate
image

After customer tried remediation, it didnt bring any of them into compliance.
I have manually checked that box and the next time the policy runs it shows the RV as compliant (manual workaround).

image

Had some discussion internally about this and was told:
The non compliance detail you shared is correct, any RSV that has not been remediated will show empty current values. By default this configuration doesnt exist on the RSV, therefore the value is empty. (you can ignore MonitorDisable, this is only used when you want to exclude certain resources from being monitored).

We were asked to investigate these things:
Can you review the following in the customer environment:

Get the details of the remediation task. In particular the related events.
Does the RSV have any resource locks? A read-only lock will cause the remediation to fail.
Can you review the Managed Identity of the Policy Assignment. For the remediation to work, we require at least sufficient permissions to modify the RSV. (By default Contributor is assigned)

[Question/Feedback]: Add severity as a additional parameter of Alert name

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

After some test of this solution I can't see possibility to run same type of alert (ex. CPU usage) with 2 severities for the same resource, ex. 70% generate Warning alert and 90% will generate Critical one.

Is it possible to make any modification in Policy Definitions and add severity as a part of concat function for name attribute?

[Question/Feedback]: Missing "Deploy-*" json files for some services

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Why for some services the "Deploy-*" json files are missing in the repo?

For example, for KeyVault, there are 5 alerts deploy files at this location https://github.com/Azure/azure-monitor-baseline-alerts/blob/main/services/KeyVault/vaults/

But for services like ContainerApps, APIM, LogicApps, etc... the deploy file is not provided.
For example

Thanks

[Question/Feedback]: Inconsistency in `Microsoft.Network/azureFirewalls`

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

SNAT Port Utilization in

- name: SNAT Port Utilization
description: Percentage of outbound SNAT ports currently in use

has on one hand a threshold of

operator: LessThan
threshold: 80

and a few lines further down below in

- name: SNATPortUtilization
description: Percentage of outbound SNAT ports currently in use

is has a different threshold and operator

timeAggregation: Average
operator: GreaterThan
criterionType: StaticThresholdCriterion
threshold: 99.0

I'd like to understand why that's the case. Is this an inconsistency or is there a reason for that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.