Giter Site home page Giter Site logo

aem-stack-manager-cloud's Introduction

Build Status Known Vulnerabilities

aem-stack-manager-cloud

AEM Stack Manager Cloud Native Implementation

This is part of Shine Solutions Open Source AEM Solution offerings.

What is AEM Stack Manager

AEM Stack Manager provides the ability to do the following:

  • deploy-artifact: deploy individual AEM package
  • deploy-artifacts: deploy AEM packages based on a descriptor file
  • export-package: exporting an AEM package based on a set of filter rules
  • import-package: import a previously exported package
  • offline-snapshot-full-set: take an EBS snapshot of AEM repository volume after stopping AEM service
  • offline-compaction-snapshot-full-set: take an EBS snapshot after stopping AEM service and compacting the repository
  • promote-author: promote a standby Author instance to be the primary.
  • enable-crxde: enable crxde on selected instances.
  • run-adhoc-puppet: run adhoc puppet code provided in a tar ball.

In addition, scheduled AEM Snapshots Purge function is also provided in a separate Lambada function, which uses AWS CloudWatch Events to trigger the execution. It provides a sensible default to start with.

For more information, please refer to: aem-stack-manager

What is AEM Stack Manager Cloud

Shine Solutions has a Java implementation of the AEM Stack Manger. This (cloud ) implementation use cloud native technologies to do the same things. The AWS services used in this implementation includes Lambda, EC2 Run Command, DynamoDB, AWS CloudWatch. Python is used as the language for the Lambda functions.

To maintain compatibility with the Java version, this cloud version uses the same SNS interface to invoke the functions. There is a separate repo: aem-stack-manager-messenger for sending the SNS messages that trigger the tasks.

the sequence of events: SNS -> Lambda -> EC2 Run Command -> Scripts/Puppet Manifests on instances DynamoDB is used to keep the state of the Tasks.

Snapshots Purge does not reply on this SNS interface.

How to Get Start

Under cloudformation, it has the CloudFormation template used to create the resources: the SSM Documents, Lambda Functions, SNS Topics, DynamoDB, and necessary IAM Roles. Please take note of the Stack Manager Topic name, Backup Topic name, as those will be used with AEM Stack Manager Messenger; they have the form of AemStackMangerversion, AemOfflineBackupversion. Please also take note of the task status query Lambda Function name if you plan passing in an identifier when invoking a function, and use it to query the status of the task. It is usually in the form: AemTaskQueryversion. version is a parameter in the Cloud Formation template.

Similarly CloudFormation Template for Snapshots Purge resources can also be found under cloudformation.

Ansible is used to orchestrate the creation of the stack, such as zip up the Python code and upload them to S3, and provide the parameters used in the CloudFormation template.

Under scripts, generate.sh is used to create the CloudFormation Template for creating the SSM Documents from a set of include files, manage-stack.sh enlist Ansible to create the CloudFormation Stacks, and task_status_query.sh query the task status by using AWS CLI. manage_document_permisson.py help sharing the SSM documents to other accounts, while output_task_doc_mapping.py generate a AEM Stack Manager task to SSM document name mapping, to be used with AEM Stack Manager Messenger and configure the Lambda Functions.

Installation

Usage

  • To create the Lambda functions, DynamoDB, and other resources:

    make create-stack-manager-cloud [config_path=path]

    A sample yaml config file can be found under ansible/group_vars/aem-stack-manager-cloud.yaml

  • To create Snapthos Purge related resources:

    make creaet-snapshots-purge-cloud [config_path=path]

  • To invoke the individual tasks, please refer to aem-stack-manager-messenger. It is usually just like the following:

    make deploy-artifacts

Dependencies

The EC2 instances are assumed to have EC2 System Manager Agent installed and properly configured. Please refer to amazon_ssm_agent for a simple, easy-to-use Puppet module that supports using a proxy.

Going Forward

Lambda function is stateless, while two of the tasks, offline-snapshot-full-set, offline-compaction-snapshot-full-set requires a few things happen in the right order and share state information between the steps. Using a combination of Lambda and DynamoDB can work, but is a less optimal choice due to AWS Step Function is not available in Sydney region when this work started.

A Step Function implementation is planed and this cloud implementation will switch to that once it is available in Sydney region.

aem-stack-manager-cloud's People

Contributors

cliffano avatar mbloch1986 avatar ovlords avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aem-stack-manager-cloud's Issues

When error put failed state to dynamodb

When the Lambda function offline_sapshot or stack_manager raise an error it doesn't update or put a state to the DynamoDB. It only send a message to syslog e.g.

aem_offline_snapshot.py Line 650:

Unhealthy Stack: RuntimeError
Traceback (most recent call last):
File "/var/task/aem_offline_snapshot.py", line 658, in sns_message_processor
raise RuntimeError('Unhealthy Stack')
RuntimeError: Unhealthy Stack

If we call the lambda function via stack-manager-messenger and it raise an error, we don't get informed nor can't we query for the state of the call we made. The messenger returns an error only after the timeout of 1h.

Instead of raising an error the lambda function should update the dynamodb item to the call we made. It might be a good idea to use the message_id as an unique identifier, since we are using it to query for the state of command.

aem_stack_manager.py line 475
aem_offline_snapshot.py leine ~633

Allow inexistent author-standby on full-set offline snapshot / compaction

Currently offline snapshot and offline compaction snapshot on AEM Full-Set requires author-standby to exist.

However, we need to also consider the scenario where author-standby has been promoted to an author-primary following a termination of the original author-primary, which will then leave the environment with a single author-primary and no author-standby.

Offline snapshot and offline compaction snapshot events should identify an inexistent author-standby, log it, and then move on to the next step.

Offline-Snapshot process does not wait until EC2 instance is in standby

At the beginning of the oflfine-snapshot process EC2 instances of the Author-Dispatcher & Publish-Dispatcher are getting moved into standby. As configured in the Classic Load Balancer Draining Policy or in the ALB Target Group deregistration_delay configuration, it can take up to 5 minutes until those EC2 instances are in standby.

The offline-snapshot process will fail in the scenario where it takes up to 5 minutes to move the EC2 instances into standby and the offline-snapshot process finishes within 5 minutes. Because the process tries to move the ec2 instances from standby into in service at the end, while those ec2 instances are not in standby yet.

Lambda Error message

ClientError: An error occurred (ValidationError) when calling the ExitStandby operation: The instance i-... is not in Standby.

Fix Bug in ec2 filter for getting promoted author-standby instance

We need to check the ec2 filter for getting the ec2 instance details of an Promoted Author Standby instance. As atm the filter is always responding with the author-standby instance, independent of the state of the standby instance.

https://github.com/shinesolutions/aem-stack-manager-cloud/blame/master/lambda/aem_offline_snapshot.py#L112-L113

Filter looks atm like:

    filters = [
        {
            'Name': 'tag:StackPrefix',
            'Values': [stack_prefix]
        }, {
            'Name': 'instance-state-name',
            'Values': ['running']
        }, {
            'Name': 'tag:aws:cloudformation:logical-id',
            'Values': ['AuthorStandbyInstance']
        }
]

If I'm not wrong we should update it to something like:

    filters = [
        {
            'Name': 'tag:StackPrefix',
            'Values': [stack_prefix]
        }, {
            'Name': 'instance-state-name',
            'Values': ['running']
        }, {
            'Name': 'tag:Name',
            'Values': ['AEM Author - Primary - Promoted from Standby']
        }
    ]

Since the EC2 tag name is the only EC2 tag we update while promoting author-standby to author-primary, we can only use this atm to filter for the promoted author-standby instance.

Offline-snapshot unlock dynamodb if error

To check if a offline-snapshot is already in process the lambda function aem_offline_snapshot.py set a lock state to the dynamo db e.g.
command_id | String | : | michaelb-aem63_backup_lock

This lock dosn't get cleared after the lambda function raise an error, even before a offline-snapshot could be taken.

Introduce enable configuration property for each action

To allow users to configure Stack Manager with different risk profiles, e.g. set up Stack Manager in non-prod with enable CRXDE action allowed, but not in prod, we need to introduce enable configuration property at action level.

Migrate task mapping from the stack manager config file to the stack manager cloud lambda function

The Stack Manager Lambda function is currently using a Configuration file to map a specific stack manager task to the related SSM Document. This makes the stack manager cloud lambda function still to undynamic, e.g. when you add new commands you always need to update the ansible module for creating the configuration file. To improve the dynamic of the stack manager cloud lambda function we can add this task mapping to the lambda function to lookup the related ssm document once the lambda function get's executed. This will give\ the users the opportunity to create his own custom SSM Document stack and they is still able to use the AEM Stack Manager Cloud Lambda functions.

Suspend ASG Balancing when doing offline snapshots.

The offline snapshot will move one of the publish dispatchers instances to standby and at the same time take a snapshot from it.
The problem may happen when it takes a snapshot from AZ with a minimum number of Launched instance. like below example:
AZa = 2 instance
AZb= 2 instance
AZc= 1 instance
in this scenario, if we take a snapshot from an instance in AZc, the AutoScaling policy will apply AZ balancing, so it will remove one instance from AZa or AZb to add one instance to AZc.
to fix this issue we may need to suspend AZ balancing when we are doing the offline snapshot.

Add stack prefix wildcard support for snapshot purging

Currently Stack Manager snapshots purging supports StackPrefix parameter, but this is not good enough because a team wants to manage a number of environments from a single stack manager. Other than that, another problem is that stack prefix is often not known during the creation of Stack Manager.

We need to introduce a stack prefix wildcard support. This way user can:

  1. define a stack prefix wildcard, e.g. myorg-* , which will then only purge snapshots taken from environments with stack prefix which fits the wildcard regex
  2. leave the wildcard as empty, which will then purge the snapshots across the account

This new stack prefix wildcard support will provide a boundary support for Stack Manager, essentially allowing multiple teams to have one Stack Manager each, with the wildcard as the boundary.

Add new event to check environment readiness

Given a stack prefix, Stack Manager should provide an event to check if the AEM environment is ready.

For Full-Set architecture, the check should be done on the Orchestrator instance.
For Consolidated architecture, the check should be done on the AuthorPublishDispatcher instance.

Yes, it's possible to run the check from Lambda, and yes, we can look at running Ruby AEM clients via Travelling Ruby, but the current design is to place the event logic within the architectures, and the Lambda function is currently used as a thin orchestration layer of the events. Moving the event logic should be applied to all events, and not just one of them (this will be a platform v3.x discussion).

Failure in Stack manager Lambda function export-packages

In the Lambda function of the stack manager is an error for the command export-packages. Following lines need to be removed:

encoded = json.dumps(message['details']['package_filter'])
 logger.debug('encoded filter: {}'.format(encoded))
 logger.debug('escaped filter: {}'.format(json.dumps(encoded)))

Add stack prefix to cloudwatch logstream log s3 path

Currently cloudwatch logstream log uses s3_bucket/s3_bucket_path/datestamp/file as s3 location.

Because the s3_bucket and s3_bucket_path values are configured at stack manager layer, that means the mapping of this config is one per stack manager, which translates to multiple AEM environments to be sharing the same stack manager configuration.

In order to avoid the possibility of conflicting s3 location across multiple aem environments, we should introduce stack prefix to the s3 location to become s3_bucket/s3_bucket_path/stack_prefix/datestamp/file .
Orchestrator as the trigger of the scheduling, should ideally inject the stack_prefix value.

AEM Stack Manager Silent Failures

except botocore.exceptions.ClientError:
succeeded = False

Hi All, When running the stack manager, our clients have occasionally experienced issues when IAM roles have been modified or VPC networking has been updated causing the Dynamo client to fail. The logs usually only print that a concurrent backup was attempted but hides the actual issue. We've updated our copy of the stack manager to log the actual error message here and it may be useful for others as well.

except botocore.exceptions.ClientError as err:
    print("Error Updating Dynamo: {}".format(err))
    succeeded = False

Update offline snapshot lambda function

The offline snapshot Lambda function need to be updated similar to the stack manager Lambda function, to store the SNS Message ID to the DynampoDB table.

So we can query for the command status.

Flush Dispatcher Cache action

Need to add a new feature to flush-dispatcher-cache, which will execute deletion of all files under httpd docroot dir (e.g. /var/www/html) on publish-dispatcher instances .

Stack healthy check fails

The stack healthy check is failing because a wrong type is passed to the method call. It expects a integer but it's receiving a string.

Need for higher flexibility for lambda function to run new SSM commands

To give the Stack Manager more flexibility about executing new/unknown command we need to redesign the handling of commands within the lambda function.

Instead of having a method for each command which is going to be executed on the remote host, it should be only one method for all commands. This makes it easier for adding new SSM Commands.

atm we have to update the Stack Manager aem-stack-manager.py lambda function everytime we add a new SSM document. This leads us to update the cloudformation stack manager ( deleting stack manager, creating stack manager)

atm a command methods looks like

def deploy_artifact(message, ssm_common_params):
    target_filter = [
        {
            'Name': 'tag:StackPrefix',
            'Values': [message['stack_prefix']]
        }, {
            'Name': 'instance-state-name',
            'Values': ['running']
        }, {
            'Name': 'tag:Component',
            'Values': [message['details']['component']]
        }
    ]
    # boto3 ssm client does not accept multiple filter for Targets
    details = {
        'InstanceIds': instance_ids_by_tags(target_filter),
        'Comment': 'deploy an AEM artifact',
        'Parameters': {
            'source': [message['details']['source']],
            'group': [message['details']['group']],
            'name': [message['details']['name']],
            'version': [message['details']['version']],
            'replicate': [message['details']['replicate']],
            'activate': [message['details']['activate']],
            'force': [message['details']['force']]

        }
    }
    params = ssm_common_params.copy()
    params.update(details)
    return send_ssm_cmd(params)

It could be like

e.g. a way it could look like

def command_params(message, ssm_common_params):
    target_filter = [
        {
            'Name': 'tag:StackPrefix',
            'Values': [message['stack_prefix']]
        }, {
            'Name': 'instance-state-name',
            'Values': ['running']
        }, {
            'Name': 'tag:Component',
            'Values': [message['details']['component']]
        }
    ]
    # boto3 ssm client does not accept multiple filter for Targets
    details = {
        'InstanceIds': instance_ids_by_tags(target_filter),
        'Comment': 'Execute SSM command',
        'Parameters': [message['details']['message']]
        }
    params = ssm_common_params.copy()
    params.update(details)
    return send_ssm_cmd(params)

Restrict purge snapshots to just AEM-generated snapshots.

Currently AEM Stack Manager filters snapshots by snapshot type tag having known values of either live, or offline, or orchestration.

The problem with this filtering is that there could be other snapshots that are not related to any AEM project and not generated by AEM OpenCloud, which get accidentally deleted.

To avoid this problem, we should add filtering by more filters such as AemId and Component.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.