ibm / ansible-lifecycle-driver Goto Github PK

View Code? Open in Web Editor NEW

4.0 9.0 23.0 741 KB

Lifecycle driver implementation that uses Ansible to execute operations

License: Apache License 2.0

Python 98.04% Dockerfile 1.80% Smarty 0.16%

ansible-lifecycle-driver's Introduction

Ansible Lifecycle Driver

Lifecycle driver implementation that uses Ansible to execute operations.

Please read the following guides to get started with the lifecycle Driver:

Developer

Developer Docs - docs for developers to install the driver from source or run the unit tests

User

User Guide

ansible-lifecycle-driver's People

Contributors

Stargazers

Watchers

ansible-lifecycle-driver's Issues

Include resource requests and limits configuration in Helm chart

It is common when creating a Pod to define the resources it will use (CPU and Memory) on a Kubernetes node.

It is also common for most helm charts to allow these parameters to be configurable.

We should consider adding Pod resource limits and requests values to our Helm Charts, making them configurable though the helm chart values.

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

The majority of helm charts available on the public repository allow the resources to be configured through the following values:

resources:    
  limits:      
    cpu: "1"      
    memory: "2048Mi"
  requests:
    cpu: "25m"      
    memory: "1536Mi"

Support building Docker image with development version of Ignition

Dockerfile should allow us to build an ignition whl before the driver whl, by adding it to the docker/whls directory. If ignition is not present, the image should continue with the driver whl, assuming a "released" version of ignition is in use instead.

The default "become_user" should be "root", not "ubuntu"

ALM 2.1 LifeCycle Driver - not compatible with Openstack os_server_action module anymore

in ALM 2.0 we used to use the below linked ansible module for stopping / strating openstack VMs:
https://docs.ansible.com/ansible/latest/modules/os_server_action_module.html

This worked for us with alm 2.0 and the original ansible resource manager.
The ansible task & role would run on the localhost (ansible RM) and would stop/start the openstack instack instance using the os_server_action module.

However with the 2.1 brent driver and the ansible lifecycle driver we can no longer use this.
When we attempt to run a task & role as shown below -

- name: Start VM in Openstack
  hosts: localhost
  gather_facts: False
  roles:
    - { role: start_vm  }

- name: start_vm
  os_server_action:
    auth:
      auth_url: "{{ dl_properties.os_api_url }}"
      username: "{{ dl_properties.os_auth_username }}"
      password: "{{dl_properties.os_auth_password }}"
      project_name: "{{ dl_properties.os_auth_project_name }}"
    validate_certs: no
    server: "{{ properties.hostname }}"
    action: start
    timeout: 180
    wait: yes
  ignore_errors: no

We get the below error in the alm lifecycle driver logs:
{"@timestamp": "2020-04-15T14:40:28.621Z", "@version": "1", "message": "Delivering envelope to lm_vnfc_lifecycle_execution_events with message content: b'{\"requestId\": \"5e88546da03546ccbb1463913158a181\", \"status\": \"FAILED\", \"failureDetails\": {\"failureCode\": \"INFRASTRUCTURE_ERROR\", \"description\": \"task start_vm : start_vm failed: {\\'msg\\': \\'openstacksdk is required for this module\\', \\'invocation\\': {\\'module_args\\': {\\'auth\\': {\\'auth_url\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'username\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'password\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'project_name\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\'}, \\'validate_certs\\': False, \\'server\\': \\'MCC-mcm\\', \\'action\\': \\'start\\', \\'timeout\\': 180, \\'wait\\': True, \\'verify\\': False, \\'interface\\': \\'public\\', \\'cloud\\': None, \\'auth_type\\': None, \\'region_name\\': None, \\'availability_zone\\': None, \\'cacert\\': None, \\'cert\\': None, \\'key\\': None, \\'api_timeout\\': None, \\'image\\': None}}, \\'_ansible_parsed\\': True, \\'_ansible_no_log\\': False, \\'changed\\': False}\"}, \"outputs\": {}}'", "host": "ansible-lifecycle-driver-57c94bf67f-9g9x2", "path": "/home/ald/.local/lib/python3.7/site-packages/ignition/service/messaging.py", "tags": [], "type": "logstash", "thread_name": "Thread-1", "level": "DEBUG", "logger_name": "ignition.service.messaging"}

Deployment location properties are being removed before the request is handled

Describe the bug
When logging the Deployment Location the driver attempts to obfuscate the properties by removing them from the message. By doing so it actually removes them from the Deployment Location object directly, meaning the properties are gone when they are later required to handle the request.

To Reproduce
Attempt any resource transition using the v1.0.0 or v1.0.1 driver

Environment: (please complete the following information where applicable):

Version 1.0.0

NoBrokersAvailable exceptions during Kafka sends and receives

Ignition-based drivers running in an OCP (Openshift) environment sometimes generate a NoBrokersAvailable exception.

Remove "vault_password_file" configuration from ansible.cfg

Allow delegate_to localhost

Some Ansible modules (such as https://docs.ansible.com/ansible/latest/modules/wait_for_module.html) require running on the Ansible controller (in this case, the Ansible driver itself). The driver currently prevents this, probably due to lack of write permission when running in a Kubernetes pod. When fixing this, we should consider the implications of allowing arbitrary Ansible code within the driver itself i.e. issues with security and locally running code putting undue load on the driver itself. It may be possible to allow modules to run on a white-list basis.

Allow Public and Private Keys to be Accessed

The Ansible Driver currently supports accessing the private key portion of LM infrastructure (SSH) keys (http://servicelifecyclemanager.com/2.1.0/user-guides/operations/infrastructure-key-management/) for use in Ansible inventory. This should be enhanced to allow the public key portion of an infrastructure key to be accessed also (use case: creating Openstack or Kubernetes infrastructure that uses an SSH key pair for access).

The driver should not log sensitive data

The driver does not currently know whether properties are secure, so it should not log any properties.

Expand the documentation for the driver

We need to create a complete user guide for this driver, explaining the Resource requirements/limitations specific to this driver.

A couple of suggested sections:

expected structure of the "ansible" directory in a Resource package
properties available to ansible playbooks and inventories
details on how to return properties to LM through facts
example of using a jumphost in an inventory file

dl_properties should include the name and infrastructure type

Describe the bug
The deployment location properties need to include the name of the deployment location and it infrastructure type so that these can be accessed in any underlying ansible

Environment: (please complete the following information where applicable):

OS: [e.g. Ubuntu 18.04]
Version 1.1.0
Stratoss LM 2.1

Autoscaling on CPU usage

Update helm chart to include configurable options allow the driver to scale up and down based on CPU usage

Use common templating and kubernetes deployment location

Ignition framework 2.0.0 includes common tools for templating and a schema for Kubernetes based deployment locations. We should update this driver to make use of these tools.

We should make sure we maintain backwards compatibility, so in the case of templating, still support the use of {{ properties.someProp }} to refer to a property (in the common templating syntax this becomes just {{ someProp }}.

Consider Using Ansible Runner

Consider using Ansible Runner as a replacement for custom Ansible runner code.

Enabling SSL with Gunicorn does not bootstrap in to SSL mode

Update API handling with Ignition Resource Driver API changes

Ignition 1.2 will feature a single Resource Driver API instead of separate Infrastructure/Lifecycle APIs, to conform with Brent 2.2.

Before this driver can upgrade to 1.2 it must update it's handling of API requests to match the new API specification.

Intermittent test failure

"test_max_queue_size" intermittently fails (perhaps due to a timeout on an synchronous call):

======================================================================
FAIL: test_max_queue_size (tests.unit.service.test_process.TestProcess)

Traceback (most recent call last):
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 249, in test_max_queue_size
lifecycle_execution_1])
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 98, in check_responses
assert self.mock_messaging_service.send_lifecycle_execution.call_args_list == list(map(lambda lifecycle_execution: call(LifecycleExecutionMatcher(lifecycle_execution)), lifecycle_executions))
AssertionError

Driver will not run in OCP Restricted SCC

Describe the bug

Bootstrap errors prevent the Ansible driver pod from starting when running in an OCP cluster with a restricted SCC:

oc logs ansible-lifecycle-driver-697c46456-4j2r7
bash: /home/ald/.local/bin/ald: Permission denied

key_property_processor is not defined

Describe the bug
key_property_processor is being used when it is None in a finally block

To Reproduce
Run a transition with a deployment location that is not a valid dict object

Expected behavior
An error about the deployment location being invalid

Environment: (please complete the following information where applicable):

Version 1.0.0

Add support for provisioning Kubernetes (pods)

Add support for the Find Driver API

Brent now supports the Find API for any type of resource driver (i.e. Infrastructure and Lifecycle Drivers). The Driver API supports this, the Ansible Driver needs to be updated to handle these FInd API requests and map them to e.g. an Ansible playbook.

Configure helm chart to explicitly run the container as "ald" user

There is an issue with some versions of Kubernetes whereby the driver container will run as root rather than the ("ald") user configured in the Docker image - see kubernetes/kubernetes#78308. The proposal is to fix the Dockerfile to explicitly set the uid of the "ald" user, and update the Helm chart to run the container as that user.

Configure K8s Label to enable Filebeat logging

Add a label "part-of: alm" to the K8s deployment so that the LM Filebeat logging will correctly process the driver logs.

Upgrade to Latest Ansible Library

Upgrade the Ansible library used by the driver to the latest version

Add support for returning associatedTopology when running a lifecycle transition playbook

The resource driver API allows the driver to return the ID, name and type of 0 or more associated topology instances (for a resource instance).

It must be possible for a lifecycle transition playbook to return associatedTopology, similar to how properties are returned i.e. using Ansible Facts. An appropriate mechanism (based on setting Ansible Facts representing associatedTopology, for example) should be constructed, implemented in the driver and documented.

Inventory Variables are Overwritten

The driver uses Jinja2 syntax "{{ ... }}" to substitute LM properties in inventory files. Unfortunately, this prevents the substitution of other properties in the inventory because the templated property value is overwritten by a blank value (the LM property does not exist). Instead, LM property substitution should use a different template syntax e.g. "{{{ ... }}}".

K8s connection plugin type name is incorrect

The Ansible connection type for Kubernetes deployment location types is "k8s", which is incorrect (it should be "kubectl"). The workaround is to add "ansible_connection: kubectl" to your resource package inventory file to override it.

Propagate logging context (such as X-TraceCtx-TransactionId) to sub-processes

The driver currently runs Ansible in sub-processes. Any logging context set during requests should be propagated to these sub-processes so that it appears in log statements.

Add affinity and toleration rules to helm chart

Add configuration to the helm chart so a user may specify affinity/anti-affinity rules and tolerations

Ansible throwing errors not considered "unreachable" when the target host is unreachable

Ansible appears to cache connections to hosts it has used in previous playbook executions. This can lead to strange errors when the target host has disappeared and we want to trigger a heal in LM (e.g. we've deleted the Stack in Openstack then attempt a Heal on a Resource).

Stop apache2 failed: {'msg': 'Timeout (32s) waiting for privilege escalation prompt: ', '_ansible_no_log': False} outputs: {}

After some time the error appears to change to:

Stop apache2 failed: {'_ansible_parsed': False, 'module_stdout': '', 'module_stderr': 'ssh: connect to host x.y.z.a port 22: Host is unreachable\r\n', 'msg': 'MODULE FAILURE\nSee stdout/stderr for the exact error', 'rc': 0, '_ansible_no_log': False, 'changed': False} outputs: {}

These errors are not seen as "unreachable" so the driver returns the execution response as FAILED with an INFRASTRUCTURE_ERROR code.

If we restart the driver container, then try again, we get the expected "unreachable" error, which the driver handles with retry attempts before returning RESOURCE_NOT_FOUND.

In the short term, we should catch these 2 known issues and treat them the same as unreachable (return a RESOURCE_NOT_FOUND).

Replace job queue and response cache usage with direct Kafka send on completion

Tighten restrictions on dependency versions

Add upper bounds to dependency versions in setup.py.

We recently ran into an issue where Gunicorn 20.0 was installed which has breaking changes compared to our tested version (19.9)

issue while installation ansible-lifecycle-driver

installed ansible-lifecycle-driver using below command:
helm install ansiblelifecycledriver-0.5.1.tgz --name ansible-lifecycle-driver --set app.config.override.messaging.connection_address=alm-kafka:9092 --namespace default --tls
as we dont have foundation-kafka , having alm-kafka in clusterip service in k8s cluster.
root@eli4-master:/ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i ansible
ansible-lifecycle-driver NodePort 10.0.241.201 8293:31680/TCP 36s
root@eli4-master:/ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i kafka
alm-kafka ClusterIP None 9092/TCP,9093/TCP,8080/TCP,8443/TCP 2d2h
ansible-lifecycle driver deployement successfully done.

root@eli4-master:~/ansible-lifecycle-driver-0.5.1# kubectl describe pods ansible-lifecycle-driver-6649dc8dc8-dr9xt
Name: ansible-lifecycle-driver-6649dc8dc8-dr9xt
Namespace: default
Priority: 0
PriorityClassName:
Node: 9.46.74.117/9.46.74.117
Start Time: Fri, 21 Feb 2020 01:41:45 -0800
Labels: app=ansible-lifecycle-driver
part-of=lm
pod-template-hash=6649dc8dc8
Annotations: kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 10.1.49.162
Controlled By: ReplicaSet/ansible-lifecycle-driver-6649dc8dc8
Containers:
ansible-lifecycle-driver:
Container ID: docker://3dde75f18eedff3817e5a9b5ab804778fa0a764003c6ef97de979491b20a79c2
Image: accanto/ansible-lifecycle-driver:0.5.1
Image ID: docker-pullable://accanto/ansible-lifecycle-driver@sha256:bf93fbc8aebeb7834f5be4671f1d8adfa11764ce24bddbbc793d1bd22b09dda2
Port: 8293/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 21 Feb 2020 01:41:48 -0800
Ready: True
Restart Count: 0
Environment Variables from:
ansible-lifecycle-driver-env ConfigMap Optional: false
Environment:
Mounts:
/var/ald/ald_config.yml from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mvd2r (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ansible-lifecycle-driver
Optional: false
default-token-mvd2r:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mvd2r
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Normal Scheduled 55s default-scheduler Successfully assigned default/ansible-lifecycle-driver-6649dc8dc8-dr9xt to 9.46.74.117
Normal Pulled 53s kubelet, 9.46.74.117 Container image "accanto/ansible-lifecycle-driver:0.5.1" already present on machine
Normal Created 53s kubelet, 9.46.74.117 Created container
Normal Started 52s kubelet, 9.46.74.117 Started container

post installation UI : http://9.46.65.22:31680/api/lifecycle/ui
getting below error: This page isn’t working 9.46.65.22 didn’t send any data.
ERR_EMPTY_RESPONSE

Upgrade Ignition to 0.6.2

Ansible worker process pool is not working

Any Ansible playbooks with more than 1 task seem to hang inside the Ansible process,

Request Queue max.poll.interval.ms is not configurable

The (Kafka-based) Request Queue should allow "max.poll.interval.ms" to be configured on a per-deployment basis (with a conservative default, to allow for long-running Ansible lifecycle requests)

Document Key Properties

Log level causing high disk usage

Describe the bug
The log level is set to DEBUG and it is causing a high amount of logging, filling up elasticsearch when using filebeat

To Reproduce
Install the driver and let it run for a few days, you will find high disk usage in elasticsearch

Expected behavior
The log level should be lower thereby reducing the rate at which the disk fills up

Ensure documentation correct for 1.0.0

Driver blocks on execute lifecycle API call after an indeterminate number of requests

Remove lifecycle scripts when task completed

Ignition writes the lifecycle scripts to disk for each request, which means there will be a build up that will lead to a full disk. We should clear the lifecycle scripts directory from disk after the process has finished (or failed).

Add support for properties of type "key"

LM and Brent will support properties of type "key" in v2.1. The driver should handle properties of this type by persisting the key to a file so that it can be used in Ansible inventory to communicate securely with VMs. An additional property with the same name and a suffix of "-path" should be added to the properties that holds the path of the created key file. The key file should be cleaned up after the request has been handled.

Helm delete to cleanup secrets

Is your feature request related to a problem? Please describe.
When I reinstall a new version of the driver I get the following error on the helm install: 'Error: secrets "ald-tls" already exists'

Describe the solution you'd like
When I run a helm delete on the driver I would like the secrets to be deleted.

Add support for certificate based K8s deployment locations

Currently, the driver supports token-based authentication for K8s deployment locations. Enhance this to support certificate-based authentication i.e. "certificate-authority-data" for the cluster kubeconfig, "client-certificate-data" and "client-key-data" for the user kubeconfig.

AttributeError thrown when enabling use_pool

When enabling the process.use_pool configuration property the driver will fail on startup because the queue_thread attribute is not defined on the AnsibleProcessorService.

We need to initialise this attribute to None, so we can perform an if statement on it later the service.

Add SSL support

Add support for SSL REST APIs.

OSError('libc not found') when running with Guincorn

When running the Docker image with WSGI_CONTAINER env set to gunicorn a "libc not found" error is thrown.

We need to add and keep libc and binutils on the container.

Improve Fault Tolerance by using an Ignition Request Queue

In order to improve driver fault tolerance the handling of LM requests should be driven through a persistent queue, rather than by REST calls (a REST LM request will be pushed on to the queue and the driver will then return a response with the infrastructure and request ids). The proposal is to use Kafka for this. The advantage of this approach is more robust handling of requests in the event that the driver goes down e.g. the Pod dies; another driver in the replica set can pick up the request and re-run it. It is recognised that the Ansible scripts must be idempotent for this to work; this is desirable for Ansible scripts anyway (in future, the driver could handle picking up Ansible scripts from where they left off - this is a feature of Ansible but would require some work in the driver).

Note: the fix for this should use the Ignition request queue (see IBM/ignition#46)

Add readiness and liveness probe to Helm chart

Requires an update to 0.8.0 version of Ignition

ibm / ansible-lifecycle-driver Goto Github PK

ansible-lifecycle-driver's Introduction

Ansible Lifecycle Driver

Developer

User

ansible-lifecycle-driver's People

Contributors

Stargazers

Watchers

Forkers

ansible-lifecycle-driver's Issues

====================================================================== FAIL: test_max_queue_size (tests.unit.service.test_process.TestProcess)

Recommend Projects

Recommend Topics

Recommend Org

======================================================================
FAIL: test_max_queue_size (tests.unit.service.test_process.TestProcess)