Giter Site home page Giter Site logo

ansible-lifecycle-driver's Introduction

License Build Status

Ansible Lifecycle Driver

Lifecycle driver implementation that uses Ansible to execute operations.

Please read the following guides to get started with the lifecycle Driver:

Developer

  • Developer Docs - docs for developers to install the driver from source or run the unit tests

User

ansible-lifecycle-driver's People

Contributors

akshata-desai-ibm avatar beanallergy avatar chowarth123 avatar clloydaccanto avatar davetobin avatar deepansh1092 avatar dependabot[bot] avatar dvaccarosenna avatar eflamini avatar haynesdavis avatar haynesdavisibm avatar lokanalla avatar manojn94 avatar owen-lynch-ibm avatar sglover avatar sums047 avatar suryadipnag avatar swatij-ibm avatar toprem1976 avatar traghave123 avatar tushar123sharma avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-lifecycle-driver's Issues

Include resource requests and limits configuration in Helm chart

It is common when creating a Pod to define the resources it will use (CPU and Memory) on a Kubernetes node.

It is also common for most helm charts to allow these parameters to be configurable.

We should consider adding Pod resource limits and requests values to our Helm Charts, making them configurable though the helm chart values.

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

The majority of helm charts available on the public repository allow the resources to be configured through the following values:

resources:    
  limits:      
    cpu: "1"      
    memory: "2048Mi"
  requests:
    cpu: "25m"      
    memory: "1536Mi"

ALM 2.1 LifeCycle Driver - not compatible with Openstack os_server_action module anymore

in ALM 2.0 we used to use the below linked ansible module for stopping / strating openstack VMs:
https://docs.ansible.com/ansible/latest/modules/os_server_action_module.html

This worked for us with alm 2.0 and the original ansible resource manager.
The ansible task & role would run on the localhost (ansible RM) and would stop/start the openstack instack instance using the os_server_action module.

However with the 2.1 brent driver and the ansible lifecycle driver we can no longer use this.
When we attempt to run a task & role as shown below -

- name: Start VM in Openstack
  hosts: localhost
  gather_facts: False
  roles:
    - { role: start_vm  }
- name: start_vm
  os_server_action:
    auth:
      auth_url: "{{ dl_properties.os_api_url }}"
      username: "{{ dl_properties.os_auth_username }}"
      password: "{{dl_properties.os_auth_password }}"
      project_name: "{{ dl_properties.os_auth_project_name }}"
    validate_certs: no
    server: "{{ properties.hostname }}"
    action: start
    timeout: 180
    wait: yes
  ignore_errors: no

We get the below error in the alm lifecycle driver logs:
{"@timestamp": "2020-04-15T14:40:28.621Z", "@version": "1", "message": "Delivering envelope to lm_vnfc_lifecycle_execution_events with message content: b'{\"requestId\": \"5e88546da03546ccbb1463913158a181\", \"status\": \"FAILED\", \"failureDetails\": {\"failureCode\": \"INFRASTRUCTURE_ERROR\", \"description\": \"task start_vm : start_vm failed: {\\'msg\\': \\'openstacksdk is required for this module\\', \\'invocation\\': {\\'module_args\\': {\\'auth\\': {\\'auth_url\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'username\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'password\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'project_name\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\'}, \\'validate_certs\\': False, \\'server\\': \\'MCC-mcm\\', \\'action\\': \\'start\\', \\'timeout\\': 180, \\'wait\\': True, \\'verify\\': False, \\'interface\\': \\'public\\', \\'cloud\\': None, \\'auth_type\\': None, \\'region_name\\': None, \\'availability_zone\\': None, \\'cacert\\': None, \\'cert\\': None, \\'key\\': None, \\'api_timeout\\': None, \\'image\\': None}}, \\'_ansible_parsed\\': True, \\'_ansible_no_log\\': False, \\'changed\\': False}\"}, \"outputs\": {}}'", "host": "ansible-lifecycle-driver-57c94bf67f-9g9x2", "path": "/home/ald/.local/lib/python3.7/site-packages/ignition/service/messaging.py", "tags": [], "type": "logstash", "thread_name": "Thread-1", "level": "DEBUG", "logger_name": "ignition.service.messaging"}

Deployment location properties are being removed before the request is handled

Describe the bug
When logging the Deployment Location the driver attempts to obfuscate the properties by removing them from the message. By doing so it actually removes them from the Deployment Location object directly, meaning the properties are gone when they are later required to handle the request.

To Reproduce
Attempt any resource transition using the v1.0.0 or v1.0.1 driver

Environment: (please complete the following information where applicable):

  • Version 1.0.0

Allow delegate_to localhost

Some Ansible modules (such as https://docs.ansible.com/ansible/latest/modules/wait_for_module.html) require running on the Ansible controller (in this case, the Ansible driver itself). The driver currently prevents this, probably due to lack of write permission when running in a Kubernetes pod. When fixing this, we should consider the implications of allowing arbitrary Ansible code within the driver itself i.e. issues with security and locally running code putting undue load on the driver itself. It may be possible to allow modules to run on a white-list basis.

Expand the documentation for the driver

We need to create a complete user guide for this driver, explaining the Resource requirements/limitations specific to this driver.

A couple of suggested sections:

  • expected structure of the "ansible" directory in a Resource package
  • properties available to ansible playbooks and inventories
  • details on how to return properties to LM through facts
  • example of using a jumphost in an inventory file

dl_properties should include the name and infrastructure type

Describe the bug
The deployment location properties need to include the name of the deployment location and it infrastructure type so that these can be accessed in any underlying ansible

Environment: (please complete the following information where applicable):

  • OS: [e.g. Ubuntu 18.04]
  • Version 1.1.0
  • Stratoss LM 2.1

Autoscaling on CPU usage

Update helm chart to include configurable options allow the driver to scale up and down based on CPU usage

Use common templating and kubernetes deployment location

Ignition framework 2.0.0 includes common tools for templating and a schema for Kubernetes based deployment locations. We should update this driver to make use of these tools.

We should make sure we maintain backwards compatibility, so in the case of templating, still support the use of {{ properties.someProp }} to refer to a property (in the common templating syntax this becomes just {{ someProp }}.

Update API handling with Ignition Resource Driver API changes

Ignition 1.2 will feature a single Resource Driver API instead of separate Infrastructure/Lifecycle APIs, to conform with Brent 2.2.

Before this driver can upgrade to 1.2 it must update it's handling of API requests to match the new API specification.

Intermittent test failure

"test_max_queue_size" intermittently fails (perhaps due to a timeout on an synchronous call):

======================================================================
FAIL: test_max_queue_size (tests.unit.service.test_process.TestProcess)

Traceback (most recent call last):
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 249, in test_max_queue_size
lifecycle_execution_1])
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 98, in check_responses
assert self.mock_messaging_service.send_lifecycle_execution.call_args_list == list(map(lambda lifecycle_execution: call(LifecycleExecutionMatcher(lifecycle_execution)), lifecycle_executions))
AssertionError

Driver will not run in OCP Restricted SCC

Describe the bug

Bootstrap errors prevent the Ansible driver pod from starting when running in an OCP cluster with a restricted SCC:

oc logs ansible-lifecycle-driver-697c46456-4j2r7
bash: /home/ald/.local/bin/ald: Permission denied

key_property_processor is not defined

Describe the bug
key_property_processor is being used when it is None in a finally block

To Reproduce
Run a transition with a deployment location that is not a valid dict object

Expected behavior
An error about the deployment location being invalid

Environment: (please complete the following information where applicable):

  • Version 1.0.0

Add support for the Find Driver API

Brent now supports the Find API for any type of resource driver (i.e. Infrastructure and Lifecycle Drivers). The Driver API supports this, the Ansible Driver needs to be updated to handle these FInd API requests and map them to e.g. an Ansible playbook.

Add support for returning associatedTopology when running a lifecycle transition playbook

The resource driver API allows the driver to return the ID, name and type of 0 or more associated topology instances (for a resource instance).

It must be possible for a lifecycle transition playbook to return associatedTopology, similar to how properties are returned i.e. using Ansible Facts. An appropriate mechanism (based on setting Ansible Facts representing associatedTopology, for example) should be constructed, implemented in the driver and documented.

Inventory Variables are Overwritten

The driver uses Jinja2 syntax "{{ ... }}" to substitute LM properties in inventory files. Unfortunately, this prevents the substitution of other properties in the inventory because the templated property value is overwritten by a blank value (the LM property does not exist). Instead, LM property substitution should use a different template syntax e.g. "{{{ ... }}}".

K8s connection plugin type name is incorrect

The Ansible connection type for Kubernetes deployment location types is "k8s", which is incorrect (it should be "kubectl"). The workaround is to add "ansible_connection: kubectl" to your resource package inventory file to override it.

Ansible throwing errors not considered "unreachable" when the target host is unreachable

Ansible appears to cache connections to hosts it has used in previous playbook executions. This can lead to strange errors when the target host has disappeared and we want to trigger a heal in LM (e.g. we've deleted the Stack in Openstack then attempt a Heal on a Resource).

Stop apache2 failed: {'msg': 'Timeout (32s) waiting for privilege escalation prompt: ', '_ansible_no_log': False} outputs: {}

After some time the error appears to change to:

Stop apache2 failed: {'_ansible_parsed': False, 'module_stdout': '', 'module_stderr': 'ssh: connect to host x.y.z.a port 22: Host is unreachable\r\n', 'msg': 'MODULE FAILURE\nSee stdout/stderr for the exact error', 'rc': 0, '_ansible_no_log': False, 'changed': False} outputs: {}

These errors are not seen as "unreachable" so the driver returns the execution response as FAILED with an INFRASTRUCTURE_ERROR code.

If we restart the driver container, then try again, we get the expected "unreachable" error, which the driver handles with retry attempts before returning RESOURCE_NOT_FOUND.

In the short term, we should catch these 2 known issues and treat them the same as unreachable (return a RESOURCE_NOT_FOUND).

Tighten restrictions on dependency versions

Add upper bounds to dependency versions in setup.py.

We recently ran into an issue where Gunicorn 20.0 was installed which has breaking changes compared to our tested version (19.9)

issue while installation ansible-lifecycle-driver

  1. installed ansible-lifecycle-driver using below command:
    helm install ansiblelifecycledriver-0.5.1.tgz --name ansible-lifecycle-driver --set app.config.override.messaging.connection_address=alm-kafka:9092 --namespace default --tls

  2. as we dont have foundation-kafka , having alm-kafka in clusterip service in k8s cluster.
    root@eli4-master:/ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i ansible
    ansible-lifecycle-driver NodePort 10.0.241.201 8293:31680/TCP 36s
    root@eli4-master:
    /ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i kafka
    alm-kafka ClusterIP None 9092/TCP,9093/TCP,8080/TCP,8443/TCP 2d2h

  3. ansible-lifecycle driver deployement successfully done.

root@eli4-master:~/ansible-lifecycle-driver-0.5.1# kubectl describe pods ansible-lifecycle-driver-6649dc8dc8-dr9xt
Name: ansible-lifecycle-driver-6649dc8dc8-dr9xt
Namespace: default
Priority: 0
PriorityClassName:
Node: 9.46.74.117/9.46.74.117
Start Time: Fri, 21 Feb 2020 01:41:45 -0800
Labels: app=ansible-lifecycle-driver
part-of=lm
pod-template-hash=6649dc8dc8
Annotations: kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 10.1.49.162
Controlled By: ReplicaSet/ansible-lifecycle-driver-6649dc8dc8
Containers:
ansible-lifecycle-driver:
Container ID: docker://3dde75f18eedff3817e5a9b5ab804778fa0a764003c6ef97de979491b20a79c2
Image: accanto/ansible-lifecycle-driver:0.5.1
Image ID: docker-pullable://accanto/ansible-lifecycle-driver@sha256:bf93fbc8aebeb7834f5be4671f1d8adfa11764ce24bddbbc793d1bd22b09dda2
Port: 8293/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 21 Feb 2020 01:41:48 -0800
Ready: True
Restart Count: 0
Environment Variables from:
ansible-lifecycle-driver-env ConfigMap Optional: false
Environment:
Mounts:
/var/ald/ald_config.yml from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mvd2r (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ansible-lifecycle-driver
Optional: false
default-token-mvd2r:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mvd2r
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Normal Scheduled 55s default-scheduler Successfully assigned default/ansible-lifecycle-driver-6649dc8dc8-dr9xt to 9.46.74.117
Normal Pulled 53s kubelet, 9.46.74.117 Container image "accanto/ansible-lifecycle-driver:0.5.1" already present on machine
Normal Created 53s kubelet, 9.46.74.117 Created container
Normal Started 52s kubelet, 9.46.74.117 Started container

  1. post installation UI : http://9.46.65.22:31680/api/lifecycle/ui
    getting below error: This page isn’t working 9.46.65.22 didn’t send any data.
    ERR_EMPTY_RESPONSE

Log level causing high disk usage

Describe the bug
The log level is set to DEBUG and it is causing a high amount of logging, filling up elasticsearch when using filebeat

To Reproduce
Install the driver and let it run for a few days, you will find high disk usage in elasticsearch

Expected behavior
The log level should be lower thereby reducing the rate at which the disk fills up

Remove lifecycle scripts when task completed

Ignition writes the lifecycle scripts to disk for each request, which means there will be a build up that will lead to a full disk. We should clear the lifecycle scripts directory from disk after the process has finished (or failed).

Add support for properties of type "key"

LM and Brent will support properties of type "key" in v2.1. The driver should handle properties of this type by persisting the key to a file so that it can be used in Ansible inventory to communicate securely with VMs. An additional property with the same name and a suffix of "-path" should be added to the properties that holds the path of the created key file. The key file should be cleaned up after the request has been handled.

Helm delete to cleanup secrets

Is your feature request related to a problem? Please describe.
When I reinstall a new version of the driver I get the following error on the helm install: 'Error: secrets "ald-tls" already exists'

Describe the solution you'd like
When I run a helm delete on the driver I would like the secrets to be deleted.

Add support for certificate based K8s deployment locations

Currently, the driver supports token-based authentication for K8s deployment locations. Enhance this to support certificate-based authentication i.e. "certificate-authority-data" for the cluster kubeconfig, "client-certificate-data" and "client-key-data" for the user kubeconfig.

AttributeError thrown when enabling use_pool

When enabling the process.use_pool configuration property the driver will fail on startup because the queue_thread attribute is not defined on the AnsibleProcessorService.

We need to initialise this attribute to None, so we can perform an if statement on it later the service.

Improve Fault Tolerance by using an Ignition Request Queue

In order to improve driver fault tolerance the handling of LM requests should be driven through a persistent queue, rather than by REST calls (a REST LM request will be pushed on to the queue and the driver will then return a response with the infrastructure and request ids). The proposal is to use Kafka for this. The advantage of this approach is more robust handling of requests in the event that the driver goes down e.g. the Pod dies; another driver in the replica set can pick up the request and re-run it. It is recognised that the Ansible scripts must be idempotent for this to work; this is desirable for Ansible scripts anyway (in future, the driver could handle picking up Ansible scripts from where they left off - this is a feature of Ansible but would require some work in the driver).

Note: the fix for this should use the Ignition request queue (see IBM/ignition#46)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.