Lifecycle driver implementation that uses Ansible to execute operations.
Please read the following guides to get started with the lifecycle Driver:
- Developer Docs - docs for developers to install the driver from source or run the unit tests
Lifecycle driver implementation that uses Ansible to execute operations
License: Apache License 2.0
Lifecycle driver implementation that uses Ansible to execute operations.
Please read the following guides to get started with the lifecycle Driver:
It is common when creating a Pod to define the resources it will use (CPU and Memory) on a Kubernetes node.
It is also common for most helm charts to allow these parameters to be configurable.
We should consider adding Pod resource limits and requests values to our Helm Charts, making them configurable though the helm chart values.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
The majority of helm charts available on the public repository allow the resources to be configured through the following values:
resources:
limits:
cpu: "1"
memory: "2048Mi"
requests:
cpu: "25m"
memory: "1536Mi"
Dockerfile should allow us to build an ignition whl before the driver whl, by adding it to the docker/whls directory. If ignition is not present, the image should continue with the driver whl, assuming a "released" version of ignition is in use instead.
in ALM 2.0 we used to use the below linked ansible module for stopping / strating openstack VMs:
https://docs.ansible.com/ansible/latest/modules/os_server_action_module.html
This worked for us with alm 2.0 and the original ansible resource manager.
The ansible task & role would run on the localhost (ansible RM) and would stop/start the openstack instack instance using the os_server_action module.
However with the 2.1 brent driver and the ansible lifecycle driver we can no longer use this.
When we attempt to run a task & role as shown below -
- name: Start VM in Openstack
hosts: localhost
gather_facts: False
roles:
- { role: start_vm }
- name: start_vm
os_server_action:
auth:
auth_url: "{{ dl_properties.os_api_url }}"
username: "{{ dl_properties.os_auth_username }}"
password: "{{dl_properties.os_auth_password }}"
project_name: "{{ dl_properties.os_auth_project_name }}"
validate_certs: no
server: "{{ properties.hostname }}"
action: start
timeout: 180
wait: yes
ignore_errors: no
We get the below error in the alm lifecycle driver logs:
{"@timestamp": "2020-04-15T14:40:28.621Z", "@version": "1", "message": "Delivering envelope to lm_vnfc_lifecycle_execution_events with message content: b'{\"requestId\": \"5e88546da03546ccbb1463913158a181\", \"status\": \"FAILED\", \"failureDetails\": {\"failureCode\": \"INFRASTRUCTURE_ERROR\", \"description\": \"task start_vm : start_vm failed: {\\'msg\\': \\'openstacksdk is required for this module\\', \\'invocation\\': {\\'module_args\\': {\\'auth\\': {\\'auth_url\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'username\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'password\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\', \\'project_name\\': \\'VALUE_SPECIFIED_IN_NO_LOG_PARAMETER\\'}, \\'validate_certs\\': False, \\'server\\': \\'MCC-mcm\\', \\'action\\': \\'start\\', \\'timeout\\': 180, \\'wait\\': True, \\'verify\\': False, \\'interface\\': \\'public\\', \\'cloud\\': None, \\'auth_type\\': None, \\'region_name\\': None, \\'availability_zone\\': None, \\'cacert\\': None, \\'cert\\': None, \\'key\\': None, \\'api_timeout\\': None, \\'image\\': None}}, \\'_ansible_parsed\\': True, \\'_ansible_no_log\\': False, \\'changed\\': False}\"}, \"outputs\": {}}'", "host": "ansible-lifecycle-driver-57c94bf67f-9g9x2", "path": "/home/ald/.local/lib/python3.7/site-packages/ignition/service/messaging.py", "tags": [], "type": "logstash", "thread_name": "Thread-1", "level": "DEBUG", "logger_name": "ignition.service.messaging"}
Describe the bug
When logging the Deployment Location the driver attempts to obfuscate the properties by removing them from the message. By doing so it actually removes them from the Deployment Location object directly, meaning the properties are gone when they are later required to handle the request.
To Reproduce
Attempt any resource transition using the v1.0.0 or v1.0.1 driver
Environment: (please complete the following information where applicable):
Ignition-based drivers running in an OCP (Openshift) environment sometimes generate a NoBrokersAvailable exception.
Some Ansible modules (such as https://docs.ansible.com/ansible/latest/modules/wait_for_module.html) require running on the Ansible controller (in this case, the Ansible driver itself). The driver currently prevents this, probably due to lack of write permission when running in a Kubernetes pod. When fixing this, we should consider the implications of allowing arbitrary Ansible code within the driver itself i.e. issues with security and locally running code putting undue load on the driver itself. It may be possible to allow modules to run on a white-list basis.
The Ansible Driver currently supports accessing the private key portion of LM infrastructure (SSH) keys (http://servicelifecyclemanager.com/2.1.0/user-guides/operations/infrastructure-key-management/) for use in Ansible inventory. This should be enhanced to allow the public key portion of an infrastructure key to be accessed also (use case: creating Openstack or Kubernetes infrastructure that uses an SSH key pair for access).
The driver does not currently know whether properties are secure, so it should not log any properties.
We need to create a complete user guide for this driver, explaining the Resource requirements/limitations specific to this driver.
A couple of suggested sections:
Describe the bug
The deployment location properties need to include the name of the deployment location and it infrastructure type so that these can be accessed in any underlying ansible
Environment: (please complete the following information where applicable):
Update helm chart to include configurable options allow the driver to scale up and down based on CPU usage
Ignition framework 2.0.0 includes common tools for templating and a schema for Kubernetes based deployment locations. We should update this driver to make use of these tools.
We should make sure we maintain backwards compatibility, so in the case of templating, still support the use of {{ properties.someProp }}
to refer to a property (in the common templating syntax this becomes just {{ someProp }}
.
Consider using Ansible Runner as a replacement for custom Ansible runner code.
Ignition 1.2 will feature a single Resource Driver API instead of separate Infrastructure/Lifecycle APIs, to conform with Brent 2.2.
Before this driver can upgrade to 1.2 it must update it's handling of API requests to match the new API specification.
"test_max_queue_size" intermittently fails (perhaps due to a timeout on an synchronous call):
Traceback (most recent call last):
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 249, in test_max_queue_size
lifecycle_execution_1])
File "/home/travis/build/accanto-systems/ansible-lifecycle-driver/tests/unit/service/test_process.py", line 98, in check_responses
assert self.mock_messaging_service.send_lifecycle_execution.call_args_list == list(map(lambda lifecycle_execution: call(LifecycleExecutionMatcher(lifecycle_execution)), lifecycle_executions))
AssertionError
Describe the bug
Bootstrap errors prevent the Ansible driver pod from starting when running in an OCP cluster with a restricted SCC:
oc logs ansible-lifecycle-driver-697c46456-4j2r7
bash: /home/ald/.local/bin/ald: Permission denied
Describe the bug
key_property_processor is being used when it is None in a finally block
To Reproduce
Run a transition with a deployment location that is not a valid dict object
Expected behavior
An error about the deployment location being invalid
Environment: (please complete the following information where applicable):
Brent now supports the Find API for any type of resource driver (i.e. Infrastructure and Lifecycle Drivers). The Driver API supports this, the Ansible Driver needs to be updated to handle these FInd API requests and map them to e.g. an Ansible playbook.
There is an issue with some versions of Kubernetes whereby the driver container will run as root rather than the ("ald") user configured in the Docker image - see kubernetes/kubernetes#78308. The proposal is to fix the Dockerfile to explicitly set the uid of the "ald" user, and update the Helm chart to run the container as that user.
Add a label "part-of: alm" to the K8s deployment so that the LM Filebeat logging will correctly process the driver logs.
Upgrade the Ansible library used by the driver to the latest version
The resource driver API allows the driver to return the ID, name and type of 0 or more associated topology instances (for a resource instance).
It must be possible for a lifecycle transition playbook to return associatedTopology, similar to how properties are returned i.e. using Ansible Facts. An appropriate mechanism (based on setting Ansible Facts representing associatedTopology, for example) should be constructed, implemented in the driver and documented.
The driver uses Jinja2 syntax "{{ ... }}" to substitute LM properties in inventory files. Unfortunately, this prevents the substitution of other properties in the inventory because the templated property value is overwritten by a blank value (the LM property does not exist). Instead, LM property substitution should use a different template syntax e.g. "{{{ ... }}}".
The Ansible connection type for Kubernetes deployment location types is "k8s", which is incorrect (it should be "kubectl"). The workaround is to add "ansible_connection: kubectl" to your resource package inventory file to override it.
The driver currently runs Ansible in sub-processes. Any logging context set during requests should be propagated to these sub-processes so that it appears in log statements.
Add configuration to the helm chart so a user may specify affinity/anti-affinity rules and tolerations
Ansible appears to cache connections to hosts it has used in previous playbook executions. This can lead to strange errors when the target host has disappeared and we want to trigger a heal in LM (e.g. we've deleted the Stack in Openstack then attempt a Heal on a Resource).
Stop apache2 failed: {'msg': 'Timeout (32s) waiting for privilege escalation prompt: ', '_ansible_no_log': False} outputs: {}
After some time the error appears to change to:
Stop apache2 failed: {'_ansible_parsed': False, 'module_stdout': '', 'module_stderr': 'ssh: connect to host x.y.z.a port 22: Host is unreachable\r\n', 'msg': 'MODULE FAILURE\nSee stdout/stderr for the exact error', 'rc': 0, '_ansible_no_log': False, 'changed': False} outputs: {}
These errors are not seen as "unreachable" so the driver returns the execution response as FAILED with an INFRASTRUCTURE_ERROR code.
If we restart the driver container, then try again, we get the expected "unreachable" error, which the driver handles with retry attempts before returning RESOURCE_NOT_FOUND.
In the short term, we should catch these 2 known issues and treat them the same as unreachable (return a RESOURCE_NOT_FOUND).
Add upper bounds to dependency versions in setup.py.
We recently ran into an issue where Gunicorn 20.0 was installed which has breaking changes compared to our tested version (19.9)
installed ansible-lifecycle-driver using below command:
helm install ansiblelifecycledriver-0.5.1.tgz --name ansible-lifecycle-driver --set app.config.override.messaging.connection_address=alm-kafka:9092 --namespace default --tls
as we dont have foundation-kafka , having alm-kafka in clusterip service in k8s cluster.
root@eli4-master:/ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i ansible/ansible-lifecycle-driver-0.5.1# kubectl get svc |grep -i kafka
ansible-lifecycle-driver NodePort 10.0.241.201 8293:31680/TCP 36s
root@eli4-master:
alm-kafka ClusterIP None 9092/TCP,9093/TCP,8080/TCP,8443/TCP 2d2h
ansible-lifecycle driver deployement successfully done.
root@eli4-master:~/ansible-lifecycle-driver-0.5.1# kubectl describe pods ansible-lifecycle-driver-6649dc8dc8-dr9xt
Name: ansible-lifecycle-driver-6649dc8dc8-dr9xt
Namespace: default
Priority: 0
PriorityClassName:
Node: 9.46.74.117/9.46.74.117
Start Time: Fri, 21 Feb 2020 01:41:45 -0800
Labels: app=ansible-lifecycle-driver
part-of=lm
pod-template-hash=6649dc8dc8
Annotations: kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 10.1.49.162
Controlled By: ReplicaSet/ansible-lifecycle-driver-6649dc8dc8
Containers:
ansible-lifecycle-driver:
Container ID: docker://3dde75f18eedff3817e5a9b5ab804778fa0a764003c6ef97de979491b20a79c2
Image: accanto/ansible-lifecycle-driver:0.5.1
Image ID: docker-pullable://accanto/ansible-lifecycle-driver@sha256:bf93fbc8aebeb7834f5be4671f1d8adfa11764ce24bddbbc793d1bd22b09dda2
Port: 8293/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 21 Feb 2020 01:41:48 -0800
Ready: True
Restart Count: 0
Environment Variables from:
ansible-lifecycle-driver-env ConfigMap Optional: false
Environment:
Mounts:
/var/ald/ald_config.yml from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mvd2r (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ansible-lifecycle-driver
Optional: false
default-token-mvd2r:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mvd2r
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Normal Scheduled 55s default-scheduler Successfully assigned default/ansible-lifecycle-driver-6649dc8dc8-dr9xt to 9.46.74.117
Normal Pulled 53s kubelet, 9.46.74.117 Container image "accanto/ansible-lifecycle-driver:0.5.1" already present on machine
Normal Created 53s kubelet, 9.46.74.117 Created container
Normal Started 52s kubelet, 9.46.74.117 Started container
Any Ansible playbooks with more than 1 task seem to hang inside the Ansible process,
The (Kafka-based) Request Queue should allow "max.poll.interval.ms" to be configured on a per-deployment basis (with a conservative default, to allow for long-running Ansible lifecycle requests)
Describe the bug
The log level is set to DEBUG and it is causing a high amount of logging, filling up elasticsearch when using filebeat
To Reproduce
Install the driver and let it run for a few days, you will find high disk usage in elasticsearch
Expected behavior
The log level should be lower thereby reducing the rate at which the disk fills up
Ignition writes the lifecycle scripts to disk for each request, which means there will be a build up that will lead to a full disk. We should clear the lifecycle scripts directory from disk after the process has finished (or failed).
LM and Brent will support properties of type "key" in v2.1. The driver should handle properties of this type by persisting the key to a file so that it can be used in Ansible inventory to communicate securely with VMs. An additional property with the same name and a suffix of "-path" should be added to the properties that holds the path of the created key file. The key file should be cleaned up after the request has been handled.
Is your feature request related to a problem? Please describe.
When I reinstall a new version of the driver I get the following error on the helm install: 'Error: secrets "ald-tls" already exists'
Describe the solution you'd like
When I run a helm delete on the driver I would like the secrets to be deleted.
Currently, the driver supports token-based authentication for K8s deployment locations. Enhance this to support certificate-based authentication i.e. "certificate-authority-data" for the cluster kubeconfig, "client-certificate-data" and "client-key-data" for the user kubeconfig.
When enabling the process.use_pool
configuration property the driver will fail on startup because the queue_thread
attribute is not defined on the AnsibleProcessorService
.
We need to initialise this attribute to None, so we can perform an if statement on it later the service.
Add support for SSL REST APIs.
When running the Docker image with WSGI_CONTAINER env set to gunicorn a "libc not found" error is thrown.
We need to add and keep libc and binutils on the container.
In order to improve driver fault tolerance the handling of LM requests should be driven through a persistent queue, rather than by REST calls (a REST LM request will be pushed on to the queue and the driver will then return a response with the infrastructure and request ids). The proposal is to use Kafka for this. The advantage of this approach is more robust handling of requests in the event that the driver goes down e.g. the Pod dies; another driver in the replica set can pick up the request and re-run it. It is recognised that the Ansible scripts must be idempotent for this to work; this is desirable for Ansible scripts anyway (in future, the driver could handle picking up Ansible scripts from where they left off - this is a feature of Ansible but would require some work in the driver).
Note: the fix for this should use the Ignition request queue (see IBM/ignition#46)
Requires an update to 0.8.0 version of Ignition
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.