Comments (21)
I would like to mention, that this issue occurs only GH actions CI, not on Travis.
Maybe it is somehow related that the molecule runs on ubuntu latest.
https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/.github/workflows/default.yml#L6
It calls an ansible module service and reads the output.
https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/setup_cluster.yml#L172-L184
It may be caused by the auto-detection of system-specific modules. If it falls back to the legacy module instead of systems, it may fail. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/service_module.html
As we support the only systemd, I will change auto to systems or replace the service module with systemd.
from ansible-mariadb-galera-cluster.
I implemented a fix to check if _mariadb_galera_cluster_joined.status is defined.
Then I implemented a task to pull out logs from the service.
TASK [ansible-mariadb-galera-cluster : command] ********************************
fatal: [node2]: FAILED! => {"changed": true, "cmd": ["systemctl", "status", "mysql"], "delta": "0:00:00.006400", "end": "2021-05-28 10:42:25.784635", "msg": "non-zero return code", "rc": 3, "start": "2021-05-28 10:42:25.778235", "stderr": "Failed to dump process list for 'mariadb.service', ignoring: Input/output error", "stderr_lines": ["Failed to dump process list for 'mariadb.service', ignoring: Input/output error"], "stdout": "* mariadb.service - MariaDB 10.5.10 database server\n Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)\n Drop-In: /etc/systemd/system/mariadb.service.d\n -migrated-from-my.cnf-settings.conf\n Active: failed (Result: resources)\n Docs: man:mariadbd(8)\n https://mariadb.com/kb/en/library/systemd/\n CGroup: /system.slice/containerd.service/system.slice/mariadb.service\n\nMay 28 10:42:12 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:12 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:18 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:24 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "stdout_lines": ["* mariadb.service - MariaDB 10.5.10 database server", " Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)", " Drop-In: /etc/systemd/system/mariadb.service.d", "
-migrated-from-my.cnf-settings.conf", " Active: failed (Result: resources)", " Docs: man:mariadbd(8)", " https://mariadb.com/kb/en/library/systemd/", " CGroup: /system.slice/containerd.service/system.slice/mariadb.service", "", "May 28 10:42:12 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:12 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:18 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:24 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server."]}
from ansible-mariadb-galera-cluster.
Looks similar to this issue https://jira.mariadb.org/browse/MDEV-23050?attachmentOrder=desc
from ansible-mariadb-galera-cluster.
This can be patched by setting SendSIGKILL=yes on mariadb.service.
However, I would prefer to wait until https://docs.docker.com/engine/release-notes/#20100 is rolled into GH ubuntu-latest image, which we use for testing. This hopefully will fix this issue.
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
needs more testing
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
ping
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
i need to check this again
from ansible-mariadb-galera-cluster.
I just run into this issue while setting up a new cluster.
TASK [ansible-mariadb-galera-cluster : setup_cluster | killing lingering mysql processes to ensure mysql is stopped] ***
fatal: [SFM-VPS-1]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.022065", "end": "2021-12-22 01:57:29.530429", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.508364", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-4]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.011336", "end": "2021-12-22 01:57:29.754751", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.743415", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-5]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.013999", "end": "2021-12-22 01:57:29.824234", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.810235", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-6]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.015467", "end": "2021-12-22 01:57:29.923334", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.907867", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
TASK [ansible-mariadb-galera-cluster : setup_cluster | configuring temp galera config for first node] ***
skipping: [SFM-VPS-4] => (item=etc/my.cnf.d/server.cnf)
skipping: [SFM-VPS-5] => (item=etc/my.cnf.d/server.cnf)
skipping: [SFM-VPS-6] => (item=etc/my.cnf.d/server.cnf)
[WARNING]: Collection ansible.netcommon does not support Ansible version 2.12.1
changed: [SFM-VPS-1] => (item=etc/my.cnf.d/server.cnf)
TASK [ansible-mariadb-galera-cluster : setup_cluster | bootstrapping galera cluster] ***
skipping: [SFM-VPS-4]
skipping: [SFM-VPS-5]
skipping: [SFM-VPS-6]
changed: [SFM-VPS-1]
TASK [ansible-mariadb-galera-cluster : setup_cluster | ensure first node is fully started before joining other nodes] ***
skipping: [SFM-VPS-1]
TASK [ansible-mariadb-galera-cluster : setup_cluster | sleep for 15 seconds to wait for node WSREP prepared state] ***
ok: [SFM-VPS-4 -> localhost]
ok: [SFM-VPS-1 -> localhost]
ok: [SFM-VPS-5 -> localhost]
ok: [SFM-VPS-6 -> localhost]
TASK [ansible-mariadb-galera-cluster : setup_cluster | joining galera cluster] ***
skipping: [SFM-VPS-1]
fatal: [SFM-VPS-4]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}
fatal: [SFM-VPS-5]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}
fatal: [SFM-VPS-6]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}
We're running CentOS 8, latest commit of this role.
This can be patched by setting SendSIGKILL=yes on mariadb.service.
@elcomtik Do you know how do I patch mariadb.service
as you said? Thanks
from ansible-mariadb-galera-cluster.
@BirkhoffLee This should not happen outside of the docker container, where do you run your MariaDB servers? I read that this occurred also on Proxmox5.4, see https://jira.mariadb.org/browse/MDEV-23050?attachmentOrder=desc.
Personally, I didn't encounter this issue on Centos8, which I run myself. I run it on ansible version 2.10.16. I should test in on newer ansible soon.
What ansible version do you use?
@elcomtik Do you know how do I patch mariadb.service as you said? Thanks
mariadb.service can be modified by systems override file like this one https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/templates/etc/systemd/system/mariadb.service.d/timeout-start-sec.conf.j2, which is added by these tasks https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/timeout-start-sec.yml
This override mentioned above I didn't test because I thought the original issue will be fixed in the ubuntu docker image. I wouldn't recommend using it in production, because it may cause cluster instabilities.
You may try and give me feedback.
from ansible-mariadb-galera-cluster.
where do you run your MariaDB servers?
OVH VPS, most likely not Proxmox.
What ansible version do you use?
I was using 2.12 in the last comment. I switched to 2.10.16, problem persists.
python version = 3.9.9 (main, Nov 21 2021, 03:23:42) [Clang 13.0.0 (clang-1300.0.29.3)]
@elcomtik I just dug into the issue more and found out _mariadb_galera_cluster_joined
will only be registered if service task ends, unfortunately the MariaDB service keeps being the state of failed
, with the log below:
● mariadb.service - MariaDB 10.6.5 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2021-12-22 09:01:54 CST; 4min 57s ago
Docs: man:mariadbd(8)
https://mariadb.com/kb/en/library/systemd/
Process: 126847 ExecStart=/usr/sbin/mariadbd $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
Process: 126794 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR>
Process: 126792 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Main PID: 126847 (code=exited, status=1/FAILURE)
Status: "MariaDB server is down"
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: at /home/buildbot/buildbot/build/gcomm/src/pc.cpp:connect():160
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection ti>
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs.cpp:gcs_open():1633: Failed to open channel 'sfm-galera-1' at 'gcomm://100.70.129.98>
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] WSREP: wsrep::connect(gcomm://100.70.129.98,100.103.157.59,100.93.158.54,100.85.108.84) failed: 7
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22 9:01:54 0 [ERROR] Aborting
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: Failed to start MariaDB 10.6.5 database server.
The IP addresses are within an internal network, connectivity is fine and <10ms. Actually for this to happen, for task setup_cluster | stopping mysql to (re)configure cluster (other nodes)
I changed the until conditional to until: _mariadb_galera_cluster_node.status.ActiveState != "active"
since it keeps at failed
.
I think the root cause is from the above log. Do you have an idea how this would be fixed? Again thank you for your fast response.
Edit: Confirmed the other nodes cannot connect to the first node because the MariaDB instance of first node is not listening on WSREP port.
2nd edit: There was configuration inconsistency causing mariadbd to be in failed
state, where in _mariadb_galera_cluster_node.status.ActiveState == "inactive"
it would never evaluate to true. Thus, when reconfiguring a failed cluster, the role would never have the chance to actually apply correct config. In my example, the first node tried to connect to other nodes when this role is in a reconfiguring state, and mariadbd on other nodes were in the failed state. So mariadbd of the first node will not launch and landing in failed state (it would just take galera_new_cluster
to fix this first node). Therefore task stopping mysql to (re)configure cluster (other nodes)
and stopping mysql to (re)configure cluster (first node)
will always fail and be not able to reconfigure without human intervention.
from ansible-mariadb-galera-cluster.
I thought it may be caused by firewall issues, when first node wsrep port is not reachable by joining node. It makes sense if first node is failed it cannot be joined.
This role can be executed only clean nodes or healthy cluster. If node is failed it is better to fix it clean it up and then execute role. This is known issue (at least to me), which should be addressed in new GH issue.
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
This can be patched by setting SendSIGKILL=yes on mariadb.service.
However, I would prefer to wait until https://docs.docker.com/engine/release-notes/#20100 is rolled into GH ubuntu-latest image, which we use for testing. This hopefully will fix this issue.
Still not fixed. If we need tests for Ubuntu 20.04, then more work is needed.
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
Ping for removal of stale wontfix
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
Ping 2.
from ansible-mariadb-galera-cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from ansible-mariadb-galera-cluster.
Related Issues (20)
- [debian] `/etc/mysql/conf.d/overrides.cnf` Permissions Constantly Changing HOT 10
- Different interface naming on dedicated servers HOT 5
- Update `CONTRIBUTING.md` to contain steps to setup the environment HOT 1
- wsrep_cluster_address variable not correctly expanded in /etc/my.cnf.d/server.cnf HOT 2
- use community.mysql.mysql_user instead of mysql_user HOT 3
- error creating database database exists HOT 3
- new release ? HOT 1
- playbook fails if mariadb_mysql_root_password is set HOT 3
- python3 support for galeranotify.py HOT 2
- Support RHEL 8 HOT 2
- [arm support] repository settings for apt missing arm HOT 3
- Deprecation warning about "ipwrap" module
- query_cache_limit and query_cache_size are deprecated
- github actions broken HOT 7
- vars/centos-9.yml config file missing HOT 3
- Debian 12 Support HOT 1
- Permission of configuration files are harcoded HOT 2
- issue with asserting version HOT 2
- There should be a rolling restart when certificates are changed
- Failed check when MariaDB is not installed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ansible-mariadb-galera-cluster.