Giter Site home page Giter Site logo

Comments (21)

elcomtik avatar elcomtik commented on June 10, 2024

I would like to mention, that this issue occurs only GH actions CI, not on Travis.

Maybe it is somehow related that the molecule runs on ubuntu latest.
https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/.github/workflows/default.yml#L6

It calls an ansible module service and reads the output.
https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/setup_cluster.yml#L172-L184

It may be caused by the auto-detection of system-specific modules. If it falls back to the legacy module instead of systems, it may fail. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/service_module.html

As we support the only systemd, I will change auto to systems or replace the service module with systemd.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

I implemented a fix to check if _mariadb_galera_cluster_joined.status is defined.
Then I implemented a task to pull out logs from the service.

TASK [ansible-mariadb-galera-cluster : command] ********************************
fatal: [node2]: FAILED! => {"changed": true, "cmd": ["systemctl", "status", "mysql"], "delta": "0:00:00.006400", "end": "2021-05-28 10:42:25.784635", "msg": "non-zero return code", "rc": 3, "start": "2021-05-28 10:42:25.778235", "stderr": "Failed to dump process list for 'mariadb.service', ignoring: Input/output error", "stderr_lines": ["Failed to dump process list for 'mariadb.service', ignoring: Input/output error"], "stdout": "* mariadb.service - MariaDB 10.5.10 database server\n Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)\n Drop-In: /etc/systemd/system/mariadb.service.d\n -migrated-from-my.cnf-settings.conf\n Active: failed (Result: resources)\n Docs: man:mariadbd(8)\n https://mariadb.com/kb/en/library/systemd/\n CGroup: /system.slice/containerd.service/system.slice/mariadb.service\n\nMay 28 10:42:12 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:12 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy\nMay 28 10:42:18 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:18 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy\nMay 28 10:42:24 node2 systemd[1]: mariadb.service: Failed with result 'resources'.\nMay 28 10:42:24 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "stdout_lines": ["* mariadb.service - MariaDB 10.5.10 database server", " Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)", " Drop-In: /etc/systemd/system/mariadb.service.d", " -migrated-from-my.cnf-settings.conf", " Active: failed (Result: resources)", " Docs: man:mariadbd(8)", " https://mariadb.com/kb/en/library/systemd/", " CGroup: /system.slice/containerd.service/system.slice/mariadb.service", "", "May 28 10:42:12 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:12 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy", "May 28 10:42:18 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:18 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server.", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Will not start SendSIGKILL=no service of type KillMode=control-group or mixed while processes exist", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Failed to run 'start-pre' task: Device or resource busy", "May 28 10:42:24 node2 systemd[1]: mariadb.service: Failed with result 'resources'.", "May 28 10:42:24 node2 systemd[1]: Failed to start MariaDB 10.5.10 database server."]}

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

Looks similar to this issue https://jira.mariadb.org/browse/MDEV-23050?attachmentOrder=desc

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

This can be patched by setting SendSIGKILL=yes on mariadb.service.

However, I would prefer to wait until https://docs.docker.com/engine/release-notes/#20100 is rolled into GH ubuntu-latest image, which we use for testing. This hopefully will fix this issue.

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

needs more testing

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

ping

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

i need to check this again

from ansible-mariadb-galera-cluster.

BirkhoffLee avatar BirkhoffLee commented on June 10, 2024

I just run into this issue while setting up a new cluster.

TASK [ansible-mariadb-galera-cluster : setup_cluster | killing lingering mysql processes to ensure mysql is stopped] ***
fatal: [SFM-VPS-1]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.022065", "end": "2021-12-22 01:57:29.530429", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.508364", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-4]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.011336", "end": "2021-12-22 01:57:29.754751", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.743415", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-5]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.013999", "end": "2021-12-22 01:57:29.824234", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.810235", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [SFM-VPS-6]: FAILED! => {"changed": true, "cmd": ["pkill", "mariadb"], "delta": "0:00:00.015467", "end": "2021-12-22 01:57:29.923334", "msg": "non-zero return code", "rc": 1, "start": "2021-12-22 01:57:29.907867", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

TASK [ansible-mariadb-galera-cluster : setup_cluster | configuring temp galera config for first node] ***
skipping: [SFM-VPS-4] => (item=etc/my.cnf.d/server.cnf) 
skipping: [SFM-VPS-5] => (item=etc/my.cnf.d/server.cnf) 
skipping: [SFM-VPS-6] => (item=etc/my.cnf.d/server.cnf) 
[WARNING]: Collection ansible.netcommon does not support Ansible version 2.12.1
changed: [SFM-VPS-1] => (item=etc/my.cnf.d/server.cnf)

TASK [ansible-mariadb-galera-cluster : setup_cluster | bootstrapping galera cluster] ***
skipping: [SFM-VPS-4]
skipping: [SFM-VPS-5]
skipping: [SFM-VPS-6]
changed: [SFM-VPS-1]

TASK [ansible-mariadb-galera-cluster : setup_cluster | ensure first node is fully started before joining other nodes] ***
skipping: [SFM-VPS-1]

TASK [ansible-mariadb-galera-cluster : setup_cluster | sleep for 15 seconds to wait for node WSREP prepared state] ***
ok: [SFM-VPS-4 -> localhost]
ok: [SFM-VPS-1 -> localhost]
ok: [SFM-VPS-5 -> localhost]
ok: [SFM-VPS-6 -> localhost]

TASK [ansible-mariadb-galera-cluster : setup_cluster | joining galera cluster] ***
skipping: [SFM-VPS-1]
fatal: [SFM-VPS-4]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}
fatal: [SFM-VPS-5]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}

fatal: [SFM-VPS-6]: FAILED! => {"msg": "The conditional check '_mariadb_galera_cluster_joined.status.ActiveState == \"active\"' failed. The error was: error while evaluating conditional (_mariadb_galera_cluster_joined.status.ActiveState == \"active\"): 'dict object' has no attribute 'status'"}

We're running CentOS 8, latest commit of this role.

This can be patched by setting SendSIGKILL=yes on mariadb.service.

@elcomtik Do you know how do I patch mariadb.service as you said? Thanks

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

@BirkhoffLee This should not happen outside of the docker container, where do you run your MariaDB servers? I read that this occurred also on Proxmox5.4, see https://jira.mariadb.org/browse/MDEV-23050?attachmentOrder=desc.

Personally, I didn't encounter this issue on Centos8, which I run myself. I run it on ansible version 2.10.16. I should test in on newer ansible soon.

What ansible version do you use?

@elcomtik Do you know how do I patch mariadb.service as you said? Thanks

mariadb.service can be modified by systems override file like this one https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/templates/etc/systemd/system/mariadb.service.d/timeout-start-sec.conf.j2, which is added by these tasks https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/timeout-start-sec.yml

This override mentioned above I didn't test because I thought the original issue will be fixed in the ubuntu docker image. I wouldn't recommend using it in production, because it may cause cluster instabilities.

You may try and give me feedback.

from ansible-mariadb-galera-cluster.

BirkhoffLee avatar BirkhoffLee commented on June 10, 2024

where do you run your MariaDB servers?

OVH VPS, most likely not Proxmox.

What ansible version do you use?

I was using 2.12 in the last comment. I switched to 2.10.16, problem persists.
python version = 3.9.9 (main, Nov 21 2021, 03:23:42) [Clang 13.0.0 (clang-1300.0.29.3)]

@elcomtik I just dug into the issue more and found out _mariadb_galera_cluster_joined will only be registered if service task ends, unfortunately the MariaDB service keeps being the state of failed, with the log below:

● mariadb.service - MariaDB 10.6.5 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-12-22 09:01:54 CST; 4min 57s ago
     Docs: man:mariadbd(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 126847 ExecStart=/usr/sbin/mariadbd $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
  Process: 126794 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR>
  Process: 126792 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 126847 (code=exited, status=1/FAILURE)
   Status: "MariaDB server is down"

Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]:          at /home/buildbot/buildbot/build/gcomm/src/pc.cpp:connect():160
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection ti>
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs.cpp:gcs_open():1633: Failed to open channel 'sfm-galera-1' at 'gcomm://100.70.129.98>
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] WSREP: wsrep::connect(gcomm://100.70.129.98,100.103.157.59,100.93.158.54,100.85.108.84) failed: 7
Dec 22 09:01:54 SFM-VPS-5 mariadbd[126847]: 2021-12-22  9:01:54 0 [ERROR] Aborting
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 22 09:01:54 SFM-VPS-5 systemd[1]: Failed to start MariaDB 10.6.5 database server.

The IP addresses are within an internal network, connectivity is fine and <10ms. Actually for this to happen, for task setup_cluster | stopping mysql to (re)configure cluster (other nodes) I changed the until conditional to until: _mariadb_galera_cluster_node.status.ActiveState != "active" since it keeps at failed.

I think the root cause is from the above log. Do you have an idea how this would be fixed? Again thank you for your fast response.

Edit: Confirmed the other nodes cannot connect to the first node because the MariaDB instance of first node is not listening on WSREP port.

2nd edit: There was configuration inconsistency causing mariadbd to be in failed state, where in _mariadb_galera_cluster_node.status.ActiveState == "inactive" it would never evaluate to true. Thus, when reconfiguring a failed cluster, the role would never have the chance to actually apply correct config. In my example, the first node tried to connect to other nodes when this role is in a reconfiguring state, and mariadbd on other nodes were in the failed state. So mariadbd of the first node will not launch and landing in failed state (it would just take galera_new_cluster to fix this first node). Therefore task stopping mysql to (re)configure cluster (other nodes) and stopping mysql to (re)configure cluster (first node) will always fail and be not able to reconfigure without human intervention.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

I thought it may be caused by firewall issues, when first node wsrep port is not reachable by joining node. It makes sense if first node is failed it cannot be joined.

This role can be executed only clean nodes or healthy cluster. If node is failed it is better to fix it clean it up and then execute role. This is known issue (at least to me), which should be addressed in new GH issue.

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

This can be patched by setting SendSIGKILL=yes on mariadb.service.

However, I would prefer to wait until https://docs.docker.com/engine/release-notes/#20100 is rolled into GH ubuntu-latest image, which we use for testing. This hopefully will fix this issue.

Still not fixed. If we need tests for Ubuntu 20.04, then more work is needed.

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

elcomtik avatar elcomtik commented on June 10, 2024

Ping for removal of stale wontfix

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

eRadical avatar eRadical commented on June 10, 2024

Ping 2.

from ansible-mariadb-galera-cluster.

stale avatar stale commented on June 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from ansible-mariadb-galera-cluster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.