Galera agent doesn't work when grastate.dat contains safe_to_bootstrap

Bug #1789527 reported by Aymen Frikha
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
resource-agents (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Won't Fix
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

Galera resource agent is not able to put mysql up and master even if safe_to_bootstrap flag in grastate.dat is set to 1.

* res_percona_promote_0 on 09fde2-2 'unknown error' (1): call=1373, status=complete, exitreason='MySQL server failed to start (pid=2432) (rc=0), please check your installation',

The resource agent is not able to handle safe_to_bootstrap feature in galera: http://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster/

I use percona cluster database which uses the same galera mechanism for clustering.

Packages I use in Xenial:

resource-agents 3.9.7-1
percona-xtradb-cluster-server-5.6 5.6.37-26.21-0ubuntu0.16.04.2
pacemaker 1.1.14-2ubuntu1.4
corosync 2.3.5-3ubuntu2.1

A workaround exist in : https://github.com/ClusterLabs/resource-agents/issues/915
A fix also exist but it was not addressed to xenial package: https://github.com/ClusterLabs/resource-agents/pull/1022

Is it possible to add this fix on the recent package of resource-agents in Xenial ?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks for bringing this up together with an upstream patch.

Changed in resource-agents (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Christian Reis (kiko) wrote :
tags: added: server-next
Changed in resource-agents (Ubuntu):
importance: Medium → High
Changed in resource-agents (Ubuntu):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Triaged → In Progress
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I have patched packages in this ppa if someone wants to try them: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3506/

I couldn't reproduce the bug so far. I deployed the percona-cluster charm on xenial using -n 3, using lxd, and then used lxd stop -f on all units to force-kill them. They came back up just fine. Then I used the charm from https://review.openstack.org/#/c/597969/, same result. That charm doesn't get me resource-agents installed, which I thought odd, so maybe I'm doing something wrong.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Hi Andreas,

Thank you very much for the package provided. It works fine for me.
You can reproduce the bug when you deploy hacluster charm as a subordinate service for percona cluster. It will deploy Pacemaker, put the vip and the resource agent to manage Percona cluster.
It will use galera resource agent to controle percona database because they use same clustering mechanism.
The hacluster charm uses packages from xenial main repository. Is it possible to backport the patch to it ?

Thanks

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Sure, I just need to get the test case in order. I'll try that today.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This wasn't enough to trigger the bug:

juju bootstrap lxd
juju deploy -n 3 cs:xenial/percona-cluster
juju config percona-cluster vip=<someipIhave> min-cluster-size=3
juju deploy hacluster
juju add-relation percona-cluster hacluster
(wait)
do a lxc stop -f <all percona units>
(wait for juju status to notice)
do a lxc start <all percona units>
(wait for juju status to become green)

It was my understanding that this should trigger the original bug, but maybe it's racy or needs active database writes do be happening at the moment of the shutdown.

I'll try again now with your charm, the one that uses resource-agents and fails without an update to resource-agents. But if you have another easy way to reproduce the bug, please do tell.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Yes you are right Andreas, you need active database writes when you shutdown them. The resource agent automatically detect which instance has the last commit and start it as master and resume the replication to the other instances.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

By doing graceful shutdowns I can get in a state where the last node to die will have "safe_to_bootstrap:1" in its grastate.dat file. But I couldn't get that node back running, which was odd, as it should be the *only* one that can be started. I had to use one of the other initscript targets, restart-bootstrap, instead of just restart, or else it would timeout trying to reach the "juju cluster":

2018-11-09 18:54:58 14147 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1478: Failed to open channel 'juju_cluster' at 'gcomm://10.0.100.131,10.0.100.191': -110 (Connection timed out)

I see two options here (at least):
a) we backport just what was called the workaround bit, since you say this is what you have been using for a long time now. That is the bit that handles the case where all nodes crashed, and thus "safe_to_bootstrap" is set to zero in all of them. Without the fix, in this case no node will be able to start up. The fix uses the same logic that has been always used to determine the right node to start before "safe_to_bootstrap" existed, and once it finds that node, it just flips that flag to 1 to allow the service to be started
b) we backport the full patch, which consiste of part (a) above, plus skipping the logic to find the right node to start if it finds "safe_to_bootstrap" set to 1. This one will need more testing.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Can you test using this configuration of Pacemaker ? :

primitive p_percona ocf:heartbeat:galera \
        params wsrep_cluster_address="gcomm://controller-1,controller-2,controller-3" \
        params config="/etc/mysql/my.cnf" \
        params datadir="/var/lib/percona-xtradb-cluster" \
        params socket="/var/run/mysqld/mysqld.sock" pid="/var/run/mysqld/mysqld.pid" \
        params check_user=root check_passwd=****** \
        params binary="/usr/bin/mysqld_safe" \
        op monitor timeout=120 interval=20 depth=0 \
        op monitor role=Master timeout=120 interval=10 depth=0 \
        op monitor role=Slave timeout=120 interval=30 depth=0
ms ms_percona p_percona \
        meta notify=true interleave=true \
        meta master-max=3 \
        meta ordered=true target-role=Started

Robie Basak (racb)
tags: removed: server-next
Revision history for this message
Robie Basak (racb) wrote :

Workaround -> Importance: Medium, and back to Triaged since this isn't being actively worked on (it's in our backlog).

Changed in resource-agents (Ubuntu):
status: In Progress → Triaged
assignee: Andreas Hasenack (ahasenack) → nobody
importance: High → Medium
Revision history for this message
Robie Basak (racb) wrote :

(if anyone can contribute a fix in terms of a debdiff that is ready to apply, please do and I'll make sure it gets looked at promptly)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I'm assigning myself to check this next time I touch the cleanup for resource-agents.

Looks like bug is blocking the following change:

https://review.opendev.org/#/c/597969/

to charm percona cluster.

For the start ordering issue:

https://bugs.launchpad.net/charm-percona-cluster/+bug/1744393

There is the merge:

https://review.opendev.org/#/c/670675/

https://opendev.org/openstack/charm-percona-cluster/commit/b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c

And that bug is not "fix released yet".

Changed in resource-agents (Ubuntu):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

commit 16dee87e24ee1a0d6e37b5fa7b91c303f7c912db
Author: Ralf Haferkamp <email address hidden>
Date: Tue Aug 22 15:47:47 2017 +0200

    galera: Honor "safe_to_bootstrap" flag in grastate.dat

    With version 3.19 galera introduced the "safe_to_bootstrap" flag to the
    grastate.dat file [1]. When all nodes of a cluster are shutdown cleanly,
    the last node shutting down gets this flag set to 1. (All others get a
    0).

    This commit enhances the galera resource agent to make use of that flag
    when selecting an appropriate node for bootstrapping the cluster. When
    any of the cluster nodes has the "safe_to_bootstrap" flag set to 1, that
    node is immediately selected as the boostrap node of the cluster.

    When all nodes have safe_to_bootstrap=0 or the flag is not present the
    current bootstrap behaviour mostly unchanged. We just set
    "safe_to_bootstrap" to 1 in grastate.dat on the selected bootstrap node
    to a allow for galera to start, as outlined in the galera documentation
    [2].

    Fixes: #915

    [1] http://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster
    [2] http://galeracluster.com/documentation-webpages/restartingcluster.html#safe-to-bootstrap-protection

$ git describe --tags 16dee87e24ee1a0d6e37b5fa7b91c303f7c912db
v4.0.1-107-g16dee87e

resource-agents | 1:3.9.2-5ubuntu4 | precise |
resource-agents | 1:3.9.2-5ubuntu4.1 | precise-updates |
resource-agents | 1:3.9.3+git20121009-3ubuntu2 | trusty |
resource-agents | 1:3.9.7-1 | xenial |
resource-agents | 1:3.9.7-1ubuntu1.1 | xenial-updates |

not affected:

resource-agents | 1:4.1.0~rc1-1ubuntu1 | bionic |
resource-agents | 1:4.1.0~rc1-1ubuntu1.2 | bionic-updates |
resource-agents | 1:4.2.0-1ubuntu1 | disco |
resource-agents | 1:4.2.0-1ubuntu1.1 | disco-updates |
resource-agents | 1:4.2.0-1ubuntu2 | eoan |
resource-agents | 1:4.4.0-3ubuntu1 | focal |

Changed in resource-agents (Ubuntu Trusty):
status: New → Won't Fix
Changed in resource-agents (Ubuntu Xenial):
status: New → Confirmed
Changed in resource-agents (Ubuntu Bionic):
status: New → Fix Released
Changed in resource-agents (Ubuntu):
status: Triaged → Fix Released
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
importance: Medium → Undecided
tags: added: block-proposed-xenial
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I have added the block-proposed-xenial tag so I can do multiple fixes in a single SRU for xenial.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Xenial has reached its end of standard support.

Changed in resource-agents (Ubuntu Xenial):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.