pacemaker fails to release clustered filesystem dlm locks on failover

Bug #1915828 reported by Jim Dickie
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Medium
Dariusz Gadomski
Groovy
Fix Released
Medium
Dariusz Gadomski

Bug Description

[impact]

programs using libqb logging exit due to failed assertion on qb log init

[test case]

test program:

#include <qb/qblog.h>

QB_LOG_INIT_DATA(test);

int main(int argc, char* argv[])
{
  return 0;
}

compile and run:

$ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains output sections; did you forget -T?

$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)

Note the error is slightly different when compiling without lto:

$ gcc -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains output sections; did you forget -T?

$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is populated, otherwise target's build is at fault, preventing reliable logging" && QB_ATTR_SECTION_START != QB_ATTR_SECTION_STOP' failed.
Aborted (core dumped)

[regression potential]

any regression would likely involve problems during logging using the libqb logging functions, which could include failure to log or even program exit and/or crash.

additionally, altering of build flags (namely -DQB_KILL_ATTRIBUTE_SECTION) removes some symbols from pacemaker libraries (please see the debdiffs for the full list of them). Those seem to be previously defined by macros (resolved in the end to QB_LOG_INIT_DATA) and used internally by libqb for logging purposes. If there was anything using those symbols build time or runtime missing symbols may be reported.

[scope]

this appears to be needed only for focal; the issue seems to be an interaction between the focal version of binutils and some linker "magic" that libqb used in the focal version.

The upstream libqb removed/replaced that linker "magic" after the version in focal, so this should not affect groovy or later. However, the fix changes the ABI and thus isn't appropriate for SRUing.
https://github.com/ClusterLabs/libqb/pull/322

The libqb code in bionic does not include the linker "magic" and so does not have this problem.

[other info]

related debian binutils bug report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923246

related gcc bug report:
https://sourceware.org/bugzilla/show_bug.cgi?id=24276

however, those appear to only have changed binutils to ignore the issue to allow the build to stop failing.

The libqb docs do contain two suggestions to possibly work around this bug, specifically using either -l:libqb.so.0 or -DQB_KILL_ATTRIBUTE_SECTION, or both. Either or both approaches do help with the simple test case, but more testing is needed that actually exercises the log functionality to make sure nothing else breaks.

$ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains output sections; did you forget -T?
$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)

$ gcc -flto -D_GNU_SOURCE -o test test.c -l:libqb.so.0 -ldl
$ ./test

$ gcc -flto -DQB_KILL_ATTRIBUTE_SECTION -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains output sections; did you forget -T?
$ ./test

[original description]

When a clustered node is detected as failed the remaining node tries to fence the resources. When using pacemaker with gfs2 on an lvm2 logical volume dlm_controld calls out to dlm_stonith to release any locks held.

Due to a build issue with the version of libqb that pacemaker is compiled against, the call to QB_LOG_INIT_DATA which is #defined to CRM_TRACE_INIT_DATA, fails with an assertion. This prevents the lock manager from releasing any held locks on the failed node.

At this point the gfs2 filesystem cannot be accessed and after any resource timeouts are met, the resource is marked as failed.

Calling dlm_stonith by hand with the data that is passed to it by dlm_controld shows the assertion.

root@u2004-1:~# /usr/sbin/dlm_stonith -n 2 -t 1612361398
dlm_stonith: utils.c:57: common: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.

It would appear that the code in libqb is over aggressive on the sanity checking, or assumes that QB_LOG_INIT_DATA will only be called by the library. External programs such as pacemaker that end up calling CRM_TRACE_INIT_DATA will suffer the same assertion.

This patch from clusterlabs is an attempt to resolve the assertion, but is still not sufficient. https://lists.clusterlabs.org/pipermail/users/2018-February/023614.html

Taking out the assertion in <qb/qblog.h> and recompiling pacemaker appears to be the only way to allow dlm_stonith to work.

journalctl shows dlm_controld keeps trying to get a successful response from dlm_stonith

Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence result 2 pid 26568 result -1 term signal 6
Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence status 2 receive -1 from 1 walltime 1613481117 local 4389
Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence request 2 pid 26607 nodedown time 1613481102 fence_all dlm_stonith
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence result 2 pid 26607 result -1 term signal 6
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence status 2 receive -1 from 1 walltime 1613481118 local 4391
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence request 2 pid 26637 nodedown time 1613481102 fence_all dlm_stonith
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence result 2 pid 26637 result -1 term signal 6
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence status 2 receive -1 from 1 walltime 1613481120 local 4392
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence request 2 pid 26693 nodedown time 1613481102 fence_all dlm_stonith
....

Calling 'dlm_tool fence_ack 2' by hand immediately releases the dlm resource locks.

root@u2004-1:~# lsb_release -rd
Description: Ubuntu 20.04 LTS
Release: 20.04

root@u2004-1:~# apt-cache policy pacemaker
pacemaker:
  Installed: 2.0.3-3ubuntu4.1
  Candidate: 2.0.3-3ubuntu4.1
  Version table:
 *** 2.0.3-3ubuntu4.1 500
        500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        500 http://gb.archive.ubuntu.com/ubuntu focal-security/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.3-3ubuntu3 500
        500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

tags: added: sts
tags: added: server-next
Changed in pacemaker (Ubuntu):
assignee: nobody → Dariusz Gadomski (dgadomski)
importance: Undecided → Medium
Dan Streetman (ddstreet)
description: updated
Dan Streetman (ddstreet)
Changed in pacemaker (Ubuntu):
status: New → Fix Released
Changed in pacemaker (Ubuntu Focal):
assignee: nobody → Dariusz Gadomski (dgadomski)
importance: Undecided → Medium
status: New → In Progress
Changed in pacemaker (Ubuntu):
assignee: Dariusz Gadomski (dgadomski) → nobody
importance: Medium → Undecided
Dan Streetman (ddstreet)
description: updated
Revision history for this message
Dan Streetman (ddstreet) wrote :

It appears the linker is eliding the section that libqb's "linker magic" expects when the linker detects that the compiled program doesn't actually use that section, which seems completely reasonable for the linker to do. This seems to be a case of libqb trying to be "clever" by relying on an implementation detail of the linker, which is bad.

A slightly adjusted test program, to actually use the libqb log functions, does not reproduce the problem, e.g.:

#include <qb/qblog.h>

QB_LOG_INIT_DATA(test);

int main(int argc, char* argv[])
{
  qb_log_init("test", LOG_USER, LOG_INFO);

  qb_log(LOG_ERR, "test\n");

  qb_log_fini();

  return 0;
}

Compiling that with or without -DQB_KILL_ATTRIBUTE_SECTION results in the message "test" logged to syslog when run, so it appears safe to compile pacemaker (and any other programs using libqb that show this problem) with that define, to work around this issue.

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

I have checked it again on Groovy and looks like change from https://github.com/ClusterLabs/libqb/pull/322 did not make it to Groovy version of libqb. Also in a test the behavior was identical to Focal, so I have targetted the bug to the series.

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

Adding -DQB_KILL_ATTRIBUTE_SECTION to CFLAGS seems to result in some symbols disappearing during the build:
https://paste.ubuntu.com/p/hmBpMXGjqy/

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

The list of symbols missing seem to be consistent with the onces defined with the CRM_TRACE_INIT_DATA macro:
lib/lrmd/lrmd_client.c
46:CRM_TRACE_INIT_DATA(lrmd);

lib/pacemaker/pcmk_trans_unpack.c
20:CRM_TRACE_INIT_DATA(transitioner);

lib/fencing/st_client.c
37:CRM_TRACE_INIT_DATA(stonith);

lib/pacemaker/pcmk_sched_allocate.c
24:CRM_TRACE_INIT_DATA(pe_allocate);

lib/cluster/cluster.c
25:CRM_TRACE_INIT_DATA(cluster);

lib/common/utils.c
57:CRM_TRACE_INIT_DATA(common);

lib/pengine/unpack.c
28:CRM_TRACE_INIT_DATA(pe_status);

lib/pengine/rules.c
25:CRM_TRACE_INIT_DATA(pe_rules);

The macro itself is defined in the following way:
include/crm/common/logging.h
112: #define CRM_TRACE_INIT_DATA(name) QB_LOG_INIT_DATA(name)

On the other hand QB_LOG_INIT_DATA is defined in libqb as follows:
#if defined(QB_KILL_ATTRIBUTE_SECTION) || defined(S_SPLINT_S)
#undef QB_HAVE_ATTRIBUTE_SECTION
#endif /* defined(QB_KILL_ATTRIBUTE_SECTION) || defined(S_SPLINT_S) */

#ifdef QB_HAVE_ATTRIBUTE_SECTION
// ...
#else
#define QB_LOG_INIT_DATA(name)
#endif /* QB_HAVE_ATTRIBUTE_SECTION */

So in the end with the QB_KILL_ATTRIBUTE_SECTION defined the macro QB_LOG_INIT_DATA is being left empty. Hence the missing symbols.

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

The symbols defined with CRM_TRACE_INIT_DATA doesn't seem to be used anywhere inside pacemaker and it's less than likely those are used anywhere outside of it.

The definitions seem to be strictly logging related without any other functionality declared.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.4 KiB)

I'm pasting an email answer here so we keep track of all that has been discussed for this issue:

----

> Me and Dan have been working recently on a customer case reported as LP: #1915828. Turns out that some linker "magic" used inside libqb broke pacemaker (and potentially any package using QB_LOG_INIT_DATA from the library or even every packaged linked against libqb).

I have checked libqb commits and: https://paste.ubuntu.com/p/vTdYJdC4Hc/ . I see that upstream added some options in configure.ac related to the linking issue, even quoting debian/ubuntu option to override libtool variable by force. There is also an existing test for the __attribute__((section("__verbose"))) breakage detection now (gcc_has_attribute_section_visible variable in configure.ac.

> Dan found an ABI-compatible way of mitigating the problem by rebuilding pacemaker with the QB_KILL_ATTRIBUTE_SECTION defined. This strips the magic and makes pacemaker useful again.

I see the mitigation (QB_KILL_ATTRIBUTE_SECTION) comes from:

--
32555d8 tests: add a script to generate callsite-heavy logging client…
 ...so as to evaluate use of resources. In particular, the intention
 here is to uncover the observable differences between the same logging
 code built with callsite section (default when available) and
 purposefully (overriding that default by force) without it.
...
--

And they even added “tests/functional/log_callsite_bench_gen.py” to measure the impact of this mitigation.

I’m particularly worried with:


    Based on the above, we can conclude that leveraging the callsite
    section for logging as facilitated by the toolchain intrinsics is
    beneficial, especially for performance-critical applications (corosync
    being the showcase here). Therefore it's desired to struggle for
    retaining this nifty trick despite some troubles emerged with recent
    binutils releases (starting with 2.29) and the changed behaviour we
    relied on so far in respective ld.bfd linkers (as mentioned in
    preceding commits). That motive is immediately followed -- well,
    judging the impact fairly, actually outclassed -- with the intention
    to preserve binary compatibility (incl. continuous library support for
    callsite section offloading spread in the existing client space widely
    for quite some years already) to the utmost extent possible.
--

I believe this will be accepted by the SRU team but this, for sure, has to be mentioned in the public bug. I would add to [regression potential] the fact that the logging mechanism would stir heap more often (commit log has even a time execution delta).

Shouldn’t this bug also affect all the libqb0 rdepends ? I can see pacemaker, sbd, corosync and usbguard source packages.

> The problem is (more details in comments #3-#5) some symbols disappear from the package. Those symbols doesn't seem to be used anywhere explicitly, but we were wondering if it's ok to just drop those symbols or maybe to implement a change in libqb to create dummy constructors (e.g. https://pastebin.canonical.com/p/Y4fk747YfK/) to ensure the symbols are available just in case.

For the pacemaker fix I’ll let the SRU team to discuss whether they would ...

Read more...

Changed in pacemaker (Ubuntu Groovy):
assignee: nobody → Dariusz Gadomski (dgadomski)
Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

SRU proposal for focal

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

SRU proposal for groovy

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Jim, or anyone else affected,

Accepted pacemaker into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/2.0.4-2ubuntu3.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in pacemaker (Ubuntu Groovy):
status: New → Fix Committed
tags: added: verification-needed verification-needed-groovy
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Jim, or anyone else affected,

Accepted pacemaker into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/2.0.3-3ubuntu4.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in pacemaker (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
description: updated
Changed in pacemaker (Ubuntu Groovy):
importance: Undecided → Medium
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (pacemaker/2.0.3-3ubuntu4.2)

All autopkgtests for the newly accepted pacemaker (2.0.3-3ubuntu4.2) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

pacemaker/2.0.3-3ubuntu4.2 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#pacemaker

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

# verification focal

$ apt-cache policy libcrmcommon34 | grep Installed
  Installed: 2.0.3-3ubuntu4.1

# dlm_stonith -t 5 -n 1
dlm_stonith: utils.c:57: common: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)

$ apt-cache policy libcrmcommon34 | grep Installed
  Installed: 2.0.3-3ubuntu4.2

# dlm_stonith -t 5 -n 1
kick_helper error -79 nodeid 1

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Dariusz Gadomski (dgadomski) wrote :

# verification groovy

$ apt-cache policy libcrmcommon34 | grep Installed
  Installed: 2.0.4-2ubuntu3.1

# dlm_stonith -t 5 -n 1089
dlm_stonith: utils.c:48: common: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)

$ apt-cache policy libcrmcommon34 | grep Installed
  Installed: 2.0.4-2ubuntu3.2

# dlm_stonith -t 5 -n 1089
kick_helper error -107 nodeid 1089

tags: added: verification-done-groovy
removed: verification-needed-groovy
tags: added: verification-done
removed: verification-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 2.0.4-2ubuntu3.2

---------------
pacemaker (2.0.4-2ubuntu3.2) groovy; urgency=medium

  * d/rules: Rebuild with QB_KILL_ATTRIBUTE_SECTION to overcome a problem in
    libqb (LP: #1915828)

 -- Dariusz Gadomski <email address hidden> Thu, 04 Mar 2021 10:52:40 +0100

Changed in pacemaker (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 2.0.3-3ubuntu4.2

---------------
pacemaker (2.0.3-3ubuntu4.2) focal; urgency=medium

  * d/rules: Rebuild with QB_KILL_ATTRIBUTE_SECTION to overcome a problem in
    libqb (LP: #1915828)

 -- Dariusz Gadomski <email address hidden> Thu, 04 Mar 2021 10:29:48 +0100

Changed in pacemaker (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.