pacemaker fails to release clustered filesystem dlm locks on failover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pacemaker (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Medium
|
Dariusz Gadomski | ||
Groovy |
Fix Released
|
Medium
|
Dariusz Gadomski |
Bug Description
[impact]
programs using libqb logging exit due to failed assertion on qb log init
[test case]
test program:
#include <qb/qblog.h>
QB_LOG_
int main(int argc, char* argv[])
{
return 0;
}
compile and run:
$ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/
$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)
Note the error is slightly different when compiling without lto:
$ gcc -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/
$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is populated, otherwise target's build is at fault, preventing reliable logging" && QB_ATTR_
Aborted (core dumped)
[regression potential]
any regression would likely involve problems during logging using the libqb logging functions, which could include failure to log or even program exit and/or crash.
additionally, altering of build flags (namely -DQB_KILL_
[scope]
this appears to be needed only for focal; the issue seems to be an interaction between the focal version of binutils and some linker "magic" that libqb used in the focal version.
The upstream libqb removed/replaced that linker "magic" after the version in focal, so this should not affect groovy or later. However, the fix changes the ABI and thus isn't appropriate for SRUing.
https:/
The libqb code in bionic does not include the linker "magic" and so does not have this problem.
[other info]
related debian binutils bug report:
https:/
related gcc bug report:
https:/
however, those appear to only have changed binutils to ignore the issue to allow the build to stop failing.
The libqb docs do contain two suggestions to possibly work around this bug, specifically using either -l:libqb.so.0 or -DQB_KILL_
$ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
/usr/bin/ld: warning: /usr/lib/
$ ./test
test: test.c:4: test: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
Aborted (core dumped)
$ gcc -flto -D_GNU_SOURCE -o test test.c -l:libqb.so.0 -ldl
$ ./test
$ gcc -flto -DQB_KILL_
/usr/bin/ld: warning: /usr/lib/
$ ./test
[original description]
When a clustered node is detected as failed the remaining node tries to fence the resources. When using pacemaker with gfs2 on an lvm2 logical volume dlm_controld calls out to dlm_stonith to release any locks held.
Due to a build issue with the version of libqb that pacemaker is compiled against, the call to QB_LOG_INIT_DATA which is #defined to CRM_TRACE_
At this point the gfs2 filesystem cannot be accessed and after any resource timeouts are met, the resource is marked as failed.
Calling dlm_stonith by hand with the data that is passed to it by dlm_controld shows the assertion.
root@u2004-1:~# /usr/sbin/
dlm_stonith: utils.c:57: common: Assertion `"implicit callsite section is observable, otherwise target's and/or libqb's build is at fault, preventing reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
It would appear that the code in libqb is over aggressive on the sanity checking, or assumes that QB_LOG_INIT_DATA will only be called by the library. External programs such as pacemaker that end up calling CRM_TRACE_INIT_DATA will suffer the same assertion.
This patch from clusterlabs is an attempt to resolve the assertion, but is still not sufficient. https:/
Taking out the assertion in <qb/qblog.h> and recompiling pacemaker appears to be the only way to allow dlm_stonith to work.
journalctl shows dlm_controld keeps trying to get a successful response from dlm_stonith
Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence result 2 pid 26568 result -1 term signal 6
Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence status 2 receive -1 from 1 walltime 1613481117 local 4389
Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence request 2 pid 26607 nodedown time 1613481102 fence_all dlm_stonith
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence result 2 pid 26607 result -1 term signal 6
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence status 2 receive -1 from 1 walltime 1613481118 local 4391
Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence request 2 pid 26637 nodedown time 1613481102 fence_all dlm_stonith
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence result 2 pid 26637 result -1 term signal 6
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence status 2 receive -1 from 1 walltime 1613481120 local 4392
Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence request 2 pid 26693 nodedown time 1613481102 fence_all dlm_stonith
....
Calling 'dlm_tool fence_ack 2' by hand immediately releases the dlm resource locks.
root@u2004-1:~# lsb_release -rd
Description: Ubuntu 20.04 LTS
Release: 20.04
root@u2004-1:~# apt-cache policy pacemaker
pacemaker:
Installed: 2.0.3-3ubuntu4.1
Candidate: 2.0.3-3ubuntu4.1
Version table:
*** 2.0.3-3ubuntu4.1 500
500 http://
500 http://
100 /var/lib/
2.0.3-3ubuntu3 500
500 http://
tags: | added: sts |
tags: | added: server-next |
Changed in pacemaker (Ubuntu): | |
assignee: | nobody → Dariusz Gadomski (dgadomski) |
importance: | Undecided → Medium |
description: | updated |
Changed in pacemaker (Ubuntu): | |
status: | New → Fix Released |
Changed in pacemaker (Ubuntu Focal): | |
assignee: | nobody → Dariusz Gadomski (dgadomski) |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in pacemaker (Ubuntu): | |
assignee: | Dariusz Gadomski (dgadomski) → nobody |
importance: | Medium → Undecided |
description: | updated |
Changed in pacemaker (Ubuntu Groovy): | |
assignee: | nobody → Dariusz Gadomski (dgadomski) |
description: | updated |
Changed in pacemaker (Ubuntu Groovy): | |
importance: | Undecided → Medium |
It appears the linker is eliding the section that libqb's "linker magic" expects when the linker detects that the compiled program doesn't actually use that section, which seems completely reasonable for the linker to do. This seems to be a case of libqb trying to be "clever" by relying on an implementation detail of the linker, which is bad.
A slightly adjusted test program, to actually use the libqb log functions, does not reproduce the problem, e.g.:
#include <qb/qblog.h>
QB_LOG_ INIT_DATA( test);
int main(int argc, char* argv[]) init("test" , LOG_USER, LOG_INFO);
{
qb_log_
qb_log(LOG_ERR, "test\n");
qb_log_fini();
return 0;
}
Compiling that with or without -DQB_KILL_ ATTRIBUTE_ SECTION results in the message "test" logged to syslog when run, so it appears safe to compile pacemaker (and any other programs using libqb that show this problem) with that define, to work around this issue.