Spurious Pacemaker errors: couldn't create file for mmap

Bug #1978955 reported by René Højbjerg Larsen
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libqb (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Lena Voytek

Bug Description

[Impact]

While running Pacemaker, a system will occasionally run into the error:
error: couldn't create file for mmap
which can result in a crash.

This has been fixed upstream, in Kinetic, and in other distributions by retrying the posix_fallocate command if it fails up to five times.

Adding this fix will bring the same retry code to Jammy, significantly lowering the rate of crashes on a system running Pacemaker.

[Test Plan]

Although the error is inconsistent, testing can be done by running Pacemaker over long periods of time.

Without the fix, the error is likely to show up multiple times a day. So to test the fix, use a system or cluster experiencing the error at around this rate, update the package with the fix, and run Pacemaker for at least a day, confirming there are no mmap errors/crashes.

[Where problems could occur]

Since posix_fallocate is retried up to 5 times, this fix may cover up underlying errors due to race conditions, which could lead to ambiguous issues in the future.

[Other Info]

Upstream fix located here: https://github.com/ClusterLabs/libqb/pull/453

[Original Description]

We recently built a new cluster based on Pacemaker 2.1 / Corosync 3 on Ubuntu 22.04 LTS.

It mostly works fine, but we frequently (multiple times per day) experience spurious restarts or failures of a service on a single node.

Syslog on the affected node reports something like this leading up to the failure:

Jun 16 14:35:06 pgdb5 pacemaker-based[925993]: error: couldn't allocate file /dev/shm/qb-925993-1649317-14-b5YWHF/qb-event-cib_rw-data: Interrupted system call (4)
Jun 16 14:35:06 pgdb5 pacemaker-based[925993]: error: couldn't create file for mmap
Jun 16 14:35:06 pgdb5 pacemaker-based[925993]: error: qb_rb_open:/dev/shm/qb-925993-1649317-14-b5YWHF/qb-event-cib_rw: Interrupted system call (4)
Jun 16 14:35:06 pgdb5 pacemaker-based[925993]: error: shm connection FAILED: Interrupted system call (4)
Jun 16 14:35:06 pgdb5 pacemaker-based[925993]: error: Error in connection setup (/dev/shm/qb-925993-1649317-14-b5YWHF/qb): Interrupted system call (4)

Our symptoms are very similar to this SUSE bug, which was fixed upstream recently: https://www.suse.com/support/kb/doc/?id=000020566

Related branches

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hello René, thank you for submitting this report. Based on the info you provided it seems like the upstream fix you provided would fix the issue. We've added the fix to our upcoming release but it is not currently in 22.04. As such, I created a PPA with the fix for Jammy, located at:

https://launchpad.net/~lvoytek/+archive/ubuntu/libqb-retry-posix-fallocate-jammy

The code for the fix is here:

https://git.launchpad.net/~lvoytek/ubuntu/+source/libqb/commit/?id=cc8029c9ad0865bed61df1741ad8ad156fc7afac

If you would like to try it on your system to make sure it works you can run the following commands:

sudo add-apt-repository ppa:lvoytek/libqb-retry-posix-fallocate-jammy
sudo apt update
sudo apt upgrade

Revision history for this message
René Højbjerg Larsen (rhljfm) wrote :

Thank you,

We've been using a local build of libqb with the above patch since last week, and it does indeed resolve the problem.

Revision history for this message
Lena Voytek (lvoytek) wrote :

That's great to hear! I will work to get this fix into 22.04 then so that the ppa will no longer be needed. Thanks!

Changed in libqb (Ubuntu):
status: New → Fix Released
Changed in libqb (Ubuntu Jammy):
status: New → In Progress
assignee: nobody → Lena Voytek (lvoytek)
Lena Voytek (lvoytek)
description: updated
Lena Voytek (lvoytek)
tags: added: server-todo
Revision history for this message
Robie Basak (racb) wrote :

It seems odd to me to retry only up to five times. The usual way to handle EINTR is to retry indefinitely, and if the process were to keep receiving signals such that it couldn't make any progress at all, then would be expected behaviour here, and a bug in whatever is keeping up such a signal rate. This is what the (GNU-specific) TEMP_FAILURE_RETRY does, for example: https://www.gnu.org/software/libc/manual/html_node/Interrupted-Primitives.html

So I think the upstream fix could do with improving here.

But in the meantime I think it's fine to grab the upstream fix as-is, as it's unlikely to make a difference in this case.

I also noticed that the commit that actually landed upstream (https://github.com/ClusterLabs/libqb/commit/176eae8f13278a5a3dab3699b84e1dc9a8d4ae11) slightly mismatches the patch uploaded here (https://github.com/ClusterLabs/libqb/commit/2e259c9d6e13968665678745f1e774bc4ccf8806). But they look functionally equivalent in this case, so I think it's probably not worth fixing now.

There's a minor change in the EINTR failure case here. Before, the code would error unwind. Now, it continues regardless even when the retry count is exceeded. But I think that's also fine in this case because the nature of posix_fallocate is advisory. There's a minor difference in that a disk full error might then happen later, but that seems unlikely to result in a regression to me.

Changed in libqb (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello René, or anyone else affected,

Accepted libqb into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libqb/2.0.4-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
René Højbjerg Larsen (rhljfm) wrote :

For the record, the package in jammy-proposed resolves the issue for us.

Revision history for this message
Lena Voytek (lvoytek) wrote :

Thanks for the update René, I'll mark the verification as complete so this can get into the release pocket

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libqb - 2.0.4-1ubuntu0.1

---------------
libqb (2.0.4-1ubuntu0.1) jammy; urgency=medium

  * d/p/retry-if-posix-fallocate-interrupted-eintr.patch: Retry posix_fallocate
    when the error - couldn't create file for mmap - occurs (LP: #1978955)

 -- Lena Voytek <email address hidden> Wed, 22 Jun 2022 11:52:46 -0700

Changed in libqb (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for libqb has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.