Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment

Bug #1783203 reported by Charles Dunbar
46
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Triaged
Undecided
Unassigned
erlang (Ubuntu)
Triaged
Undecided
Unassigned
rabbitmq-server (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

While performing an openstack release upgrade from Pike to Queens following the charmers guide, we have upgraded Ceph-* and Mysql. After setting source=cloud:xenial-queens on the RabbitMQ-Server charm and the cluster re-stabilizes, rabbitmq beam processes lock up on one cluster node causing complete denial of service on the openstack vhost across all 3 members of the cluster. Killing the beam process on that node causes another node to lock up within a short timeframe.

We have reproduced this twice in the same environment by re-deploying a fresh pike rabbitmq cluster and upgrading to queens. The issue is not reproducable with generic workloads such as creating/deleting nova instances and creating/attaching/detaching cinder volumes, however, when running a full heat stack, we can reproduce this issue.

This is happening on two of the three clouds on this site when RMQ is upgraded to Queens. The third cloud was able to upgrade to Queens without issue but was upgraded on 18.02 charms. Heat templates forthcoming.

Revision history for this message
Xav Paice (xavpaice) wrote :

Attaching web.yaml, a heat template known to trigger this event.

tags: added: canonical-bootstack
Revision history for this message
James Page (james-page) wrote :

Diff between 18.02 and 18.05 rmq charm:

http://paste.ubuntu.com/p/K2RT3QvGkQ/

Revision history for this message
James Page (james-page) wrote :

Logs from /var/log/rabbitmq might be useful here

summary: - Upgrade to Queens causing crashing
+ Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in rabbitmq-server (Ubuntu):
status: New → Confirmed
Revision history for this message
Charles Dunbar (ccdunbar) wrote :

Adding that the config "management_plugin=true" causes the crashing to happen with the web.yaml stack, while having it false does not cause the crashing.

Revision history for this message
James Page (james-page) wrote :

Now that is a useful nugget of information.

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Felipe Reyes (freyes) wrote :

for people being hit by this bug, it would be helpful if you could generate a core file while rabbit is still stuck, this may give us some insights of where the process is looping.

$ sudo gcore -o lp1783203.core $PID

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

We are also seeing this issue after upgrading OpenStack from Pike to Queens. It only seems to affect our larger setups, it wasn't seen during testing on our staging setup. The good thing is that I can confirm that disabling the management plugin seems to avoid the issue. We also have a core dump, but since it is from a production environment, I cannot share the content. However, the trace from thread 1 looks like

#0 0x00007fb09cd565d3 in select () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000563c00 in erts_sys_main_thread ()
#2 0x0000000000469860 in erl_start ()
#3 0x000000000042f389 in main ()

and the other threads all seem to be at

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00000000005c15ed in ethr_event_wait ()
#2 0x000000000051e0a5 in ?? ()
#3 0x00000000005c0da5 in ?? ()
#4 0x00007fb09d2326ba in start_thread (arg=0x7fb098ce6700) at pthread_create.c:333
#5 0x00007fb09cd6041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Binary is /usr/lib/erlang/erts-7.3/bin/beam.smp from erlang-base 1:18.3-dfsg-1ubuntu3.1.

Please let me know if you need further information, you can also reach me as "frickler" in #ubuntu-server.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Forgot to mention: The log in /<email address hidden> just stops when the server hangs, no output until after it is being restarted.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The latest release in the 3.6.x series seems to be 3.6.16. Almost all releases since 3.6.10 list changes to the management plugin, but I didn't spot anything obvious. Here are the release notes for each:

https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_16
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_15
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_14
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_13
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_12
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_11

Maybe this in 3.6.12? https://github.com/rabbitmq/rabbitmq-server/issues/1346

Can anybody more familiar with the problem take a quick peek at the release notes to see if something about this bug jumps out?

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

From the commit message https://github.com/rabbitmq/rabbitmq-server/pull/1431 "Avoid infinite loop when dropping entries in the GM" in 3.6.15 sounds interesting, as we do seem to see some kind of infinite loop here. I don't have an explanation though as to why GM would be interacting with the management plugin in the way we see.

Revision history for this message
Michael Klishin (michaelklishin) wrote :

There isn't a whole lot of details about node state here but GM (a multicast module) has nothing to do with the management plugin.

According to the above comments this is on Erlang 18.3 which has known bugs that stop any activity on a node that had accepted any TCP connections (including HTTP requests) [1][2]. They were reported by team RabbitMQ in mid-2017 and fixed shortly after. Erlang 19.3.6.4 is the minimum supported version for RabbitMQ 3.6.16 primarily because of those issues.

Somewhat related: RabbitMQ 3.6.x is out of support [3][4] and since January 2018 [4], Erlang 19.3.6.4 is the minimum supported version even for 3.6.x.

1. https://bugs.erlang.org/browse/ERL-430
2. https://bugs.erlang.org/browse/ERL-448
3. https://www.rabbitmq.com/which-erlang.html#old-timers
4. http://www.rabbitmq.com/changelog.html

Revision history for this message
Michael Klishin (michaelklishin) wrote :

I mostly failed to explain the impact of the above Erlang bugs.

Due to ERl-430 and ERL-448 above, nodes that had any TCP connections open (possibly from HTTP requests) could fail to stop (shut down), accept new TCP connections (including from HTTP clients, RabbitMQ CLI tools) and respond on already open TCP connections.

All of those behaviors are particularly problematic during upgrades. In fact, it was discovered as part of an upgrade of the RabbitMQ BOSH release [1].

1. https://github.com/pivotal-cf/cf-rabbitmq-release

Revision history for this message
James Page (james-page) wrote :

raising erlang bug task; I think we need to address the two ERL-* issues in the xenial erlang version as identified in #13.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Triaged the charm task until we can validate the fixes against the erlang/rabbit bits

Changed in charm-rabbitmq-server:
status: New → Triaged
Changed in erlang (Ubuntu):
status: New → Triaged
Revision history for this message
Drew Freiberger (afreiberger) wrote :

We should revisit if this is still an issue with the Focal version of RabbitMQ.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.