OpenStack RabbitMQ Server Charm

Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment

Bug #1783203 reported by Charles Dunbar on 2018-07-23

This bug affects 7 people

	Status	Importance	Assigned to
OpenStack RabbitMQ Server Charm	Triaged	Undecided	Unassigned
erlang (Ubuntu)	Triaged	Undecided	Unassigned
rabbitmq-server (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

While performing an openstack release upgrade from Pike to Queens following the charmers guide, we have upgraded Ceph-* and Mysql. After setting source=cloud:xenial-queens on the RabbitMQ-Server charm and the cluster re-stabilizes, rabbitmq beam processes lock up on one cluster node causing complete denial of service on the openstack vhost across all 3 members of the cluster. Killing the beam process on that node causes another node to lock up within a short timeframe.

We have reproduced this twice in the same environment by re-deploying a fresh pike rabbitmq cluster and upgrading to queens. The issue is not reproducable with generic workloads such as creating/deleting nova instances and creating/attaching/detaching cinder volumes, however, when running a full heat stack, we can reproduce this issue.

This is happening on two of the three clouds on this site when RMQ is upgraded to Queens. The third cloud was able to upgrade to Queens without issue but was upgraded on 18.02 charms. Heat templates forthcoming.

Tags:

Revision history for this message

Xav Paice (xavpaice) wrote on 2018-07-24:

web.yaml Edit (10.7 KiB, text/html)

Attaching web.yaml, a heat template known to trigger this event.

tags:

added: canonical-bootstack

Revision history for this message

James Page (james-page) wrote on 2018-07-24:

Diff between 18.02 and 18.05 rmq charm:

http://paste.ubuntu.com/p/K2RT3QvGkQ/

Revision history for this message

James Page (james-page) wrote on 2018-07-24:

Logs from /var/log/rabbitmq might be useful here

summary:

- Upgrade to Queens causing crashing
+ Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-07-24:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in rabbitmq-server (Ubuntu):
status:	New → Confirmed

Revision history for this message

Charles Dunbar (ccdunbar) wrote on 2018-07-24:

Adding that the config "management_plugin=true" causes the crashing to happen with the web.yaml stack, while having it false does not cause the crashing.

Revision history for this message

James Page (james-page) wrote on 2018-07-25:

Now that is a useful nugget of information.

Felipe Reyes (freyes) on 2018-07-26

tags:

added: sts

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-10-05:

for people being hit by this bug, it would be helpful if you could generate a core file while rabbit is still stuck, this may give us some insights of where the process is looping.

$ sudo gcore -o lp1783203.core $PID

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2019-02-25:

We are also seeing this issue after upgrading OpenStack from Pike to Queens. It only seems to affect our larger setups, it wasn't seen during testing on our staging setup. The good thing is that I can confirm that disabling the management plugin seems to avoid the issue. We also have a core dump, but since it is from a production environment, I cannot share the content. However, the trace from thread 1 looks like

#0 0x00007fb09cd565d3 in select () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000563c00 in erts_sys_main_thread ()
#2 0x0000000000469860 in erl_start ()
#3 0x000000000042f389 in main ()

and the other threads all seem to be at

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00000000005c15ed in ethr_event_wait ()
#2 0x000000000051e0a5 in ?? ()
#3 0x00000000005c0da5 in ?? ()
#4 0x00007fb09d2326ba in start_thread (arg=0x7fb098ce6700) at pthread_create.c:333
#5 0x00007fb09cd6041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Binary is /usr/lib/erlang/erts-7.3/bin/beam.smp from erlang-base 1:18.3-dfsg-1ubuntu3.1.

Please let me know if you need further information, you can also reach me as "frickler" in #ubuntu-server.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2019-02-25:

Forgot to mention: The log in /<email address hidden> just stops when the server hangs, no output until after it is being restarted.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2019-02-28:

#10

The latest release in the 3.6.x series seems to be 3.6.16. Almost all releases since 3.6.10 list changes to the management plugin, but I didn't spot anything obvious. Here are the release notes for each:

https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_16
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_15
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_14
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_13
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_12
https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_11

Maybe this in 3.6.12? https://github.com/rabbitmq/rabbitmq-server/issues/1346

Can anybody more familiar with the problem take a quick peek at the release notes to see if something about this bug jumps out?

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2019-03-01:

#11

From the commit message https://github.com/rabbitmq/rabbitmq-server/pull/1431 "Avoid infinite loop when dropping entries in the GM" in 3.6.15 sounds interesting, as we do seem to see some kind of infinite loop here. I don't have an explanation though as to why GM would be interacting with the management plugin in the way we see.

Revision history for this message

Michael Klishin (michaelklishin) wrote on 2019-03-15:

#12

There isn't a whole lot of details about node state here but GM (a multicast module) has nothing to do with the management plugin.

According to the above comments this is on Erlang 18.3 which has known bugs that stop any activity on a node that had accepted any TCP connections (including HTTP requests) [1][2]. They were reported by team RabbitMQ in mid-2017 and fixed shortly after. Erlang 19.3.6.4 is the minimum supported version for RabbitMQ 3.6.16 primarily because of those issues.

Somewhat related: RabbitMQ 3.6.x is out of support [3][4] and since January 2018 [4], Erlang 19.3.6.4 is the minimum supported version even for 3.6.x.

1. https://bugs.erlang.org/browse/ERL-430
2. https://bugs.erlang.org/browse/ERL-448
3. https://www.rabbitmq.com/which-erlang.html#old-timers
4. http://www.rabbitmq.com/changelog.html

Revision history for this message

Michael Klishin (michaelklishin) wrote on 2019-03-15:

#13

I mostly failed to explain the impact of the above Erlang bugs.

Due to ERl-430 and ERL-448 above, nodes that had any TCP connections open (possibly from HTTP requests) could fail to stop (shut down), accept new TCP connections (including from HTTP clients, RabbitMQ CLI tools) and respond on already open TCP connections.

All of those behaviors are particularly problematic during upgrades. In fact, it was discovered as part of an upgrade of the RabbitMQ BOSH release [1].

1. https://github.com/pivotal-cf/cf-rabbitmq-release