Bug #1890491 “A pacemaker node fails monitor (probe) and stop /s...” : Bugs : pacemaker package : Ubuntu

Jorge Niedbalski (niedbalski) on 2020-08-05

Changed in pacemaker (Ubuntu Groovy):
status:	New → Fix Released
Changed in pacemaker (Ubuntu Focal):
status:	New → Fix Released

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2020-08-13:

#1

I am able to reproduce a similar issue with the following bundle: https://paste.ubuntu.com/p/VJ3m7nMN79/

Resource created with
sudo pcs resource create test2 ocf:pacemaker:Dummy op_sleep=10 op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s

juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-10.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-11.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-12.cloud.sts"

Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-10.cloud.sts juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ]

Full list of resources:

Resource Group: grp_nova_vips
res_nova_bf9661e_vip (ocf::heartbeat:IPaddr2): Started juju-acda3d-pacemaker-remote-7
Clone Set: cl_nova_haproxy [res_nova_haproxy]
Started: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
juju-acda3d-pacemaker-remote-10.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-12.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-11.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-7

test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-10.cloud.sts

## After running the following commands on juju-acda3d-pacemaker-remote-10.cloud.sts

1) sudo systemctl stop pacemaker_remote
2) forcedfully shutdown (openstack server stop xxxx) in less than 10 seconds after the pacemaker_remote gets
executed.

Remote is shutdown

RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

The resource status remains as stopped across the 3 machines, and doesn't recovers.

$ juju run --application nova-cloud-controller "sudo pcs resource show | grep -i test2"
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/0
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/1
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/2

However, If I do a clean shutdown (without interrupting the pacemaker_remote fence), that ends up
with the resource migrated correctly to another node.

6 nodes configured
9 resources configured

Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ]
RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

Full list of resources:

[...]
test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-12.cloud.sts

I will keep investigating this behavior and determine is this is linked to the bug reported.

I am able to reproduce a similar issue with the following bundle: https://paste.ubuntu.com/p/VJ3m7nMN79/

Resource created with
sudo pcs resource create test2 ocf:pacemaker:Dummy op_sleep=10 op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s

juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-10.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-11.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers juju-acda3d-pacemaker-remote-12.cloud.sts"

Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-10.cloud.sts juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ]

Full list of resources:

Resource Group: grp_nova_vips
res_nova_bf9661e_vip (ocf::heartbeat:IPaddr2): Started juju-acda3d-pacemaker-remote-7
Clone Set: cl_nova_haproxy [res_nova_haproxy]
Started: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
juju-acda3d-pacemaker-remote-10.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-12.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-11.cloud.sts (ocf::pacemaker:remote): Started juju-acda3d-pacemaker-remote-7

test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-10.cloud.sts

## After running the following commands on juju-acda3d-pacemaker-remote-10.cloud.sts

1) sudo systemctl stop pacemaker_remote
2) forcedfully shutdown (openstack server stop xxxx) in less than 10 seconds after the pacemaker_remote gets
executed.

Remote is shutdown

RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

The resource status remains as stopped across the 3 machines, and doesn't recovers.

$ juju run --application nova-cloud-controller "sudo pcs resource show | grep -i test2"
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/0
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/1
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/2

However, If I do a clean shutdown (without interrupting the pacemaker_remote fence), that ends up
with the resource migrated correctly to another node.

6 nodes configured
9 resources configured

Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-11.cloud.sts juju-acda3d-pacemaker-remote-12.cloud.sts ]
RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

Full list of resources:

[...]
test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-12.cloud.sts

I will keep investigating this behavior and determine is this is linked to the bug reported.

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2020-08-14:

#2

Hello,

I am testing a couple of patches (both imported from master), through this PPA: https://launchpad.net/~niedbalski/+archive/ubuntu/fix-1890491

c20f8920 - don't order implied stops relative to a remote connection
938e99f2 - remote state is failed if node is shutting down with connection failure

I'll report back here if these patches fixes the behavior described in my previous
comment.

Jorge Niedbalski (niedbalski) on 2020-08-18

Changed in pacemaker (Ubuntu Bionic):
status:	New → In Progress
assignee:	nobody → Jorge Niedbalski (niedbalski)

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-08-20:

#3

Thanks for taking care of this Jorge. Let me know whenever you have a fix ready for this.

Cheers!

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2020-09-08:

#4

lp1890491-bionic.debdiff Edit (16.2 KiB, text/plain)

Mathew Hodson (mhodson) on 2020-10-03

Changed in pacemaker (Ubuntu Bionic):
importance:	Undecided → Medium
Changed in pacemaker (Ubuntu Focal):
importance:	Undecided → Medium
Changed in pacemaker (Ubuntu Groovy):
importance:	Undecided → Medium

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-04-07:

#5

I think this has fallen through the cracks by Rafel being unavailable.
@Lucas - is this something you could look into?
@Jorge - did this resolve some other way as things just stayed silent?

Revision history for this message

Lucas Kanashiro (lucaskanashiro) wrote on 2021-04-07:

#6

Sorry for taking too long to get to this bug. I have some comments about the proposed debdiff:

1- The version needs to be updated to 1.1.18-0ubuntu1.4. The .3 version was already released to bionic-updates.

2- The patches need some DEP-3 headers. I see you are just backporting the upstream patches but it would be good to also add some headers after the original commit message, such as Origin, Bug-Ubuntu, Reviewed-By.

3- The patches 0001-Fix-libpe_status-don-t-order-implied-stops-relative-.patch and 0002-Fix-scheduler-remote-state-is-failed-if-node-is-shut.patch are in the debdiff but they are not mentioned in debian/patches/series nor debian/changelog. Should they be removed? Or added to d/p/series and d/changelog?

The proposed debdiff as-is built fine for me locally. We need to address the comments above to be able to upload this package. In parallel, we can update the bug description to add the SRU template (impact, test plan, where problems could occur), are you willing to do that @Jorge?

Thanks for the work you have done so far!

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2021-04-09:

#7

Hello Lucas,

I'll reformat the patch accordingly and re-submit. Thanks.

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-14:

#9

Hi @niedbalski,

Were you still interested in SRUing this?

Ubuntu
pacemaker package

A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189

Bug Description

Other bug subscribers

Patches

Remote bug watches

	Status	Importance	Assigned to
pacemaker (Ubuntu)	Fix Released	Medium	Unassigned
Bionic	In Progress	Medium	Jorge Niedbalski
Focal	Fix Released	Medium	Unassigned
Groovy	Fix Released	Medium	Unassigned

Ubuntupacemaker package

A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
pacemaker package