Ubuntu
multipath-tools package

System Lockup during failover

Bug #1944005 reported by jarred wilson on 2021-09-17

This bug report is a duplicate of: Bug #1944586: kernel bug found when disconnecting one fiber channel interface on Cisco Chassis with fnic DRV_VERSION "1.6.0.47". Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Incomplete	Undecided	Unassigned
	multipath-tools (Ubuntu)	Confirmed	Undecided	Athos Ribeiro

Bug Description

When testing failover for configured multipath devices, I am seeing some nodes lock up and become unresponsive. These nodes have frozen consoles, but can be pinged. So some of the network stack is functional, but ssh is not possible to them.

1) 20.04 focal
2) 0.8.3-1ubuntu2
3) multipathing failover would be successful across all nodes and the nodes would continue to be responsive
4) some of the nodes lock up when attempting to failover and cannot be reached via ssh or console

Here is the ubuntu-bug output: https://pastebin.canonical.com/p/6MyWBggtS2/

Here is the kern.log: https://pastebin.canonical.com/p/rG42gzXG9X/

Here is the syslog: https://pastebin.canonical.com/p/C99ZprphZn/

Tags:

Revision history for this message

Lucas Kanashiro (lucaskanashiro) wrote on 2021-09-20:

Thank you for taking the time to file a bug report.

I went through the logs but I was not able to spot anything that could help us understand the why of your issue. Have you applied any change to the default multipath config file? Could you describe better your scenario? I'd like to gather all the information available before spending time trying to find a reproducer in a virtual environment, which involves setting up a bunch of things.

I am setting the Status of the bug to Incomplete, once you provide more information please set it back to New and our team will take a look at it again.

Changed in multipath-tools (Ubuntu):
status:	New → Incomplete

Camille Rodriguez (camille.rodriguez) on 2021-09-21

tags:

added: cpe-onsite

Revision history for this message

Camille Rodriguez (camille.rodriguez) wrote on 2021-09-21 (last edit on 2021-09-21):

Hi Lucas,

FYI, this is affecting a customer deployment and I am working with Jarred on this issue. We have tried a few variations of the multipath configuration. The current config is the following:

defaults {
  user_friendly_names yes
  find_multipaths yes
  polling_interval 10
}
devices {
  device {
    vendor "PURE"
    product "FlashArray"
    path_selector "queue-length 0"
    path_grouping_policy "group_by_prio"
    rr_min_io 1
    path_checker tur
    fast_io_fail_tmo 1
    dev_loss_tmo infinity
    no_path_retry 5
    failback immediate
    prio alua
    hardware_handler "1 alua"
    max_sectors_kb 4096
  }
}
multipaths {
  multipath {
    wwid "<WWID>"
    alias data1
  }
}

This configuration is used successfully in other systems (Ubuntu 18.04, Debian, Centos) in the customer environment and is the recommended configuration from their storage team.

For context, the ubuntu nodes that are affected are part of an Openstack deployment. The multipath configuration is in place to set a FibreChannel connection for the device used as Ceph-OSD. The failover was tested in 2 different ways, which both led to nodes being unresponsive. First, we turned off one of the F5 switch to simulate a power failure. The other way we've tested it is by resetting the IO module in UCS manager for the chassis where the nodes are located.

As you can see, there's very little information in the logs about what is actually causing the node to go wild. Do you have guidance on how to gather more detailed logs? Jarred and myself can provide live access to the environment if needed.

Revision history for this message

Camille Rodriguez (camille.rodriguez) wrote on 2021-09-21 (last edit on 2021-09-21):

Subscribed field-high as this issue is blocking the environment to go live.

Camille Rodriguez (camille.rodriguez) on 2021-09-21

Changed in multipath-tools (Ubuntu):
status:	Incomplete → New

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-09-21:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in multipath-tools (Ubuntu):
status:	New → Confirmed

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2021-09-21: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1944005

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Christian Ehrhardt  (paelzer) on 2021-09-22

tags:	added: server-next
Changed in multipath-tools (Ubuntu):
assignee:	nobody → Athos Ribeiro (athos-ribeiro)

Revision history for this message

Steven Parker (sbparke) wrote on 2021-12-13:

Dulicate of bug which is now fixed in stable

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1944586/comments/8

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1944586 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntumultipath-tools package

System Lockup during failover

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
multipath-tools package