System Lockup during failover

Bug #1944005 reported by jarred wilson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
multipath-tools (Ubuntu)
Confirmed
Undecided
Athos Ribeiro

Bug Description

When testing failover for configured multipath devices, I am seeing some nodes lock up and become unresponsive. These nodes have frozen consoles, but can be pinged. So some of the network stack is functional, but ssh is not possible to them.

1) 20.04 focal
2) 0.8.3-1ubuntu2
3) multipathing failover would be successful across all nodes and the nodes would continue to be responsive
4) some of the nodes lock up when attempting to failover and cannot be reached via ssh or console

Here is the ubuntu-bug output: https://pastebin.canonical.com/p/6MyWBggtS2/

Here is the kern.log: https://pastebin.canonical.com/p/rG42gzXG9X/

Here is the syslog: https://pastebin.canonical.com/p/C99ZprphZn/

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thank you for taking the time to file a bug report.

I went through the logs but I was not able to spot anything that could help us understand the why of your issue. Have you applied any change to the default multipath config file? Could you describe better your scenario? I'd like to gather all the information available before spending time trying to find a reproducer in a virtual environment, which involves setting up a bunch of things.

I am setting the Status of the bug to Incomplete, once you provide more information please set it back to New and our team will take a look at it again.

Changed in multipath-tools (Ubuntu):
status: New → Incomplete
tags: added: cpe-onsite
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote (last edit ):

Hi Lucas,

FYI, this is affecting a customer deployment and I am working with Jarred on this issue. We have tried a few variations of the multipath configuration. The current config is the following:

defaults {
  user_friendly_names yes
  find_multipaths yes
  polling_interval 10
}
devices {
  device {
    vendor "PURE"
    product "FlashArray"
    path_selector "queue-length 0"
    path_grouping_policy "group_by_prio"
    rr_min_io 1
    path_checker tur
    fast_io_fail_tmo 1
    dev_loss_tmo infinity
    no_path_retry 5
    failback immediate
    prio alua
    hardware_handler "1 alua"
    max_sectors_kb 4096
  }
}
multipaths {
  multipath {
    wwid "<WWID>"
    alias data1
  }
}

This configuration is used successfully in other systems (Ubuntu 18.04, Debian, Centos) in the customer environment and is the recommended configuration from their storage team.

For context, the ubuntu nodes that are affected are part of an Openstack deployment. The multipath configuration is in place to set a FibreChannel connection for the device used as Ceph-OSD. The failover was tested in 2 different ways, which both led to nodes being unresponsive. First, we turned off one of the F5 switch to simulate a power failure. The other way we've tested it is by resetting the IO module in UCS manager for the chassis where the nodes are located.

As you can see, there's very little information in the logs about what is actually causing the node to go wild. Do you have guidance on how to gather more detailed logs? Jarred and myself can provide live access to the environment if needed.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote (last edit ):

Subscribed field-high as this issue is blocking the environment to go live.

Changed in multipath-tools (Ubuntu):
status: Incomplete → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in multipath-tools (Ubuntu):
status: New → Confirmed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1944005

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: server-next
Changed in multipath-tools (Ubuntu):
assignee: nobody → Athos Ribeiro (athos-ribeiro)
Revision history for this message
Steven Parker (sbparke) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.