Installer crash in s390x environments w/o OSA network adapters

Bug #1987236 reported by Frank Heimes
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
subiquity
Fix Released
Undecided
Unassigned

Bug Description

The subiquity installer should support installations of s390x environments that have no OSA (CCW-) devices in place, but Mellanox Connect-X (RoCE Express) PCIe-based card instead, that act like a regular ethernet devices.

But on s390x the installer seem to expect OSA devices (or maybe any CCW devices) on top, at least it calls chzdev always by default, which will end up in a crash in case no CCW / OSA devices are in place.

A fix is needed, so that PCIe RoCE only network based installations are properly supported.

(initial report)

In the Ubuntu installer there is a section that lists zdev's and there are none.
With OSA network cards you would see qeth devices to include.
Further down the install a network card is detected and the IP we assign is showing.

Looking at the debug output the errors are occurring around the chzdev command.
I am wondering if this is because there are no zdev's and we are hitting a bug.
I think the kernel is correctly finding the NIC as a Mellanox per mlx5 module (these IBM cards are indeed Mellanox hardware):

 109.848474] mlx5_core 0001:00:00.0 ens1: Link up

Here's the total debug output <see attachment>

And the failing output around the kernel install and chzdev below:

 Running
 command ['chzdev', '--quiet', '--active', '--online', '--export', '-'] with allowed return codes [0] (capture=True)

 finish:
 cmd-install/stage-curthooks/builtin/cmd-curthooks/installing-kernel: FAIL: installing kernel

 finish:
 cmd-install/stage-curthooks/builtin/cmd-curthooks: FAIL: curtin command curthooks

 Traceback
 (most recent call last):

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/commands/main.py", line 202, in main

     ret
 = args.func(args)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/commands/curthooks.py", line 1886, in curthooks

     builtin_curthooks(cfg,
 target, state)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/commands/curthooks.py", line 1731, in builtin_curthooks

     chzdev_persist_active_online(cfg,
 target)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/commands/curthooks.py", line 221, in chzdev_persist_active_online

     (chzdev_conf,
 _) = chzdev_export(active=True, online=True)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/commands/curthooks.py", line 243, in chzdev_export

     return
 util.subp(cmd, capture=True)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/util.py", line 275, in subp

     return
 _subp(*args, **kwargs)

   File
 "/snap/subiquity/3699/lib/python3.8/site-packages/curtin/util.py", line 139, in _subp

     raise
 ProcessExecutionError(stdout=out, stderr=err,

 curtin.util.ProcessExecutionError:
 Unexpected error while running command.

 Command:
 ['chzdev', '--quiet', '--active', '--online', '--export', '-']

 Exit
 code: 8

 Reason:
 -

 Stdout:
 ''

:[K

[K
 Stderr: chzdev: No settings found to export

Background info:

Initially 'RoCE Express' (Mellanox Connect-X based) non-OSA network adapters on s390x were planned to improve network connectivity between Linux on s390x and z/OS and an OSA interface was needed on top of RoCE for the initial handshake between those two systems.
But with the increasing popularity of LinuxONE systems (and a tendency away from CCW devices, like OSA) more towards PCIe devices, like RoCE Express (but also NVMe), there are more and more cases where installations are needed in s390x environments that have no (CCW-based) OSA network adapters at all, but RoCE (Mellanox) instead (in ethernet mode).

Related branches

Revision history for this message
Frank Heimes (fheimes) wrote :

Logs from an attempt to install 22.04 on a (PCIe-based) Mellanox network only LPAR system (with NVMe disk storage only).
(So virtually no CCW-devices at all in that system).

information type: Public → Private
tags: added: rls-kk-incoming
Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
importance: Undecided → High
Revision history for this message
Frank Heimes (fheimes) wrote :

Further investigations showed that it's sufficient to just have "some" kind of a CCW device to get along this problem.
For example there are so called "hipersocket" devices, which can be seen as virtual CCW OSA devices. And if such a hipersocket device is created (which btw. can be quite cumbersome, because it may require to change of the underlying ioconfig of the overall system, hence cannot be seen as a simple workaround), one is able to by-pass this issue.

With this in mind I think a fix should be relatively straight forward,
since it obviously shows that if no CCW devices are in place,
and lszdev/chzdev fail today,
this should be caught,
and the installation can well be successful in case other network device or interfaces are in place (like 'ens*' which is the interface representation of the RoCE/Mellanox devices).

System details about OSA (device) / qeth (driver/interface) are here:
/sys/bus/ccwgroup/drivers/qeth/
whereas details about RoCE-Express / Mellanox / (ethernet mode) are here:
/sys/bus/pci/drivers/mlx?_core/

Hope this makes sense ...

Revision history for this message
Dan Bungert (dbungert) wrote :

https://github.com/canonical/curtin/blob/master/curtin/commands/curthooks.py#L213

This seems to be the relvant code. As I don't know the first thing about s390x networking I'm reluctant to make changes.

Are you saying that chzdev_persist_active_online() should be an optional step, where the install attempts to proceed even if it fails?

I can prepare a snap with that change if you help test.

Frank Heimes (fheimes)
information type: Private → Public
Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Dan, yes, that's what I think.
On the wast majority of s390x systems there are so called "ccw" (zSystems specific) devices, and we definitely need to find them and offer to configure them.
But on some installations that might not be the case, but this should not break or stop the installation, since it can be a system without any ccw devices, like just zPCI based devices (NVMe for disk and Mellanox for networking).
So I think the case where chzdev fails (btw. maybe also 'lszdev' is somewhere used in the code) because no ccw devices are there, just need to be properly caught, and the installation should proceed..

The workaround /kind of/ proves that already (creating a virtual so called hipersocket 'ccw') networking device that is not used at all and just sits there to make chzdev happy.

Revision history for this message
Dan Bungert (dbungert) wrote :

I have made a test snap available on channel edge/test-lp-1987236.
It contains a speculative fix from the linked merge proposal that I think should work, but am not able to test personally. Testing assistance would be appreciated.

Revision history for this message
Frank Heimes (fheimes) wrote :

I just wanted to test to manual install update to 'edge/test-lp-1987236'
kicked off a 22.04.1 installation and hit F2 to enter the installer shell:
root@ubuntu-server:/# snap list
Name Version Rev Tracking Publisher Notes
core20 20220719 1589 latest/stable canonical** base
lxd 5.0.0-b0287c1 22924 5.0/stable/… canonical** -
snapd 2.56.2 16295 latest/stable canonical** snapd
subiquity 22.07.2 3699 latest/stable/… canonical** classic
root@ubuntu-server:/#
But when I now try to refresh to 'edge/test-lp-1987236' I get:
root@ubuntu-server:/# snap refresh subiquity --channel=edge/test-lp-1987236
error: cannot refresh "subiquity": snap "subiquity" has running apps
       (subiquity), pids: 1348,1583,2251,2267,2281,2304
root@ubuntu-server:/#
I think I was always able to update subiquity like this in the past, no?

Revision history for this message
Dan Bungert (dbungert) wrote :

There's some new argument, --ignore-running I think?

Revision history for this message
Dan Bungert (dbungert) wrote :

snap refresh subiquity --channel=edge/test-lp-1987236 --ignore-running

try that please

Revision history for this message
Frank Heimes (fheimes) wrote (last edit ):

Ha, that argument is new to me - thx for the hint - works like a charm!

Changed in subiquity:
status: New → In Progress
Changed in ubuntu-z-systems:
status: New → In Progress
Revision history for this message
Dan Bungert (dbungert) wrote :

Yes, the ignore-running bit is fairly new, part of the refresh awareness work there.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Tangent but I think there is work ongoing to allow subiquity as a snap to opt out of the "running apps block refresh by default" stuff.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit f174ed73 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=f174ed73

Frank Heimes (fheimes)
Changed in subiquity:
status: In Progress → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Dan Bungert (dbungert) wrote :

We believe a fix for this can be found in Subiquity 22.10.1. On
install you will be offered to update to the new version of the
installer if network is available, or you can perform a manual update
by running the follwing in a terminal:
sudo snap refresh subiquity

Changed in subiquity:
status: Fix Committed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

Many thanks - also closing this on the 'ubuntu-z-systems' project level as Fix Released.

Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.