crash trying to install jammy final on a ppc64le

Bug #1969393 reported by Patricia Domingues
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Unassigned
subiquity
Fix Released
Undecided
Unassigned

Bug Description

Not able to install jammy final `20220418.1` on ppc64el POWER9 systems
The installer is crashing

`An error occurred during installation`

"subiquity 22.04.1 3316 latest/stable"

```
2022-04-18 20:53:02,292 DEBUG root:39 finish: subiquity/Filesystem/guided_POST: SUCCESS: 200 {"status": "DONE", "error_report": null, "bootloader": "PREP", "orig_config":...
 2022-04-18 20:53:02,292 INFO aiohttp.access:233 [18/Apr/2022:20:53:02 +0000] "POST /storage/guided HTTP/1.1" 200 17864 "-" "Python/3.8 aiohttp/3.6.2"
 2022-04-18 20:53:02,379 ERROR subiquity.server.server:416 top level error
 Traceback (most recent call last):
   File "/snap/subiquity/3316/usr/lib/python3/dist-packages/aiohttp/connector.py", line 829, in _resolve_host
     addrs = await \
   File "/snap/subiquity/3316/usr/lib/python3/dist-packages/aiohttp/resolver.py", line 29, in resolve
     infos = await self._loop.getaddrinfo(
   File "/snap/subiquity/3316/usr/lib/python3.8/asyncio/base_events.py", line 825, in getaddrinfo
     return await self.run_in_executor(
   File "/snap/subiquity/3316/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
     result = self.fn(*self.args, **self.kwargs)
   File "/snap/subiquity/3316/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
 socket.gaierror: [Errno -3] Temporary failure in name resolution
```

Tags: iso-testing
Revision history for this message
Patricia Domingues (patriciasd) wrote :

this has the same error as reported on https://bugs.launchpad.net/ubuntu-power-systems/+bug/1967324, but as it's a new image and subiquity version I've filed a new bug. Let me know if you need any info or tests

Revision history for this message
Patricia Domingues (patriciasd) wrote :

log from POWER9 Boston (hostname baltar)

Revision history for this message
Patricia Domingues (patriciasd) wrote :

log from Power9 Boston (hostname tiselius)

Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1969393

tags: added: iso-testing
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I poked Michael and Dan about this and it might be networking-related. Let's keep an eye out for this one.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Out of 4 installation attempts, every attempt has failed. And the same results have been seen across 2 different Power9 systems.
So, the issue can be reliably reproduced.

Revision history for this message
Patricia Domingues (patriciasd) wrote :

ok thanks, same happening with 20220418.2 image:
  ┌────────────────────────────────────────────────────────────────────────┐
   │ │
   │ Sorry, an unknown error occurred. │
   │ │
   │ [ View full report ] │
   │ │
   │ If you want to help improve the installer, you can send an error │
   │ report. │
   │ │
   │ [ Send to Canonical ] │
   │ │
   │ [ Close report ] │
   │ │
   └────────────────────────────────────────────────────────────────────────┘

                             [ View error report ]
                             [ Reboot Now ]

┌──────────────────────────────────────────────────────────────────────────┐
  │subiquity/Early/apply_autoinstall_config │
  │subiquity/Reporting/apply_autoinstall_config │
  │subiquity/Error/apply_autoinstall_config │
  │subiquity/Userdata/apply_autoinstall_config │
  │subiquity/Package/apply_autoinstall_config │
  │subiquity/Debconf/apply_autoinstall_config │
  │subiquity/Kernel/apply_autoinstall_config │
  │subiquity/Zdev/apply_autoinstall_config │
  │subiquity/Late/apply_autoinstall_config │
  │ │
  │ │
  │ │
  │ │
  │ │
  └──────────────────────────────────────────────────────────────────────────

                             [ View full log ]
                             [ View error report ]
                             [ Reboot Now ]

Revision history for this message
Patricia Domingues (patriciasd) wrote :

was re-trying and after it crashes - selecting `Close report` and re-trying I can finish the installation, but it crashes again on `installing kernel` closing again it finishes `Install complete!`

: patricia@tiselius:~$ cat /var/log/installer/media-info
Ubuntu-Server 22.04 LTS "Jammy Jellyfish" - Release ppc64el (20220418.2)

Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :

Added a few crash logs from iso 20220418.2 for reference:

1650376052.513536215-2nd-attempt.unknown.crash
1650378076.097841978-3rd-attempt.unknown.crash
1650378566.696431160-4th-attempt.unknown.crash

Now testing 20220419.1.

Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :
Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :

The same problem does still happen with ISO 20220419.1: when installing from serial console, it is not always possible to proceed the install by just ignoring the error -- sometimes it triggers before the user has a chance of confirming the disk partioning, so no installation process does run in background. A crash report is attached in file 1650385824.373705626.unknown.crash

I'm also sending three asciinema screencasts to demonstrate the failure while capturing all steps, log messages, etc.:

- File install-error-iso-from-vlan.cast (also in https://asciinema.org/a/vys71zoYYccDGzcDHdIwXOc3X ) shows how the problem happens when booting from an ISO image downloaded from a machine in the same VLAN (mostly for speed reasons, I first download the ISO to an LPAR in another machine, in the same network, and serve it to the target system from there).

- File install-error-iso-from-inet.cast (also in https://asciinema.org/a/TAjjylBvz4KbC1S4eGpGgMvpO ) shows the same failure from the same image, but downloaded from internet.

- File install-error-basic-mode-ssh.cast (also available in https://asciinema.org/a/nnq6ABfVFlCGx800OQEFTDhAM ) shows the same problem, but with the installer running in basic mode, and how to proceed the install by reconnecting to the installer session from ssh and ignoring the error messages.

Revision history for this message
Olivier Gayot (ogayot) wrote (last edit ):

Responding to:

https://bugs.launchpad.net/ubuntu-power-systems/+bug/1967324/comments/8

I would be surprised if geoip was the cause of the crash because the following line is present in the all the .crash files I've had a look at :

 2022-04-19 13:46:23,649 DEBUG subiquity.common.geoip:130 no CountryCode found in '<?xml version="1.0" encoding="UTF-8"?><Response><Ip>10.245.71.188</Ip><Status>IP NOT FOUND</Status></Response>\n'

This indicates that the geoip service was contacted successfully. Yes, it fails with IP NOT FOUND but it seems to be what happens when you contact the geoip service from within the Canonical network (i.e., traffic is not NAT-ed?).

Having said that, we switched from python requests to python aiohttp for querying the geoip service in JJ so that would be a suspicious coincidence if this error was not occurring before.

This thread [1] seems to suggest that sometimes, socket.gaierror exceptions can leave the aiohttp code without being encapsulated in aiohttp.ClientConnectorError but I have not been able to successfully reproduce it.

[1] https://github.com/aio-libs/aiohttp/issues/674

Changed in ubuntu-power-systems:
importance: Undecided → Critical
Revision history for this message
Olivier Gayot (ogayot) wrote (last edit ):

I did end up crashing the installer by configuring a broken network with the same stack using an amd64 VM.

What's funny is that we have in the logs:
1. the socket.gaierror wrapped inside a asyncio.ClientConnectorError -> Subiquity caught the exception and showed that geoip service lookup fails: OK
2. another socket.gaierror that didn't get caught, crashing the installer

Steps that I used to reproduce:
* download the jammy-live-server-amd64.iso
* run:
    $ LIVEFS_EDITOR=dummy scripts/kvm-test.py -m 2G --install --overwrite jammy-live-server-amd64.iso
* next, next, next until the network configuration
* change ens3 from DHCPv4 to manual and input the following configuration (192.168.16.0 is a non existent network):
    * Subnet: 192.168.16.0/24
    * Address: 192.168.16.1
    * Gateway: 192.168.16.254
    * DNS: 1.2.3.4,5.6.7.8
    * Search domains: <empty>
* Click on Save
* Wait for ~30 seconds (no need to move to the next screen)

Revision history for this message
Alexandre Erwin Ittner (aittner) wrote :

@ogayot

Just finished a few extra tests here: It passed after setting a proxy (in subiquity "Configure proxy" screen) that allowed it to access the internet. Removing this proxy while keeping the same values for all other options failed with the error I posted in comment #17.

So, this happens on Intel too? For this, I only tested with full Internet access for now... will redo a few tests.

Revision history for this message
Patricia Domingues (patriciasd) wrote :

IBM was able to install image `20220419.1` on a POWER9 baremetal (AC922) without any installer issues.

Dan Bungert (dbungert)
Changed in subiquity:
status: New → Fix Committed
Revision history for this message
Patricia Domingues (patriciasd) wrote :

As you have asked for, I've tested with subiquity from edge (3348):

`installed: 22.02.2+git232.6f5a0fdb (3348) 27MB classic`

Made 6 tests - 2 different POWER9 servers and did Not see the reported crash - could successfully boot to prompt every time.

Let me know if you need any other test.

Revision history for this message
Olivier Gayot (ogayot) wrote (last edit ):

A patch moving from python3-requests to python3-aiohttp for querying geoip services was reverted. This seems to fix the issue.

Root cause analysis:
-------------------

We wrap the geoip lookup in a SingleInstanceTask. Therefore, upon starting a lookup, we cancel any unfinished lookup in progress.
When the lookup was done with python3-requests, cancelling the lookup would cancel the thread properly.

However, there is a bug in aiohttp 3.6 (the version on focal) that leads to a DNS resolution task becoming headless if the underlying task is cancelled. when that happens, any exception raised by the DNS resolution task goes directly to the exception handler.

This bug is fixed in aiohttp 3.7 with this patch:

https://github.com/aio-libs/aiohttp/commit/9b918a3fc6d87a97c02d7f96ff81a6b3502ba693

But the fix is not present in aiohttp 3.6 (the one we have in the Subiquity snap currently).

The options that we have when staying on aiohttp 3.6 are:
* preventing the lookups from being cancelled. This means potentially more traffic
* requesting for the aiohttp patch to be back-ported to 3.6 (quite unlikely to be accepted IMO)
* SRU the patch in Ubuntu on Focal. It seems to apply cleanly. No need to forward to Debian since they don't have aiohttp 3.6 in any release.

Changed in ubuntu-power-systems:
status: New → Fix Committed
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Changed in subiquity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.