QEMU coroutines fail with LTO on non-x86_64 architectures

Bug #1921664 reported by Tommy Thorn
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Fedora)
Confirmed
Medium
qemu (Ubuntu)
Fix Released
Medium
Paride Legovini
Jammy
Fix Released
Undecided
Michał Małoszewski

Bug Description

[Impact]

-QEMU on Jammy (22.04) is affected.

-Emulation of riscv64 on arm64 fails.

-Emulation of arm64/armhf on arm64, ppc64el, s390x fails.

-Problem when trying to install the Ubuntu arm64 ISO image in a VM.

[Fix]

-There is no entry in debian/rules, where if the Debian architecture of the host machine is not amd64, LTO should be disabled to prevent QEMU coroutines from failing and marked to be exported to all child processes created from that shell.

-Adding DEB_BUILD_MAINT_OPTIONS=optimize=-lto to prevent QEMU coroutines from failing.

[Test Plan]

** Reproduction **

- Detailed tests steps on AWS in comment #87.

arm64 on arm64, ubuntu cloud image

 wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-arm64.img

 sudo apt install --yes --no-install-recommends qemu-system-arm qemu-efi-aarch64 ipxe-qemu

 cp /usr/share/AAVMF/AAVMF_CODE.fd flash0.img
 cp /usr/share/AAVMF/AAVMF_VARS.fd flash1.img

 qemu-system-aarch64 \
   -machine virt -nographic \
   -smp 4 -m 4G \
   -cpu cortex-a57 \
   -pflash flash0.img -pflash flash1.img \
   -drive file=jammy-server-cloudimg-arm64.img,format=qcow2,id=drive0,if=none \
   -device virtio-blk-device,drive=drive0
 ...

 BdsDxe: failed to load Boot0001 "UEFI Misc Device" from VenHw(93E34C7E-B50E-11DF-9223-2443DFD72085,00): Not Found
 qemu-system-aarch64: GLib: g_source_ref: assertion 'source != NULL' failed
 Segmentation fault (core dumped)

As mentioned this only happens on specific systems. Paride has a reliable reproducer (see comment 39), he will perform the SRU validation for arm64.
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921664/comments/39

[Where problems could occur]

Any code change might change the behavior of the package in a specific situation and cause other errors.
Possible, but rather unlikely regression source is the fact that the qemu will be rebuilt against newer versions of its build dependencies, on Jammy. There might also be some other warning in the code fixed in later versions not identified by us. It is unlikely, but there might be an architecture where the test plan will fail.
Some updates can also break the functionality of an introduced fix.
Anyway, the fix is quite not complex and problems can be detected easily.

[Other Info]

This change in itself does not change QEMU source code (ie, no functional change), but it does change its object code, since the compiler build options are now changed (in addition to build dependencies versions).

This change has been in Kinetic since September, 2022 (~6 months).

[Original Bug Description]

Note: this could as well be "riscv64 on arm64" for being slow@slow and affect
other architectures as well.

The following case triggers on a Raspberry Pi4 running with arm64 on
Ubuntu 21.04 [1][2]. It might trigger on other environments as well,
but that is what we have seen it so far.

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~2 minutes)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.

This is often, but not 100% reproducible and the cases differ slightly we
see either of:
- qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Rebuilding working cases has shown to make them fail, as well as rebulding
(or even reinstalling) bad cases has made them work. Also the same builds on
different arm64 CPUs behave differently. TL;DR: The full list of conditions
influencing good/bad case here are not yet known.

[1]: https://ubuntu.com/tutorials/how-to-install-ubuntu-on-your-raspberry-pi#1-overview
[2]: http://cdimage.ubuntu.com/daily-preinstalled/pending/hirsute-preinstalled-desktop-arm64+raspi.img.xz

--- --- original report --- ---

I regularly run a RISC-V (RV64GC) QEMU VM, but an update a few days ago broke it. Now when I launch it, it hits an assertion:

OpenSBI v0.6
   ____ _____ ____ _____
  / __ \ / ____| _ \_ _|
 | | | |_ __ ___ _ __ | (___ | |_) || |
 | | | | '_ \ / _ \ '_ \ \___ \| _ < | |
 | |__| | |_) | __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
618 bytes read in 2 ms (301.8 KiB/s)
RISC-V Qemu Boot Options
1: Linux kernel-5.5.0-dirty
2: Linux kernel-5.5.0-dirty (recovery mode)
Enter choice: 1: Linux kernel-5.5.0-dirty
Retrieving file: /boot/initrd.img-5.5.0-dirty
qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.
./run.sh: line 31: 1604 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 8 -m 8G -bios fw_payload.bin -device virtio-blk-devi
ce,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -devi
ce virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

Interestingly this doesn't happen on the AMD64 version of Ubuntu 21.04 (fully updated).

Think you have everything already, but just in case:

$ lsb_release -rd
Description: Ubuntu Hirsute Hippo (development branch)
Release: 21.04

$ uname -a
Linux minimacvm 5.11.0-11-generic #12-Ubuntu SMP Mon Mar 1 19:27:36 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
(note this is a VM running on macOS/M1)

$ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://ports.ubuntu.com/ubuntu-ports hirsute/universe arm64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: qemu 1:5.2+dfsg-9ubuntu1
ProcVersionSignature: Ubuntu 5.11.0-11.12-generic 5.11.0
Uname: Linux 5.11.0-11-generic aarch64
ApportVersion: 2.20.11-0ubuntu61
Architecture: arm64

CasperMD5CheckResult: unknown
CurrentDmesg:
 Error: command ['pkexec', 'dmesg'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.
Date: Mon Mar 29 02:33:25 2021
Dependencies:

KvmCmdLine: COMMAND STAT EUID RUID PID PPID %CPU COMMAND
Lspci-vt:
 -[0000:00]-+-00.0 Apple Inc. Device f020
            +-01.0 Red Hat, Inc. Virtio network device
            +-05.0 Red Hat, Inc. Virtio console
            +-06.0 Red Hat, Inc. Virtio block device
            \-07.0 Red Hat, Inc. Virtio RNG
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: console=hvc0 root=/dev/vda
SourcePackage: qemu
UpgradeStatus: Upgraded to hirsute on 2020-12-30 (88 days ago)
acpidump:
 Error: command ['pkexec', '/usr/share/apport/dump_acpi_tables.py'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.

Related branches

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW, I just now built qemu-system-riscv64 from git ToT and that works fine.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Tommy,
you reported that against "1:5.2+dfsg-9ubuntu1" which is odd.
The only recent change was around
a) package dependencies
b) CVEs not touching your use-case IMHO

Was the formerly working version 1:5.2+dfsg-6ubuntu2 as I'm assuming or did you upgrade from a different one?

Could you also add the full commandline you use to start your qemu test case?
If there are any images or such involved as far as you can share where one could fetch them please.

And to be clear on your report - with the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you.
Just the emulation of riscv64 on arm64 HW is what now fails for you correct?

It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?
If you built v5.2.0 it might be something in the Ubuntu Delta that I have to look for.
If you've built the latest HEAD of qemu git then most likely the solution is a vommit since v5.2.0 - in that case would you be willing and able to maybe bisect that from v5.2.0..HEAD what the fix was?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

0. Repro:

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~ 20 s)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
   qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

  (root password is "riscv" fwiw)

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

   I'm afraid I don't know, but I update a few times a week.

   If you can tell me know to try individual versions, I'll do that

2. "full commandline you use to start your qemu test case?"

   Probably the repo above is more useful, but FWIW:

   qemu-system-riscv64 \
    -machine virt \
    -nographic \
    -smp 4 \
    -m 4G \
    -bios fw_payload.bin \
    -device virtio-blk-device,drive=hd0 \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-device,rng=rng0 \
    -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 \
    -device virtio-net-device,netdev=usernet \
    -netdev user,id=usernet,$ports

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

   Yes x 2, confirmed with the above repro.

   $ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu hirsute/universe amd64 Packages
        100 /var/lib/dpkg/status

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

  latest.

  Rebuilding from the "vommit" tagged with v5.2.0 ...

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Self-built v5.2.0 qemu-system-riscv64 does _not_ produce the bug.

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

0. Repro:

> ...
> $ ./run_riscvVM.sh
> ...

Thanks, I was not able to reproduce with that using the most recent
qemu 1:5.2+dfsg-9ubuntu1 on amd64 (just like you)

Trying the same on armhf was slower and a bit odd.
- I first got:
  qemu-system-riscv64: at most 2047 MB RAM can be simulated
  Reducing the memory to 2047M started up the system.
- then I have let it boot, which took quite a while and eventually
  hung at
[ 13.017716] mousedev: PS/2 mouse device common for all mice
[ 13.065889] usbcore: registered new interface driver usbhid
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10

So it hung on armhf, while working on a amd64 host. That isn't good, but there was no crash to be seen :-/

Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as the assert is about storage) you run on.
My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's qemu).
What is it for you?

I've waited more, but no failure other than the hang was showing up.
Is this failing 100% of the times for you, or just sometimes and maybe racy?

---

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

> I'm afraid I don't know, but I update a few times a week.

A hint which versions to look at can be derived from
  $ grep -- qemu-system-misc /var/log/dpkg.log

   If you can tell me know to try individual versions, I'll do that

You can go to https://launchpad.net/ubuntu/+source/qemu/+publishinghistory
There you'll see every version of the package that existed. If you click on a version
it allows you to download the debs which you can install with "dpkg -i ....deb"

---

2. "full commandline you use to start your qemu test case?"

> Probably the repo above is more useful, but FWIW:

Indeed, thanks!

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

> Yes x 2, confirmed with the above repro.

Thanks for the confirmation

---

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

> Rebuilding from the "commit" tagged with v5.2.0 ...

Very interesting, this short after a release this is mostly a few CVEs and integration of e.g. Ubuntu/Debian specific paths. Still chances are that you used a different toolchain than the packaging builds.
Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...
If that doesn't fail then your build-env differs from our builds, and therein is the solution.
If it fails we need to check which delta it is.

Furthermore if indeed that fails while v5.2.0 worked I've pushed all our delta as one commit at a time to https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+ref/hirsute-delta-as-commits-lp1921664 so you could maybe bisect that. But to be sure build from the first commit in there and verify that it works. If this fails as well we have to look what differs in those builds.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI my qemu is still busy
   1913 root 20 0 2833396 237768 7640 S 100.7 5.9 25:54.13 qemu-system-ris

And after about 1000 seconds the guest moved a bit forward now reaching
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10
[ 1003.282387] Segment Routing with IPv6
[ 1004.790268] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 1009.002716] NET: Registered protocol family 17
[ 1012.612965] 9pnet: Installing 9P2000 support
[ 1012.915223] Key type dns_resolver registered
[ 1015.022864] registered taskstats version 1
[ 1015.324660] Loading compiled-in X.509 certificates
[ 1036.408956] Freeing unused kernel memory: 264K
[ 1036.410322] This architecture does not have kernel memory protection.
[ 1036.710012] Run /init as init process
Loading, please wait...

I'll keep it running to check if I'll hit the assert later ....

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

> Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as > the assert is about storage) you run on.
> My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's > qemu).
> What is it for you?

Sorry, I thought I had already reported that, but it's not clear. My setup is special in a couple of ways:
- I'm running Ubuntu/Arm64 (21.04 beta, fully up-to-date except kernel), but ...
- it's a virtual machine on a macOS/Mac Mini M1 (fully up-to-date)
- It's running the 5.8.0-36-generic which isn't the latest (for complicated reasons)

I'll try to bring my Raspberry Pi 4 back up on Ubuntu and see if I can reproduce it there.

> Is this failing 100% of the times for you, or just sometimes and maybe racy?

100% consistently reproducible with the official packages. 0% reproducible with my own build

> A hint which versions to look at can be derived from
> $ grep -- qemu-system-misc /var/log/dpkg.log

Alas, I had critical space issues and /var/log was among the casualties

> Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...

TIL. I tried `apt source --compile qemu` but it complains

  dpkg-checkbuilddeps: error: Unmet build dependencies: gcc-alpha-linux-gnu gcc-powerpc64-linux-gnu

but these packages are not available [anymore?]. I don't currently have the time to figure this out.

> FYI my qemu is still busy

It's hung. The boot take ~ 20 seconds on my host. Multi-minutes is not normal.

If I can reproduce this on a Raspberry Pi 4, then I'll proceed with your suggestions above, otherwise I'll pause this until I can run Ubuntu natively on the Mac Mini.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok,
thanks for all the further details.

Let us chase this further down once you got to that test & bisect.
I'll set the state to incomplete util then.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

On my 4 GB Raspberry Pi 4

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-3ubuntu1)

worked as expected as did, but

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-9ubuntu1)

*did* reproduce the issue, but it took slightly longer to hit it (a few minutes):

```
...
[ OK ] Started Serial Getty on ttyS0.
[ OK ] Reached target Login Prompts.

Ubuntu 20.04 LTS Ubuntu-riscv64 ttyS0

Ubuntu-riscv64 login: qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 2304 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 4 -m 3G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports
```

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Christian, I think I need some help. Like I said I couldn't build with apt source --compile qemu.
I proceeded with

  $ git clone -b hirsute-delta-as-commits-lp1921664 git+ssh://<email address hidden>/~paelzer/ubuntu/+source/qemu

  (git submodule update --init did nothing)

but the configure step failed with

  $ ../configure
  warn: ignoring non-existent submodule meson
  warn: ignoring non-existent submodule dtc
  warn: ignoring non-existent submodule capstone
  warn: ignoring non-existent submodule slirp
  cross containers no

  NOTE: guest cross-compilers enabled: cc s390x-linux-gnu-gcc cc s390x-linux-gnu-gcc
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory

I had no problem building the master branch so I'm not sure what's going on with the submodules in your repo.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

I'm not sure how I was _supposed_ to do this, but I checked out the official release and then switch to the hirsute-delta-as-commits-lp1921664 (6c7e3708580ac50f78261a82b2fcdc2f288d6cea) branch which kept the directories around. I configured with "--target-list=riscv64-softmmu" to save time and the resulting binary did *not* reproduce the bug.

So in summary:
- Debian 1:5.2+dfsg-9ubuntu1 reproduces the issue of both RPi4 and my M1 VM.
- So far no version I have built have reproduced the issue.
Definitely makes either _how_ I built it or the _build tools_ I used sus

I'm not sure what to do next. I assume I'm supposed to set the bug back to "new"?

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW: I went full inception and ran QEMU/RISC-V under QEMU/RISC-V but I couldn't reproduce the issue here (that is, everything worked, but very slowly).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you for all your work and these confirmations Tommy!

I was bringing my RPi4 up as well...
Note: My RPi4 is installed as aarch64
I ran userspaces with arm64 and armhf (via LXD).

In the arm64 userspace case I was able to trigger the bug reliably in 3/3 tries under a minute each time
In the armhf userspace case it worked just fine.

So to summarize (on my RPi4)
- RPi4 riscv emulation on arm64 userspace on arm64 kernel - fails (local system)
- RPi4 riscv emulation on armhf userspace on arm64 kernel - TODO (local system)
- XGene riscv emulation on armhf userspace on arm64 kernel - works (Canonistac)
- M1 riscv emulation on armhf userspace on armhf kernel - fails (Tommy)

But I've found a way to recreate this, which is all I needed for now \o/

...
[ OK ] Finished Load/Save Random Seed.
[ OK ] Started udev Kernel Device Manager.
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 8302 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 2 -m 1G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

I need to build & rebuild the different qemu options (git, ubuntu, ubuntu without delta, former ubuntu version) to compare those. And a lot of other tasks fight for having higher prio ... that will take a while ...

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Small correction: everything I've done have been everything 64-bit. I don't use armhf.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, thanks Tommy - then my Repro hits exactly what you had.
Good to have that sorted out as well.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it was reported to have worked with former builds in Ubuntu I was testing the former builds that were published in Hirsute.

https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-3ubuntu1 - working
https://launchpad.net/ubuntu/+source/qemu/1:5.1+dfsg-4ubuntu3 - working
I have also had prepped 1:5.0-5ubuntu11 but didn't go further after above results.

This absolutely confirms your initial report (changed in recent version) and gladly leaves us much less to churn through.
OTOH the remaining changes that could be related are mostly CVEs which are most of the time not very debatable.1

That was a rebase without changes in Ubuntu, but picking up Debian changes between
1:5.2+dfsg-3 -> 1:5.2+dfsg-9.

Those are:
- virtiofsd changes - not used here
- package dependency changes - not relevant here
- deprecate qemu-debootstrap - not used here
- security fixes
  - arm_gic-fix-interrupt-ID-in-GICD_SGIR-CVE-2021-20221.patch - not used (arm virt)
  - 9pfs-Fully-restart-unreclaim-loop-CVE-2021-20181.patch - not used (9pfs)
  - CVE-2021-20263 - again virtiofsd (not used)
  - CVE-2021-20257 - network for e1000 (not related to the error and nic none works)
  - I'll still unapply these for a test just to be sure
- there also is the chance that this is due to libs/build-toolchain - I'll rebuild a former working version for a re-test

I was trying to to further limit the scope but here things got a bit crazy:

- 1:5.2+dfsg-9ubuntu1 - tried 3 more times as-is - 2 failed 1 worked
So it isn't 100% reproducible :-/

This made me re-recheck the older builds (maybe some race window got bigger/smaller).

Then I had 3 more tries with "-nic none"
All three failed - so it is unlikely the e1000 fix that could have crept in via a default config.

I have created two PPAs which just started to build:
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-secrevertpatches
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold

Once these are complete I can further chase this down ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From my test PPAs the version "1:5.2+dfsg-9ubuntu2~hirsuteppa3" which is a no-change rebuild of the formerly working "1:5.2+dfsg-9ubuntu1" did in three tries fail three times.

So we are not looking at anything that is in the qemu source or the Ubuntu/Debian Delta applied to it. But at something in the build environment that now creates binaries behaving badly, which built on 2021-03-23 worked fine.
Since I have no idea yet where exactly to look at I'll add "the usual suspects" of glibc, gcc-10 and binutils - also Doko/Rbalint (who look after those packages) have seen a lot might have an idea about what might be going on here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

That would explain why I could reproduce with personal builds. Glibc looks very relevant here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

couldN’T, grr

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Before going into any rebuild-mania I wanted to further reduce how much builds I'll need I've changed "qemu-system-misc" but kept the others like "qemu-block-extra" and "qemu-system-common" - that mostly means /usr/bin/qemu-system-riscv64 is replaced but all the roms and modules (can't load then).
Reminder: all these are the same effective source
I've done this two ways:

All good pkg (1:5.2+dfsg-9ubuntu1) + emu bad (1:5.2+dfsg-9ubuntu2~hirsuteppa3):
All bad pkg (1:5.2+dfsg-9ubuntu2~hirsuteppa3) + emu good (1:5.2+dfsg-9ubuntu1): 3/3 fails
That made me wonder and I also got:
All good pkg (1:5.2+dfsg-9ubuntu1): 5/5 fails (formerly this was known good)

Sadly - the formerly seen non-distinct results continued. For example I did at one point end up with all packages of version "1:5.2+dfsg-9ubuntu1" (that is known good) failing in 5/5 tests repeatedly.

So I'm not sure how much the results are worth anymore :-/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Furthermore I've built (again the very same source) in groovy as 5.2+dfsg-9ubuntu2~groovyppa1 in the same PPA.
This build works as well in my tries.

So I have the same code as in "1:5.2+dfsg-9ubuntu1" three times now:
1. [1] => built 2021-03-23 in Hirsute => works
2. [2] => built 2021-04-12 in Hirsute => fails
3. [3] => built 2021-04-13 in Groovy => works

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458
[3]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21394457

With the two results above, I expected as next step could spin up a (git based)
Groovy and a Hirsute build environment.
I'd do a build from git (and optimize a bit for build speed).
If these builds confirm the above results of [2] and [3] then I should be able
to upgrade the components in the Groovy build environment one by one to Hirsute
To identify which one is causing the breakage...

But unfortunately I have to start to challenge the reproducibility and that is
breaking the camels back here. Without that I can't go on well, and as sad (it
is a real issue) as it is it is riscv64 emulation on an arm64 host really isn't
the most common use case. So I'm unsure how much time I can spend on this.

Maybe I have looked at this from the wrong angle, let me try something else before I give up ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've continued on one of the former approaches and started a full Ubuntu style
package build of the full source on arm64 in Groovy and Hirsute.
But it fell apart going out of space and I'm slowly getting hesitant to spend
more HW and time on this without
a) at least asking upstream if it is any known issue
b) not seeing it on a less edge case than risc emulation @ arm64

But I think by now we can drop the formerly "usual suspects" again as I have
had plenty of fails with the former good builds. It is just racy and a yet
unknown amount of conditions seems to influence this race.

If we are later on finding some evidence we can add them back ...

no longer affects: glibc (Ubuntu)
no longer affects: binutils (Ubuntu)
no longer affects: gcc-10 (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From the error message this seems to be about concurrency:

qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

 42 void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock)
 43 {
 44 Coroutine *self = qemu_coroutine_self();
 45 QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
 46
 47 if (lock) {
 48 qemu_lockable_unlock(lock);
 49 }
 50
 51 /* There is no race condition here. Other threads will call
 52 * aio_co_schedule on our AioContext, which can reenter this
 53 * coroutine but only after this yield and after the main loop
 54 * has gone through the next iteration.
 55 */
 56 qemu_coroutine_yield();
 57 assert(qemu_in_coroutine());
 58
 59 /* TODO: OSv implements wait morphing here, where the wakeup
 60 * primitive automatically places the woken coroutine on the
 61 * mutex's queue. This avoids the thundering herd effect.
 62 * This could be implemented for CoMutexes, but not really for
 63 * other cases of QemuLockable.
 64 */
 65 if (lock) {
 66 qemu_lockable_lock(lock);
 67 }
 68 }

I wondered if I can stop this from happening by reducing the SMP count and/or
the real CPUs that are usable.

- Running with -smp 1 - 3/3 fails

Arm cpus are not so easily hot-pluggable so I wasn't able to run with just
one cpu yet - but then the #host cpus won't change the threads/processes that are executed - just their concurrency.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (15.1 KiB)

There are two follow on changes to this code (in the not yet released qemu 6.0):
 050de36b13 coroutine-lock: Reimplement CoRwlock to fix downgrade bug
 2f6ef0393b coroutine-lock: Store the coroutine in the CoWaitRecord only once

They change how things are done, but are no known fixes to the current issue.

We might gather more data and report it upstream - it could ring a bell for
someone there.

Attaching gdb to the live qemu in into further issues
# Cannot find user-level thread for LWP 29341: generic error
Which on qemu led to
# [ 172.294630] watchdog: BUG: soft lockup - CPU#0 stuck for 78s! [systemd-udevd:173]

I'm not sorting this out now, so post mortem debugging it will be :-/

I've taken a crash dump of the most recent 1:5.2+dfsg-9ubuntu2 which
has debug symbols in Ubuntu and even later one can fetch from
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0xffffa98f9010 (LWP 29397) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
  2 Thread 0xffffa904f8b0 (LWP 29398) syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
  3 Thread 0xffffa3ffe8b0 (LWP 29399) 0x0000ffffab022d14 in __GI___sigtimedwait (set=set@entry=0xaaaac2fed320, info=info@entry=0xffffa3ffdd88, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:54
  4 Thread 0xffff237ee8b0 (LWP 29407) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff237ede48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  5 Thread 0xffff22fde8b0 (LWP 29408) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff22fdde48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  6 Thread 0xffff2bee18b0 (LWP 29405) __futex_abstimed_wait_common64 (cancel=true, private=-1022925092, abstime=0xffff2bee0e48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766dc)
    at ../sysdeps/nptl/futex-internal.c:74
  7 Thread 0xffffa27ce8b0 (LWP 29402) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  8 Thread 0xffffa2fde8b0 (LWP 29401) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  9 Thread 0xffff23ffe8b0 (LWP 29406) 0x0000ffffab0b9024 in __GI_pwritev64 (fd=<optimized out>, vector=0xaaaac3559fd0, count=2, offset=668794880)
    at ../sysdeps/unix/sysv/linux/pwritev64.c:26
  10 Thread 0xffffa37ee8b0 (LWP 29404) 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28

(gdb) thread apply all bt

Thread 10 (Thread 0xffffa37ee8b0 (LWP 29404)):
#0 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28
#1 0x0000aaaab8b8d3a8 in qemu_fdatasync (fd=<optimized out>) at ../../util/cutils.c:161
#2 handle_aiocb_flush (opaque=<optimized out>) at ../../block/file-posix.c:1350
#3 0x0000aaaab8c57314 in worker_thread (opaque=opaque@entry=0xaaaac307660...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also I've rebuilt the most recent master c1e90def01 about ~55 commits newer than 6.0-rc2.
As in the experiments of Tommy I was unable to reproduce it there.
But with the data from the tests before it is very likely that this is more
likely an accident by having a slightly different timing than a fix (to be
clear I'd appreciate if there is a fix, I'm just unable to derive from this
being good I could e.g. bisect).

export CFLAGS="-O0 -g -fPIC"
../configure --enable-system --disable-xen --disable-werror --disable-docs --disable-libudev --disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi --disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-opengl --disable-virglrenderer --disable-xfsctl --disable-slirp --disable-blobs --disable-rdma --disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock --disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user --disable-spice --disable-qom-cast-debug --disable-bochs --disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat --disable-qed --disable-parallels --disable-sheepdog --disable-avx2 --disable-nettle --disable-gnutls --disable-capstone --enable-tools --disable-libssh --disable-libpmem --disable-cap-ng --disable-vte --disable-iconv --disable-curses --disable-linux-aio --disable-linux-io-uring --disable-kvm --disable-replication --audio-drv-list="" --disable-vhost-kernel --disable-vhost-vdpa --disable-live-block-migration --disable-keyring --disable-auth-pam --disable-curl --disable-strip --enable-fdt --target-list="riscv64-softmmu"
make -j10

Just like the package build that configures as
   coroutine backend: ucontext
   coroutine pool: YES

5/5 runs with that were ok
But since we know it is racy I'm unsure if that implies much :-/

P.S. I have not yet went into a build-option bisect, but chances are it could be
related. But that is too much stabbing in the dark, maybe someone experienced
in the coroutines code can already make sense of all the info we have gathered so
far.
I'll update the bug description and add an upstream task to have all the info we have get mirrored to the qemu mailing lists.

summary: - Recent update broke qemu-system-riscv64
+ Coroutines are racy for risc64 emu on arm64 - crash on Assertion
description: updated
Changed in qemu (Ubuntu):
importance: Undecided → Low
Revision history for this message
Thomas Huth (th-huth) wrote : Re: Coroutines are racy for risc64 emu on arm64 - crash on Assertion

@Christian & Tommy : Could you please check whether the problematic binaries were built with link-time optimization, i.e. with -flto ? If so, does the problem go away when you rebuild the package without LTO?

Changed in qemu:
status: New → Incomplete
Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, thanks for the hint Thomas.

Of the two formerly referenced same-source different result builds:

[1] => built 2021-03-23 in Hirsute => works
[2] => built 2021-04-12 in Hirsute => fails

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458

The default flags changed in
  https://launchpad.net/ubuntu/+source/dpkg/1.20.7.1ubuntu4
and according to the build logs both ran with that.
Copy-Pasta from the log:
  dpkg (= 1.20.7.1ubuntu4),
=> In between those we did not switch the LTO default flags

For clarification LTO is the default nowadays and we are not disabling it generally in qemu. So - yes the builds are with LTO, but both the good and the bad one are.

Although looking at versions I see we have:
- good case 10.2.1-23ubuntu2
- bad case 10.3.0-1ubuntu1

So maybe - while it wasn't LTO - something in 10.3 maybe even LTO-since-10.3 is what is broken?

@Tommy - I don't have any of the test systems around anymore, if I'd build you a no-LTO qemu for testing what would you these days need - Hirsute, Impish, ... ?

38 comments hidden view all 106 comments
Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

(In reply to serge_sans_paille from comment #30)
> The following diff
>
> ```
> --- qemu-6.1.0.orig/util/async.c 2021-08-24 13:35:41.000000000 -0400
> +++ qemu-6.1.0/util/async.c 2021-09-20 17:48:15.404681749 -0400
> @@ -673,6 +673,10 @@
>
> AioContext *qemu_get_current_aio_context(void)
> {
> + if (qemu_in_coroutine()) {

This uses the `current` TLS variable. Are you sure this works? It seems like the same problem :).

Revision history for this message
In , sguelton (sguelton-redhat-bugs) wrote :

The patch above fixes the LTO issue, and once applied, I've been successfully building qemu with LTO with GCC: https://koji.fedoraproject.org/koji/taskinfo?taskID=76803353 (all archs) and with Clang : https://koji.fedoraproject.org/koji/taskinfo?taskID=76802978 (s390x only).

It's a compiler-agnostic patch, it works for any compiler that honors __attribute__((noinline)), as long as the compiler doesn't tries to do inter procedural optimization across non inlinable functions.

38 comments hidden view all 106 comments
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
38 comments hidden view all 106 comments
Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

RFC patch posted upstream based on the patch Serge attached to this BZ:
https://<email address hidden>/

37 comments hidden view all 106 comments
Revision history for this message
Dana Goyette (danagoyette) wrote :

I've been having crashes with the same assertion message, when trying to run Windows 10 ARM under a VM. But I finally figured out that what's actually crashing it is not the fact that it's Windows, it's the fact that I was attaching the virtual drive via virtual USB.

If I do the same thing to an Ubuntu ARM64 guest, it *also* crashes.

qemu-system-aarch64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed

With the RISC-V guest, does your crash change if you change the type of attachment that's used for the virtual disk?

Also, I tried enabling core dumps in libvirt, but it didn't seem to dump cores to apport. Enabling core dumps would be useful for issues like this.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

No, as I described in great detail it has nothing to do with the attached devices.
I just noticed that the bug was excused away
as being do to the “slow” RPi 4. I’ll share that I originally hit it
on Apple’s M1 but as I expect my environment might be too unusual I replicated
it on RPi 4. I have since switched to building qemu from source so I don’t know if
it still happens.

37 comments hidden view all 106 comments
Revision history for this message
In , kkiwi (kkiwi-redhat-bugs) wrote :

Based on recent discussions with Stefan/Thomas and others, I'm moving this to ITR 9.1.0 as a "FutureFeature" since we don't yet enable LTO downstream on non-x86 architectures. We do have an RFC patch upstream, so hopefully this can be added soon.

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

The following have been merged:
d5d2b15ecf cpus: use coroutine TLS macros for iothread_locked
17c78154b0 rcu: use coroutine TLS macros
47b7446456 util/async: replace __thread with QEMU TLS macros
7d29c341c9 tls: add macros for coroutine-safe TLS variables

I sent another 3 patches as a follow-up series.

Revision history for this message
In , mdeng (mdeng-redhat-bugs) wrote :

(In reply to Miroslav Rezanina from comment #0)
> When running build for qemu-kvm for RHEL 9, test-block-iothread during "make
> check " fails on aarch64, ppc64le and s390x architecture for
> /attach/blockjob (pass on x86_64):
  FYI, qemu-kvm isn't supported on RHEL 9 on Power.
Thanks

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

The following patches were merged upstream:
c1fe694357 coroutine-win32: use QEMU_DEFINE_STATIC_CO_TLS()
ac387a08a9 coroutine: use QEMU_DEFINE_STATIC_CO_TLS()
34145a307d coroutine-ucontext: use QEMU_DEFINE_STATIC_CO_TLS()

Revision history for this message
In , lijin (lijin-redhat-bugs) wrote :

Hi Yihuang and Boqiao,

Could you do the pre-verify on aarch64 and s390x with the fixed version?

Thanks.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

Analyzed the build log, "-flto" is still not in the configure setting, is this expected? The full configure from here: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/aarch64/build.log

I can see x86 enabled -flto: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on aarch64, it passed. Steps refer to Eric's bug 2000479

# ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto' '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg -fstack-protector-strong -fasynchronous-unwind-tables ' --target-list=aarch64-softmmu --enable-kvm --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls --enable-trace-backends=log --enable-seccomp --enable-cap-ng --disable-werror --without-default-devices --disable-capstone --target-list='aarch64-softmmu'

# make check-unit -j16
......
......
22/92 qemu:unit / test-block-iothread OK 0.64s 16 subtests passed
......
......
Ok: 92
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 0
Timeout: 0

So in my opinion, maybe we can also enable -flto on other architectures?

Anyway, the test result on the official build is passed.

Result: PASS as no Critical Regression or TestBlocker found

Test Environment:
Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
Host Kernel: kernel-5.14.0-119.el9.aarch64
QEMU: qemu-kvm-7.0.0-7.el9.aarch64
edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
Guest: RHEL.9.1.0

Results Analysis:
From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% (excluding SKIP and CANCEL)
1 test case failed with an auto issue but retes passed

New bugs(0):
Existing bugs(0):

Job link:
http://10.0.136.47/6759356/results.html

Revision history for this message
In , yfu (yfu-redhat-bugs) wrote :

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Revision history for this message
In , eric.auger (eric.auger-redhat-bugs) wrote :

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/aarch64/build.log
>
> I can see x86 enabled -flto:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> aarch64, it passed. Steps refer to Eric's bug 2000479
>
> # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto'
> '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> -fstack-protector-strong -fasynchronous-unwind-tables '
> --target-list=aarch64-softmmu --enable-kvm
> --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> --disable-werror --without-default-devices --disable-capstone
> --target-list='aarch64-softmmu'
>
> # make check-unit -j16
> ......
> ......
> 22/92 qemu:unit / test-block-iothread OK 0.64s
> 16 subtests passed
> ......
> ......
> Ok: 92
> Expected Fail: 0
> Fail: 0
> Unexpected Pass: 0
> Skipped: 0
> Timeout: 0
>
> So in my opinion, maybe we can also enable -flto on other architectures?
>
> Anyway, the test result on the official build is passed.
>
> Result: PASS as no Critical Regression or TestBlocker found
>
> Test Environment:
> Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> Host Kernel: kernel-5.14.0-119.el9.aarch64
> QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> Guest: RHEL.9.1.0
>
> Results Analysis:
> From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> (excluding SKIP and CANCEL)
> 1 test case failed with an auto issue but retes passed
>
> New bugs(0):
> Existing bugs(0):
>
> Job link:
> http://10.0.136.47/6759356/results.html

While at it, would you have cycles to test with Safestack enabled (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same symptoms and maybe Stefan's series also fixes that other BZ. Thank you in advance!

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:

I think we also need a change to the qemu-kvm.spec file to enable LTO on non-x86 again. There's a hack there at the top of the file that looks like this:

%ifnarch x86_64
     %global _lto_cflags %%{nil}
%endif

Without removing that, we don't get LTO on s390x and aarch64, so I think this cannot properly verified. @stefanha, could you add such a patch on top, please?

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

(In reply to Eric Auger from comment #45)
> (In reply to Yihuang Yu from comment #43)
> > Analyzed the build log, "-flto" is still not in the configure setting, is
> > this expected? The full configure from here:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/aarch64/build.log
> >
> > I can see x86 enabled -flto:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> > aarch64, it passed. Steps refer to Eric's bug 2000479
> >
> > # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> > --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> > --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> > '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto'
> > '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> > -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> > --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> > -fstack-protector-strong -fasynchronous-unwind-tables '
> > --target-list=aarch64-softmmu --enable-kvm
> > --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> > --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> > --disable-werror --without-default-devices --disable-capstone
> > --target-list='aarch64-softmmu'
> >
> > # make check-unit -j16
> > ......
> > ......
> > 22/92 qemu:unit / test-block-iothread OK 0.64s
> > 16 subtests passed
> > ......
> > ......
> > Ok: 92
> > Expected Fail: 0
> > Fail: 0
> > Unexpected Pass: 0
> > Skipped: 0
> > Timeout: 0
> >
> > So in my opinion, maybe we can also enable -flto on other architectures?
> >
> > Anyway, the test result on the official build is passed.
> >
> > Result: PASS as no Critical Regression or TestBlocker found
> >
> > Test Environment:
> > Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> > Host Kernel: kernel-5.14.0-119.el9.aarch64
> > QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> > edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> > Guest: RHEL.9.1.0
> >
> > Results Analysis:
> > From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> > (excluding SKIP and CANCEL)
> > 1 test case failed with an auto issue but retes passed
> >
> > New bugs(0):
> > Existing bugs(0):
> >
> > Job link:
> > http://10.0.136.47/6759356/results.html
>
> While at it, would you have cycles to test with Safestack enabled
> (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same
> symptoms and maybe Stefan's series also fixes that other BZ. Thank you in
> advance!

OK Eric, I will enable both flto and safe-stack, and then trigger a tier1 testing. Will update test result later.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

Unfortunately, I cannot rebuild the qemu-kvm rpm package from src.rpm if I have both flto and safe-stack enabled. Eric, so I don't think now is the right time to enable the safe-stack. Maybe we need to tweak some CFLAGS?

# diff /root/rpmbuild/SPECS/qemu-kvm.spec /home/qemu-kvm.spec.backup
7a8,13
> # LTO does not work with the coroutines of QEMU on non-x86 architectures
> # (see BZ 1952483 and 1950192 for more information)
> %ifnarch x86_64
> %global _lto_cflags %%{nil}
> %endif
>
18c24
< %global have_safe_stack 1
---
> %global have_safe_stack 0
22a29,31
> %ifarch x86_64
> %global have_safe_stack 1
> %endif

flto + safe-stack:

 27/128 qemu:unit / test-bdrv-drain ERROR 1.01s killed by signal 11 SIGSEGV
―――――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――――――
stderr:

TAP parsing error: Too few tests run (expected 42, got 20)
(test program exited with status code -11)

 33/128 qemu:unit / test-block-iothread ERROR 1.66s killed by signal 6 SIGABRT
―――――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――――――
stderr:
qemu_aio_coroutine_enter: Co-routine was already scheduled in ''

TAP parsing error: Too few tests run (expected 16, got 10)
(test program exited with status code -6)

Summary of Failures:

 27/128 qemu:unit / test-bdrv-drain ERROR 0.94s killed by signal 11 SIGSEGV
 33/128 qemu:unit / test-block-iothread ERROR 1.54s killed by signal 6 SIGABRT

Ok: 123
Expected Fail: 0
Fail: 2
Unexpected Pass: 0
Skipped: 3
Timeout: 0

Revision history for this message
In , bfu (bfu-redhat-bugs) wrote :

(In reply to lijin from comment #42)
> Hi Yihuang and Boqiao,
>
> Could you do the pre-verify on aarch64 and s390x with the fixed version?
>
> Thanks.

[root@l42 build]# tests/unit/test-block-iothread
# random seed: R02Sdf2c11a84ebf6fa4a3bf33e5f4ba9f5c
1..16
# Start of sync-op tests
ok 1 /sync-op/pread
ok 2 /sync-op/pwrite
ok 3 /sync-op/load_vmstate
ok 4 /sync-op/save_vmstate
ok 5 /sync-op/pdiscard
ok 6 /sync-op/truncate
ok 7 /sync-op/block_status
ok 8 /sync-op/flush
ok 9 /sync-op/check
ok 10 /sync-op/activate
# End of sync-op tests
# Start of attach tests
ok 11 /attach/blockjob
ok 12 /attach/second_node
ok 13 /attach/preserve_blk_ctx
# End of attach tests
# Start of propagate tests
ok 14 /propagate/basic
ok 15 /propagate/diamond
ok 16 /propagate/mirror
# End of propagate tests

I didn't see an error on s390x

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

Based on comment 48 there are still issues, probably related to coroutines, that need to be debugged if we want to enable LTO + SafeStack on non-x86 architectures.

The coroutine TLS patches were already merged in 7.0.0-7 for this BZ.

I am on PTO until August. At that time I can investigate the root cause. Let's keep LTO disabled until the root cause is understood.

If someone else wants to take over this BZ while I'm away, feel free.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

OK, Stefan.

Then let me move the ITM to a bit later until we decide to fix the compile issue in which release, thanks for understanding.

49 comments hidden view all 106 comments
Revision history for this message
Paride Legovini (paride) wrote :

I am consistently hitting this when trying to install the Ubuntu arm64 ISO image in a VM. A minimal command line that reproduces the problem is (host system is jammy arm64):

qemu-system-aarch64 -enable-kvm -m 2048 -M virt -cpu host -nographic -drive file=flash0.img,if=pflash,format=raw -drive file=flash1.img,if=pflash,format=raw -drive file=image2.qcow2,if=virtio -cdrom jammy-live-server-arm64.iso

The installation never gets to an end, always crashing.

Changed in qemu:
status: Expired → Incomplete
Changed in qemu (Ubuntu):
status: Expired → Incomplete
Revision history for this message
Thomas Huth (th-huth) wrote :

Upstream QEMU bugs are now tracked on https://gitlab.com/qemu-project/qemu/-/issues - so if you can reproduce it with the latest version from upstream QEMU, please report it there.

no longer affects: qemu
Revision history for this message
Paride Legovini (paride) wrote :

I tried the qemu package from Kinetic on a Jammy system

$ qemu-system-aarch64 --version
QEMU emulator version 7.0.0 (Debian 1:7.0+dfsg-7ubuntu1)

and it fails in the same way:

qemu-system-aarch64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

Revision history for this message
Paride Legovini (paride) wrote :

In the end looks like it's LTO. I rebuilt Jammy's qemu (1:6.2+dfsg-2ubuntu6.3) with

  DEB_BUILD_MAINT_OPTIONS = optimize=-lto

and it doesn't crash anymore. I can't really tell if the issue is with Qemu's code or is due to a compiler bug. The rebuilt package is available in a PPA:

  https://launchpad.net/~paride/+archive/ubuntu/qemu-bpo

which despite the name doesn't actually contain backports.

FWIW Fedora disables LTO on aarch64 (arm64) because of this issue, see:

  https://bugzilla.redhat.com/show_bug.cgi?id=1952483
  https://src.fedoraproject.org/rpms/qemu/c/38b1a6c732bee90f75345c4d07

This is also discussed in this short Fedora mailing list thread:

https://<email address hidden>/msg159665.html

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Paride Legovini (paride)
Changed in qemu (Ubuntu):
importance: Low → Medium
Paride Legovini (paride)
tags: added: lto server-todo
Revision history for this message
Paride Legovini (paride) wrote :

@Christian if we agree the path forward here is "disable LTO on non-amd64" I can prepare MPs and uploads for Kinetic and Jammy. I have a reproducer handy which will help with the SRU.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We have recently looked at some coroutine racyness in older versions, but all of those I know of would be fixed in 7.0.

If you even see this in 7.0 (as stated above) and you have a reproducer we can use - then I'd be absolutely happy if you could prep this change.

The upstream bug discussion seems to indicate !x86 is at fault, so I'm just curious if you found more than riscv.
@Paride, have you had a chance to check and confirm this on !risc64 and/or !7.0 qemu?

Changed in qemu (Ubuntu):
assignee: nobody → Paride Legovini (paride)
Revision history for this message
Paride Legovini (paride) wrote :

Hi, all my findings above are based on testing on arm64, not riscv64. I do confirm seeing the coroutine racyness with 7.0, but I tested it on Jammy, not Kinetic, so another round of tests is needed to confirm Kinetic is affected by this (I think it is).

In any case Jammy needs to be fixed. The machine where I can reliably reproduce the issue is the same we use to run the Ubuntu ISO tests, and given that this is point release week I have to be careful with it, as I don't want to interfere with the ISO testing. After the point release I'll be away from keyboard for a couple of weeks, so the ETA for the fix is end of August.

Revision history for this message
Paride Legovini (paride) wrote :

Confirmed happening on arm64 using a clean Kinetic host system (qemu 1:7.0+dfsg-7ubuntu1).

Changed in qemu (Ubuntu):
status: Confirmed → Triaged
Changed in qemu (Ubuntu Jammy):
status: New → Triaged
assignee: nobody → Paride Legovini (paride)
Changed in qemu (Ubuntu):
status: Triaged → Fix Released
Changed in qemu (Fedora):
importance: Unknown → Medium
status: Unknown → In Progress
Changed in qemu (Fedora):
status: In Progress → Confirmed
43 comments hidden view all 106 comments
Revision history for this message
Paride Legovini (paride) wrote (last edit ): Re: Coroutines are racy for risc64 emu on arm64 - crash on Assertion

I'm removing myself from the Jammy task as I didn't manage to find the time to work on this, and I don't see this changing in the short term. I'm moving the bug back to the team backlog.

This is a bitesize Jammy-only SRU requiring applying this patch to the package:

https://salsa.debian.org/qemu-team/qemu/-/merge_requests/35

I'm available for performing the SRU verification, I have a system which always hits the bug.

Changed in qemu (Ubuntu Jammy):
assignee: Paride Legovini (paride) → nobody
Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Assigning to me the Jammy task.

Changed in qemu (Ubuntu Jammy):
assignee: nobody → Michał Małoszewski (michal-maloszewski99)
Paride Legovini (paride)
summary: - Coroutines are racy for risc64 emu on arm64 - crash on Assertion
+ QEMU coroutines fail with LTO on non-x86_64 architectures
description: updated
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Triggered the QEMU autopkgtests against a PPA build for Jammy on all archs (except riscv64, not enabled on PPAs).

description: updated
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (4.2 KiB)

The issues reported in this bug are reproducible on an AWS c6g.2xlarge instance.

The test packages with the backported patch to disable LTO addressed the issues.

...

$ lsb_release -cs
jammy

$ dpkg --print-architecture
arm64

$ sudo dmesg | grep DMI:
[ 0.033370] DMI: Amazon EC2 c6g.2xlarge/, BIOS 1.0 11/1/2018

Test 1) riscv64 on arm64, additional bits (comment #4)

 sudo apt install --yes --no-install-recommends qemu-system-riscv64

 wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz

 tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
 ./run_riscvVM.sh
 <... wait ...>
 [ OK ] Reached target Local File Systems.
   Starting udev Kernel Device Manager...
 qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.
 ./run_riscvVM.sh: line 31: 2572 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 4 -m 4G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

 (fails 5/5 times)

Test 2) riscv64 on arm64, ubuntu cloud image (based on comment #4)

 wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-riscv64.img

 sudo modprobe nbd
 sudo qemu-nbd -c /dev/nbd0 --read-only jammy-server-cloudimg-riscv64.img
 sudo mount /dev/nbd0p1 /mnt -o ro

 sudo cp /mnt/boot/vmlinuz-* /mnt/boot/initrd.img-* .
 sudo chown $USER vmlinuz-* initrd.img-*

 sudo umount /mnt
 sudo qemu-nbd -d /dev/nbd0

 qemu-system-riscv64 \
   -machine virt -nographic \
   -smp 4 -m 4G \
   -bios default \
   -kernel vmlinuz-* -initrd initrd.img-* \
   -append 'root=/dev/vda1 ro console=ttyS0' \
   -device virtio-blk-device,drive=drive0 \
   -drive file=jammy-server-cloudimg-riscv64.img,format=qcow2,id=drive0

 <... wait ~30-60 seconds ...>

 [ 36.089758] systemd[1]: Starting Load Kernel Module fuse...
   Starting Load Kernel Module fuse...
 [ 36.134443] systemd[1]: Starting Load Kernel Module pstore_blk...
   Starting Load Kernel Module pstore_blk...
 qemu-system-riscv64: GLib: g_source_ref: assertion 'source != NULL' failed
 Segmentation fault (core dumped)

 [fails 5/5 times]

Test 3) arm64 on arm64, ubuntu cloud image

 wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-arm64.img

 sudo apt install --yes --no-install-recommends qemu-system-arm qemu-efi-aarch64 ipxe-qemu

 cp /usr/share/AAVMF/AAVMF_CODE.fd flash0.img
 cp /usr/share/AAVMF/AAVMF_VARS.fd flash1.img

 qemu-system-aarch64 \
   -machine virt -nographic \
   -smp 4 -m 4G \
   -cpu cortex-a57 \
   -pflash flash0.img -pflash flash1.img \
   -drive file=jammy-server-cloudimg-arm64.img,format=qcow2,id=drive0,if=none \
   -device virtio-blk-device,drive=drive0
 ...

 BdsDxe: failed to load Boot0001 "UEFI Misc Device" from VenHw(93E34C7E-B50E-11DF-9223-2443DFD72085,00): Not Found
 qemu-system-aarch64: GLib: g_source_ref: assertion 'source != NULL' failed
 Segmentation fault (core dumped)

 [fails 5/5 t...

Read more...

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Uploaded to Jammy.

Changed in qemu (Ubuntu Jammy):
status: Triaged → In Progress
tags: added: verification-needed
tags: added: verification-needed-jammy
tags: removed: verification-needed verification-needed-jammy
Revision history for this message
Steve Langasek (vorlon) wrote :

The impact of this bug is reported as being that emulation of riscv64 on arm64 hardware fails. But the change that's been introduced affects the build flags for all binaries, not just qemu-system-riscv64; and affects the build flags for all non-amd64 *architectures*, not just arm64. Since LTO is an optimization, please provide further justification for the widespread reversion of this build flag instead of a targeted fix, or provide evidence that it does not significantly reduce performance.

Changed in qemu (Ubuntu Jammy):
status: In Progress → Incomplete
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Steve,

Thanks for your review; appreciate the prudent considerations.

In order to address the concerns on binaries and architectures,
I done some more testing on arm64/armhf/ppc64el/riscv64/s390x,
and it seems we should disable LTO on more architecures, indeed.

I'll summarize and provide the results and justification soon,
and also touch on the performance bit, since I sponsored this.

(BTW, I forgot to add 'arm64 on arm64', comment #87, test 3;
but other binaries and architectures are also affected; most
noticeably with qemu-system-arm binaries.)

P.S.: of course, others involved, please feel free to provide
additional insight/answers based on your experience with this
change for Kinetic (initial target for inclusion).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Steve,

Back to the couple of points you brought up.

The answers are unfortunately a bit long, but hopefully
capture the details/tought process sufficiently for you
make an assessment on the data and rationale I had.

I'll split them into a few comments.

Thanks!
Mauricio

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

On the 'further justification for widespread reversion'
(data is provided are in the next comments/attachment).

...

The assertion failures in qemu due to LTO optimization
are more likely observed with arm64 and armhf binaries,
at least on the arm64, ppc64el and s390x architectures
(archs available as openstack vms on canonistack bos01).

This apparently indicates: the arm instruction code in
qemu is more _likely_ to be affected or trigger issues
(note: possibly not the _only_ one affected/triggering).

Perhaps it's due to arch-specific assembly/bugs in LTO
_or_ characteristics like register allocation, kernel
interaction, coroutine/userspace thread switching and/
or ABI specifics, that hit the arm-based binaries more.

However, considering that maybe other arch _binaries_
aren't just hitting this _often_ enough (or during boot,
which is the duration of our test), and might hit this
some time later.. (or worse, w/out an observable impact;
i.e., silent corruption)

...

We have the other example of riscv64 (misc) binary on
arm64 architecture that hits the issues; it is indeed
a different code, ABI, etc.

Other binaries didn't hit issues on arm64 arch, e.g.,
amd64, ppc64el, and s390x (not LTO issues, at least).

Interestingly, armhf binaries failed on arm64 but not
on armhf (actually running on arm64-capable processor,
but in armhf compat mode) so there's apparently timing,
compiler, and/or environment factors into play as well.

...

So, although the results point to arm-based binaries
across different architectures, issues with riscv64
on arm64 pose questions as raised above (likelyhood).

I guess it's a matter of deciding whether we'd like
to selectively disable LTO on archs with _reported_
bugs (binaries: armhf, arm64, misc; on arm64, s390x,
ppc64el), or generally disable LTO with the rationale
this might be a toolchain issue with unknown impact.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

On 'evidence that it does not significantly reduce performance'.

This will be more theoretical, as all systems that I
have access to test the changes are VMs themselves,
and didn't run QEMU with KVM acceleration enabled.

(I'm happy to be proven wrong or get contrary opinions.)

I would think the performance differences from LTO
disabled should not significantly affect performance.

The main reason is, such optimizations (or lack of)
affect QEMU code, which is somewhat bypassed on
most performance-related scenarios / hot paths.
Some examples that come to mind:
- kvm
- vhost
- vfio

Now, on the cases that such bypasses are not used,
I'd hazard to guess that, being I/O path, the CPU
instruction/optimization gains are lower magnitude
/less observable when compared to I/O delays.

The CPU optimization gains would be most noticeable
on CPU-bound paths, but in that case, the workload
is likely running qemu with KVM, in which the qemu
code is ideally not running as much as the workload.

And for CPU cases on non-KVM (ie, TCG) then performance
is not the point in general (e.g., usually enablement
or cross-arch testing, in which no native instruction
performnance is expected/required).

Again, this is more theoretical and hopefully educated
guesses; other feedback is welcome.

Perhaps if someone has access to bare-metal arm64,
ppc64el, s390x (or as possible), we could get some
tests/numbers -- but these are always workload and
qemu-config specific, so might not catch regressions
on some functionality/area.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test Summary:
------------

architecture: arm64
qemu-system binaries:
- arm64: FAIL 5/5 (PASS 5/5 with LTO disabled)
- armhf: FAIL 3/3 (PASS 3/3 with LTO disabled)
- s390x: FAIL 2/5 (unrelated to LTO)
- amd64: PASS 5/5
- ppc64le: PASS 5/5
- riscv64: FAIL 5/5

architecture: s390x
qemu-system binaries:
- arm64: FAIL 5/5 (PASS 3/3 with LTO disabled)
- armhf: FAIL 3/3 (PASS 3/3 with LTO disabled)
- s390x: FAIL 2/10 (FAIL 2/30 with LTO disabled; unrelated)
- riscv64: FAIL 3/3 (unrelated to LTO)

architecture: ppc64el
qemu-system binaries:
- arm64: FAIL 3/3 (PASS: 3/3 with LTO disabled)
- armhf: FAIL 3/3 (PASS: 3/3 with LTO disabled)
- riscv64: PASS 3/3

architecture: armhf
qemu-system binaries:
armhf: PASS 3/3

(more details/snippets in the attachment.)

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Changed in qemu (Ubuntu Jammy):
status: Incomplete → Confirmed
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

TL;DR
+1 to Mauricio's suggestion to disable it on all but x86 right now with this upload

> I guess it's a matter of deciding whether we'd like
> to selectively disable LTO on archs with _reported_
> bugs (binaries: armhf, arm64, misc; on arm64, s390x,
> ppc64el), or generally disable LTO with the rationale
> this might be a toolchain issue with unknown impact.

Yeah, while we only have examples and reports on arm and riscv so far IIRC when Paride tracked this down back in the day upstreams statement was (sadly I can't find a link anymore where this happened) was more or less like "if at all it only works on x86". Hence we decided to better disable it on all non-x86 based on the reports we already have and that statement.

I'll ping Paride if he still has a reference where this came up, but it was quite some time ago so it might be lost due time.

Revision history for this message
Paride Legovini (paride) wrote :

Hi, what Christian recalls is likely this sentence:

> "make check " fails on aarch64, ppc64le and s390x architecture

from the description of the linked bug [1].

Based on that I submitted the (now merged) Debian change that disables LTO on all the non-amd64 architectures, and I still believe it's the right thing to do. The TLDR is that I agree with Mauricio's rationale, but I'll try to elaborate how I see it.

We have to deal with incomplete information here (with perfect information we'd fix the underlying bug!). What we know is:

- the issue doesn't show up on amd64
- some binary package on non-amd64 are broken if LTO is used
- even on the affected configurations the crash doesn't always happen
- we don't know exactly which arch/binarypkg combinations are affected

I think that drawing a line between amd64 and everything else is justified by the disproportionately higher usage share (= real world testing) that the amd64 packages have. We can be quite confident in saying that amd64 is not affected (no bug reports); we can't be as confident in the other cases, because the testing we can do will always be narrow compared to the "on the field" testing that the amd64 packages are getting. It's a grey area. It is worth noting that the arm64-on-arm64 crash has been reported only months after the Jammy release, and by an Ubuntu developer (me).

Speaking of performance, Mauricio already brought a good point. On top of that it is true that we didn't measure the performance loss due to disabling LTO, but we also don't know the potential performance gain. In this case I think that the specific argument (qemu definitely crashes in some cases, making qemu unusable) wins over the general one (lto may bring some performance gains). Moreover we know that LTO didn't solve any specific performance issue (we're not regressing a bug, if you want).

This to say that in my opinion the risk (as in severity*likelihood of event) of qemu crashing on non-amd64 (high severity, medium likelihood) outweighs the risk of causing a *significant* performance regression by disabling LTO (medium severity, low likelihood).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1952483

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hi all,

I appreciate the effort that went into troubleshooting this bug, quite the journey.

The [test plan] section, however, as it currently stands, seems to be a bunch of copied & pasted command lines and text from other comments, and does not stand on its own. None of the commands shown there work as described.

Please use one of the test cases from comment #87, which are pretty good and direct to the point. I suggest case 3: arm64 on arm64, as that might be the most common one, ans also replicates what Paride was trying in one of the comments.

I was able to use it on my raspberry pi4 and got qemu to crash (just had to lower the memory from 4G to 1G):

$ qemu-system-aarch64 -machine virt -nographic -smp 4 -m 1G -cpu cortex-a57 -pflash flash0.img -pflash flash1.img -drive file=jammy-server-c
loudimg-arm64.img,format=qcow2,id=drive0,if=none -device virtio-blk-device,drive=drive0
WARNING: Image format was not specified for 'flash0.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
WARNING: Image format was not specified for 'flash1.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
BdsDxe: failed to load Boot0001 "UEFI Misc Device" from VenHw(93E34C7E-B50E-11DF-9223-2443DFD72085,00): Not Found
qemu-system-aarch64: GLib: g_source_ref: assertion 'source != NULL' failed
Segmentation fault (core dumped)

Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Tommy, or anyone else affected,

Accepted qemu into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:6.2+dfsg-2ubuntu6.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Jammy):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (qemu/1:6.2+dfsg-2ubuntu6.7)

All autopkgtests for the newly accepted qemu (1:6.2+dfsg-2ubuntu6.7) for jammy have finished running.
The following regressions have been reported in tests triggered by the package:

initramfs-tools/0.140ubuntu13.1 (amd64)
ubuntu-image/2.2+22.04ubuntu3 (amd64, arm64, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/jammy/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Updated the [Test Plan] section.

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on jammy.

Setup:
---

$ openstack server create --image ubuntu/ubuntu-jammy-22.04-arm64-server-20230302-disk1.img --flavor cpu4-ram4-disk10 --key-name mfo_canonistack-bos01 mfo-jammy-arm64

$ ssh ...

$ uname -m
aarch64

$ lsb_release -cs
jammy

...

wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-arm64.img

sudo apt install --yes --no-install-recommends qemu-system-arm qemu-efi-aarch64 ipxe-qemu

cp /usr/share/AAVMF/AAVMF_CODE.fd flash0.img
cp /usr/share/AAVMF/AAVMF_VARS.fd flash1.img

qemu-system-aarch64 \
   -machine virt -nographic \
   -smp 4 -m 2G \
   -cpu cortex-a57 \
   -pflash flash0.img -pflash flash1.img \
   -drive file=jammy-server-cloudimg-arm64.img,format=qcow2,id=drive0,if=none \
   -device virtio-blk-device,drive=drive0

Before:
---

$ dpkg -s qemu-system-arm | grep Version:
Version: 1:6.2+dfsg-2ubuntu6.6

$ qemu-system-aarch64 \
...
BdsDxe: failed to load Boot0001 "UEFI Misc Device" from VenHw(93E34C7E-B50E-11DF-9223-2443DFD72085,00): Not Found
qemu-system-aarch64: GLib: g_source_ref: assertion 'source != NULL' failed

After:
---

$ sudo add-apt-repository -yp proposed
$ sudo apt install -y qemu-system-arm

$ dpkg -s qemu-system-arm | grep Version:
Version: 1:6.2+dfsg-2ubuntu6.7

$ qemu-system-aarch64 \
...
BdsDxe: failed to load Boot0001 "UEFI Misc Device" from VenHw(93E34C7E-B50E-11DF-9223-2443DFD72085,00): Not Found
BdsDxe: loading Boot0002 "UEFI Misc Device 2" from VenHw(837DCA9E-E874-4D82-B29A-23FE0E23D1E2,003E000A00000000)
BdsDxe: starting Boot0002 "UEFI Misc Device 2" from VenHw(837DCA9E-E874-4D82-B29A-23FE0E23D1E2,003E000A00000000)
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x411fd070]
[ 0.000000] Linux version 5.15.0-67-generic (buildd@bos02-arm64-073) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 (Ubuntu 5.15.0-67.74-generic 5.15.85)
...
[ 150.121404] systemd[1]: Detected virtualization qemu.
[ 150.125509] systemd[1]: Detected architecture arm64.
...

tags: added: verification-done-jammy
removed: verification-needed-jammy
tags: added: verification-done
removed: verification-needed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

autopkgtests regressions cleared; unrelated to these changes.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I verified the test results and am satisfied that they show the executed planned test case, and that the results are correct.

The package built correctly in all architectures and Ubuntu releases it was meant for.

There are no DEP8 regressions at the moment.

There is no SRU freeze ongoing at the moment.

There is no halted phasing on the previous update.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:6.2+dfsg-2ubuntu6.7

---------------
qemu (1:6.2+dfsg-2ubuntu6.7) jammy; urgency=medium

  [ Brett Milford ]
  * d/p/u/lp1994002-migration-Read-state-once.patch: Fix for libvirt
    error 'migration was active, but no RAM info was set' (LP: #1994002)

  [ Mauricio Faria de Oliveira ]
  * d/p/u/lp2009048-vfio_map_dma_einval_amd_iommu_1tb.patch: Add hint
    to VFIO_MAP_DMA error on AMD IOMMU for VMs with ~1TB+ RAM (LP: #2009048)
  * d/rules: move "Disable LTO on non-amd64" before buildflags.mk on Jammy.

  [ Michal Maloszewski ]
  * d/rules: Disable LTO on non-amd 64 architectures to prevent QEMU
    coroutines from failing (LP: #1921664)

 -- Mauricio Faria de Oliveira <email address hidden> Mon, 06 Mar 2023 17:00:46 -0300

Changed in qemu (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Displaying first 40 and last 40 comments. View all 106 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.