Bug #1805256 “qemu-img hangs on rcu_call_ready_event logic in Aa...” : Bugs : QEMU

Revision history for this message

John Snow (jnsnow) wrote on 2018-11-26:

#1

Hi, can you do a `thread apply all bt` instead? If I were to bet, we're probably waiting for some slow call like lseek to return in another thread.

What filesystem/blockdevice is involved here?

Revision history for this message

dann frazier (dannf) wrote on 2018-11-26:

#2

ext4 filesystem, SATA drive:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff9bffc9a0 (LWP 9015)):
#0 0x0000ffffaaa462cc in __GI___sigtimedwait (set=<optimized out>,
    set@entry=0xaaaae725c070, info=info@entry=0xffff9bffbf18,
    timeout=0x3ff0000000000001, timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1 0x0000ffffaab7dfac in __sigwait (set=set@entry=0xaaaae725c070,
    sig=sig@entry=0xffff9bffbff4) at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2 0x0000aaaad998a628 in sigwait_compat (opaque=0xaaaae725c070)
    at util/compatfd.c:36
#3 0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#4 0x0000ffffaab73088 in start_thread (arg=0xffffc528531f)
    at pthread_create.c:463
#5 0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffffa0e779a0 (LWP 9014)):
#0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1 0x0000aaaad998c9e8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2 qemu_event_wait (ev=ev@entry=0xaaaad9a091c0 <rcu_call_ready_event>)
    at util/qemu-thread-posix.c:442
#3 0x0000aaaad99a6834 in call_rcu_thread (opaque=<optimized out>)
    at util/rcu.c:261
#4 0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#5 0x0000ffffaab73088 in start_thread (arg=0xffffc528542f)
    at pthread_create.c:463
#6 0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffffa0fa8010 (LWP 9013)):
#0 0x0000ffffaaada154 in __GI_ppoll (fds=0xaaaae7291dc0, nfds=187650771816320,
    timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc52852e0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x0000aaaad9987f00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
    __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
    timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3 0x0000aaaad9988f80 in os_host_main_loop_wait (timeout=-1)
    at util/main-loop.c:233
#4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5 0x0000aaaad98b7a30 in convert_do_copy (s=0xffffc52854e8) at qemu-img.c:1980
#6 img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7 0x0000aaaad98b033c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

ext4 filesystem, SATA drive:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff9bffc9a0 (LWP 9015)):
#0  0x0000ffffaaa462cc in __GI___sigtimedwait (set=<optimized out>, 
    set@entry=0xaaaae725c070, info=info@entry=0xffff9bffbf18, 
    timeout=0x3ff0000000000001, timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1  0x0000ffffaab7dfac in __sigwait (set=set@entry=0xaaaae725c070, 
    sig=sig@entry=0xffff9bffbff4) at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x0000aaaad998a628 in sigwait_compat (opaque=0xaaaae725c070)
    at util/compatfd.c:36
#3  0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#4  0x0000ffffaab73088 in start_thread (arg=0xffffc528531f)
    at pthread_create.c:463
#5  0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffffa0e779a0 (LWP 9014)):
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaad998c9e8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0xaaaad9a091c0 <rcu_call_ready_event>)
    at util/qemu-thread-posix.c:442
#3  0x0000aaaad99a6834 in call_rcu_thread (opaque=<optimized out>)
    at util/rcu.c:261
#4  0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#5  0x0000ffffaab73088 in start_thread (arg=0xffffc528542f)
    at pthread_create.c:463
#6  0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffffa0fa8010 (LWP 9013)):
#0  0x0000ffffaaada154 in __GI_ppoll (fds=0xaaaae7291dc0, nfds=187650771816320, 
    timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc52852e0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaad9987f00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
    __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
    timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x0000aaaad9988f80 in os_host_main_loop_wait (timeout=-1)
    at util/main-loop.c:233
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5  0x0000aaaad98b7a30 in convert_do_copy (s=0xffffc52854e8) at qemu-img.c:1980
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7  0x0000aaaad98b033c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

Alex Bennée (ajbennee) on 2018-12-05

tags:

added: qemu-img

Revision history for this message

贞贵李 (lizhengui) wrote on 2019-04-15:

#3

Hi, I also found a problem that qemu-img convert hands in ARM.

The convert command line is "qemu-img convert -f qcow2 -O raw disk.qcow2 disk.raw ".

The bt is below:

Thread 2 (Thread 0x40000b776e50 (LWP 27215)):
#0 0x000040000a3f2994 in sigtimedwait () from /lib64/libc.so.6
#1 0x000040000a39c60c in sigwait () from /lib64/libpthread.so.0
#2 0x0000aaaaaae82610 in sigwait_compat (opaque=0xaaaac5163b00) at util/compatfd.c:37
#3 0x0000aaaaaae85038 in qemu_thread_start (args=args@entry=0xaaaac5163b90) at util/qemu_thread_posix.c:496
#4 0x000040000a3918bc in start_thread () from /lib64/libpthread.so.0
#5 0x000040000a492b2c in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x40000b573370 (LWP 27214)):
#0 0x000040000a489020 in ppoll () from /lib64/libc.so.6
#1 0x0000aaaaaadaefc0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu_timer.c:391
#3 0x0000aaaaaadae014 in os_host_main_loop_wait (timeout=<optimized out>) at main_loop.c:272
#4 0x0000aaaaaadae190 in main_loop_wait (nonblocking=<optimized out>) at main_loop.c:534
#5 0x0000aaaaaad97be0 in convert_do_copy (s=0xffffdc32eb48) at qemu-img.c:1923
#6 0x0000aaaaaada2d70 in img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2414
#7 0x0000aaaaaad99ac4 in main (argc=7, argv=<optimized out>) at qemu-img.c:5305

Do you find the cause of the problem and fix it? Thanks for your reply!

Revision history for this message

贞贵李 (lizhengui) wrote on 2019-04-15:

#4

sorry, I make a spelling mistake here（"Hi, I also found a problem that qemu-img convert hands in ARM."）.The right is "I also found a problem that qemu-img convert hangs in ARM".

dann frazier (dannf) on 2019-04-15

Changed in qemu:
status:	New → Confirmed

Revision history for this message

dann frazier (dannf) wrote on 2019-04-15:

#5

No, sorry - this bugs still persists w/ latest upstream (@ afccfc0). I found a report of similar symptoms:

https://patchwork.kernel.org/patch/10047341/
https://bugzilla.redhat.com/show_bug.cgi?id=1524770#c13

To be clear, ^ is already fixed upstream, so it is not the *same* issue - but perhaps related.

Revision history for this message

贞贵李 (lizhengui) wrote on 2019-04-16:

#6

Do you have any good ideas about it? Maybe somewhere lack of memeory barriers that cause it?

Revision history for this message

贞贵李 (lizhengui) wrote on 2019-04-16:

#7

frazier, Do you find the conditions that necessarily make this problem appear?

Revision history for this message

贞贵李 (lizhengui) wrote on 2019-04-23:

#8

I can reproduce this problem with qemu.git/matser. It still exists in qemu.git/matser. I found that when an IO return in worker threads and want to call aio_notify to wake up main_loop, but it found that ctx->notify_me is cleared to 0 by main_loop in aio_ctx_check by calling atomic_and(&ctx->notify_me, ~1) . So worker thread won't write enventfd to notify main_loop.If such a scene happens, the main_loop will hang:
    main loop worker thread1 worker thread2
-----------------------------------------------------------------------------------------------
    qemu_poll_ns aio_worker
                            qemu_bh_schedule(pool->completion_bh)
    glib_pollfds_poll
    g_main_context_check
    aio_ctx_check
    atomic_and(&ctx->notify_me, ~1) aio_worker
                                                      qemu_bh_schedule(pool->completion_bh)
    /* do something for event */
    qemu_poll_ns
    /* hangs !!!*/

As we known, ctx->notify_me will be visited by worker thread and main loop. I thank we should add a lock protection for ctx->notify_me to avoid this happend.what do you thank so?

dann frazier (dannf) on 2019-06-05

Changed in qemu (Ubuntu):
status:	New → Confirmed

Rafael David Tinoco (rafaeldtinoco) on 2019-09-05

Changed in qemu (Ubuntu):
status:	Confirmed → In Progress
assignee:	nobody → Rafael David Tinoco (rafaeldtinoco)
importance:	Undecided → Medium

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-06:

#9

Hello Liz,

I'll try to reproduce this issue in a Cortex-A53 aarch64 real environment (w/ 24 HW threads) AND in a virtual environment w/ lots of vCPUs... but, if it's a barrier missing - or the lack of atomicity and/or ordering in a primitive - then, I'm afraid the context switch in between vCPUs might not be the same as in real CPUs (IPIs are sent and handled differently and host kernel delays IPI delivery because of its own callbacks, before scheduling, etc...) and I could need a qemu dump from your environment.

Would that be feasible ? Can you reproduce this nowadays ? This bug has aged a little, so I'm now sure!

Could you provide me the dump caused by latest package available for your Ubuntu version ? This way I have the debug symbols to work with.

Meanwhile, I'll be trying to reproduce on my side.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-06:

#10

OOhh nm on the virtual environment test, as I just remembered we don't have KVM on 2nd level for aarch64 yet (at least in ARMv8 implementing virt extension). I'll try to reproduce in the real env only.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-06:

#11

Alright, I couldn't reproduce this yet, I'm running same test case in a 24 cores box and causing lots of context switches and CPU migrations in parallel (trying to exhaust the logic).

Will let this running for sometime to check.

Unfortunately this can be related QEMU AIO BH locking/primitives and cache coherency in the HW in question (which I got specs from: https://en.wikichip.org/wiki/hisilicon/kunpeng/hi1616):

l1$ size 8 MiB
l1d$ size 4 MiB
l1i$ size 4 MiB
l2$ size 32 MiB
l3$ size 64 MiB

like for example when having 2 threads in different NUMA domains, or some other situation.

I can't simulate the same since I have a SOC with:

Cortex-A53 MPCore 24cores,

L1 I/D=32KB/32KB
L2 =256KB
L3 =4MB

and I'm not even close to L1/L2/L3 cache numbers from D06 =o).

Just got a note that I'll be able to reproduce this in the real HW, will get back soon with real gdb debugging.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-09:

#12

Download full text (4.5 KiB)

Alright, with a d06 aarch64 machine I was able to reproduce it after 8 attempts.I'll debug it today and provide feedback on my findings.

(gdb) bt full
#0 0x0000ffffb0b2181c in __GI_ppoll (fds=0xaaaace5ab770, nfds=4, timeout=<optimized out>, timeout@entry=0x0,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        _x3tmp = 0
        _x0tmp = 187650583213936
        _x0 = 187650583213936
        _x3 = 0
        _x4tmp = 8
        _x1tmp = 4
        _x1 = 4
        _x4 = 8
        _x2tmp = <optimized out>
        _x2 = 0
        _x8 = 73
        _sys_result = <optimized out>
        _sys_result = <optimized out>
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 0, tv_nsec = 187650583137792}
#1 0x0000aaaacd2a773c in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
No locals.
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at ./util/qemu-timer.c:322
No locals.
#3 0x0000aaaacd2a8764 in os_host_main_loop_wait (timeout=-1) at ./util/main-loop.c:233
        context = 0xaaaace599d90
        ret = <optimized out>
        context = <optimized out>
        ret = <optimized out>
#4 main_loop_wait (nonblocking=<optimized out>) at ./util/main-loop.c:497
        ret = <optimized out>
        timeout = 4294967295
        timeout_ns = <optimized out>
#5 0x0000aaaacd1df454 in convert_do_copy (s=0xfffff9b2b1d8) at ./qemu-img.c:1981
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
#6 img_convert (argc=<optimized out>, argv=<optimized out>) at ./qemu-img.c:2457
        c = <optimized out>
        bs_i = <optimized out>
        flags = 16898
        src_flags = 0
        fmt = 0xfffff9b2bad1 "qcow2"
        out_fmt = <optimized out>
        cache = 0xaaaacd2cb1c8 "unsafe"
        src_cache = 0xaaaacd2ca9c0 "writeback"
        out_baseimg = <optimized out>
        out_filename = <optimized out>
        out_baseimg_param = <optimized out>
        snapshot_name = 0x0
        drv = <optimized out>
        proto_drv = <optimized out>
        bdi = {cluster_size = 65536, vm_state_offset = 32212254720, is_dirty = false, unallocated_blocks_are_zero = true,
          needs_compressed_writes = false}
        out_bs = <optimized out>
        opts = 0xaaaace5ab390
        sn_opts = 0x0
        create_opts = 0xaaaace5ab0c0
        open_opts = <optimized out>
        options = 0x0
        local_err = 0x0
        writethrough = false
        src_writethrough = false
        quiet = <optimized out>
        image_opts = false
        skip_create = false
        progress = <optimized out>
        tgt_image_opts = false
        ret = <optimized out>
        force_share = false
        explict_min_sparse = false
        s = {src = 0xaaaace577240, src_sectors = 0xaaaace577300, src_num = 1, total_sectors = 62914560,allocated_sectors = 9572096, allocated_done = 6541440, sector_num = 8863744, wr_offs = 885...

Alright, with a d06 aarch64 machine I was able to reproduce it after 8 attempts.I'll debug it today and provide feedback on my findings.

(gdb) bt full
#0  0x0000ffffb0b2181c in __GI_ppoll (fds=0xaaaace5ab770, nfds=4, timeout=<optimized out>, timeout@entry=0x0,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        _x3tmp = 0
        _x0tmp = 187650583213936
        _x0 = 187650583213936
        _x3 = 0
        _x4tmp = 8
        _x1tmp = 4
        _x1 = 4
        _x4 = 8
        _x2tmp = <optimized out>
        _x2 = 0
        _x8 = 73
        _sys_result = <optimized out>
        _sys_result = <optimized out>
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 0, tv_nsec = 187650583137792}
#1  0x0000aaaacd2a773c in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
No locals.
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at ./util/qemu-timer.c:322
No locals.
#3  0x0000aaaacd2a8764 in os_host_main_loop_wait (timeout=-1) at ./util/main-loop.c:233
        context = 0xaaaace599d90
        ret = <optimized out>
        context = <optimized out>
        ret = <optimized out>
#4  main_loop_wait (nonblocking=<optimized out>) at ./util/main-loop.c:497
        ret = <optimized out>
        timeout = 4294967295
        timeout_ns = <optimized out>
#5  0x0000aaaacd1df454 in convert_do_copy (s=0xfffff9b2b1d8) at ./qemu-img.c:1981
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at ./qemu-img.c:2457
        c = <optimized out>
        bs_i = <optimized out>
        flags = 16898
        src_flags = 0
        fmt = 0xfffff9b2bad1 "qcow2"
        out_fmt = <optimized out>
        cache = 0xaaaacd2cb1c8 "unsafe"
        src_cache = 0xaaaacd2ca9c0 "writeback"
        out_baseimg = <optimized out>
        out_filename = <optimized out>
        out_baseimg_param = <optimized out>
        snapshot_name = 0x0
        drv = <optimized out>
        proto_drv = <optimized out>
        bdi = {cluster_size = 65536, vm_state_offset = 32212254720, is_dirty = false, unallocated_blocks_are_zero = true,
          needs_compressed_writes = false}
        out_bs = <optimized out>
        opts = 0xaaaace5ab390
        sn_opts = 0x0
        create_opts = 0xaaaace5ab0c0
        open_opts = <optimized out>
        options = 0x0
        local_err = 0x0
        writethrough = false
        src_writethrough = false
        quiet = <optimized out>
        image_opts = false
        skip_create = false
        progress = <optimized out>
        tgt_image_opts = false
        ret = <optimized out>
        force_share = false
        explict_min_sparse = false
        s = {src = 0xaaaace577240, src_sectors = 0xaaaace577300, src_num = 1, total_sectors = 62914560,allocated_sectors = 9572096, allocated_done = 6541440, sector_num = 8863744, wr_offs = 8859776, status = BLK_DATA, sector_next_status = 8863744, target = 0xaaaace5bd2a0, has_zero_init = true,compressed = false, unallocated_blocks_are_zero = true, target_has_backing = false, target_backing_sectors = -1, wr_in_order = true, copy_range = false, min_sparse = 8, alignment = 8,cluster_sectors = 128, buf_sectors = 4096, num_coroutines = 8, running_coroutines = 8, co = {0xaaaace5ceda0,0xaaaace5cef50, 0xaaaace5cf100, 0xaaaace5cf2b0, 0xaaaace5cf460, 0xaaaace5cf610, 0xaaaace5cf7c0,0xaaaace5cf970, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, wait_sector_num = {-1, 8859904, 8860928, 8863360,8861952, 8862976, 8862592, 8861440, 0, 0, 0, 0, 0, 0, 0, 0}, lock = {locked = 0, ctx = 0x0, from_push = {slh_first = 0x0}, to_pop = {slh_first = 0x0}, handoff = 0, sequence = 0, holder = 0x0}, ret = -115}
        __PRETTY_FUNCTION__ = "img_convert"
#7  0x0000aaaacd1d8400 in main (argc=7, argv=<optimized out>) at ./qemu-img.c:4976
        cmd = 0xaaaacd34ad78 <img_cmds+80>
        cmdname = <optimized out>
        local_error = 0x0
        trace_file = 0x0
        c = <optimized out>
        long_options = {{name = 0xaaaacd2cbbb0 "help", has_arg = 0, flag = 0x0, val = 104}, {
            name = 0xaaaacd2cbc78 "version", has_arg = 0, flag = 0x0, val = 86}, {name = 0xaaaacd2cbc80 "trace",
            has_arg = 1, flag = 0x0, val = 84}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-10:

#13

Download full text (10.7 KiB)

Alright,

I'm still investigating this but wanted to share some findings... I haven't got a kernel dump yet after the task is frozen, I have analyzed only the userland part of it (although I have checked if code was running inside kernel with perf cycles:u/cycles:k at some point).

The big picture is this: Whenever qemu-img hangs, we have 3 hung tasks basically with these stacks:

----

TRHREAD #1
__GI_ppoll (../sysdeps/unix/sysv/linux/ppoll.c:39)
ppoll (/usr/include/aarch64-linux-gnu/bits/poll2.h:77)
qemu_poll_ns (./util/qemu-timer.c:322)
os_host_main_loop_wait (./util/main-loop.c:233)
main_loop_wait (./util/main-loop.c:497)
convert_do_copy (./qemu-img.c:1981)
img_convert (./qemu-img.c:2457)
main (./qemu-img.c:4976)

got stack traces:

./33293/stack ./33293/stack
[<0>] __switch_to+0xc0/0x218 [<0>] __switch_to+0xc0/0x218
[<0>] ptrace_stop+0x148/0x2b0 [<0>] do_sys_poll+0x508/0x5c0
[<0>] get_signal+0x5a4/0x730 [<0>] __arm64_sys_ppoll+0xc0/0x118
[<0>] do_notify_resume+0x158/0x358 [<0>] el0_svc_common+0xa0/0x168
[<0>] work_pending+0x8/0x10 [<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33293 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.871 MB perf.data (48730 samples) ]

root@d06-1:~$ perf report --stdio
# Overhead Command Shared Object Symbol
# ........ ........ .................. ......................
#
    37.82% qemu-img libc-2.29.so [.] 0x00000000000df710
    21.81% qemu-img [unknown] [k] 0xffff000010099504
    14.23% qemu-img [unknown] [k] 0xffff000010085dc0
     9.13% qemu-img [unknown] [k] 0xffff00001008fff8
     6.47% qemu-img libc-2.29.so [.] 0x00000000000df708
     5.69% qemu-img qemu-img [.] qemu_event_reset
     2.57% qemu-img libc-2.29.so [.] 0x00000000000df678
     0.63% qemu-img libc-2.29.so [.] 0x00000000000df700
     0.49% qemu-img libc-2.29.so [.] __sigtimedwait
     0.42% qemu-img libpthread-2.29.so [.] __libc_sigwait

----

TRHREAD #3
__GI___sigtimedwait (../sysdeps/unix/sysv/linux/sigtimedwait.c:29)
__sigwait (linux/sigwait.c:28)
qemu_thread_start (./util/qemu-thread-posix.c:498)
start_thread (pthread_create.c:486)
thread_start (linux/aarch64/clone.S:78)

./33303/stack ./33303/stack
[<0>] __switch_to+0xc0/0x218 [<0>] __switch_to+0xc0/0x218
[<0>] ptrace_stop+0x148/0x2b0 [<0>] do_sigtimedwait.isra.9+0x194/0x288
[<0>] get_signal+0x5a4/0x730 [<0>] __arm64_sys_rt_sigtimedwait+0xac/0x110
[<0>] do_notify_resume+0x158/0x358 [<0>] el0_svc_common+0xa0/0x168
[<0>] work_pending+0x8/0x10 [<0>] el0_svc_handler+0x38/0x78
[<0>] el0_svc+0x8/0xc

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33303 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record:...

Alright,

I'm still investigating this but wanted to share some findings... I haven't got a kernel dump yet after the task is frozen, I have analyzed only the userland part of it (although I have checked if code was running inside kernel with perf cycles:u/cycles:k at some point).

The big picture is this: Whenever qemu-img hangs, we have 3 hung tasks basically with these stacks:

----

TRHREAD #1
__GI_ppoll (../sysdeps/unix/sysv/linux/ppoll.c:39)
ppoll (/usr/include/aarch64-linux-gnu/bits/poll2.h:77)
qemu_poll_ns (./util/qemu-timer.c:322)
os_host_main_loop_wait (./util/main-loop.c:233)
main_loop_wait (./util/main-loop.c:497)
convert_do_copy (./qemu-img.c:1981)
img_convert (./qemu-img.c:2457)
main (./qemu-img.c:4976)

got stack traces:

./33293/stack                          ./33293/stack                         
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218          
[<0>] ptrace_stop+0x148/0x2b0          [<0>] do_sys_poll+0x508/0x5c0         
[<0>] get_signal+0x5a4/0x730           [<0>] __arm64_sys_ppoll+0xc0/0x118    
[<0>] do_notify_resume+0x158/0x358     [<0>] el0_svc_common+0xa0/0x168       
[<0>] work_pending+0x8/0x10            [<0>] el0_svc_handler+0x38/0x78       
                                       [<0>] el0_svc+0x8/0xc

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33293 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.871 MB perf.data (48730 samples) ]

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    37.82%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    21.81%  qemu-img  [unknown]           [k] 0xffff000010099504
    14.23%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     9.13%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     6.47%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     5.69%  qemu-img  qemu-img            [.] qemu_event_reset
     2.57%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     0.63%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.49%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.42%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

----

TRHREAD #3
__GI___sigtimedwait (../sysdeps/unix/sysv/linux/sigtimedwait.c:29)
__sigwait (linux/sigwait.c:28)
qemu_thread_start (./util/qemu-thread-posix.c:498)
start_thread (pthread_create.c:486)
thread_start (linux/aarch64/clone.S:78)

./33303/stack                          ./33303/stack                               
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218                
[<0>] ptrace_stop+0x148/0x2b0          [<0>] do_sigtimedwait.isra.9+0x194/0x288    
[<0>] get_signal+0x5a4/0x730           [<0>] __arm64_sys_rt_sigtimedwait+0xac/0x110
[<0>] do_notify_resume+0x158/0x358     [<0>] el0_svc_common+0xa0/0x168             
[<0>] work_pending+0x8/0x10            [<0>] el0_svc_handler+0x38/0x78             
                                       [<0>] el0_svc+0x8/0xc

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33303 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.905 MB perf.data (49647 samples) ]

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    45.37%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    23.52%  qemu-img  [unknown]           [k] 0xffff000010099504
     9.08%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     8.89%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     5.56%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     3.66%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     1.01%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.80%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.64%  qemu-img  qemu-img            [.] qemu_event_reset
     0.55%  qemu-img  libc-2.29.so        [.] 0x00000000000df718
     0.52%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

----

TRHREAD #2
syscall (linux/aarch64/syscall.S:38)
qemu_futex_wait (./util/qemu-thread-posix.c:438)
qemu_event_wait (./util/qemu-thread-posix.c:442)
call_rcu_thread (./util/rcu.c:261)
qemu_thread_start (./util/qemu-thread-posix.c:498)
start_thread (pthread_create.c:486)
thread_start (linux/aarch64/clone.S:78)

./33302/stack                          ./33302/stack                       
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218        
[<0>] ptrace_stop+0x148/0x2b0          [<0>] ptrace_stop+0x148/0x2b0       
[<0>] get_signal+0x5a4/0x730           [<0>] get_signal+0x5a4/0x730        
[<0>] do_notify_resume+0x1c4/0x358     [<0>] do_notify_resume+0x1c4/0x358  
[<0>] work_pending+0x8/0x10            [<0>] work_pending+0x8/0x10

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    50.30%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    26.44%  qemu-img  [unknown]           [k] 0xffff000010099504
     5.88%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     5.26%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     5.25%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     4.25%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     0.93%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.51%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.35%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

Their stack show those tasks are pretty much "stuck" in same userland program logic, while one of them is stuck at the same program counter address. Profiling those tasks give no much information without more debugging data and less optimizations.

Although all the 0x000000dfXXX addresses seem broken as we get where libc was mapped (mid heap) and we have:

(gdb) print __libc_sigwait
$25 = {int (const sigset_t *, int *)} 0xffffbf128080 <__GI___sigwait>

----

Anyway, continuing.... I investigated the qemu_event_{set,reset,xxx} logic. In non Linux OSes it uses pthread primitives, but, for Linux, it uses a futex() implementation with a struct QemuEvent (rcu_call_ready_event) being the one holding values (busy, set, free, etc).

I got 2 hung situations:

(gdb) print (struct QemuEvent) *(0xaaaacd35fce8)
$16 = {
  value = 4294967295,
  initialized = true
}

value = 4294967295 -> THIS IS A 32-bit 0xFFFF (casting vs overflow issue ?)

AND

a situation where value was either 0 or 1 (like expected). In this last situation I changed things by hand to make program to continue its execution:

void qemu_event_wait(QemuEvent *ev)
{
    unsigned value;

assert(ev->initialized);
    value = atomic_read(&ev->value);
    smp_mb_acquire();
    if (value != EV_SET) {
        if (value == EV_FREE) {

if (atomic_cmpxchg(&ev->value, 
                       EV_FREE, EV_BUSY) == EV_SET) {
                return;
            }
        }
        qemu_futex_wait(ev, EV_BUSY);
    }
}

438     in ./util/qemu-thread-posix.c
   0x0000aaaaaabd4174 <+44>:    mov     w1, #0xffffffff                 // #-1
   0x0000aaaaaabd4178 <+48>:    ldaxr   w0, [x19]
   0x0000aaaaaabd417c <+52>:    cmp     w0, #0x1
   0x0000aaaaaabd4180 <+56>:    b.ne    0xaaaaaabd418c <qemu_event_wait+68>  // b.any
=> 0x0000aaaaaabd4184 <+60>:    stlxr   w2, w1, [x19]
   0x0000aaaaaabd4188 <+64>:    cbnz    w2, 0xaaaaaabd4178 <qemu_event_wait+48>
   0x0000aaaaaabd418c <+68>:    cbz     w0, 0xaaaaaabd41cc <qemu_event_wait+132>
   0x0000aaaaaabd4190 <+72>:    mov     w6, #0x0                        // #0
   0x0000aaaaaabd4194 <+76>:    mov     x5, #0x0                        // #0
   0x0000aaaaaabd4198 <+80>:    mov     x4, #0x0                        // #0
   0x0000aaaaaabd419c <+84>:    mov     w3, #0xffffffff                 // #-1
   0x0000aaaaaabd41a0 <+88>:    mov     w2, #0x0                        // #0
   0x0000aaaaaabd41a4 <+92>:    mov     x1, x19
   0x0000aaaaaabd41a8 <+96>:    mov     x0, #0x62                       // #98
   0x0000aaaaaabd41ac <+100>:   bl      0xaaaaaaaff380 <syscall@plt>
   
I unblocked it by hand, setting the program counter register to outside that logic:

(gdb) print qemu_event_wait+132
$15 = (void (*)(QemuEvent *)) 0xaaaaaabd41cc <qemu_event_wait+132>
(gdb) print rcu_call_ready_event
$16 = {value = 1, initialized = true}
(gdb) set rcu_call_ready_event->value=0
(gdb) set $pc=0xaaaaaabd41cc

And it got stuck again with program counter in other STLXR instruction:

(gdb) thread 2

[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33302))]
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
414     ./util/qemu-thread-posix.c: No such file or directory.
(gdb) bt
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
#1  0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

(gdb) print rcu_call_ready_event
$20 = {value = 1, initialized = true}

(gdb) disassemble qemu_event_reset
Dump of assembler code for function qemu_event_reset:
   0x0000aaaaaabd40f0 <+0>:     ldrb    w1, [x0, #4]
   0x0000aaaaaabd40f4 <+4>:     cbz     w1, 0xaaaaaabd411c <qemu_event_reset+44>
   0x0000aaaaaabd40f8 <+8>:     ldr     w1, [x0]
   0x0000aaaaaabd40fc <+12>:    dmb     ishld
   0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
   0x0000aaaaaabd4104 <+20>:    ret
   0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
   0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
=> 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
   0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 <qemu_event_reset+24>
   0x0000aaaaaabd4118 <+40>:    ret
   
And it does not matter if I continue, CPU keeps stuck in that program counter (again in a STLXR instruction)

----

So, initially I was afraid that the lack barriers (or not so strong ones being used) could have caused a race condition that would make one thread to depend on the other thread logic.

Unfortunately it looks that instruction STLXR might not be behaving appropriately for this CPU/architecture as program counter seem to be stuck in the same instruction (which is super weird, by not throwing a general exception for some microcode issue, for example).

But this was just an initial overview, I still have to revisit this in order interpret results better (and recompile qemu with debugging data, and possible with other GCC version).

Any comments are appreciated.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-10:

#14

Download full text (8.9 KiB)

Alright, here is what is happening:

Whenever program is stuck, thread #2 backtrace is this:

(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1 0x0000aaaaaabd41b0 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3 0x0000aaaaaabee03c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4 0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5 0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6 0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'

And there should be no 0x7fff value for "ev->value".

qemu_event_init() is the one initializing the global:

static QemuEvent rcu_call_ready_event;

and it is called by "rcu_init_complete()" which is called by "rcu_init()":

static void __attribute__((__constructor__)) rcu_init(void)

a constructor function.

So, "fixing" this issue by:

(gdb) print rcu_call_ready_event
$8 = {value = 4294967295, initialized = true}

(gdb) watch rcu_call_ready_event
Hardware watchpoint 1: rcu_call_ready_event

(gdb) set rcu_call_ready_event.initialized = 1

(gdb) set rcu_call_ready_event.value = 0

and note that I added a watchpoint to rcu_call_ready_event global:

<HANG>

Thread 1 "qemu-img" received signal SIGINT, Interrupt.
(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33625))]

(gdb) bt
#0 0x0000aaaaaabd4110 in qemu_event_reset (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>)
#1 0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2 0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3 0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4 0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) print rcu_call_ready_event
$9 = {value = 0, initialized = true}

You can see I advanced in the qemu_event_{reset,set,wait} logic.

    (gdb) disassemble /m 0x0000aaaaaabd4110
    Dump of assembler code for function qemu_event_reset:
    408 in ./util/qemu-thread-posix.c

409 in ./util/qemu-thread-posix.c

    410 in ./util/qemu-thread-posix.c
    411 in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f0 <+0>: ldrb w1, [x0, #4]
       0x0000aaaaaabd40f4 <+4>: cbz w1, 0xaaaaaabd411c <qemu_event_reset+44>
       0x0000aaaaaabd411c <+44>: stp x29, x30, [sp, #-16]!
       0x0000aaaaaabd...

Alright, here is what is happening:

Whenever program is stuck, thread #2 backtrace is this:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaaaabd41b0 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2  qemu_event_wait (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3  0x0000aaaaaabee03c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'

And there should be no 0x7fff value for "ev->value".

qemu_event_init() is the one initializing the global:

static QemuEvent rcu_call_ready_event;

and it is called by "rcu_init_complete()" which is called by "rcu_init()":

static void __attribute__((__constructor__)) rcu_init(void)

a constructor function.

So, "fixing" this issue by:

(gdb) print rcu_call_ready_event
    $8 = {value = 4294967295, initialized = true}
    
    (gdb) watch rcu_call_ready_event
    Hardware watchpoint 1: rcu_call_ready_event
    
    (gdb) set rcu_call_ready_event.initialized = 1
    
    (gdb) set rcu_call_ready_event.value = 0

and note that I added a watchpoint to rcu_call_ready_event global:

<HANG>

Thread 1 "qemu-img" received signal SIGINT, Interrupt.
(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33625))]

(gdb) bt
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>)
#1  0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) print rcu_call_ready_event
$9 = {value = 0, initialized = true}

You can see I advanced in the qemu_event_{reset,set,wait} logic.

(gdb) disassemble /m 0x0000aaaaaabd4110
    Dump of assembler code for function qemu_event_reset:
    408     in ./util/qemu-thread-posix.c

409     in ./util/qemu-thread-posix.c

410     in ./util/qemu-thread-posix.c
    411     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f0 <+0>:     ldrb    w1, [x0, #4]
       0x0000aaaaaabd40f4 <+4>:     cbz     w1, 0xaaaaaabd411c <qemu_event_reset+44>
       0x0000aaaaaabd411c <+44>:    stp     x29, x30, [sp, #-16]!
       0x0000aaaaaabd4120 <+48>:    adrp    x3, 0xaaaaaac20000
       0x0000aaaaaabd4124 <+52>:    add     x3, x3, #0x908
       0x0000aaaaaabd4128 <+56>:    mov     x29, sp
       0x0000aaaaaabd412c <+60>:    adrp    x1, 0xaaaaaac20000
       0x0000aaaaaabd4130 <+64>:    adrp    x0, 0xaaaaaac20000
       0x0000aaaaaabd4134 <+68>:    add     x3, x3, #0x290
       0x0000aaaaaabd4138 <+72>:    add     x1, x1, #0xc00
       0x0000aaaaaabd413c <+76>:    add     x0, x0, #0xd40
       0x0000aaaaaabd4140 <+80>:    mov     w2, #0x19b    // #411
       0x0000aaaaaabd4144 <+84>:    bl      0xaaaaaaaff190 <__assert_fail@plt>

412     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f8 <+8>:     ldr     w1, [x0]

413     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40fc <+12>:    dmb     ishld

414     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4118 <+40>:    ret

And I'm currently inside the STLXR and LDAXR logic. To make sure my program counter is advancing, I added a breakpoint at 0x0000aaaaaabd4108, so CBNZ instruction would branch indefinitely into LDXAR instruction again, until the
LDAXR<->STLXR logic is satisfied (inside qemu_event_wait()).

(gdb) break *(0x0000aaaaaabd4108)
Breakpoint 2 at 0xaaaaaabd4108: file ./util/qemu-thread-posix.c, line 414.

which is basically this:

if (value == EV_SET) {                        EV_SET == 0
        atomic_or(&ev->value, EV_FREE);           EV_FREE = 1
    }
    
and we can see that this logic being called one time after another:

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
    
(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

EVEN when rcu_call_ready_event->value is already EV_SET (0):

(gdb) print rcu_call_ready_event
$11 = {value = 0, initialized = true}

(gdb) info break
Num     Type           Disp Enb Address            What
1       hw watchpoint  keep y                      rcu_call_ready_event
3       breakpoint     keep n   0x0000aaaaaabd4108 qemu-thread-posix.c:414
        breakpoint already hit 23 times
4       breakpoint     keep y   0x0000aaaaaabd4148 qemu-thread-posix.c:424

IF I enable only rcu_call_ready_event HW watchpoint, nothing is triggered.

(gdb) watch *(rcu_call_ready_event->value)
Hardware watchpoint 6: *(rcu_call_ready_event->value)

not if I set it directly to QemuEvent->value...

assert(ev->initialized);
    value = atomic_read(&ev->value);
    smp_mb_acquire();
    if (value == EV_SET) {
        atomic_or(&ev->value, EV_FREE);
    }

meaning that "value" and "ev->value" might have a diff value... is that so ?

(gdb) print value
$14 = <optimized out>

can't say.. checking registers AND stack:

0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108

x0             0xaaaaaac87ce8      187649986428136
x1             0x1                 1
x2             0x1                 1
x3             0x0                 0
x4             0xffffbec61e98      281473882398360
x5             0xffffbec61c90      281473882397840
x6             0xffffbec61c90      281473882397840
x7             0x1                 1
x8             0x65                101
x9             0x0                 0
x10            0x0                 0
x11            0x0                 0
x12            0xffffbec61d90      281473882398096
x13            0x0                 0
x14            0x0                 0
x15            0x2                 2
x16            0xffffbf67ccf0      281473892994288
x17            0xffffbf274938      281473888766264
x18            0x23f               575
x19            0x0                 0
x20            0xaaaaaac87ce8      187649986428136
x21            0x0                 0
x22            0xfffffffff5bf      281474976708031
x23            0xaaaaaac87ce0      187649986428128
x24            0xaaaaaac29000      187649986039808
x25            0xfffffffff658      281474976708184
x26            0x1000              4096
x27            0xffffbf28c000      281473888862208
x28            0xffffbec61d90      281473882398096
x29            0xffffbec61420      281473882395680
x30            0xaaaaaabedff8      187649985798136
sp             0xffffbec61420      0xffffbec61420
pc             0xaaaaaabd4110      0xaaaaaabd4110 <qemu_event_reset+32>
cpsr           0x0                 [ EL=0 ]
fpsr           0x0                 0
fpcr           0x0                 0

AND the ORR instruction is ALWAYS being executed against 0x1 (not 0x0, which is what I just changed by changing .value):

(gdb) print value
$14 = <optimized out>

0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
       
#0x1 is being used instead of contents of "value" local variable (volatile).

I'll recompile QEMU flagging all those local "unsigned value" variables as being volatile and check if optimization changes. Or even try to disable optimizations.

Rafael David Tinoco (rafaeldtinoco) on 2019-09-10

summary:	- qemu-img hangs on high core count ARM system + qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting + images
Changed in qemu:
status:	Confirmed → In Progress
assignee:	nobody → Rafael David Tinoco (rafaeldtinoco)

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-10:

#15

QEMU BUG: #1

Alright, one of the issues is (according to comment #14):

"""
Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'
"""

Now I know why rcu_call_ready_event->value is set to INT_MAX. That is because in the following declaration:

struct QemuEvent {
#ifndef __linux__
    pthread_mutex_t lock;
    pthread_cond_t cond;
#endif
    unsigned value;
    bool initialized;
};

#define EV_SET 0
#define EV_FREE 1
#define EV_BUSY -1

"value" is declared as unsigned, but EV_BUSY sets it to -1, and, according to the Two's Complement Operation (https://en.wikipedia.org/wiki/Two%27s_complement), it will be INT_MAX (4294967295).

So this is the "first bug" found AND it is definitely funny that this hasn't been seen in other architectures at all... I can reproduce it at will.

With that said, it seems that there is still another issue causing (less frequently):

(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec5ad90 (LWP 17459))]
#0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
38 ../sysdeps/unix/sysv/linux/aarch64/syscall.S: No such file or directory.
(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1 0x0000aaaaaabd41cc in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3 0x0000aaaaaabed05c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4 0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5 0x0000ffffbf25c880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6 0x0000ffffbf1b6b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 to be stuck at "futex()" kernel syscall (like the FUTEX_WAKE never happened and/or wasn't atomic for this arch/binary). Need to investigate this also.

QEMU BUG: #1

Alright, one of the issues is (according to comment #14):

"""
Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'
"""

Now I know why rcu_call_ready_event->value is set to INT_MAX. That is because in the following declaration:

struct QemuEvent {
#ifndef __linux__
    pthread_mutex_t lock;
    pthread_cond_t cond;
#endif
    unsigned value;
    bool initialized;
};

#define EV_SET         0
#define EV_FREE        1
#define EV_BUSY       -1

"value" is declared as unsigned, but EV_BUSY sets it to -1, and, according to the Two's Complement Operation (https://en.wikipedia.org/wiki/Two%27s_complement), it will be INT_MAX (4294967295).

So this is the "first bug" found AND it is definitely funny that this hasn't been seen in other architectures at all... I can reproduce it at will.

With that said, it seems that there is still another issue causing (less frequently):

(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec5ad90 (LWP 17459))]
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/aarch64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaaaabd41cc in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2  qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3  0x0000aaaaaabed05c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5  0x0000ffffbf25c880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6  0x0000ffffbf1b6b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 to be stuck at "futex()" kernel syscall (like the FUTEX_WAKE never happened and/or wasn't atomic for this arch/binary). Need to investigate this also.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-11: qemu_futex_wait() lockups in ARM64: 2 possible issues

#16

Paolo,

While debugging hungs in ARM64 while doing a simple:

qemu-img convert -f qcow2 -O qcow2 file.qcow2 output.qcow2

I might have found 2 issues which I'd like you to review, if possible.

ISSUE #1
========

I've caught the following stack trace after an HUNG in qemu-img convert:

(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():

- EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
unsigned), is a two's complement, making argument into a INT_MAX when
that's not what is expected (unless I missed something).

*** If that is the case, unsure if you, Paolo, prefer declaring
*(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
here ***

BUG: description:
https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

========
ISSUE #2
========

I found this when debugging lockups while in futex() in a specific ARM64
server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still
investigating.

After fixing the issue above, I'm still getting stuck into:

qemu_event_wait() -> qemu_futex_wait()

***
As if qemu_event_set() has ran before qemu_futex_wait() ever started running
***

The Other threads are waiting for poll() on a PIPE coming from this
stuck thread (thread #1), and in sigwait():

(gdb) thread 1
...
(gdb) bt
#0 0x0000ffffbf1ad81c in __GI_ppoll
#1 0x0000aaaaaabcf73c in ppoll
#2 qemu_poll_ns
#3 0x0000aaaaaabd0764 in os_host_main_loop_wait
#4 main_loop_wait
...

(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) thread 3
...
(gdb) bt
#0 0x0000ffffbf11aa20 in __GI___sigtimedwait
#1 0x0000ffffbf2671b4 in __sigwait
#2 0x0000aaaaaabd1ddc in sigwait_compat
#3 0x0000aaaaaabd34c8 in qemu_thread_start
#4 0x0000ffffbf25c880 in start_thread
#5 0x0000ffffbf1b6b9c in thread_start

QUESTION:

- Should qemu_event_set() check return code from
qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

Tks in advance,

Rafael D. Tinoco

Paolo,

While debugging hungs in ARM64 while doing a simple:

qemu-img convert -f qcow2 -O qcow2 file.qcow2 output.qcow2

I might have found 2 issues which I'd like you to review, if possible.

ISSUE #1
========

I've caught the following stack trace after an HUNG in qemu-img convert:

(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():

- EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
unsigned), is a two's complement, making argument into a INT_MAX when
that's not what is expected (unless I missed something).

*** If that is the case, unsure if you, Paolo, prefer declaring
*(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
here ***

BUG: description:
https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

========
ISSUE #2
========

I found this when debugging lockups while in futex() in a specific ARM64
server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still
investigating.

After fixing the issue above, I'm still getting stuck into:

qemu_event_wait() -> qemu_futex_wait()

***
As if qemu_event_set() has ran before qemu_futex_wait() ever started running
***

The Other threads are waiting for poll() on a PIPE coming from this
stuck thread (thread #1), and in sigwait():

(gdb) thread 1
...
(gdb) bt
#0  0x0000ffffbf1ad81c in __GI_ppoll
#1  0x0000aaaaaabcf73c in ppoll
#2  qemu_poll_ns
#3  0x0000aaaaaabd0764 in os_host_main_loop_wait
#4  main_loop_wait
...

(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) thread 3
...
(gdb) bt
#0  0x0000ffffbf11aa20 in __GI___sigtimedwait
#1  0x0000ffffbf2671b4 in __sigwait
#2  0x0000aaaaaabd1ddc in sigwait_compat
#3  0x0000aaaaaabd34c8 in qemu_thread_start
#4  0x0000ffffbf25c880 in start_thread
#5  0x0000ffffbf1b6b9c in thread_start

QUESTION:

- Should qemu_event_set() check return code from
qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

Tks in advance,

Rafael D. Tinoco

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-11:

#17

In comment #14, please disregard the second half of the issue, related to:

       0x0000aaaaaabd4100 <+16>: cbz w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>: ret
       0x0000aaaaaabd4108 <+24>: ldaxr w1, [x0]
       0x0000aaaaaabd410c <+28>: orr w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>: stlxr w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>: cbnz w2, 0xaaaaaabd4108

Duh! This is just a regular load/xor/store logic for atomic_or() inside qemu_event_reset().

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-11:

#18

Quick update...

> value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():
>
> - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
> unsigned), is a two's complement, making argument into a INT_MAX when
> that's not what is expected (unless I missed something).
>
> *** If that is the case, unsure if you, Paolo, prefer declaring
> *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
> here ***
>
> BUG: description:
> https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

I realized this might be intentional, but, still, I tried:

https://pastebin.ubuntu.com/p/6rkkY6fJdm/

looking for anything that could have misbehaved in arm64 (specially
concerned on casting and type conversions between the functions).

> QUESTION:
>
> - Should qemu_event_set() check return code from
> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

And I also tried:

- qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+ while(qemu_futex(pval, FUTEX_WAKE, val, NULL, NULL, 0) == 0)
+ continue;

and it made little difference (took way more time for me to reproduce
the issue though):

"""
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2

[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]

[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]

<HUNG w/ 3 threads in the stack trace showed before>
"""

All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).

Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).

I'll try to check why those tasks exited.

Any thoughts ?

Tks

Quick update...

> value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():
> 
> - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
> unsigned), is a two's complement, making argument into a INT_MAX when
> that's not what is expected (unless I missed something).
> 
> *** If that is the case, unsure if you, Paolo, prefer declaring
> *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
> here ***
> 
> BUG: description:
> https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

I realized this might be intentional, but, still, I tried:

https://pastebin.ubuntu.com/p/6rkkY6fJdm/

looking for anything that could have misbehaved in arm64 (specially
concerned on casting and type conversions between the functions).

> QUESTION:
> 
> - Should qemu_event_set() check return code from
> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

And I also tried:

-    qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+    while(qemu_futex(pval, FUTEX_WAKE, val, NULL, NULL, 0) == 0)
+        continue;

and it made little difference (took way more time for me to reproduce
the issue though):

"""
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2

[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]

[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]

<HUNG w/ 3 threads in the stack trace showed before>
"""

All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).

Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).

I'll try to check why those tasks exited.

Any thoughts ?

Tks

Rafael David Tinoco (rafaeldtinoco) on 2019-09-11

description:	updated
description:	updated

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-09-11:

#19

Download full text (5.3 KiB)

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved. In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
>
>
> glib_pollfds_fill
> g_main_context_prepare
> aio_ctx_prepare
> atomic_or(&ctx->notify_me, 1)
> qemu_poll_ns
> glib_pollfds_poll
> g_main_context_check
> aio_ctx_check
> atomic_and(&ctx->notify_me, ~1)
> g_main_context_dispatch
> aio_ctx_dispatch
> /* do something for event */
> qemu_poll_ns
>

Paolo,

I tried confining execution in a single NUMA domain (cpu & mem) and
still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
into context to protect "ctx->notify_me", like showed bellow, and it
seems to have either fixed or mitigated it.

I was able to cause the hung once every 3 or 4 runs. I have already ran
qemu-img convert more than 30 times now and couldn't reproduce it again.

Next step is to play with the barriers and check why existing ones
aren't enough for ordering access to ctx->notify_me ... or should I
try/do something else in your opinion ?

This arch/machine (Huawei D06):

$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: 0x48
Model: 0
Stepping: 0x0
CPU max MHz: 2000.0000
CPU min MHz: 200.0000
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
cpuid asimdrdm dcpop

----

diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..0724086d91 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -84,6 +84,7 @@ struct AioContext {
      * dispatch phase, hence a simple counter is enough for them.
      */
     uint32_t notify_me;
+ QemuMutex notify_me_lcktest;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 51c41ed3c9..031d6e2997 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
int64_t max_ns, int64_t *timeout)
     bool progress;
     int64_t start_time, elapsed_time;

+ qemu_mutex_lock(&ctx->notify_me_lcktest);
assert(ctx->notify_me);
+ qemu_mutex_unlock(&ctx->notify_me_lcktest);
assert(qemu_lockcnt_count(&ctx->list_lock) > 0);

     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
@@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * so disable the optimization now.
      */
     if (blocking) {
+ qemu_mutex_lock(&ctx->notify_me_lcktest);
       ...

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved.  In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> 
> 
>     glib_pollfds_fill
>       g_main_context_prepare
>         aio_ctx_prepare
>           atomic_or(&ctx->notify_me, 1)
>     qemu_poll_ns
>     glib_pollfds_poll
>       g_main_context_check
>         aio_ctx_check
>           atomic_and(&ctx->notify_me, ~1)
>       g_main_context_dispatch
>         aio_ctx_dispatch
>           /* do something for event */
>             qemu_poll_ns 
>

Paolo,

I tried confining execution in a single NUMA domain (cpu & mem) and
still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
into context to protect "ctx->notify_me", like showed bellow, and it
seems to have either fixed or mitigated it.

I was able to cause the hung once every 3 or 4 runs. I have already ran
qemu-img convert more than 30 times now and couldn't reproduce it again.

Next step is to play with the barriers and check why existing ones
aren't enough for ordering access to ctx->notify_me ... or should I
try/do something else in your opinion ?

This arch/machine (Huawei D06):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           0x48
Model:               0
Stepping:            0x0
CPU max MHz:         2000.0000
CPU min MHz:         200.0000
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
cpuid asimdrdm dcpop

----

diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..0724086d91 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -84,6 +84,7 @@ struct AioContext {
      * dispatch phase, hence a simple counter is enough for them.
      */
     uint32_t notify_me;
+    QemuMutex notify_me_lcktest;

/* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 51c41ed3c9..031d6e2997 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
int64_t max_ns, int64_t *timeout)
     bool progress;
     int64_t start_time, elapsed_time;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     assert(ctx->notify_me);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
     assert(qemu_lockcnt_count(&ctx->list_lock) > 0);

trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
@@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * so disable the optimization now.
      */
     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         assert(in_aio_context_home_thread(ctx));
         atomic_add(&ctx->notify_me, 2);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

qemu_lockcnt_inc(&ctx->list_lock);
@@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }

if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         atomic_sub(&ctx->notify_me, 2);
         aio_notify_accept(ctx);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

/* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index c10642a385..140e1e86f5 100644
--- a/util/async.c
+++ b/util/async.c
@@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
 {
     AioContext *ctx = (AioContext *) source;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_or(&ctx->notify_me, 1);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

/* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
     AioContext *ctx = (AioContext *) source;
     QEMUBH *bh;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_and(&ctx->notify_me, ~1);
     aio_notify_accept(ctx);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

for (bh = ctx->first_bh; bh; bh = bh->next) {
         if (bh->scheduled) {
@@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
     /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
-    smp_mb();
+    //smp_mb();
+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     if (ctx->notify_me) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
 }

void aio_notify_accept(AioContext *ctx)
@@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
     ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
     QSLIST_INIT(&ctx->scheduled_coroutines);

+    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)

Revision history for this message

dann frazier (dannf) wrote on 2019-09-24: Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues

#20

Download full text (6.3 KiB)

On Wed, Sep 11, 2019 at 04:09:25PM -0300, Rafael David Tinoco wrote:
> > Zhengui's theory that notify_me doesn't work properly on ARM is more
> > promising, but he couldn't provide a clear explanation of why he thought
> > notify_me is involved. In particular, I would have expected notify_me to
> > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> >
> >
> > glib_pollfds_fill
> > g_main_context_prepare
> > aio_ctx_prepare
> > atomic_or(&ctx->notify_me, 1)
> > qemu_poll_ns
> > glib_pollfds_poll
> > g_main_context_check
> > aio_ctx_check
> > atomic_and(&ctx->notify_me, ~1)
> > g_main_context_dispatch
> > aio_ctx_dispatch
> > /* do something for event */
> > qemu_poll_ns
> >
>
> Paolo,
>
> I tried confining execution in a single NUMA domain (cpu & mem) and
> still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
> into context to protect "ctx->notify_me", like showed bellow, and it
> seems to have either fixed or mitigated it.
>
> I was able to cause the hung once every 3 or 4 runs. I have already ran
> qemu-img convert more than 30 times now and couldn't reproduce it again.
>
> Next step is to play with the barriers and check why existing ones
> aren't enough for ordering access to ctx->notify_me ... or should I
> try/do something else in your opinion ?
>
> This arch/machine (Huawei D06):
>
> $ lscpu
> Architecture: aarch64
> Byte Order: Little Endian
> CPU(s): 96
> On-line CPU(s) list: 0-95
> Thread(s) per core: 1
> Core(s) per socket: 48
> Socket(s): 2
> NUMA node(s): 4
> Vendor ID: 0x48
> Model: 0
> Stepping: 0x0
> CPU max MHz: 2000.0000
> CPU min MHz: 200.0000
> BogoMIPS: 200.00
> L1d cache: 64K
> L1i cache: 64K
> L2 cache: 512K
> L3 cache: 32768K
> NUMA node0 CPU(s): 0-23
> NUMA node1 CPU(s): 24-47
> NUMA node2 CPU(s): 48-71
> NUMA node3 CPU(s): 72-95
> Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> cpuid asimdrdm dcpop

Note that I'm also seeing this on a ThunderX2 (same calltrace):

$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Thread(s) per core: 4
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: Cavium
Model: 1
Model name: ThunderX2 99xx
Stepping: 0x1
BogoMIPS: 400.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 32768K
NUMA node0 CPU(s): 0-111
NUMA node1 CPU(s): 112-223
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

-dann

> ----
>
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..0724086d91 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -84,6 +84,7 @@ struct AioContext {
> * dispatch phase, hence a simple counter is enough for them.
> */
> uint32_t notify_me;
> + QemuMutex notif...

On Wed, Sep 11, 2019 at 04:09:25PM -0300, Rafael David Tinoco wrote:
> > Zhengui's theory that notify_me doesn't work properly on ARM is more
> > promising, but he couldn't provide a clear explanation of why he thought
> > notify_me is involved.  In particular, I would have expected notify_me to
> > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> > 
> > 
> >     glib_pollfds_fill
> >       g_main_context_prepare
> >         aio_ctx_prepare
> >           atomic_or(&ctx->notify_me, 1)
> >     qemu_poll_ns
> >     glib_pollfds_poll
> >       g_main_context_check
> >         aio_ctx_check
> >           atomic_and(&ctx->notify_me, ~1)
> >       g_main_context_dispatch
> >         aio_ctx_dispatch
> >           /* do something for event */
> >             qemu_poll_ns 
> > 
> 
> Paolo,
> 
> I tried confining execution in a single NUMA domain (cpu & mem) and
> still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
> into context to protect "ctx->notify_me", like showed bellow, and it
> seems to have either fixed or mitigated it.
> 
> I was able to cause the hung once every 3 or 4 runs. I have already ran
> qemu-img convert more than 30 times now and couldn't reproduce it again.
> 
> Next step is to play with the barriers and check why existing ones
> aren't enough for ordering access to ctx->notify_me ... or should I
> try/do something else in your opinion ?
> 
> This arch/machine (Huawei D06):
> 
> $ lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              96
> On-line CPU(s) list: 0-95
> Thread(s) per core:  1
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        4
> Vendor ID:           0x48
> Model:               0
> Stepping:            0x0
> CPU max MHz:         2000.0000
> CPU min MHz:         200.0000
> BogoMIPS:            200.00
> L1d cache:           64K
> L1i cache:           64K
> L2 cache:            512K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-23
> NUMA node1 CPU(s):   24-47
> NUMA node2 CPU(s):   48-71
> NUMA node3 CPU(s):   72-95
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> cpuid asimdrdm dcpop

Note that I'm also seeing this on a ThunderX2 (same calltrace):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              224
On-line CPU(s) list: 0-223
Thread(s) per core:  4
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-111
NUMA node1 CPU(s):   112-223
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

-dann

> ----
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..0724086d91 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -84,6 +84,7 @@ struct AioContext {
>       * dispatch phase, hence a simple counter is enough for them.
>       */
>      uint32_t notify_me;
> +    QemuMutex notify_me_lcktest;
> 
>      /* A lock to protect between QEMUBH and AioHandler adders and deleter,
>       * and to ensure that no callbacks are removed while we're walking and
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 51c41ed3c9..031d6e2997 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
> int64_t max_ns, int64_t *timeout)
>      bool progress;
>      int64_t start_time, elapsed_time;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      assert(ctx->notify_me);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      assert(qemu_lockcnt_count(&ctx->list_lock) > 0);
> 
>      trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
> @@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>       * so disable the optimization now.
>       */
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          assert(in_aio_context_home_thread(ctx));
>          atomic_add(&ctx->notify_me, 2);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      qemu_lockcnt_inc(&ctx->list_lock);
> @@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>      }
> 
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          atomic_sub(&ctx->notify_me, 2);
>          aio_notify_accept(ctx);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      /* Adjust polling time */
> diff --git a/util/async.c b/util/async.c
> index c10642a385..140e1e86f5 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
>  {
>      AioContext *ctx = (AioContext *) source;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_or(&ctx->notify_me, 1);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      /* We assume there is no timeout already supplied */
>      *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
> @@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
>      AioContext *ctx = (AioContext *) source;
>      QEMUBH *bh;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_and(&ctx->notify_me, ~1);
>      aio_notify_accept(ctx);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>          if (bh->scheduled) {
> @@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
>      /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>       * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>       */
> -    smp_mb();
> +    //smp_mb();
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      if (ctx->notify_me) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>  }
> 
>  void aio_notify_accept(AioContext *ctx)
> @@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
>      ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
>      QSLIST_INIT(&ctx->scheduled_coroutines);
> 
> +    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
> +
>      aio_set_event_notifier(ctx, &ctx->notifier,
>                             false,
>                             (EventNotifierHandler *)
>

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-02:

#21

I've looked into this on ThunderX2. The arm64 code generated for the
atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
memory barriers. It is just plain ldaxr/stlxr.

From my understanding this is not sufficient for SMP sync.

If I read this comment correct:

    void aio_notify(AioContext *ctx)
    {
        /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
         * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
         */
        smp_mb();
        if (ctx->notify_me) {

it points out that the smp_mb() should be paired. But as
I said the used atomics don't generate any barriers at all.

I've tried to verify me theory with this patch and didn't run into the
issue for ~500 iterations (usually I would trigger the issue ~20 iterations).

--Jan

diff --git a/util/aio-posix.c b/util/aio-posix.c
index d8f0cb4af8dd..d07dcd4e9993 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -591,6 +591,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
      */
     if (blocking) {
         atomic_add(&ctx->notify_me, 2);
+ smp_mb();
     }

qemu_lockcnt_inc(&ctx->list_lock);
@@ -632,6 +633,7 @@ bool aio_poll(AioContext *ctx, bool blocking)

     if (blocking) {
         atomic_sub(&ctx->notify_me, 2);
+ smp_mb();
     }

/* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index 4dd9d95a9e73..92ac209c4615 100644
--- a/util/async.c
+++ b/util/async.c
@@ -222,6 +222,7 @@ aio_ctx_prepare(GSource *source, gint *timeout)
AioContext *ctx = (AioContext *) source;

atomic_or(&ctx->notify_me, 1);
+ smp_mb();

     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -240,6 +241,7 @@ aio_ctx_check(GSource *source)
     QEMUBH *bh;

atomic_and(&ctx->notify_me, ~1);
+ smp_mb();
aio_notify_accept(ctx);

for (bh = ctx->first_bh; bh; bh = bh->next) {

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-02:

#22

aio-posix.tar.xz Edit (1.2 MiB, application/x-tar)

Debug files for aio-posix generated on 18.04 on ThunderX2.

Compiler:
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

Distro:
Ubuntu 18.04.3 LTS

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-02:

#23

On Wed, Oct 02, 2019 at 11:45:19AM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
>
> Awesome! That would be a compiler bug though, as atomic_add and atomic_sub
> are defined as sequentially consistent:
>
> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))

Compiler bug sounds kind of unlikely...

> What compiler are you using and what distro? Can you compile util/aio-posix.c
> with "-fdump-rtl-all -fdump-tree-all", zip the boatload of debugging files and
> send them my way?

This is on Ubuntu 18.04.3,
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

I've uploaded the debug files to:
https://bugs.launchpad.net/qemu/+bug/1805256/+attachment/5293619/+files/aio-posix.tar.xz

Thanks,
Jan

> Thanks,
>
> Paolo

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-10-03:

#24

Download full text (5.2 KiB)

Documenting this here as bug# was dropped from the mail thread:

On 02/10/19 13:05, Jan Glauber wrote:
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
>
> From my understanding this is not sufficient for SMP sync.
>
>>> If I read this comment correct:
>>>
>>> void aio_notify(AioContext *ctx)
>>> {
>>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
>>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>> */
>>> smp_mb();
>>> if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome! That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. A
small testcase is as follows:

int ctx_notify_me;
int bh_scheduled;

  int x()
  {
      int one = 1;
      int ret;
      __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE); // x1
      __atomic_thread_fence(__ATOMIC_SEQ_CST); // x2
      __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED); // x3
      return ret;
  }

  int y()
  {
      int ret;
      __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST); // y1
      __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED); // y2
      return ret;
  }

Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled. However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me. So you can have something like:

thread 0 (y) thread 1 (x)
----------------------------------- -----------------------------
y1: load-acq ctx->notify_me
y2: load-rlx bh->scheduled
                                       x1: store-rel bh->scheduled <-- 1
                                       x2: memory barrier
                                       x3: load-rlx ctx->notify_me
y1: store-rel ctx->notify_me <-- 2

Being very puzzled, I tried to put this into cppmem:

  int main() {
    atomic_int ctx_notify_me = 0;
    atomic_int bh_scheduled = 0;
    {{{ {
          bh_scheduled.store(1, mo_release);
          atomic_thread_fence(mo_seq_cst);
          // must be zero since the bug report shows no notification
          ctx_notify_me.load(mo_relaxed).readsvalue(0);
        }
    ||| {
          ctx_notify_me.store(2, mo_seq_cst);
          r2=bh_scheduled.load(mo_relaxed);
        }
    }}};
    return 0;
  }

and much to my surprise, the tool said r2 *can* be 0. Same if I put a
CAS like

cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
mo_seq_cst, mo_seq_cst);

which resembles the code in ...

Documenting this here as bug# was dropped from the mail thread:

On 02/10/19 13:05, Jan Glauber wrote:
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
>
> From my understanding this is not sufficient for SMP sync.
>
>>> If I read this comment correct:
>>>
>>>     void aio_notify(AioContext *ctx)
>>>     {
>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>          */
>>>         smp_mb();
>>>         if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
small testcase is as follows:

int ctx_notify_me;
  int bh_scheduled;

int x()
  {
      int one = 1;
      int ret;
      __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
      __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
      __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
      return ret;
  }

int y()
  {
      int ret;
      __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
      __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
      return ret;
  }

Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled.  However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me.  So you can have something like:

thread 0 (y)                          thread 1 (x)
 -----------------------------------   -----------------------------
 y1: load-acq ctx->notify_me
 y2: load-rlx bh->scheduled
                                       x1: store-rel bh->scheduled <-- 1
                                       x2: memory barrier
                                       x3: load-rlx ctx->notify_me
 y1: store-rel ctx->notify_me <-- 2

Being very puzzled, I tried to put this into cppmem:

int main() {
    atomic_int ctx_notify_me = 0;
    atomic_int bh_scheduled = 0;
    {{{ {
          bh_scheduled.store(1, mo_release);
          atomic_thread_fence(mo_seq_cst);
          // must be zero since the bug report shows no notification
          ctx_notify_me.load(mo_relaxed).readsvalue(0);
        }
    ||| {
          ctx_notify_me.store(2, mo_seq_cst);
          r2=bh_scheduled.load(mo_relaxed);
        }
    }}};
    return 0;
  }

and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
CAS like

cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_seq_cst, mo_seq_cst);

which resembles the code in the test case a bit more.

I then found a discussion about using the C11 memory model in Linux
(https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
following statement, which is a bit disheartening even though it is
about a different test:

My first gut feeling was that the assertion should never fire, but
   that was wrong because (as I seem to usually forget) the seq-cst
   total order is just a constraint but doesn't itself contribute
   to synchronizes-with -- but this is different for seq-cst fences.

and later in the thread:

Use of C11 atomics to implement Linux kernel atomic operations
   requires knowledge of the underlying architecture and the compiler's
   implementation, as was noted earlier in this thread.

Indeed if I add an atomic_thread_fence I get only one valid execution,
where r2 must be 1.  This is similar to GCC's bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
execution if the store is replaced with something like GCC's assembly
for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):

cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_release, mo_release);
        atomic_thread_fence(mo_seq_cst);

So we should:

1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
seems to be making the same assumptions as Linux about the memory model,
and this is wrong because QEMU uses C11 atomics if available.
Fortunately, this kind of synchronization in QEMU is relatively rare and
only this particular bit seems affected.  If there is a fix which stays
within the C11 memory model, and does not pessimize code on x86, we can
use it[1] and document the pitfall.

2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
__sync_fetch_and_add and friends.  And again, in this case the
difference between the C11 and Linux/QEMU memory models must be documented.

Torvald, Will, help me please... :((

Paolo

[1] as would be the case if fetch_add was implemented as
fetch_add(RELEASE)+thread_fence(SEQ_CST).

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-10-03:

#25

Download full text (6.3 KiB)

On Wed, 2019-10-02 at 15:20 +0200, Paolo Bonzini wrote:
> On 02/10/19 13:05, Jan Glauber wrote:
>> The arm64 code generated for the
>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>> memory barriers. It is just plain ldaxr/stlxr.
>>
>> From my understanding this is not sufficient for SMP sync.
>>
>>>> If I read this comment correct:
>>>>
>>>> void aio_notify(AioContext *ctx)
>>>> {
>>>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
>>>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>> */
>>>> smp_mb();
>>>> if (ctx->notify_me) {
>>>>
>>>> it points out that the smp_mb() should be paired. But as
>>>> I said the used atomics don't generate any barriers at all.
>>>
>>> Awesome! That would be a compiler bug though, as atomic_add and atomic_sub
>>> are defined as sequentially consistent:
>>>
>>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>>
>> Compiler bug sounds kind of unlikely...
>
> Indeed the assembly produced by the compiler matches for example the
> mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. A
> small testcase is as follows:
>
> int ctx_notify_me;
> int bh_scheduled;
>
> int x()
> {
> int one = 1;
> int ret;
> __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE); // x1
> __atomic_thread_fence(__ATOMIC_SEQ_CST); // x2
> __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED); // x3
> return ret;
> }
>
> int y()
> {
> int ret;
> __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST); // y1
> __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED); // y2
> return ret;
> }
>
> Here y (which is aio_poll) wants to order the write to ctx->notify_me
> before reads of bh->scheduled. However, the processor can speculate the
> load of bh->scheduled between the load-acquire and store-release of
> ctx->notify_me. So you can have something like:
>
> thread 0 (y) thread 1 (x)
> ----------------------------------- -----------------------------
> y1: load-acq ctx->notify_me
> y2: load-rlx bh->scheduled
> x1: store-rel bh->scheduled <-- 1
> x2: memory barrier
> x3: load-rlx ctx->notify_me
> y1: store-rel ctx->notify_me <-- 2
>
> Being very puzzled, I tried to put this into cppmem:
>
> int main() {
> atomic_int ctx_notify_me = 0;
> atomic_int bh_scheduled = 0;
> {{{ {
> bh_scheduled.store(1, mo_release);
> atomic_thread_fence(mo_seq_cst);
> // must be zero since the bug report shows no notification
> ctx_notify_me.load(mo_relaxed).readsvalue(0);
> }
> ||| {
> ctx_notify_me.store(2, mo_seq_cst);
> r2=bh_scheduled.load(mo_relaxed);
> }
> }}};
> return 0;
> }
>
> and much to my surprise, the tool said r2 *can* be 0. Same if I put a
> CAS like
>
> ...

On Wed, 2019-10-02 at 15:20 +0200, Paolo Bonzini wrote:
> On 02/10/19 13:05, Jan Glauber wrote:
>> The arm64 code generated for the
>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>> memory barriers. It is just plain ldaxr/stlxr.
>>
>> From my understanding this is not sufficient for SMP sync.
>>
>>>> If I read this comment correct:
>>>>
>>>>     void aio_notify(AioContext *ctx)
>>>>     {
>>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>>          */
>>>>         smp_mb();
>>>>         if (ctx->notify_me) {
>>>>
>>>> it points out that the smp_mb() should be paired. But as
>>>> I said the used atomics don't generate any barriers at all.
>>>
>>> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
>>> are defined as sequentially consistent:
>>>
>>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>>
>> Compiler bug sounds kind of unlikely...
>
> Indeed the assembly produced by the compiler matches for example the
> mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
> small testcase is as follows:
>
>   int ctx_notify_me;
>   int bh_scheduled;
>
>   int x()
>   {
>       int one = 1;
>       int ret;
>       __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
>       __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
>       __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
>       return ret;
>   }
>
>   int y()
>   {
>       int ret;
>       __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
>       __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
>       return ret;
>   }
>
> Here y (which is aio_poll) wants to order the write to ctx->notify_me
> before reads of bh->scheduled.  However, the processor can speculate the
> load of bh->scheduled between the load-acquire and store-release of
> ctx->notify_me.  So you can have something like:
>
>  thread 0 (y)                          thread 1 (x)
>  -----------------------------------   -----------------------------
>  y1: load-acq ctx->notify_me
>  y2: load-rlx bh->scheduled
>                                        x1: store-rel bh->scheduled <-- 1
>                                        x2: memory barrier
>                                        x3: load-rlx ctx->notify_me
>  y1: store-rel ctx->notify_me <-- 2
>
> Being very puzzled, I tried to put this into cppmem:
>
>   int main() {
>     atomic_int ctx_notify_me = 0;
>     atomic_int bh_scheduled = 0;
>     {{{ {
>           bh_scheduled.store(1, mo_release);
>           atomic_thread_fence(mo_seq_cst);
>           // must be zero since the bug report shows no notification
>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>         }
>     ||| {
>           ctx_notify_me.store(2, mo_seq_cst);
>           r2=bh_scheduled.load(mo_relaxed);
>         }
>     }}};
>     return 0;
>   }
>
> and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
> CAS like
>
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_seq_cst, mo_seq_cst);
>
> which resembles the code in the test case a bit more.

This example looks like Dekker synchronization (if I get the intent right).

Two possible implementations of this are either (1) with all memory
accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
fences on between the store and load on both ends.  It's possible to mix
both, but that get's trickier I think.  I'd prefer the one with just
fences, just because it's easiest, conceptually.

> I then found a discussion about using the C11 memory model in Linux
> (https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
> following statement, which is a bit disheartening even though it is
> about a different test:
>
>    My first gut feeling was that the assertion should never fire, but
>    that was wrong because (as I seem to usually forget) the seq-cst
>    total order is just a constraint but doesn't itself contribute
>    to synchronizes-with -- but this is different for seq-cst fences.

It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
tech report should give you the gory details.
My comment is just about seq-cst working differently on memory accesses vs.
fences (in the way it's specified in the memory model).

> and later in the thread:
>
>    Use of C11 atomics to implement Linux kernel atomic operations
>    requires knowledge of the underlying architecture and the compiler's
>    implementation, as was noted earlier in this thread.
>
> Indeed if I add an atomic_thread_fence I get only one valid execution,
> where r2 must be 1.  This is similar to GCC's bug
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
> QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
> execution if the store is replaced with something like GCC's assembly
> for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):
>
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_release, mo_release);
>         atomic_thread_fence(mo_seq_cst);
>
> So we should:
>
> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
> seems to be making the same assumptions as Linux about the memory model,
> and this is wrong because QEMU uses C11 atomics if available.
> Fortunately, this kind of synchronization in QEMU is relatively rare and
> only this particular bit seems affected.  If there is a fix which stays
> within the C11 memory model, and does not pessimize code on x86, we can
> use it[1] and document the pitfall.

Using the fences between the store/load pairs in Dekker-like
synchronization should do that, right?  It's also relatively easy to deal
with.

> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
> __sync_fetch_and_add and friends.  And again, in this case the
> difference between the C11 and Linux/QEMU memory models must be documented.

I surely not aware of all the constraints here, but I'd be surprised if the
C11 memory model isn't good enough for portable synchronization code (with
the exception of the consume MO minefield, perhaps).

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-10-03:

#26

On 02/10/19 16:58, Torvald Riegel wrote:
> This example looks like Dekker synchronization (if I get the intent right).

It is the same pattern. However, one of the two synchronized variables
is a counter rather than just a flag.

> Two possible implementations of this are either (1) with all memory
> accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
> fences on between the store and load on both ends. It's possible to mix
> both, but that get's trickier I think. I'd prefer the one with just
> fences, just because it's easiest, conceptually.

Got it.

I'd also prefer the one with just fences, because we only really control
one side of the synchronization primitive (ctx_notify_me in my litmus
test) and I don't like the idea of forcing seq-cst MO on the other side
(bh_scheduled). The performance issue that I mentioned is that x86
doesn't have relaxed fetch and add, so you'd have a redundant fence like
this:

lock xaddl $2, mem1
mfence
...
movl mem1, %r8

(Gory QEMU details however allow us to use relaxed load and store here,
because there's only one writer).

> It works if you use (1) or (2) consistently. cppmem and the Batty et al.
> tech report should give you the gory details.
>
>> 1) understand why ATOMIC_SEQ_CST is not enough in this case. QEMU code
>> seems to be making the same assumptions as Linux about the memory model,
>> and this is wrong because QEMU uses C11 atomics if available.
>> Fortunately, this kind of synchronization in QEMU is relatively rare and
>> only this particular bit seems affected. If there is a fix which stays
>> within the C11 memory model, and does not pessimize code on x86, we can
>> use it[1] and document the pitfall.
>
> Using the fences between the store/load pairs in Dekker-like
> synchronization should do that, right? It's also relatively easy to deal
> with.
>
>> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
>> __sync_fetch_and_add and friends. And again, in this case the
>> difference between the C11 and Linux/QEMU memory models must be documented.
>
> I surely not aware of all the constraints here, but I'd be surprised if the
> C11 memory model isn't good enough for portable synchronization code (with
> the exception of the consume MO minefield, perhaps).

This helps a lot already; I'll work on a documentation and code patch.
Thanks very much.

Paolo

>> int main() {
>> atomic_int ctx_notify_me = 0;
>> atomic_int bh_scheduled = 0;
>> {{{ {
>> bh_scheduled.store(1, mo_release);
>> atomic_thread_fence(mo_seq_cst);
>> // must be zero since the bug report shows no notification
>> ctx_notify_me.load(mo_relaxed).readsvalue(0);
>> }
>> ||| {
>> ctx_notify_me.store(2, mo_seq_cst);
>> r2=bh_scheduled.load(mo_relaxed);
>> }
>> }}};
>> return 0;
>> }

On 02/10/19 16:58, Torvald Riegel wrote:
> This example looks like Dekker synchronization (if I get the intent right).

It is the same pattern.  However, one of the two synchronized variables
is a counter rather than just a flag.

> Two possible implementations of this are either (1) with all memory
> accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
> fences on between the store and load on both ends.  It's possible to mix
> both, but that get's trickier I think.  I'd prefer the one with just
> fences, just because it's easiest, conceptually.

Got it.

I'd also prefer the one with just fences, because we only really control
one side of the synchronization primitive (ctx_notify_me in my litmus
test) and I don't like the idea of forcing seq-cst MO on the other side
(bh_scheduled).  The performance issue that I mentioned is that x86
doesn't have relaxed fetch and add, so you'd have a redundant fence like
this:

lock	xaddl $2, mem1
	mfence
	...
	movl	mem1, %r8

(Gory QEMU details however allow us to use relaxed load and store here,
because there's only one writer).

> It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
> tech report should give you the gory details.
>
>> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
>> seems to be making the same assumptions as Linux about the memory model,
>> and this is wrong because QEMU uses C11 atomics if available.
>> Fortunately, this kind of synchronization in QEMU is relatively rare and
>> only this particular bit seems affected.  If there is a fix which stays
>> within the C11 memory model, and does not pessimize code on x86, we can
>> use it[1] and document the pitfall.
>
> Using the fences between the store/load pairs in Dekker-like
> synchronization should do that, right?  It's also relatively easy to deal
> with.
>
>> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
>> __sync_fetch_and_add and friends.  And again, in this case the
>> difference between the C11 and Linux/QEMU memory models must be documented.
>
> I surely not aware of all the constraints here, but I'd be surprised if the
> C11 memory model isn't good enough for portable synchronization code (with
> the exception of the consume MO minefield, perhaps).

This helps a lot already; I'll work on a documentation and code patch.
Thanks very much.

Paolo

>>   int main() {
>>     atomic_int ctx_notify_me = 0;
>>     atomic_int bh_scheduled = 0;
>>     {{{ {
>>           bh_scheduled.store(1, mo_release);
>>           atomic_thread_fence(mo_seq_cst);
>>           // must be zero since the bug report shows no notification
>>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>>         }
>>     ||| {
>>           ctx_notify_me.store(2, mo_seq_cst);
>>           r2=bh_scheduled.load(mo_relaxed);
>>         }
>>     }}};
>>     return 0;
>>   }

Changed in qemu (Ubuntu Disco):
importance:	Undecided → Medium
Changed in qemu (Ubuntu Bionic):
importance:	Undecided → Medium

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-07:

#27

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> >
> > From my understanding this is not sufficient for SMP sync.
> >
> > If I read this comment correct:
> >
> > void aio_notify(AioContext *ctx)
> > {
> > /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> > * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> > */
> > smp_mb();
> > if (ctx->notify_me) {
> >
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
>
> Based on the rest of the thread, this patch should also fix the bug:
>
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> aio_notify_accept(ctx);
>
> for (bh = ctx->first_bh; bh; bh = bh->next) {
> - if (bh->scheduled) {
> + if (atomic_mb_read(&bh->scheduled)) {
> return true;
> }
> }
>
>
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
>
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>
> void aio_notify(AioContext *ctx)
> {
> - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
> + * atomic_add in aio_poll.
> */
> - smp_mb();
> - if (ctx->notify_me) {
> + if (atomic_mb_read(&ctx->notify_me)) {
> event_notifier_set(&ctx->notifier);
> atomic_mb_set(&ctx->notified, true);
> }
>
>
> Would you be able to test these (one by one possibly)?

Sure.

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
>
> Sorry for asking the obvious---500 iterations of what?

The testcase mentioned in the Canonical issue:
https://bugs.launchpad.net/qemu/+bug/1805256

It's a simple image convert:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

Usually it got stuck after 3-20 iterations.

--Jan

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Sure.

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

The testcase mentioned in the Canonical issue:
https://bugs.launchpad.net/qemu/+bug/1805256

It's a simple image convert:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

Usually it got stuck after 3-20 iterations.

--Jan

Revision history for this message

dann frazier (dannf) wrote on 2019-10-07:

#28

Download full text (5.1 KiB)

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> >
> > From my understanding this is not sufficient for SMP sync.
> >
> > If I read this comment correct:
> >
> > void aio_notify(AioContext *ctx)
> > {
> > /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> > * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> > */
> > smp_mb();
> > if (ctx->notify_me) {
> >
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
>
> Based on the rest of the thread, this patch should also fix the bug:
>
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> aio_notify_accept(ctx);
>
> for (bh = ctx->first_bh; bh; bh = bh->next) {
> - if (bh->scheduled) {
> + if (atomic_mb_read(&bh->scheduled)) {
> return true;
> }
> }
>
>
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
>
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>
> void aio_notify(AioContext *ctx)
> {
> - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
> + * atomic_add in aio_poll.
> */
> - smp_mb();
> - if (ctx->notify_me) {
> + if (atomic_mb_read(&ctx->notify_me)) {
> event_notifier_set(&ctx->notifier);
> atomic_mb_set(&ctx->notified, true);
> }
>
>
> Would you be able to test these (one by one possibly)?

Paolo,
I tried them both separately and together on a Hi1620 system, each
time it hung in the first iteration. Here's a backtrace of a run with
both patches applied:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff8154b820 (LWP 63900)):
#0 0x0000ffff8b9402cc in __GI___sigtimedwait (set=<optimized out>, set@entry=0xaaaaf1e08070,
info=info@entry=0xffff8154ad98, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1 0x0000ffff8ba77fac in __sigwait (set=set@entry=0xaaaaf1e08070, sig=sig@entry=0xffff8154ae74)
at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2 0x0000aaaab7dc1610 in sigwait_compat (opaque=0xaaaaf1e08070) at util/compatfd.c:35
#3 0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#4 0x0000ffff8ba6d088 in start_thread (arg=0xffffceefbf4f) at pthread_create.c:463
#5 0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0x...

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Paolo,
  I tried them both separately and together on a Hi1620 system, each
time it hung in the first iteration. Here's a backtrace of a run with
both patches applied:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff8154b820 (LWP 63900)):
#0  0x0000ffff8b9402cc in __GI___sigtimedwait (set=<optimized out>, set@entry=0xaaaaf1e08070, 
    info=info@entry=0xffff8154ad98, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1  0x0000ffff8ba77fac in __sigwait (set=set@entry=0xaaaaf1e08070, sig=sig@entry=0xffff8154ae74)
    at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x0000aaaab7dc1610 in sigwait_compat (opaque=0xaaaaf1e08070) at util/compatfd.c:35
#3  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#4  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefbf4f) at pthread_create.c:463
#5  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffff81d4c820 (LWP 63899)):
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaab7dc4cd8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0xaaaab7e48708 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
#3  0x0000aaaab7ddf44c in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:260
#4  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#5  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefc05f) at pthread_create.c:463
#6  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffff81e83010 (LWP 63898)):
#0  0x0000ffff8b9d4154 in __GI_ppoll (fds=0xaaaaf1e0dbc0, nfds=187650205809964, timeout=<optimized out>, 
    timeout@entry=0x0, sigmask=0xffffceefbef0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaab7dbedb0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:340
#3  0x0000aaaab7dbfd2c in os_host_main_loop_wait (timeout=-1) at util/main-loop.c:236
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:517
#5  0x0000aaaab7ce86e8 in convert_do_copy (s=0xffffceefc068) at qemu-img.c:2028
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2520
#7  0x0000aaaab7ce1e54 in main (argc=8, argv=<optimized out>) at qemu-img.c:5097

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

$ for i in $(seq 1 500); do echo "==$i=="; ./qemu/qemu-img convert -p -f qcow2 -O qcow2 bionic-server-cloudimg-arm64.img out.img; done
==1==
    (37.19/100%)

-dann

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-09:

#29

Download full text (4.0 KiB)

On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
> On 07/10/19 16:44, dann frazier wrote:
> > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> >> On 02/10/19 11:23, Jan Glauber wrote:
> >>> I've looked into this on ThunderX2. The arm64 code generated for the
> >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> >>> memory barriers. It is just plain ldaxr/stlxr.
> >>>
> >>> From my understanding this is not sufficient for SMP sync.
> >>>
> >>> If I read this comment correct:
> >>>
> >>> void aio_notify(AioContext *ctx)
> >>> {
> >>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> >>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >>> */
> >>> smp_mb();
> >>> if (ctx->notify_me) {
> >>>
> >>> it points out that the smp_mb() should be paired. But as
> >>> I said the used atomics don't generate any barriers at all.
> >>
> >> Based on the rest of the thread, this patch should also fix the bug:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> >> aio_notify_accept(ctx);
> >>
> >> for (bh = ctx->first_bh; bh; bh = bh->next) {
> >> - if (bh->scheduled) {
> >> + if (atomic_mb_read(&bh->scheduled)) {
> >> return true;
> >> }
> >> }
> >>
> >>
> >> And also the memory barrier in aio_notify can actually be replaced
> >> with a SEQ_CST load:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
> >>
> >> void aio_notify(AioContext *ctx)
> >> {
> >> - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> >> - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >> + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> >> + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
> >> + * atomic_add in aio_poll.
> >> */
> >> - smp_mb();
> >> - if (ctx->notify_me) {
> >> + if (atomic_mb_read(&ctx->notify_me)) {
> >> event_notifier_set(&ctx->notifier);
> >> atomic_mb_set(&ctx->notified, true);
> >> }
> >>
> >>
> >> Would you be able to test these (one by one possibly)?
> >
> > Paolo,
> > I tried them both separately and together on a Hi1620 system, each
> > time it hung in the first iteration. Here's a backtrace of a run with
> > both patches applied:
>
> Ok, not a great start... I'll find myself an aarch64 machine and look
> at it more closely. I'd like the patch to be something we can
> understand and document, since this is probably the second most-used
> memory barrier idiom in QEMU.
>
> Paolo

I'm still not sure what the actual issue is here, but could it be some bad
interaction between the notify_me and the list_lock? The are both 4 byte
and side-by-side:

address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4
address list_lock: 0xaaaadb528aa4 size...

On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
> On 07/10/19 16:44, dann frazier wrote:
> > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> >> On 02/10/19 11:23, Jan Glauber wrote:
> >>> I've looked into this on ThunderX2. The arm64 code generated for the
> >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> >>> memory barriers. It is just plain ldaxr/stlxr.
> >>>
> >>> From my understanding this is not sufficient for SMP sync.
> >>>
> >>> If I read this comment correct:
> >>>
> >>>     void aio_notify(AioContext *ctx)
> >>>     {
> >>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >>>          */
> >>>         smp_mb();
> >>>         if (ctx->notify_me) {
> >>>
> >>> it points out that the smp_mb() should be paired. But as
> >>> I said the used atomics don't generate any barriers at all.
> >>
> >> Based on the rest of the thread, this patch should also fix the bug:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> >>      aio_notify_accept(ctx);
> >>  
> >>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> >> -        if (bh->scheduled) {
> >> +        if (atomic_mb_read(&bh->scheduled)) {
> >>              return true;
> >>          }
> >>      }
> >>
> >>
> >> And also the memory barrier in aio_notify can actually be replaced
> >> with a SEQ_CST load:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
> >>  
> >>  void aio_notify(AioContext *ctx)
> >>  {
> >> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> >> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> >> +     * atomic_add in aio_poll.
> >>       */
> >> -    smp_mb();
> >> -    if (ctx->notify_me) {
> >> +    if (atomic_mb_read(&ctx->notify_me)) {
> >>          event_notifier_set(&ctx->notifier);
> >>          atomic_mb_set(&ctx->notified, true);
> >>      }
> >>
> >>
> >> Would you be able to test these (one by one possibly)?
> > 
> > Paolo,
> >   I tried them both separately and together on a Hi1620 system, each
> > time it hung in the first iteration. Here's a backtrace of a run with
> > both patches applied:
> 
> Ok, not a great start...  I'll find myself an aarch64 machine and look
> at it more closely.  I'd like the patch to be something we can
> understand and document, since this is probably the second most-used
> memory barrier idiom in QEMU.
> 
> Paolo

I'm still not sure what the actual issue is here, but could it be some bad
interaction between the notify_me and the list_lock? The are both 4 byte
and side-by-side:

address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4

AFAICS the generated code looks OK (all load/store exclusive done
with 32 bit size):

e6c:       885ffc01        ldaxr   w1, [x0]
     e70:       11000821        add     w1, w1, #0x2
     e74:       8802fc01        stlxr   w2, w1, [x0]

...but if I bump notify_me size to uint64_t the issue goes away.

BTW, the image file I convert in the testcase is ~20 GB.

--Jan

diff --git a/include/block/aio.h b/include/block/aio.h
index a1d6b9e24939..e8a5ea3860bb 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -83,7 +83,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-11:

#30

On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> On 09/10/19 10:02, Jan Glauber wrote:

> > I'm still not sure what the actual issue is here, but could it be some bad
> > interaction between the notify_me and the list_lock? The are both 4 byte
> > and side-by-side:
> >
> > address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4
> > address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4
> >
> > AFAICS the generated code looks OK (all load/store exclusive done
> > with 32 bit size):
> >
> > e6c: 885ffc01 ldaxr w1, [x0]
> > e70: 11000821 add w1, w1, #0x2
> > e74: 8802fc01 stlxr w2, w1, [x0]
> >
> > ...but if I bump notify_me size to uint64_t the issue goes away.
>
> Ouch. :) Is this with or without my patch(es)?
>
> Also, what if you just add a dummy uint32_t after notify_me?

With the dummy the testcase also runs fine for 500 iterations.

Dann, can you try if this works on the Hi1620 too?

--Jan

Revision history for this message

Jan Glauber (jan-glauber-i) wrote on 2019-10-11:

#31

On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> On 11/10/19 08:05, Jan Glauber wrote:
> > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> >>
> >> Ouch. :) Is this with or without my patch(es)?
>
> You didn't answer this question.

Oh, sorry... I did but the mail probably didn't make it out.
I have both of your changes applied (as I think they make sense).

> >> Also, what if you just add a dummy uint32_t after notify_me?
> >
> > With the dummy the testcase also runs fine for 500 iterations.
>
> You might be lucky and causing list_lock to be in another cache line.
> What if you add __attribute__((aligned(16)) to notify_me (and keep the
> dummy)?

Good point. I'll try to force both into the same cacheline.

--Jan

> Paolo
>
> > Dann, can you try if this works on the Hi1620 too?

Revision history for this message

dann frazier (dannf) wrote on 2019-10-11:

#32

On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > On 09/10/19 10:02, Jan Glauber wrote:
>
> > > I'm still not sure what the actual issue is here, but could it be some bad
> > > interaction between the notify_me and the list_lock? The are both 4 byte
> > > and side-by-side:
> > >
> > > address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4
> > > address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4
> > >
> > > AFAICS the generated code looks OK (all load/store exclusive done
> > > with 32 bit size):
> > >
> > > e6c: 885ffc01 ldaxr w1, [x0]
> > > e70: 11000821 add w1, w1, #0x2
> > > e74: 8802fc01 stlxr w2, w1, [x0]
> > >
> > > ...but if I bump notify_me size to uint64_t the issue goes away.
> >
> > Ouch. :) Is this with or without my patch(es)?
> >
> > Also, what if you just add a dummy uint32_t after notify_me?
>
> With the dummy the testcase also runs fine for 500 iterations.
>
> Dann, can you try if this works on the Hi1620 too?

On Hi1620, it hung on the first iteration. Here's the complete patch
I'm running with:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..e6fd6f1a1a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
- uint32_t notify_me;
+ uint64_t notify_me;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);

     for (bh = ctx->first_bh; bh; bh = bh->next) {
- if (bh->scheduled) {
+ if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)

void aio_notify(AioContext *ctx)
{
- /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
- * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+ /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+ * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
+ * atomic_add in aio_poll.
      */
- smp_mb();
- if (ctx->notify_me) {
- event_notifier_set(&ctx->notifier);
+ if (atomic_mb_read(&ctx->notify_me)) {
+ event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
}

On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > On 09/10/19 10:02, Jan Glauber wrote:
> 
> > > I'm still not sure what the actual issue is here, but could it be some bad
> > > interaction between the notify_me and the list_lock? The are both 4 byte
> > > and side-by-side:
> > > 
> > > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > > 
> > > AFAICS the generated code looks OK (all load/store exclusive done
> > > with 32 bit size):
> > > 
> > >      e6c:       885ffc01        ldaxr   w1, [x0]
> > >      e70:       11000821        add     w1, w1, #0x2
> > >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > > 
> > > ...but if I bump notify_me size to uint64_t the issue goes away.
> > 
> > Ouch. :)  Is this with or without my patch(es)?
> > 
> > Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.
> 
> Dann, can you try if this works on the Hi1620 too?

On Hi1620, it hung on the first iteration. Here's the complete patch
I'm running with:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..e6fd6f1a1a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }

Revision history for this message

dann frazier (dannf) wrote on 2019-10-11:

#33

On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :) Is this with or without my patch(es)?
> >
> > You didn't answer this question.
>
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
>
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > >
> > > With the dummy the testcase also runs fine for 500 iterations.
> >
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
>
> Good point. I'll try to force both into the same cacheline.

On the Hi1620, this still hangs in the first iteration:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
- uint32_t notify_me;
+ __attribute__((aligned(16))) uint64_t notify_me;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);

     for (bh = ctx->first_bh; bh; bh = bh->next) {
- if (bh->scheduled) {
+ if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)

void aio_notify(AioContext *ctx)
{
- /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
- * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+ /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+ * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or
+ * atomic_add in aio_poll.
      */
- smp_mb();
- if (ctx->notify_me) {
- event_notifier_set(&ctx->notifier);
+ if (atomic_mb_read(&ctx->notify_me)) {
+ event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
}

On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :)  Is this with or without my patch(es)?
> > 
> > You didn't answer this question.
> 
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
> 
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > > 
> > > With the dummy the testcase also runs fine for 500 iterations.
> > 
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
> 
> Good point. I'll try to force both into the same cacheline.

On the Hi1620, this still hangs in the first iteration:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    __attribute__((aligned(16))) uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }

dann frazier (dannf) on 2019-12-13

Changed in kunpeng920:
status:	New → Confirmed
Changed in qemu (Ubuntu Bionic):
status:	New → Confirmed
Changed in qemu (Ubuntu Disco):
status:	New → Confirmed
Changed in qemu (Ubuntu Focal):
status:	New → Confirmed

Revision history for this message

Fred Kimmy (kongzizaixian) wrote on 2019-12-17:

#34

include/block/aio.h | 3 +++
qemu-img.c | 4 ++++
util/async.c | 5 +----
3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index e9bc04c..9153d87 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -89,6 +89,9 @@ struct AioContext {
*/
uint32_t notify_me;

+ /* force to notify for qemu-img convert */
+ bool notify_for_convert;
+
/* lock to protect between bh's adders and deleter */
QemuMutex bh_lock;

diff --git a/qemu-img.c b/qemu-img.c
index 60a2be3..cf037aa 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2411,6 +2411,10 @@ static int img_convert(int argc, char **argv)
         .wr_in_order = wr_in_order,
         .num_coroutines = num_coroutines,
     };
+
+ AioContext *ctx = qemu_get_aio_context();
+ ctx->notify_for_convert = 1;
+
     ret = convert_do_copy(&state);

out:
diff --git a/util/async.c b/util/async.c
index 042bf8a..af235fc 100644
--- a/util/async.c
+++ b/util/async.c
@@ -336,12 +336,9 @@ void aio_notify(AioContext *ctx)
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
     smp_mb();
- if (ctx->notify_me) {
+ if (ctx->notify_me || ctx->notify_for_convert) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
-#if defined(__aarch64__)
- kill(getpid(), SIGIO);
-#endif
     }
}

Can you try this aboving patchset to slove it?

Revision history for this message

dann frazier (dannf) wrote on 2019-12-17:

#35

I tested the patch in Comment #34, and it was able to pass 500 iterations.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-12-18:

#36

Hello Fred,

Based on Dann's feedback on testing, I'm failing to see where your patch fixes the "root" cause (despite being able to mitigate the issue by changing the aio notification mechanism).

I think the root cause is best described in this 2 emails from the thread:

https://lore.kernel.org/qemu-devel/20191009080220.GA2905@hc/

and

https://<email address hidden>/

So, by adding ctx->notify_for_convert, it is very likely you workarounded the issue by doing what Jan already said: removing both variables (ctx->list_lock and, in old case, ctx->notify_me, in your case, ctx->notify_for_convert) from the same cacheline and making the issue to "disappear" (as we would eventually do in a workaround patch).

What about aarch64 issue with both, ctx->list_lock and ctx->notify_for_convert, being synchronized by qemu used primitives, and being in the same cache line ?

Any "workaround" here would try to dodge the same cacheline situation, but, for upstream, I suppose Paolo wants to have something else regarding aarch64 ATOMIC_SEQ_CST.

like describe in this part of the discussion:

https://<email address hidden>/

Unless I'm missing something, am I ?

Thank you!

Revision history for this message

iveskim (iveskim) wrote on 2019-12-18:

#37

I tested the patch in Comment #34, and it was also failed to pass 5 iterations.
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 3987
[New LWP 3988]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffbd2b3154 in __GI_ppoll (fds=0xaaaae80ef080, nfds=187650381360708,
timeout=<optimized out>, sigmask=0xffffc31815f0)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
39 ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
(gdb)

Revision history for this message

dann frazier (dannf) wrote on 2019-12-18:

#38

comment-34-ported-to-upstream.patch Edit (1.3 KiB, text/plain)

fyi, what I tested in Comment #35 was upstream QEMU (@ aceeaa69d2) with a port of the patch in Comment #34 applied. I've attached that patch here. While it did avoid the issue in my testing, I agree with Rafael's Comment #36 that it does not appear to address the root cause (as I understand it), and is therefore unlikely something we'd ship in Ubuntu.

Ubuntu Foundations Team Bug Bot (crichton) on 2019-12-18

tags:

added: patch

Ike Panhc (ikepanhc) on 2020-02-13

tags:

added: ikeradar

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2020-02-13:

#39

Could HiSilicon respond to Dann & Rafael's comments #36 and #38?
Is there an upstream acceptable patch that addresses this issue?

Changed in kunpeng920:
status:	Confirmed → Incomplete
Changed in qemu (Ubuntu Bionic):
status:	Confirmed → Incomplete
Changed in qemu (Ubuntu Disco):
status:	Confirmed → Incomplete
Changed in qemu (Ubuntu Eoan):
status:	In Progress → Incomplete
Changed in qemu (Ubuntu Focal):
status:	Confirmed → Incomplete

Revision history for this message

Fred Kimmy (kongzizaixian) wrote on 2020-02-13:

#40

=》Could HiSilicon respond to Dann & Rafael's comments #36 and #38?
=》Is there an upstream acceptable patch that addresses this issue?

No upstream patchset， I Only provide a private solution and do not know this root cause.

Rafael David Tinoco (rafaeldtinoco) on 2020-04-15

Changed in qemu (Ubuntu Eoan):
assignee:	Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in qemu:
assignee:	Rafael David Tinoco (rafaeldtinoco) → nobody

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-05-04:

#41

PPA created with temporarily workaround in comment #34.

https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256

This PPA can solve temporarily but is not acceptable for offical release.

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-05-05:

#42

Take several CPUs offline and re-test. Even only 32 threads left, I still can reproduce this issue easily.

ubuntu@kreiken:~$ lscpu | grep list;for i in `seq 1 10`;do echo ;rm -f out.img;timeout 30 qemu-img convert -f qcow2 -O qcow2 ./bionic-server-cloudimg-arm64.img out.img -p; done
On-line CPU(s) list: 0-31
Off-line CPU(s) list: 32-127

(100.00/100%)

    (43.20/100%)
    (0.00/100%)
    (1.00/100%)

Revision history for this message

Ying Fang (ying-fang) wrote on 2020-05-05:

#43

Hi, Ike.

I think this tricky bug was fixed by Paolo last month.
Please try patch https://git.qemu.org/?p=qemu.git;a=commitdiff;h=5710a3e09f9b85801e5ce70797a4a511e5fc9e2c.

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-05-05:

#44

Thanks. I will test it.

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-05-05:

#45

The test deb has been pushed to https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256

40 run with patch mentioned in #43 and all passed.

Thanks.

Changed in kunpeng920:
status:	Incomplete → Triaged
assignee:	nobody → Ike Panhc (ikepanhc)

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-05-05:

#46

Hello Ike,

Please, let me know if you want me to go after the needed SRUs for this fix or if you will.

I'll wait for the final feedback from tests with your PPA.

Cheers!

Revision history for this message

dann frazier (dannf) wrote on 2020-05-05:

#47

fyi, I backported that fix also to focal/groovy and eoan, and with those builds. On my test systems the hang reliable occurs within 20 iterations. After the fix, they have survived > 500 iterations thus far. I'll leave running overnight just to be sure.

Rafael David Tinoco (rafaeldtinoco) on 2020-05-06

Changed in qemu (Ubuntu):
assignee:	Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in qemu:
status:	In Progress → Fix Released
Changed in qemu (Ubuntu Focal):
status:	Incomplete → In Progress
Changed in qemu (Ubuntu Eoan):
status:	Incomplete → In Progress
Changed in qemu (Ubuntu Disco):
status:	Incomplete → In Progress
Changed in qemu (Ubuntu Bionic):
status:	Incomplete → In Progress
Changed in qemu (Ubuntu):
status:	Incomplete → In Progress

Rafael David Tinoco (rafaeldtinoco) on 2020-05-06

description:

updated

dann frazier (dannf) on 2020-05-06

description:

updated

Revision history for this message

Philippe Mathieu-Daudé (philmd) wrote on 2020-05-06:

#48

Isn't this fixed by commit 5710a3e09f9?

commit 5710a3e09f9b85801e5ce70797a4a511e5fc9e2c
Author: Paolo Bonzini <email address hidden>
Date: Tue Apr 7 10:07:46 2020 -0400

async: use explicit memory barriers

    When using C11 atomics, non-seqcst reads and writes do not participate
    in the total order of seqcst operations. In util/async.c and util/aio-posix.c,
    in particular, the pattern that we use

              write ctx->notify_me write bh->scheduled
              read bh->scheduled read ctx->notify_me
              if !bh->scheduled, sleep if ctx->notify_me, notify

    needs to use seqcst operations for both the write and the read. In
    general this is something that we do not want, because there can be
    many sources that are polled in addition to bottom halves. The
    alternative is to place a seqcst memory barrier between the write
    and the read. This also comes with a disadvantage, in that the
    memory barrier is implicit on strongly-ordered architectures and
    it wastes a few dozen clock cycles.

    Fortunately, ctx->notify_me is never written concurrently by two
    threads, so we can assert that and relax the writes to ctx->notify_me.
    The resulting solution works and performs well on both aarch64 and x86.

    Note that the atomic_set/atomic_read combination is not an atomic
    read-modify-write, and therefore it is even weaker than C11 ATOMIC_RELAXED;
    on x86, ATOMIC_RELAXED compiles to a locked operation.

Revision history for this message

dann frazier (dannf) wrote on 2020-05-06: Re: [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images

#49

On Wed, May 6, 2020 at 1:20 PM Philippe Mathieu-Daudé
<email address hidden> wrote:
>
> Isn't this fixed by commit 5710a3e09f9?

See comment #43. The discussions hence are about testing/integration
of that fix.

-dann

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-05-06:

#50

FYIO, from now on all the "merge" work will be done in the merge requests being linked to this BUG (at the top). @paelzer will be verifying those.

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-05-07:

#51

Tested debs in ppa:rafaeldtinoco/lp1805256 for focal and eoan and 1000 qemu-img convert passed.

Revision history for this message

dann frazier (dannf) wrote on 2020-05-07:

#52

Ike's backport in https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256 tests well for me on Cavium Sabre. One minor note is that the function in_aio_context_home_thread() is being called in aio-win32.c, but that function didn't exist in 2.11. We probably want to change that to aio_context_in_iothread(). It was renamed in https://git.qemu.org/?p=qemu.git;a=commitdiff;h=d2b63ba8dd20c1091b3f1033e6a95ef95b18149d

Andrew Cloke (andrew-cloke) on 2020-05-14

Changed in kunpeng920:
status:	Triaged → In Progress

Christian Ehrhardt  (paelzer) on 2020-05-27

no longer affects:

qemu (Ubuntu Disco)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-05-28:

#53

FYI: sponsored into groovy

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-05-29:

#54

This bug was fixed in the package qemu - 1:4.2-3ubuntu8

---------------
qemu (1:4.2-3ubuntu8) groovy; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

-- Rafael David Tinoco <email address hidden> Wed, 27 May 2020 21:47:21 +0000

Changed in qemu (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-05-29:

#55

Migrated right now, sponsoring the related SRU portions into B/E/F ... for consideration by the SRU Team.

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-06-02: Please test proposed package

#56

Hello dann, or anyone else affected,

Accepted qemu into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.2-3ubuntu6.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags:	added: block-proposed-focal
Changed in qemu (Ubuntu Focal):
status:	In Progress → Fix Committed
tags:	added: verification-needed verification-needed-focal
Changed in qemu (Ubuntu Eoan):
status:	In Progress → Fix Committed
tags:	added: verification-needed-eoan

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-06-02:

#57

Hello dann, or anyone else affected,

Accepted qemu into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.0+dfsg-0ubuntu9.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags:

added: block-proposed-eoan

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-06-02:

#58

Hello dann, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-1ubuntu7.27 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Bionic):
status:	In Progress → Fix Committed
tags:	added: verification-needed-bionic
tags:	added: block-proposed-bionic removed: verification-needed-bionic

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2020-06-03: Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7)

#59

All autopkgtests for the newly accepted qemu (1:4.0+dfsg-0ubuntu9.7) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

edk2/0~20190606.20d2e5a1-2ubuntu1.1 (amd64, armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/eoan/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-06-03:

#60

100 run on bionic/eoan/focal -proposed `qemu-img convert` all successful. No hang occurs. Thanks a lot.

tags:	added: verification-done-bionic verification-done-eoan verification-done-focal removed: verification-needed verification-needed-eoan verification-needed-focal
Changed in kunpeng920:
assignee:	Ike Panhc (ikepanhc) → nobody

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2020-06-03: Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2)

#61

All autopkgtests for the newly accepted qemu (1:4.2-3ubuntu6.2) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/245.4-4ubuntu3.1 (arm64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-06-05:

#62

I've looked and retried the tests - all green now.
Let us give it a few extra days in proposed as planned, but other than that it looks ok to be released.

Andrew Cloke (andrew-cloke) on 2020-06-11

Changed in kunpeng920:
status:	In Progress → Fix Committed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-06-17:

#63

We had the 14 (instead f 7) days in -proposed for some extended maturing. Nothing came up in regard to this and all validations were good.
Dropping block-proposed to be released once the SRU Team gets to it.

tags:

removed: block-proposed-bionic block-proposed-eoan block-proposed-focal

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-06-18:

#64

This bug was fixed in the package qemu - 1:4.2-3ubuntu6.2

---------------
qemu (1:4.2-3ubuntu6.2) focal; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

-- Rafael David Tinoco <email address hidden> Wed, 27 May 2020 21:19:20 +0000

Changed in qemu (Ubuntu Focal):
status:	Fix Committed → Fix Released

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-06-18: Update Released

#65

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-06-18:

#66

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu9.7

---------------
qemu (1:4.0+dfsg-0ubuntu9.7) eoan; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

-- Rafael David Tinoco <email address hidden> Wed, 27 May 2020 20:07:57 +0000

Changed in qemu (Ubuntu Eoan):
status:	Fix Committed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-06-18:

#67

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.27

---------------
qemu (1:2.11+dfsg-1ubuntu7.27) bionic; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - aio: rename aio_context_in_iothread() to in_aio_context_home_thread()
    - aio: Do aio_notify_accept only during blocking aio_poll
    - aio-posix: Assert that aio_poll() is always called in home thread
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held
    - aio-posix: Don't count ctx->notifier as progress when polling

-- Rafael David Tinoco <email address hidden> Tue, 26 May 2020 17:39:21 +0000

Changed in qemu (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Andrew Cloke (andrew-cloke) on 2020-06-18

Changed in kunpeng920:
status:	Fix Committed → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-06-30:

#68

This will re-open again for Bionic due to bug 1885419 forcing a revert of the former backports.
After a deeper evaluation if the assert is wrong in the backport or just flagging a problem formerly already existing in Bionic this will be re-fixed.

Changed in qemu (Ubuntu Bionic):
status:	Fix Released → Triaged
assignee:	nobody → Rafael David Tinoco (rafaeldtinoco)

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2020-07-01:

#69

Re-open for bionic due to regression found

Changed in kunpeng920:
status:	Fix Released → Triaged

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-07-12:

#70

Started working on this again...

Changed in qemu (Ubuntu Bionic):
status:	Triaged → In Progress

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-07-13:

#71

Worked being done for the Bionic SRU:

BUG: https://bugs.launchpad.net/qemu/+bug/1805256
(fix for the bionic regression demonstrated at LP: #1885419)
PPA: https://launchpad.net/~rafaeldtinoco/+archive/ubuntu/lp1805256-bionic
MERGE: https://tinyurl.com/y8sucs6x

Merge proposal currently going under review, tests and discussions.

Revision history for this message

dann frazier (dannf) wrote on 2020-07-15:

#72

I ran the new PPA build (1:2.11+dfsg-1ubuntu7.29~ppa01) on both a ThunderX2 system and a Hi1620 system overnight, and both survived (6574 & 12919 iterations, respectively).

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-07-20:

#73

Thanks @dannf! I spoke to Christian and him and I agreed to confine this change into ARM builds only (as SRU for Bionic). Preparing it...

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-07-21:

#75

I've hidden last post as it was posted in the wrong bug.

Rafael David Tinoco (rafaeldtinoco) on 2020-07-31

description:

updated

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-07-31:

#76

I just pushed/uploaded a SRU for bionic from:

https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/387269

Waiting for SRU on it.

Changed in qemu (Ubuntu Bionic):
assignee:	Rafael David Tinoco (rafaeldtinoco) → nobody

Revision history for this message

Timo Aaltonen (tjaalton) wrote on 2020-08-07: Please test proposed package

#77

Hello dann, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-1ubuntu7.30 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Bionic):
status:	In Progress → Fix Committed
tags:	added: verification-needed verification-needed-bionic removed: verification-done-bionic

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2020-08-07: Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30)

#78

All autopkgtests for the newly accepted qemu (1:2.11+dfsg-1ubuntu7.30) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/237-3ubuntu10.42 (i386)
ubuntu-image/1.9+18.04ubuntu1 (arm64, armhf, i386, s390x, ppc64el, amd64)
cinder/2:12.0.9-0ubuntu1.2 (s390x)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message

dann frazier (dannf) wrote on 2020-08-07:

#79

Verified w/ over 500 successful iterations on a m6g.metal instance, and over 300 in an armhf chroot on the same.

tags:

added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic

dann frazier (dannf) on 2020-08-14

Changed in kunpeng920:
status:	Triaged → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-08-19:

#80

Download full text (4.0 KiB)

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.31

---------------
qemu (1:2.11+dfsg-1ubuntu7.31) bionic-security; urgency=medium

  * SECURITY UPDATE: out-of-bounds read in slirp networking
    - debian/patches/CVE-2020-10756.patch: drop bogus IPv6 messages in
      slirp/src/ip6_input.c.
    - CVE-2020-10756
  * SECURITY UPDATE: out-of-bounds read and write in sm501
    - debian/patches/CVE-2020-12829-pre1.patch: use values from the pitch
      register for 2D operations.
    - debian/patches/CVE-2020-12829-pre2.patch: implement negated
      destination raster operation mode.
    - debian/patches/CVE-2020-12829-pre3.patch: log unimplemented raster
      operation modes.
    - debian/patches/CVE-2020-12829-pre4.patch: fix support for non-zero
      frame buffer start address.
    - debian/patches/CVE-2020-12829-pre5.patch: set updated region dirty
      after 2D operation.
    - debian/patches/CVE-2020-12829-pre6.patch: adjust endianness of pixel
      value in rectangle fill.
    - debian/patches/CVE-2020-12829-pre7.patch: convert printf +
      abort to qemu_log_mask.
    - debian/patches/CVE-2020-12829-pre8.patch: shorten long
      variable names in sm501_2d_operation.
    - debian/patches/CVE-2020-12829-pre9.patch: use BIT(x) macro to
      shorten constant.
    - debian/patches/CVE-2020-12829-pre10.patch: clean up local
      variables in sm501_2d_operation.
    - debian/patches/CVE-2020-12829.patch: replace hand written
      implementation with pixman where possible.
    - debian/patches/CVE-2020-12829-2.patch: optimize small overlapping
      blits.
    - debian/patches/CVE-2020-12829-3.patch: fix bounds checks.
    - debian/patches/CVE-2020-12829-4.patch: drop unneded variable.
    - debian/patches/CVE-2020-12829-5.patch: do not allow guest to set
      invalid format.
    - debian/patches/CVE-2020-12829-6.patch: introduce variable for
      commonly used value for better readability.
    - debian/patches/CVE-2020-12829-7.patch: fix and optimize overlap
      check.
    - CVE-2020-12829
  * SECURITY UPDATE: out-of-bounds read during sdhci_write() operations
    - debian/patches/CVE-2020-13253.patch: do not switch to ReceivingData
      if address is invalid in hw/sd/sd.c.
    - CVE-2020-13253
  * SECURITY UPDATE: out-of-bounds access during es1370_write() operation
    - debian/patches/CVE-2020-13361.patch: check total frame count against
      current frame in hw/audio/es1370.c.
    - CVE-2020-13361
  * SECURITY UPDATE: out-of-bounds read via crafted reply_queue_head
    - debian/patches/CVE-2020-13362-1.patch: use unsigned type for
      reply_queue_head and check index in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-2.patch: avoid NULL pointer dereference
      in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-3.patch: use unsigned type for positive
      numeric fields in hw/scsi/megasas.c.
    - CVE-2020-13362
  * SECURITY UPDATE: NULL pointer dereference related to BounceBuffer
    - debian/patches/CVE-2020-13659.patch: set map length to zero when
      returning NULL in exec.c, include/exec/memory.h.
    - CVE-2020-13659
  * SECURITY UPDATE: out-of-bounds access via msi-x mmio operation
    ...

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.31

---------------
qemu (1:2.11+dfsg-1ubuntu7.31) bionic-security; urgency=medium

* SECURITY UPDATE: out-of-bounds read in slirp networking
    - debian/patches/CVE-2020-10756.patch: drop bogus IPv6 messages in
      slirp/src/ip6_input.c.
    - CVE-2020-10756
  * SECURITY UPDATE: out-of-bounds read and write in sm501
    - debian/patches/CVE-2020-12829-pre1.patch: use values from the pitch
      register for 2D operations.
    - debian/patches/CVE-2020-12829-pre2.patch: implement negated
      destination raster operation mode.
    - debian/patches/CVE-2020-12829-pre3.patch: log unimplemented raster
      operation modes.
    - debian/patches/CVE-2020-12829-pre4.patch: fix support for non-zero
      frame buffer start address.
    - debian/patches/CVE-2020-12829-pre5.patch: set updated region dirty
      after 2D operation.
    - debian/patches/CVE-2020-12829-pre6.patch: adjust endianness of pixel
      value in rectangle fill.
    - debian/patches/CVE-2020-12829-pre7.patch: convert printf +
      abort to qemu_log_mask.
    - debian/patches/CVE-2020-12829-pre8.patch: shorten long
      variable names in sm501_2d_operation.
    - debian/patches/CVE-2020-12829-pre9.patch: use BIT(x) macro to
      shorten constant.
    - debian/patches/CVE-2020-12829-pre10.patch: clean up local
      variables in sm501_2d_operation.
    - debian/patches/CVE-2020-12829.patch: replace hand written
      implementation with pixman where possible.
    - debian/patches/CVE-2020-12829-2.patch: optimize small overlapping
      blits.
    - debian/patches/CVE-2020-12829-3.patch: fix bounds checks.
    - debian/patches/CVE-2020-12829-4.patch: drop unneded variable.
    - debian/patches/CVE-2020-12829-5.patch: do not allow guest to set
      invalid format.
    - debian/patches/CVE-2020-12829-6.patch: introduce variable for
      commonly used value for better readability.
    - debian/patches/CVE-2020-12829-7.patch: fix and optimize overlap
      check.
    - CVE-2020-12829
  * SECURITY UPDATE: out-of-bounds read during sdhci_write() operations
    - debian/patches/CVE-2020-13253.patch: do not switch to ReceivingData
      if address is invalid in hw/sd/sd.c.
    - CVE-2020-13253
  * SECURITY UPDATE: out-of-bounds access during es1370_write() operation
    - debian/patches/CVE-2020-13361.patch: check total frame count against
      current frame in hw/audio/es1370.c.
    - CVE-2020-13361
  * SECURITY UPDATE: out-of-bounds read via crafted reply_queue_head
    - debian/patches/CVE-2020-13362-1.patch: use unsigned type for
      reply_queue_head and check index in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-2.patch: avoid NULL pointer dereference
      in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-3.patch: use unsigned type for positive
      numeric fields in hw/scsi/megasas.c.
    - CVE-2020-13362
  * SECURITY UPDATE: NULL pointer dereference related to BounceBuffer
    - debian/patches/CVE-2020-13659.patch: set map length to zero when
      returning NULL in exec.c, include/exec/memory.h.
    - CVE-2020-13659
  * SECURITY UPDATE: out-of-bounds access via msi-x mmio operation
    - debian/patches/CVE-2020-13754-1.patch: revert accepting mismatching
      sizes in memory_region_access_valid in memory.c.
    - debian/patches/CVE-2020-13754-2.patch: accept byte and word access to
      core ACPI registers in hw/acpi/core.c.
    - CVE-2020-13754
  * SECURITY UPDATE: invalid memory copy operation via rom_copy
    - debian/patches/CVE-2020-13765.patch: add extra check to
      hw/core/loader.c.
    - CVE-2020-13765
  * SECURITY UPDATE: buffer overflow in XGMAC Ethernet controller
    - debian/patches/CVE-2020-15863.patch: check bounds in hw/net/xgmac.c.
    - CVE-2020-15863
  * SECURITY UPDATE: reachable assertion failure
    - debian/patches/CVE-2020-16092.patch: fix assertion failure in
      hw/net/net_tx_pkt.c.
    - CVE-2020-16092

-- Marc Deslauriers <marc.deslauriers@ubuntu.com>  Tue, 11 Aug 2020 13:19:33 -0400

Changed in qemu (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Andrew Cloke (andrew-cloke) on 2020-08-19

no longer affects:	kunpeng920/ubuntu-18.04-hwe
Changed in kunpeng920:
status:	Fix Committed → Fix Released

QEMU

qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
QEMU	Fix Released	Undecided	Unassigned
kunpeng920	Fix Released	Undecided	Unassigned
Ubuntu-18.04	Fix Released	Undecided	Unassigned
Ubuntu-19.10	Fix Released	Undecided	Unassigned
Ubuntu-20.04	Fix Released	Undecided	Unassigned
Upstream-kernel	Invalid	Undecided	Unassigned
qemu (Ubuntu)	Fix Released	Medium	Unassigned
Bionic	Fix Released	Medium	Unassigned
Eoan	Fix Released	Medium	Unassigned
Focal	Fix Released	Medium	Unassigned