KVM on 17.10 crashes the machine

Bug #1725350 reported by bugproxy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Artful
Fix Released
Critical
Joseph Salisbury

Bug Description

When you start qemu on a 17.10 machine, the whole machine goes down and crashes:

[ 90.689627] Unable to handle kernel paging request for data at address 0xf000000002d3bda0
[ 90.689705] Faulting instruction address: 0xc000000000361224
[ 90.689840] Oops: Kernel access of bad area, sig: 11 [#1]
[ 90.689911] SMP NR_CPUS=2048
[ 90.689912] NUMA
[ 90.690053] PowerNV
[ 90.690092] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack_netlink nf_conntrack nfnetlink idt_89hpesx snd_hda_codec_hdmi xfs joydev input_leds mac_hid snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ofpart opal_prd cmdlinepart powernv_flash mtd at24 ipmi_powernv ipmi_devintf ipmi_msghandler powernv_rng uio_pdrv_genirq vmx_crypto ibmpowernv uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables
[ 90.690724] autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage ast crct10dif_vpmsum i2c_algo_bit crc32c_vpmsum ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm tg3 ahci libahci
[ 90.690937] CPU: 48 PID: 3986 Comm: qemu-system-ppc Not tainted 4.13.0-12-generic #13-Ubuntu
[ 90.691001] task: c000000b122d8700 task.stack: c000000b431cc000
[ 90.691167] NIP: c000000000361224 LR: c000000000998960 CTR: c0000000009a19b0
[ 90.691223] REGS: c000000bff61b800 TRAP: 0300 Not tainted (4.13.0-12-generic)
[ 90.691277] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 90.691282] CR: 88002844 XER: 00000000
[ 90.691347] CFAR: c00000000099895c DAR: f000000002d3bda0 DSISR: 40000000 SOFTE: 0
[ 90.691347] GPR00: c000000000998960 c000000bff61ba80 c0000000015e3000 c000000b4ef61f20
[ 90.691347] GPR04: c000000b44c61680 0000000000000000 000000000000001f 000000000000001f
[ 90.691347] GPR08: 000000000000001f 0000000002d3bd80 c00000000178e8e8 c000000b5a0c26f0
[ 90.691347] GPR12: 0000000028002842 c00000000fadf800 c000000b52d07880 c000000b44c61680
[ 90.691347] GPR16: 0000000000000000 000000000000001f 000000000000001f c00000000553a560
[ 90.691347] GPR20: 0000000000000001 0000000000000002 080000000553a560 c000000b5c62a228
[ 90.691347] GPR24: c000000005531110 c000000b5c632238 0000000000000210 0000000000000000
[ 90.691347] GPR28: c000000000998960 c000000bff61bc20 c000000b4ef61f20 f000000002d3bd80
[ 90.692089] NIP [c000000000361224] kfree+0x54/0x270
[ 90.692133] LR [c000000000998960] xhci_urb_free_priv+0x20/0x40
[ 90.692325] Call Trace:
[ 90.692345] [c000000bff61ba80] [c000000bff61bad0] 0xc000000bff61bad0 (unreliable)
[ 90.692402] [c000000bff61bac0] [c000000000998960] xhci_urb_free_priv+0x20/0x40
[ 90.692459] [c000000bff61bae0] [c00000000099bfc8] xhci_giveback_urb_in_irq.isra.22+0x78/0x190
[ 90.692645] [c000000bff61bb40] [c00000000099c350] xhci_td_cleanup+0x130/0x200
[ 90.692702] [c000000bff61bbc0] [c0000000009a175c] handle_tx_event+0x74c/0x1380
[ 90.692759] [c000000bff61bcc0] [c0000000009a2894] xhci_irq+0x504/0xf20
[ 90.692808] [c000000bff61bde0] [c00000000017b110] __handle_irq_event_percpu+0x90/0x300
[ 90.692977] [c000000bff61bea0] [c00000000017b3b8] handle_irq_event_percpu+0x38/0x90
[ 90.693038] [c000000bff61bee0] [c00000000017b474] handle_irq_event+0x64/0xb0
[ 90.693094] [c000000bff61bf10] [c000000000180da0] handle_fasteoi_irq+0xc0/0x230
[ 90.693155] [c000000bff61bf40] [c00000000017972c] generic_handle_irq+0x4c/0x70
[ 90.693332] [c000000bff61bf60] [c00000000001767c] __do_irq+0x7c/0x1c0
[ 90.693383] [c000000bff61bf90] [c00000000002ab70] call_do_irq+0x14/0x24
[ 90.693431] [c000000b431cf9d0] [c00000000001785c] do_IRQ+0x9c/0x130
[ 90.693478] [c000000b431cfa20] [c000000000008ac4] hardware_interrupt_common+0x114/0x120
[ 90.693663] --- interrupt: 501 at __copy_tofrom_user_power7+0x1f4/0x7cc
[ 90.693663] LR = _copy_to_user+0x3c/0x60
[ 90.693736] [c000000b431cfd10] [c000000b431cfdc0] 0xc000000b431cfdc0 (unreliable)
[ 90.693797] [c000000b431cfd30] [c0000000003bfa90] poll_select_copy_remaining+0x180/0x1b0
[ 90.693853] [c000000b431cfda0] [c0000000003c1934] SyS_ppoll+0x104/0x1e0
[ 90.694018] [c000000b431cfe30] [c00000000000b184] system_call+0x58/0x6c
[ 90.694064] Instruction dump:
[ 90.694094] Unable to handle kernel paging request for data at address 0xf000000002ffd860
[ 90.694153] Faulting instruction address: 0xc000000000399624
[ 90.694198] Oops: Kernel access of bad area, sig: 11 [#2]
[ 90.694351] SMP NR_CPUS=2048
[ 90.694351] NUMA
[ 90.694381] PowerNV

I am using the latest kernel at the moment version 4.13-12

I just reproduced it with a different stack this time:

[ 2764.725547] Severe Machine check interrupt [Recovered]
[ 2764.725676] NIP [c000000000089268]: __copy_tofrom_user_power7+0x1f4/0x7cc
[ 2764.725743] Initiator: CPU
[ 2764.725764] Error type: SLB [Multihit]
[ 2764.725786] Effective address: 00007fffd16e82c8
[ 2796.015384] Severe Machine check interrupt [Recovered]
[ 2796.015509] NIP [c000000000089268]: __copy_tofrom_user_power7+0x1f4/0x7cc
[ 2796.015586] Initiator: CPU
[ 2796.015701] Error type: SLB [Parity]
[ 2796.015723] Effective address: 00007fffddabe278
[ 2796.073775] Unable to handle kernel paging request for data at address 0xf000000002378020
[ 2796.073949] Faulting instruction address: 0xc000000000309a18
[ 2796.074075] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2796.074104] SMP NR_CPUS=2048
[ 2796.074104] NUMA
[ 2796.074126] PowerNV
[ 2796.074156] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack_netlink nf_conntrack nfnetlink xfs idt_89hpesx snd_hda_codec_hdmi joydev input_leds mac_hid snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ipmi_powernv at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_devintf powernv_rng mtd ipmi_msghandler opal_prd uio ibmpowernv vmx_crypto sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables
[ 2796.074643] autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage ast i2c_algo_bit crct10dif_vpmsum ttm crc32c_vpmsum drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm tg3 ahci libahci
[ 2796.074902] CPU: 40 PID: 21964 Comm: CPU 0/KVM Tainted: G M 4.13.0-15-generic #16-Ubuntu
[ 2796.074955] task: c000000a0b255900 task.stack: c000000a0bf9c000
[ 2796.074990] NIP: c000000000309a18 LR: c000000000309a14 CTR: c00000000030a280
[ 2796.075031] REGS: c000000a0bf9f560 TRAP: 0300 Tainted: G M (4.13.0-15-generic)
[ 2796.075080] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 2796.075083] CR: 48024244 XER: 20000000
[ 2796.075133] CFAR: c00000000006c508 DAR: f000000002378020 DSISR: 40000000 SOFTE: 0
[ 2796.075133] GPR00: c000000000309a14 c000000a0bf9f7e0 c0000000015f3400 f000000002378000
[ 2796.075133] GPR04: 00000000d9458000 0000000000000012 00000000834c0000 0000000000000008
[ 2796.075133] GPR08: f000000000000000 0000000000000001 0000000002378000 c00000000179e958
[ 2796.075133] GPR12: 0000000028004248 c00000000fada400 000072882e440000 000072882e440000
[ 2796.075133] GPR16: 0000000000010000 000074882e430000 c000000ad9458000 0000000000000001
[ 2796.075133] GPR20: 4000000000002000 c00000000179e968 000072882e43ffff 000072882e440000
[ 2796.075133] GPR24: c000000a0bf9f988 0008000000000040 07000000000000c0 0000000000000001
[ 2796.075133] GPR28: c0800008de002386 862300de080080c0 c0000009834c0170 0000000000000004
[ 2796.075513] NIP [c000000000309a18] __get_user_pages_fast+0x798/0xfd0
[ 2796.075549] LR [c000000000309a14] __get_user_pages_fast+0x794/0xfd0
[ 2796.075652] Call Trace:
[ 2796.075699] [c000000a0bf9f7e0] [d0000000070f89e4] kvmppc_run_core+0xeec/0x1370 [kvm_hv] (unreliable)
[ 2796.075749] [c000000a0bf9f900] [c00000000030a390] get_user_pages_fast+0x110/0x160
[ 2796.075793] [c000000a0bf9f950] [d0000000070fe21c] kvmppc_book3s_hv_page_fault+0x384/0xc60 [kvm_hv]
[ 2796.075844] [c000000a0bf9fa40] [d0000000070fa94c] kvmppc_vcpu_run_hv+0x314/0x790 [kvm_hv]
[ 2796.075891] [c000000a0bf9fb10] [d000000006f759ec] kvmppc_vcpu_run+0x34/0x48 [kvm]
[ 2796.075941] [c000000a0bf9fb30] [d000000006f71aa0] kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
[ 2796.076100] [c000000a0bf9fbd0] [d000000006f65018] kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
[ 2796.076144] [c000000a0bf9fd40] [c0000000003bd6a4] do_vfs_ioctl+0xd4/0xa00
[ 2796.076181] [c000000a0bf9fde0] [c0000000003be094] SyS_ioctl+0xc4/0x130
[ 2796.076217] [c000000a0bf9fe30] [c00000000000b184] system_call+0x58/0x6c
[ 2796.076252] Instruction dump:
[ 2796.076275] Unable to handle kernel paging request for data at address 0xf00000000282fe60
[ 2796.076339] Faulting instruction address: 0xc0000000003995c4
[ 2796.076444] Oops: Kernel access of bad area, sig: 11 [#2]
[ 2796.076473] SMP NR_CPUS=2048
[ 2796.076473] NUMA
[ 2796.076494] PowerNV
[ 2796.076523] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack_netlink nf_conntrack nfnetlink xfs idt_89hpesx snd_hda_codec_hdmi joydev input_leds mac_hid snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ipmi_powernv at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash ipmi_devintf powernv_rng mtd ipmi_msghandler opal_prd uio ibmpowernv vmx_crypto sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables
[ 2796.078461] autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage ast i2c_algo_bit crct10dif_vpmsum ttm crc32c_vpmsum drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm tg3 ahci libahci
[ 2796.080130] CPU: 40 PID: 21964 Comm: CPU 0/KVM Tainted: G M 4.13.0-15-generic #16-Ubuntu
[ 2796.080797] task: c000000a0b255900 task.stack: c000000a0bf9c000
[ 2796.081128] NIP: c0000000003995c4 LR: c0000000002bf778 CTR: 00000000300303f0
[ 2796.081474] REGS: c000000a0bf9efc0 TRAP: 0300 Tainted: G M (4.13.0-15-generic)
[ 2796.081819] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE>
[ 2796.081822] CR: 48024228 XER: 20000000
[ 2796.082458] CFAR: c0000000002bf774 DAR: f00000000282fe60 DSISR: 40000000 SOFTE: 0
[ 2796.082458] GPR00: c0000000002bf778 c000000a0bf9f240 c0000000015f3400 c000000a0bf9f360
[ 2796.082458] GPR04: 0000000000000004 f00000000282fe40 9000000000001033 0000000000000060
[ 2796.082458] GPR08: 000000000000a0b0 000000000282fe40 c00000000179e8e8 9000000000001003
[ 2796.082458] GPR12: 0000000000004400 c00000000fada400 000072882e440000 000072882e440000
[ 2796.082458] GPR16: 0000000000010000 000074882e430000 c000000ad9458000 0000000000000001
[ 2796.082458] GPR20: 4000000000002000 c00000000179e968 000072882e43ffff 000072882e440000
[ 2796.082458] GPR24: c000000a0bf9f988 c000000000e98308 c000000000e98318 c000000a0bf9f560
[ 2796.082458] GPR28: c000000a0bf9f364 0000000000000000 0000000000000004 c000000a0bf9f360
[ 2796.088348] NIP [c0000000003995c4] __check_object_size+0xc4/0x250
[ 2796.088427] LR [c0000000002bf778] __probe_kernel_read+0x68/0xd0
[ 2796.088750] Call Trace:
[ 2796.089060] [c000000a0bf9f240] [c000000a0bf9f2c0] 0xc000000a0bf9f2c0 (unreliable)
[ 2796.089405] [c000000a0bf9f2c0] [c0000000002bf778] __probe_kernel_read+0x68/0xd0
[ 2796.090048] [c000000a0bf9f300] [c00000000001e010] show_regs+0x300/0x430
[ 2796.090394] [c000000a0bf9f3c0] [c00000000002647c] __die+0xec/0x130
[ 2796.090732] [c000000a0bf9f440] [c000000000026524] die+0x64/0xe0
[ 2796.091091] [c000000a0bf9f480] [c000000000069fb0] bad_page_fault+0xe0/0x14c
[ 2796.091404] [c000000a0bf9f4f0] [c00000000000a4b8] handle_page_fault+0x34/0x38
[ 2796.091745] --- interrupt: 300 at __get_user_pages_fast+0x798/0xfd0
[ 2796.091745] LR = __get_user_pages_fast+0x794/0xfd0
[ 2796.092403] [c000000a0bf9f7e0] [d0000000070f89e4] kvmppc_run_core+0xeec/0x1370 [kvm_hv] (unreliable)
[ 2796.093083] [c000000a0bf9f900] [c00000000030a390] get_user_pages_fast+0x110/0x160
[ 2796.093418] [c000000a0bf9f950] [d0000000070fe21c] kvmppc_book3s_hv_page_fault+0x384/0xc60 [kvm_hv]
[ 2796.094073] [c000000a0bf9fa40] [d0000000070fa94c] kvmppc_vcpu_run_hv+0x314/0x790 [kvm_hv]
[ 2796.094423] [c000000a0bf9fb10] [d000000006f759ec] kvmppc_vcpu_run+0x34/0x48 [kvm]
[ 2796.094777] [c000000a0bf9fb30] [d000000006f71aa0] kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
[ 2796.096433] [c000000a0bf9fbd0] [d000000006f65018] kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
[ 2796.096785] [c000000a0bf9fd40] [c0000000003bd6a4] do_vfs_ioctl+0xd4/0xa00
[ 2796.097121] [c000000a0bf9fde0] [c0000000003be094] SyS_ioctl+0xc4/0x130
[ 2796.097467] [c000000a0bf9fe30] [c00000000000b184] system_call+0x58/0x6c
[ 2796.098127] Instruction dump:
...

It repeats the above.

Breno got some information the problem is mostly like to be related to SBL multi-hit.

Mirroring to Launchpad to advise Canonical of this KVM issue...

Revision history for this message
bugproxy (bugproxy) wrote : log of the error

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-159844 severity-critical targetmilestone-inin1710
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It would be good to know if this bug is already fixed in the mainline kernel.

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc5

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Changed in linux (Ubuntu Artful):
importance: High → Critical
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Hi Joseph,

I tested in the mainline kernel, and the problem does not happen. This is a problem we are only seeing in 17.10 at this moment.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after applying an update? Was there a prior 17.10 kernel version that did not exhibit this bug?

Could you test the following two kernels:

v4.12 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12/
v4.13 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/

Revision history for this message
Anders Hall (a.hall) wrote :

Hi, does this KVM bug affect all users or only "Ubuntu on IBM Power Systems"? I use Lenovo X1 (latest gen) and also KVM (on Intel Hardware). I'm currently holding back upgrade from 17.04.

Thanks in advance.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-10-26 10:06 EDT-------
(In reply to comment #20)
> Did this issue start happening after applying an update? Was there a prior
> 17.10 kernel version that did not exhibit this bug?
>
> Could you test the following two kernels:
>
> v4.12 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12/
> v4.13 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/

The link http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/ does not contain all the packages for ppc64le. Where can I find them?

Revision history for this message
Seth Forshee (sforshee) wrote :

The mainline ppc64el builds have been failing due to not having this patch, which has not yet been pushed out to Linus' tree.

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=186b8f1587c79c2fa04bfa392fdf084443e398c1

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the info, Seth. I'll manually build the mainline kernels requested in comment #4 and post a link to them.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if this bug also happens with the following two kernels:

v4.13 Upstream:
http://kernel.ubuntu.com/~jsalisbury/lp1725350/4.13/

Ubuntu 17.04 -proposed:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13563578

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-29 16:09 EDT-------
(In reply to comment #25)
> Can you see if this bug also happens with the following two kernels:
>
> v4.13 Upstream:
> http://kernel.ubuntu.com/~jsalisbury/lp1725350/4.13/
>
> Ubuntu 17.04 -proposed:
> https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/
> 13563578

The bug only happens on v4.13 kernel. On Ubuntu 17.04 it is not reproducible.

Changed in linux (Ubuntu Artful):
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Artful):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. We should be able to bisect this issue. Before starting a bisect, can you test the 4.14-rc7 kernel to see if this bug is already fixed there. If it is, we can perform a "Reverse" bisect to identify that fix. The kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1725350/4.14

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-31 07:11 EDT-------
Leonardo,

Who can help this KVM issue on 17.10?

Manoj Iyer (manjo)
tags: added: triage-g
Changed in ubuntu-power-systems:
status: In Progress → Incomplete
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Incomplete → In Progress
status: In Progress → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-10 12:18 EDT-------
So far I've only been able to reproduce this 2 ways:
a) booting up a Debian Jessie guest (kernel 3.16). generally the crash
happens some time after boot, but on some situations it needs some
"help", like running "useradd <newuser>".
b) bootup up an Ubuntu 16.04 guest, which doesn't seem to ever trigger
the issue itself, but then chrooting into that same Debian Jessie
image (attaching as a 2nd virtio disk), and then running that same
"useradd <newuser>".

Using these test cases, the crash appears to be during the first
instance of compound_head(page) within mm/gup.c:gup_pte_range()

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c?h=v4.13#n1312

The compound_head(page) call results in a pointer dereference of a
struct *page, via page->compound_page, and that generates a page
fault which leads to the crash. The address of *page in one instance
of the crash was 0xf00000000783af20:

[95667.639406] Unable to handle kernel paging request for data at address 0xf00000000783af20
[95667.639518] Faulting instruction address: 0xc000000000309714
90:mon> t
[c000001e3c4db900] c00000000030a3d0 get_user_pages_fast+0x110/0x160
[c000001e3c4db950] d0000000181be21c kvmppc_book3s_hv_page_fault+0x384/0xc60 [kvm_hv]
[c000001e3c4dba40] d0000000181ba94c kvmppc_vcpu_run_hv+0x314/0x790 [kvm_hv]
[c000001e3c4dbb10] d0000000181059ec kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000001e3c4dbb30] d000000018101aa0 kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
[c000001e3c4dbbd0] d0000000180f5018 kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
[c000001e3c4dbd40] c0000000003bd6e4 do_vfs_ioctl+0xd4/0xa00
[c000001e3c4dbde0] c0000000003be0d4 SyS_ioctl+0xc4/0x130
[c000001e3c4dbe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 000079d53a595550
SP (79d5354ede40) is in userspace

The 0xf address corresponds to the vmemmap area, where page structs are
allocated sequentially for all PFNs in the system, so it isn't obviously
a bad address. Some of our kernel folks took a look at this and worked out
that that with a 64 byte sizeof(struct page), 0xf00000000783af20
corresponds to 0x783af20 / 64 = 1969852th PFN. For a 64K page size this
corresponds to 1969852*64K, an address somewhere at around 120GB, which
is in the range of physical memory on the system (0-128GB in this case)

Since the *page address appeared valid, it was suggested that the issue
was with the vmemmap area being "unbolted" by KVM, leading to a page
fault for an address that should always be pinned/bolted within the
host, and the following fix was suggested:

commit 67f8a8c1151c9ef3d1285905d1e66ebb769ecdf7
Author: Paul Mackerras <email address hidden>
Date: Tue Sep 12 13:47:23 2017 +1000
KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly

I've tested this patch against kernel 4.13.0-16-generic, and at least for
test cases a) and b) above, this does appear to resolve the issue.

So it looks like we need kernel commit 67f8a8c115 pulled into 17.10 to resolve
this bug.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a 17.10(Artful) test kernel with a pick of the following commit:
67f8a8c1151c ("KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly")

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1725350/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-14 18:07 EDT-------
(In reply to comment #34)
> I built a 17.10(Artful) test kernel with a pick of the following commit:
> 67f8a8c1151c ("KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored
> incorrectly")
>
> The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1725350/
>
> Can you test this kernel and see if it resolves this bug?

Thanks! I've retried test cases a) and b) above using this kernel, and it does appear to resolve the issue.

bugproxy (bugproxy)
tags: removed: bugnameltc-159844 kernel-key severity-critical triage-g
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Commit 67f8a8c1151c is now in the -proposed 4.13.0-17.20 Artful kernel.

Would it be possible for you to test the proposed kernel and post back if it resolves this bug?
See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Thank you in advance!

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Incomplete → Fix Committed
tags: added: triage-g
bugproxy (bugproxy)
tags: added: bugnameltc-159844 severity-critical
Revision history for this message
Breno Leitão (breno-leitao) wrote :

I tested kernel 4.13.0-18 and I do not see this problem anymore. Marking it as verification-done.

I am also not seeing the problem reported at LP#1733864 also. I am wondering if they were related.

➜ ~ uname -a
Linux 1710 4.13.0-18-generic #21-Ubuntu SMP Tue Nov 21 17:00:07 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

Manoj Iyer (manjo)
tags: added: verification-done-artful
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.