Bug #1736390 “openvswitch: kernel oops destroying interfaces on ...” : Bugs : linux package : Ubuntu

Revision history for this message

James Page (james-page) wrote on 2017-12-05:

#1

Only seen on i386; all other archs pass OK.

description:

updated

Revision history for this message

James Page (james-page) wrote on 2017-12-05:

#2

See the same without proposed (i.e. I don't think its the version of ovs in proposed that's causing this problem).

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-12-05: Missing required logs.

#3

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1736390

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: artful

James Page (james-page) on 2017-12-05

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-05: Re: openvswitch: kernel opps destroying interfaces on i386

#4

Is this issue new in bionic? Do you you happen to know of a prior kernel version that does not exhibit the bug?

Also, could you see if this bug happens with the latest mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc2

Changed in linux (Ubuntu):
importance:	Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance:	Medium → High
tags:	added: kernel-key
Changed in linux (Ubuntu Bionic):
status:	Confirmed → Triaged

Joseph Salisbury (jsalisbury) on 2017-12-11

tags:

added: kernel-da-key
removed: kernel-key

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#5

From autopkgtest histories:

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-artful/artful/i386/o/openvswitch/20170904_183010_82746@/log.gz

linux-generic is already the newest version (4.12.0.12.13).

That was for the 2.8.0-0ubuntu1 upload during artful development.

that said subsequent tests using:

linux-generic is already the newest version (4.12.0.12.13).

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#6

Testing with the mainline kernel I don't even get as far as the original error - the instance I'm testing with locks up as soon as the performance test is executed.

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#7

OK so to summarize what's concrete here:

Only impacts i386
Impacts both ovs 2.8.0 and 2.8.1
Impacts artful and bionic

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#8

Download full text (5.0 KiB)

Stacktrace from artful:

Dec 13 12:06:09 artful-i386-testing kernel: [ 160.913030] BUG: unable to handle kernel NULL pointer dereference at (null)
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.915102] IP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.916317] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.916329]
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.917728] Oops: 0000 [#1] SMP
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.918345] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass crc32_pclmul input_leds serio_raw joydev parport_pc parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid aesni_intel hid aes_i586 crypto_simd cryptd virtio_blk psmouse virtio_net floppy
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.931511] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G W 4.13.0-16-generic #19-Ubuntu
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.933321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.934970] Workqueue: netns cleanup_net
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.937847] task: f577c200 task.stack: f4c5a000
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.938694] EIP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.940625] EFLAGS: 00010202 CPU: 0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.941432] EAX: 00000000 EBX: f2b32c60 ECX: f4c5be68 EDX: 00000002
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.942712] ESI: 00000000 EDI: f29c1700 EBP: f4c5bdc8 ESP: f4c5bd90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.944471] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.945612] CR0: 80050033 CR2: 00000000 CR3: 1ad8a000 CR4: 000406f0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.946879] Call Trace:
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.949247] ? __wake_up+0x36/0x40
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.950074] ip_mc_down+0x27/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.951557] inetdev_event+0x398/0x4e0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.953391] ? skb_dequeue+0x5b/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.954246] ? wireless_nlevent_flush+0x4c/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.955656] notifier_call_chain+0x4e/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.956814] raw_notifier_call_chain+0x11/0x20
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.957803] call_netdevice_notifiers_info+0x2a/0x60
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.958886] dev_close_many+0x9d/0xe0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.962228] rollback_registered_many+0xd7/0x...

Stacktrace from artful:

Dec 13 12:06:09 artful-i386-testing kernel: [  160.913030] BUG: unable to handle kernel NULL pointer dereference at   (null)
Dec 13 12:06:09 artful-i386-testing kernel: [  160.915102] IP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [  160.916317] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
Dec 13 12:06:09 artful-i386-testing kernel: [  160.916329]
Dec 13 12:06:09 artful-i386-testing kernel: [  160.917728] Oops: 0000 [#1] SMP
Dec 13 12:06:09 artful-i386-testing kernel: [  160.918345] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass crc32_pclmul input_leds serio_raw joydev parport_pc parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid aesni_intel hid aes_i586 crypto_simd cryptd virtio_blk psmouse virtio_net floppy
Dec 13 12:06:09 artful-i386-testing kernel: [  160.931511] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G        W       4.13.0-16-generic #19-Ubuntu
Dec 13 12:06:09 artful-i386-testing kernel: [  160.933321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
Dec 13 12:06:09 artful-i386-testing kernel: [  160.934970] Workqueue: netns cleanup_net
Dec 13 12:06:09 artful-i386-testing kernel: [  160.937847] task: f577c200 task.stack: f4c5a000
Dec 13 12:06:09 artful-i386-testing kernel: [  160.938694] EIP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [  160.940625] EFLAGS: 00010202 CPU: 0
Dec 13 12:06:09 artful-i386-testing kernel: [  160.941432] EAX: 00000000 EBX: f2b32c60 ECX: f4c5be68 EDX: 00000002
Dec 13 12:06:09 artful-i386-testing kernel: [  160.942712] ESI: 00000000 EDI: f29c1700 EBP: f4c5bdc8 ESP: f4c5bd90
Dec 13 12:06:09 artful-i386-testing kernel: [  160.944471]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 13 12:06:09 artful-i386-testing kernel: [  160.945612] CR0: 80050033 CR2: 00000000 CR3: 1ad8a000 CR4: 000406f0
Dec 13 12:06:09 artful-i386-testing kernel: [  160.946879] Call Trace:
Dec 13 12:06:09 artful-i386-testing kernel: [  160.949247]  ? __wake_up+0x36/0x40
Dec 13 12:06:09 artful-i386-testing kernel: [  160.950074]  ip_mc_down+0x27/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [  160.951557]  inetdev_event+0x398/0x4e0
Dec 13 12:06:09 artful-i386-testing kernel: [  160.953391]  ? skb_dequeue+0x5b/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [  160.954246]  ? wireless_nlevent_flush+0x4c/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [  160.955656]  notifier_call_chain+0x4e/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [  160.956814]  raw_notifier_call_chain+0x11/0x20
Dec 13 12:06:09 artful-i386-testing kernel: [  160.957803]  call_netdevice_notifiers_info+0x2a/0x60
Dec 13 12:06:09 artful-i386-testing kernel: [  160.958886]  dev_close_many+0x9d/0xe0
Dec 13 12:06:09 artful-i386-testing kernel: [  160.962228]  rollback_registered_many+0xd7/0x380
Dec 13 12:06:09 artful-i386-testing kernel: [  160.963690]  unregister_netdevice_many.part.102+0x10/0x80
Dec 13 12:06:09 artful-i386-testing kernel: [  160.964981]  default_device_exit_batch+0x134/0x160
Dec 13 12:06:09 artful-i386-testing kernel: [  160.966026]  ? do_wait_intr_irq+0x80/0x80
Dec 13 12:06:09 artful-i386-testing kernel: [  160.966937]  ops_exit_list.isra.8+0x4d/0x60
Dec 13 12:06:09 artful-i386-testing kernel: [  160.969310]  cleanup_net+0x18e/0x260
Dec 13 12:06:09 artful-i386-testing kernel: [  160.973290]  process_one_work+0x1a0/0x390
Dec 13 12:06:09 artful-i386-testing kernel: [  160.975236]  worker_thread+0x37/0x440
Dec 13 12:06:09 artful-i386-testing kernel: [  160.976998]  kthread+0xf3/0x110
Dec 13 12:06:09 artful-i386-testing kernel: [  160.977853]  ? process_one_work+0x390/0x390
Dec 13 12:06:09 artful-i386-testing kernel: [  160.978803]  ? kthread_create_on_node+0x20/0x20
Dec 13 12:06:09 artful-i386-testing kernel: [  160.980521]  ret_from_fork+0x19/0x24
Dec 13 12:06:09 artful-i386-testing kernel: [  160.981350] Code: 90 90 90 90 90 90 90 90 90 90 66 66 66 66 90 55 89 e5 57 56 53 89 c3 83 ec 2c 8b 33 65 a1 14 00 00 00 89 45 f0 31 c0 80 7b 4b 00 <8b> 06 8b b8 20 03 00 00 8b 43 04 0f 85 5e 01 00 00 3d e0 00 00
Dec 13 12:06:09 artful-i386-testing kernel: [  160.987207] EIP: igmp_group_dropped+0x21/0x220 SS:ESP: 0068:f4c5bd90
Dec 13 12:06:09 artful-i386-testing kernel: [  160.988793] CR2: 0000000000000000
Dec 13 12:06:09 artful-i386-testing kernel: [  160.989603] ---[ end trace 9fe78986aa6abc43 ]---
Dec 13 12:12:22 artful-i386-testing kernel: [    0.000000] random: get_random_bytes called from start_kernel+0x35/0x41e with crng_init=0
Dec 13 12:12:22 artful-i386-testing kernel: [    0.000000] Linux version 4.13.0-16-generic (buildd@lgw01-amd64-011) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:33:49 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4)

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#9

And some more testing

xenial: ovs 2.5.2 + 4.4/4.10/4.13 - OK
xenial: ovs 2.8.1 + 4.13 - FAIL

so maybe this is something that ovs is doing that's breaking the kernel.

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#10

[ 118.059308] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 118.065034] IP: rtmsg_ifa+0x2d/0xe0
[ 118.065744] *pdpt = 000000003434e001 *pde = 0000000000000000

[ 118.067166] Oops: 0000 [#1] SMP
[ 118.067863] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass parport_pc input_leds parport joydev serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crc32_pclmul pcbc hid_generic usbhid aesni_intel hid aes_i586 psmouse crypto_simd virtio_net cryptd virtio_blk floppy
[ 118.081568] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G W 4.13.0-19-generic #22~16.04.1-Ubuntu
[ 118.088041] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 118.090334] Workqueue: netns cleanup_net
[ 118.091146] task: f537e300 task.stack: f485a000
[ 118.092077] EIP: rtmsg_ifa+0x2d/0xe0
[ 118.092778] EFLAGS: 00010246 CPU: 0
[ 118.093654] EAX: 00000000 EBX: f3674e00 ECX: 00000000 EDX: 014000c0
[ 118.095002] ESI: 00000000 EDI: f34dad80 EBP: f485bd94 ESP: f485bd7c
[ 118.101126] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 118.102531] CR0: 80050033 CR2: 00000000 CR3: 33df3860 CR4: 000406f0
[ 118.104276] Call Trace:
[ 118.105159] __inet_del_ifa+0x129/0x270
[ 118.106192] ? igmpv3_clear_delrec+0x28/0xb0
[ 118.107392] inetdev_event+0x1ff/0x4e0
[ 118.113963] ? __schedule+0x41e/0x8d0
[ 118.115418] notifier_call_chain+0x4e/0x70
[ 118.117049] ? notifier_call_chain+0x4e/0x70
[ 118.118081] raw_notifier_call_chain+0x11/0x20
[ 118.119125] call_netdevice_notifiers_info+0x2a/0x60
[ 118.125600] rollback_registered_many+0x268/0x370
[ 118.127652] unregister_netdevice_many+0x16/0x80
[ 118.129980] ? unregister_netdevice_many+0x16/0x80
[ 118.136510] default_device_exit_batch+0x126/0x150
[ 118.138577] ? do_wait_intr_irq+0x80/0x80
[ 118.140326] ops_exit_list.isra.8+0x4d/0x60
[ 118.142126] cleanup_net+0x18e/0x270
[ 118.143731] process_one_work+0x118/0x390
[ 118.149200] worker_thread+0x37/0x410
[ 118.150881] kthread+0xdb/0x110
[ 118.152617] ? process_one_work+0x390/0x390
[ 118.154530] ? kthread_create_on_node+0x20/0x20
[ 118.161804] ret_from_fork+0x19/0x24
[ 118.163523] Code: 66 66 90 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 ec 0f 84 93 00 00 00 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 e8 b8 60 00 00 00 e8 7c 3e
[ 118.175426] EIP: rtmsg_ifa+0x2d/0xe0 SS:ESP: 0068:f485bd7c
[ 118.177188] CR2: 0000000000000000
[ 118.178305] ---[ end trace d1d2a116a66e2f9d ]---

[  118.059308] BUG: unable to handle kernel NULL pointer dereference at   (null)
[  118.065034] IP: rtmsg_ifa+0x2d/0xe0
[  118.065744] *pdpt = 000000003434e001 *pde = 0000000000000000

[  118.067166] Oops: 0000 [#1] SMP
[  118.067863] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass parport_pc input_leds parport joydev serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crc32_pclmul pcbc hid_generic usbhid aesni_intel hid aes_i586 psmouse crypto_simd virtio_net cryptd virtio_blk floppy
[  118.081568] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G        W       4.13.0-19-generic #22~16.04.1-Ubuntu
[  118.088041] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[  118.090334] Workqueue: netns cleanup_net
[  118.091146] task: f537e300 task.stack: f485a000
[  118.092077] EIP: rtmsg_ifa+0x2d/0xe0
[  118.092778] EFLAGS: 00010246 CPU: 0
[  118.093654] EAX: 00000000 EBX: f3674e00 ECX: 00000000 EDX: 014000c0
[  118.095002] ESI: 00000000 EDI: f34dad80 EBP: f485bd94 ESP: f485bd7c
[  118.101126]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  118.102531] CR0: 80050033 CR2: 00000000 CR3: 33df3860 CR4: 000406f0
[  118.104276] Call Trace:
[  118.105159]  __inet_del_ifa+0x129/0x270
[  118.106192]  ? igmpv3_clear_delrec+0x28/0xb0
[  118.107392]  inetdev_event+0x1ff/0x4e0
[  118.113963]  ? __schedule+0x41e/0x8d0
[  118.115418]  notifier_call_chain+0x4e/0x70
[  118.117049]  ? notifier_call_chain+0x4e/0x70
[  118.118081]  raw_notifier_call_chain+0x11/0x20
[  118.119125]  call_netdevice_notifiers_info+0x2a/0x60
[  118.125600]  rollback_registered_many+0x268/0x370
[  118.127652]  unregister_netdevice_many+0x16/0x80
[  118.129980]  ? unregister_netdevice_many+0x16/0x80
[  118.136510]  default_device_exit_batch+0x126/0x150
[  118.138577]  ? do_wait_intr_irq+0x80/0x80
[  118.140326]  ops_exit_list.isra.8+0x4d/0x60
[  118.142126]  cleanup_net+0x18e/0x270
[  118.143731]  process_one_work+0x118/0x390
[  118.149200]  worker_thread+0x37/0x410
[  118.150881]  kthread+0xdb/0x110
[  118.152617]  ? process_one_work+0x390/0x390
[  118.154530]  ? kthread_create_on_node+0x20/0x20
[  118.161804]  ret_from_fork+0x19/0x24
[  118.163523] Code: 66 66 90 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 ec 0f 84 93 00 00 00 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 e8 b8 60 00 00 00 e8 7c 3e
[  118.175426] EIP: rtmsg_ifa+0x2d/0xe0 SS:ESP: 0068:f485bd7c
[  118.177188] CR2: 0000000000000000
[  118.178305] ---[ end trace d1d2a116a66e2f9d ]---

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#11

And some more testing:

xenial: ovs 2.8.1 + 4.4 - OK
xenial: ovs 2.8.1 + 4.10 - OK

so the issue appears to be the combination of ovs 2.8.1 (and 2.8.0) with the 4.13 kernel.

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#12

More log output from ovs:

More log output from ovs:

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#13

root 16037 1 0 12:39 ? 00:00:00 ovsdb-server: monitoring pid 16038 (healthy)
root 16038 16037 0 12:39 ? 00:00:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch
root 16051 1 0 12:39 ? 00:00:00 ovs-vswitchd: monitoring pid 16052 (healthy)
root 16052 16051 0 12:39 ? 00:00:00 [ovs-vswitchd] <defunct>

and

root 17517 16216 0 12:40 pts/0 00:00:00 ovs-vsctl --if-exists del-br s1 -- --if-exists del-br s2

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-13:

#14

Can you see if using earlier kernel versions makes the bug go away?

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#15

I don't have a data point for 4.11, but at 4.10 (as shipped in zesty) ovs 2.8.1 tests are OK.

Revision history for this message

James Page (james-page) wrote on 2017-12-13:

#16

Looking at the test history - artful passed tests with 4.11 and 4.12 kernels with 2.8.0, but appears to have started failing sporadically when 4.13 entered the archive.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-13:

#17

Would you be able to test some test kernels? If so, we can try to bisect down to which commit introduced the regression.

Revision history for this message

Greg Rose (gvrose) wrote on 2017-12-15:

#18

Download full text (4.9 KiB)

James Page asked me to post some findings here:

Here’s the trace I’m getting (same as one in comment #10:

[ 5152.142936] device s1 left promiscuous mode
[ 5152.427823] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5152.428422] IP: rtmsg_ifa+0x30/0xd0
[ 5152.428816] *pdpt = 0000000033f65001 *pde = 0000000000000000
[ 5152.428820]
[ 5152.429682] Oops: 0000 [#1] SMP
[ 5152.430046] Modules linked in: veth netconsole openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec joydev snd_hda_core snd_hwdep snd_pcm input_leds serio_raw snd_timer snd pvpanic parport_pc i2c_piix4 soundcore mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm crc32_pclmul drm_kms_helper pcbc aesni_intel syscopyarea aes_i586 sysfillrect crypto_simd sysimgblt fb_sys_fops psmouse cryptd virtio_net virtio_blk drm pata_acpi floppy
[ 5152.433348] CPU: 1 PID: 90 Comm: kworker/u4:3 Tainted: G W 4.13.0-16-generic #19-Ubuntu
[ 5152.433852] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5152.434346] Workqueue: netns cleanup_net
[ 5152.434816] task: f17aa100 task.stack: f4ef0000
[ 5152.435302] EIP: rtmsg_ifa+0x30/0xd0
[ 5152.435780] EFLAGS: 00010246 CPU: 1
[ 5152.436254] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 014000c0
[ 5152.436764] ESI: 00000000 EDI: f063a6c0 EBP: f4ef1dcc ESP: f4ef1db4
[ 5152.437267] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 5152.437780] CR0: 80050033 CR2: 00000000 CR3: 33c3f4a0 CR4: 001406f0
[ 5152.438311] Call Trace:
[ 5152.438816] __inet_del_ifa+0xbb/0x260
[ 5152.439344] ? igmpv3_clear_delrec+0x28/0xa0
[ 5152.439868] inetdev_event+0x22f/0x4e0
[ 5152.440401] ? skb_dequeue+0x5b/0x70
[ 5152.440934] ? wireless_nlevent_flush+0x4c/0x90
[ 5152.441487] notifier_call_chain+0x4e/0x70
[ 5152.442016] raw_notifier_call_chain+0x11/0x20
[ 5152.442554] call_netdevice_notifiers_info+0x2a/0x60
[ 5152.443097] rollback_registered_many+0x21c/0x380
[ 5152.443646] unregister_netdevice_many.part.102+0x10/0x80
[ 5152.444180] default_device_exit_batch+0x134/0x160
[ 5152.444709] ? do_wait_intr_irq+0x80/0x80
[ 5152.445223] ops_exit_list.isra.8+0x4d/0x60
[ 5152.445744] cleanup_net+0x18e/0x260
[ 5152.446264] process_one_work+0x1a0/0x390
[ 5152.446790] worker_thread+0x37/0x440
[ 5152.447321] kthread+0xf3/0x110
[ 5152.447843] ? process_one_work+0x390/0x390
[ 5152.448380] ? kthread_create_on_node+0x20/0x20
[ 5152.448919] ret_from_fork+0x19/0x24
[ 5152.449462] Code: 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 e8 c7 45 f0 00 00 00 00 74 06 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 ec b8 60 00 00 00 e8 19 46
[ 5152.450719] EIP: rtmsg_ifa+0x30/0xd0 SS:ESP: 0068:f4ef1db4
[ 5152.451308] CR2: 0000000000000000
[ 5152.451885] ---[ end trace 5cdfc95a5b343f5c ]---

Looks to me like 4.13 is missing this set of patches from Cong Wang:

c...

James Page asked me to post some findings here:

Here’s the trace I’m getting (same as one in comment #10:

[ 5152.142936] device s1 left promiscuous mode
[ 5152.427823] BUG: unable to handle kernel NULL pointer dereference at   (null)
[ 5152.428422] IP: rtmsg_ifa+0x30/0xd0
[ 5152.428816] *pdpt = 0000000033f65001 *pde = 0000000000000000
[ 5152.428820]
[ 5152.429682] Oops: 0000 [#1] SMP
[ 5152.430046] Modules linked in: veth netconsole openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec joydev snd_hda_core snd_hwdep snd_pcm input_leds serio_raw snd_timer snd pvpanic parport_pc i2c_piix4 soundcore mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm crc32_pclmul drm_kms_helper pcbc aesni_intel syscopyarea aes_i586 sysfillrect crypto_simd sysimgblt fb_sys_fops psmouse cryptd virtio_net virtio_blk drm pata_acpi floppy
[ 5152.433348] CPU: 1 PID: 90 Comm: kworker/u4:3 Tainted: G        W       4.13.0-16-generic #19-Ubuntu
[ 5152.433852] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5152.434346] Workqueue: netns cleanup_net
[ 5152.434816] task: f17aa100 task.stack: f4ef0000
[ 5152.435302] EIP: rtmsg_ifa+0x30/0xd0
[ 5152.435780] EFLAGS: 00010246 CPU: 1
[ 5152.436254] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 014000c0
[ 5152.436764] ESI: 00000000 EDI: f063a6c0 EBP: f4ef1dcc ESP: f4ef1db4
[ 5152.437267]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 5152.437780] CR0: 80050033 CR2: 00000000 CR3: 33c3f4a0 CR4: 001406f0
[ 5152.438311] Call Trace:
[ 5152.438816]  __inet_del_ifa+0xbb/0x260
[ 5152.439344]  ? igmpv3_clear_delrec+0x28/0xa0
[ 5152.439868]  inetdev_event+0x22f/0x4e0
[ 5152.440401]  ? skb_dequeue+0x5b/0x70
[ 5152.440934]  ? wireless_nlevent_flush+0x4c/0x90
[ 5152.441487]  notifier_call_chain+0x4e/0x70
[ 5152.442016]  raw_notifier_call_chain+0x11/0x20
[ 5152.442554]  call_netdevice_notifiers_info+0x2a/0x60
[ 5152.443097]  rollback_registered_many+0x21c/0x380
[ 5152.443646]  unregister_netdevice_many.part.102+0x10/0x80
[ 5152.444180]  default_device_exit_batch+0x134/0x160
[ 5152.444709]  ? do_wait_intr_irq+0x80/0x80
[ 5152.445223]  ops_exit_list.isra.8+0x4d/0x60
[ 5152.445744]  cleanup_net+0x18e/0x260
[ 5152.446264]  process_one_work+0x1a0/0x390
[ 5152.446790]  worker_thread+0x37/0x440
[ 5152.447321]  kthread+0xf3/0x110
[ 5152.447843]  ? process_one_work+0x390/0x390
[ 5152.448380]  ? kthread_create_on_node+0x20/0x20
[ 5152.448919]  ret_from_fork+0x19/0x24
[ 5152.449462] Code: 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 e8 c7 45 f0 00 00 00 00 74 06 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 ec b8 60 00 00 00 e8 19 46
[ 5152.450719] EIP: rtmsg_ifa+0x30/0xd0 SS:ESP: 0068:f4ef1db4
[ 5152.451308] CR2: 0000000000000000
[ 5152.451885] ---[ end trace 5cdfc95a5b343f5c ]---

Looks to me like 4.13 is missing this set of patches from Cong Wang:

commit 623859ae06b85cabba79ce78f0d49e67783d4c34
Merge: 8f56246 35c55fc
Author: David S. Miller <davem@davemloft.net>
Date:   Thu Nov 9 10:03:10 2017 +0900

Merge branch 'net-sched-race-fix'

Cong Wang says:

====================
    net_sched: close the race between call_rcu() and cleanup_net()

This patchset tries to fix the race between call_rcu() and
    cleanup_net() again. Without holding the netns refcnt the
    tc_action_net_exit() in netns workqueue could be called before
    filter destroy works in tc filter workqueue. This patchset
    moves the netns refcnt from tc actions to tcf_exts, without
    breaking per-netns tc actions.

Patch 1 reverts the previous fix, patch 2 introduces two new
    API's to help to address the bug and the rest patches switch
    to the new API's. Please see each patch for details.

I was not able to reproduce this bug, but now after adding
    some delay in filter destroy work I manage to trigger the
    crash. After this patchset, the crash is not reproducible
    any more and the debugging printk's show the order is expected
    too.
    ====================

Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc action
    Reported-by: Lucas Bates <lucasb@mojatatu.com>
    Cc: Lucas Bates <lucasb@mojatatu.com>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Note the comment he makes about “filter destroy work” and how the final function in the trace is __inet_del_ifa().  As you can see from the trace the machine is executing the netns cleanup_net() function when the panic occurs.  This series of patches has not been backported to the 4.13.16 kernel.

Revision history for this message

James Page (james-page) wrote on 2017-12-18:

#19

Thanks Greg

Joseph - would it be possible to get a 4.13 kernel prepared with the patch identified picked?

Joseph Salisbury (jsalisbury) on 2017-12-18

Changed in linux (Ubuntu Artful):
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
status:	Triaged → Fix Committed
assignee:	nobody → Joseph Salisbury (jsalisbury)
tags:	added: bionic

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#20

@Greg, do you happen to know which commit in that merge is the actual fix for this bug?

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-01-03:

#21

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu Artful):
status:	New → Confirmed
Changed in openvswitch (Ubuntu):
status:	New → Confirmed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-03:

#23

From the Dup - FYI
To reproduce do:
$ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
$ pull-lp-source openvswitch
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.8.0~git20170809.7aa47a19d-0ubuntu1.dsc -- qemu ~/work/autopkgtest-artful-i386.img
# This guest currently will crash after a while of testing

Really easy to reproduce (and close to 100% fail rate)

Steve Langasek (vorlon) on 2018-01-03

summary:

- openvswitch: kernel opps destroying interfaces on i386
+ openvswitch: kernel oops destroying interfaces on i386

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#24

I tried to build a test kernel with the patch set from Cong mentioned in comment #18. However, there was a build failure due to some missing prereq commits. I also had to backport many of the patches, since they were not clean cherry picks.

Per comment 16, the bug did not start happening until the 4.13 kernel was introduced to Artful. We may want to look for a commit to revert instead of Congs patch set, since it may not pass all the SRU requirements.

I'll create a VM and see if I can reproduce the bug. If I can, I'll bisect down the offending commit in 4.13.

Just to confirm, this is only happening on the i386 server release?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#25

hang.jpg Edit (5.5 MiB, image/jpeg)

I'm attempting to reproduce the bug. However, I get a hang when I run the following:
$ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G

Attached is a screen shot of the hang. It seems to be trying to boot the image.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#26

Also, I'm trying to run this on a VM instead of bare hardware. Could that be an issue?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-05:

#27

@Joseph - if you mean 2nd level so:
Machine
->KVM
-> $ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
That could be an issue - for 2nd level being famous for only working mostly.

But why would you do so - since the tests are in VMs they are already more or less host release agnostic. You can run that on any host system - although in general having "autopkgtest" from $RELEASE-backports is often a good idea.

Just ran a check if buildvm works for me atm, and it did with the same cmdline you had (Xenial + autopkgtest backport) - it "appears" to hang at that stage, but is done after ~1-2 minutes. In 2nd level that could as well be just way more inefficient and take much longer?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-05:

#28

@Christina, Thanks for the info. I am now able to reproduce the bug using your steps. Thanks again for that!

I'm going to dig deeper into this and see if I can bisect down to the commit that caused this. I next need to figure out how to swap in different test kernels. I think you explained how to do that in the duplicate bug, so I'll take a look there.

I'll post an update shortly.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-05:

#29

I'm running into an issue mounting the image to get a kernel onto it. For some reason, I cannot mount /dev/nbd0p1:

sudo modprobe nbd

sudo qemu-nbd --connect=/dev/nbd0 ./autopkgtest-artful-i386.img

sudo mount /dev/nbd0p1 /mnt/test/
mount: /mnt/test: special device /dev/nbd0p1 does not exist.

sudo fdisk /dev/nbd0

Welcome to fdisk (util-linux 2.30.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): p
Disk /dev/nbd0: 12.2 GiB, 13098811392 bytes, 25583616 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xae40d318

Device Boot Start End Sectors Size Id Type
/dev/nbd0p1 * 2048 25583582 25581535 12.2G 83 Linux

fdisk can see that the nbd device has one partition, but I can't seem to mount it. Any suggestions?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-05:

#30

After running qemu-nbd:

ls /dev/nbd*
nbd0 nbd1 nbd10 nbd11 nbd12 nbd13 nbd14 nbd15 nbd2 nbd3 nbd4 nbd5 nbd6 nbd7 nbd8 nbd9

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-05:

#31

I'm able to install a kernel using kvm:
kvm -m 512 -hda ./autopkgtest-artful-i386.img

Is there a way to modify the autopkgtest command line to tell it which kernel to boot? When I have multiple test kernels installed, I usually just select them from the GRUB menu. What would be the equivalent here?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-08: Re: [Bug 1736390] Re: openvswitch: kernel oops destroying interfaces on i386

#32

On Fri, Jan 5, 2018 at 9:27 PM, Joseph Salisbury
<email address hidden> wrote:
> I'm able to install a kernel using kvm:
> kvm -m 512 -hda ./autopkgtest-artful-i386.img

If that works for you fine.

> Is there a way to modify the autopkgtest command line to tell it which
> kernel to boot? When I have multiple test kernels installed, I usually
> just select them from the GRUB menu. What would be the equivalent here?

I must admit, for automated tests I set the one to test as the default
booted kernel.

If you really want to bisect you might want to go a step further.
The tool to drive via qmeu is "autopkgtest-virt-qemu" and it has
"--qemu-options=".
That said (I never tried) you should be able to build (e.g. by bisect)
a valid kernel outside (make sure that it has all modules in a way to
work without having the kernel actually installed).
Via that you could use qemu options:
-kernel
-initrd
-append
as needed.
Once set up in a way to work you could really bisect through that.
OTOH that might be just as much work as installing them manually a few
times - so your choice what you prefer.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-10:

#33

Thanks for the feedback, Christian. I would much rather install them manually. I am able to do that without a problem. However, I am unable to access the GRUB menu in the usual way to select a specific kernel.

I tried all the usual way, holding shift, modifying /etc/default/grub setting, but none seem to work.

@James Page, how did you tell your image to test the mainline kernel in comment #6? Do you identify the position of the new kernel in /boot/grub/grub.cfg then modify GREUB_DEFAULT in /etc/default/grub to this number?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-01-11:

#34

On Wed, Jan 10, 2018 at 5:45 PM, Joseph Salisbury
<email address hidden> wrote:
> Thanks for the feedback, Christian. I would much rather install them
> manually. I am able to do that without a problem. However, I am unable
> to access the GRUB menu in the usual way to select a specific kernel.

If you test through autotest, then there isn't a good way to manually
intercept "while" running.

> I tried all the usual way, holding shift, modifying /etc/default/grub
> setting, but none seem to work.

Other than just making sure that grub picks the right default by
making sue the to-be-tested kernel is the latest I worked by modifying
grub.

Checking again if this still works ...
I had a zesty testbed to remove anyway, so I could kill it if needed.
By default it does boot "4.8.0-26-generic"

Note: working in guest image via:
$ qemu-system-x86_64 -m 1024 -smp 1 -nographic -net nic,model=virtio
-net user -enable-kvm -cpu kvm64,+vmx,+lahf_lm
~/work/autopkgtest-zesty-amd64.img

I installed 4.7.10-040710_4.7.10-040710.201610220847 from mainline
builds as it is older and therefore would not be selected by grub
automatically.
After install I checked autopkgtest output...

autopkgtest [07:54:48]: testbed running kernel: Linux 4.8.0-26-generic
#28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016

Ok, now lets modify grub to boot the older kernel:
I found that (at least in this case) the BIOS boot partition kind of
breaks update-grub.
/dev/sda1 227328 25583582 25356255 12.1G Linux filesystem
/dev/sda14 2048 10239 8192 4M BIOS boot
/dev/sda15 10240 227327 217088 106M EFI System

The middle one is the odd one - that is the non efi compat grub img
storage area.
Anyway - to get around that I was adding:

$ apt-get install --reinstall grub-efi
$ echo "GRUB_DISABLE_OS_PROBER=true" | sudo tee -a /etc/default/grub
So as usual e.g.:
echo 'GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux
4.7.10-040710-generic"' | sudo tee -a /etc/default/grub
$ sudo update-grub

And e voila:
autopkgtest [08:20:29]: testbed running kernel: Linux
4.7.10-040710-generic #201610220847 SMP Sat Oct 22 12:50:14 UTC 2016

Other than the extra hoop I had to jump for the BIOS boot there was
nothing special in my try.
And I'd assume that could as much appear on real HW.

I hope that helps to drive your test kernels.

On Wed, Jan 10, 2018 at 5:45 PM, Joseph Salisbury
<joseph.salisbury@canonical.com> wrote:
> Thanks for the feedback, Christian.  I would much rather install them
> manually.  I am able to do that without a problem.  However, I am unable
> to access the GRUB menu in the usual way to select a specific kernel.

If you test through autotest, then there isn't a good way to manually
intercept "while" running.

> I tried all the usual way, holding shift, modifying /etc/default/grub
> setting, but none seem to work.

Other than just making sure that grub picks the right default by
making sue the to-be-tested kernel is the latest I worked by modifying
grub.

Checking again if this still works ...
I had a zesty testbed to remove anyway, so I could kill it if needed.
By default it does boot "4.8.0-26-generic"

Note: working in guest image via:
$ qemu-system-x86_64 -m 1024 -smp 1 -nographic -net nic,model=virtio
-net user -enable-kvm -cpu kvm64,+vmx,+lahf_lm
~/work/autopkgtest-zesty-amd64.img

I installed 4.7.10-040710_4.7.10-040710.201610220847 from mainline
builds as it is older and therefore would not be selected by grub
automatically.
After install I checked autopkgtest output...

autopkgtest [07:54:48]: testbed running kernel: Linux 4.8.0-26-generic
#28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016

Ok, now lets modify grub to boot the older kernel:
I found that (at least in this case) the BIOS boot partition kind of
breaks update-grub.
/dev/sda1  227328 25583582 25356255 12.1G Linux filesystem
/dev/sda14   2048    10239     8192    4M BIOS boot
/dev/sda15  10240   227327   217088  106M EFI System

The middle one is the odd one - that is the non efi compat grub img
storage area.
Anyway - to get around that I was adding:

$ apt-get install --reinstall grub-efi
$ echo "GRUB_DISABLE_OS_PROBER=true" | sudo tee -a /etc/default/grub
So as usual e.g.:
echo 'GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux
4.7.10-040710-generic"' | sudo tee -a /etc/default/grub
$ sudo update-grub

And  e voila:
autopkgtest [08:20:29]: testbed running kernel: Linux
4.7.10-040710-generic #201610220847 SMP Sat Oct 22 12:50:14 UTC 2016

Other than the extra hoop I had to jump for the BIOS boot there was
nothing special in my try.
And I'd assume that could as much appear on real HW.

I hope that helps to drive your test kernels.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-12:

#35

Thanks for the feedback, Christian! This gives me enough to be able to try and bisect this issue down now. I'll post an update shortly.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-31:

#36

I'm swapping the bug back in and working on the bisect again. For some reason, all of the prior kernels that worked, are now freezing on the test while "Stopping Links":

*** Configuring hosts
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14
*** Starting controller
c1
*** Starting 2 switches
s1 s2 ...
INFO:root:Running TCP performance test
*** Iperf: testing TCP bandwidth between h1 and h14
*** Results: ['20.9 Gbits/sec', '20.9 Gbits/sec']
INFO:root:Stopping network
*** Stopping 1 controllers
c1
*** Stopping 15 links
............

Has anyone else seen this?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-31:

#37

The 4.10 based kernel does not experience this hang, but I am seeing it with 4.12. Maybe the 4.12 based kernels does in fact have the bug. I'll test further to find out.

Joseph Salisbury (jsalisbury) on 2018-01-31

Changed in linux (Ubuntu Bionic):
status:	Fix Committed → In Progress

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-31:

#38

After some more testing, it appears v4.11 final does not have the bug. Version 4.12-rc1 appears to be the first kernel version to hit the bug. I'll run a few more tests to confirm this. Once confirmed, I'll start a bisect between v4.11 and v4.12-rc1.

If others want to try these two versions, they can be downloaded from:

v4.11: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/
v4.12-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Revision history for this message

James Page (james-page) wrote on 2018-02-05:

#39

@Joseph

Any update on your bisecting?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-02-05:

#40

I'm still working on bisecting the issue down. I should have another update in a day or two.

Revision history for this message

James Page (james-page) wrote on 2018-02-06:

#41

Marking OVS tasks as invalid as we think this issue is in the kernel.

Changed in openvswitch (Ubuntu Artful):
status:	Confirmed → Invalid
Changed in openvswitch (Ubuntu Bionic):
status:	Confirmed → Invalid

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-02-07:

#42

The first bisect report the following commit as the first bad commit:
2f34c1231bfc ("Merge tag 'drm-for-v4.12' of git://people.freedesktop.org/~airlied/linux")

However, because this commit is a merge, I have to perform another round of bisecting between this commits parents. I'll have another update shortly.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-02-12:

#43

I'm still in the process of bisecting this. Some of the testing that did not exhibit the bug, does sometimes. I need to run through the prior testing I did to confirm whether the kernels were really good or bad.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-02-23:

#44

I'm still working on the bisect. It looks like two commits may have introduced the bug, which is slowing the bisect process.

I found that the following commit causes a kernel trace with rtmsg_ifa as the EIP:

120645513f55 ("openvswitch: Add eventmask support to CT action.")

Reverting this commit allows the tests to work just fine in v4.12, which freeze without the revert. This is allowing me to bisect into newer kernel versions now.

I'm bisecting further now to identify the commit that is causing the trace with igmp_group_dropped as the EIP.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-03-19:

#45

It looks like only commit 120645513f55 would need to be reverted in v4.13.0-38. Can you test the following test kernel:

http://kernel.ubuntu.com/~jsalisbury/lp1736390/

Be sure to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message

Andy Whitcroft (apw) wrote on 2018-07-24: Closing unsupported series nomination.

#46

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status:	In Progress → Won't Fix

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-09-11:

#47

Hi Joseph,
neither me nor James have realized that this waited for a retest on our side.
The old kernel that you had linked is gone by now (together with Artful I assume) :-/

Would you mind prepping a new test kernel of your choice (This still is an issue in cosmic, so whatever works best for you is ok) so that we can check and verify if it helps?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-09-12:

#48

I built a Bionic test kernel with a revert of commit 120645513f55. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-09-13:

#49

Download full text (3.4 KiB)

While the tests that run on autopkgtest infra suggest it still is an issue, I first I tried to reproduce as-is to be sure the trigger is good (new release, new kernel, new OVS):

$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --qemu-command=qemu-system-i386 --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img

While faster in the crash with more CPUs I eventually reduced to 1 to have better (more clear) stack traces.

It hangs (tests POV) and crashes (main console running dmesg -w).

[ 56.320025] BUG: unable to handle kernel NULL pointer dereference at 00000000
[ 56.320760] IP: add_grec+0x28/0x450
[ 56.321137] *pdpt = 000000001ebe7001 *pde = 0000000000000000
[ 56.321699] Oops: 0000 [#1] SMP
[ 56.322009] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel ppdev kvm irqbypass joydev 9pnet_virtio input_leds parport_pc serio_raw 9pnet parport qemu_fw_cfg mac_hid sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[ 56.325571] CPU: 0 PID: 240 Comm: systemd-journal Tainted: G W 4.15.0-34-generic #37-Ubuntu
[ 56.326485] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 56.327356] EIP: add_grec+0x28/0x450
[ 56.327712] EFLAGS: 00010202 CPU: 0
[ 56.328052] EAX: 00000000 EBX: dda65420 ECX: 00000006 EDX: dda65420
[ 56.328651] ESI: dc489a00 EDI: dc489a00 EBP: d94c9f34 ESP: d94c9ef4
[ 56.329259] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 56.329774] CR0: 80050033 CR2: 00000000 CR3: 1e9adba0 CR4: 000006f0
[ 56.330379] Call Trace:
[ 56.330623] <SOFTIRQ>
[ 56.330864] mld_ifc_timer_expire+0x10e/0x260
[ 56.331285] ? igmp6_timer_handler+0x60/0x60
[ 56.331699] call_timer_fn+0x2f/0x120
[ 56.332066] ? igmp6_timer_handler+0x60/0x60
[ 56.332489] run_timer_softirq+0x3b5/0x410
[ 56.332899] ? rcu_process_callbacks+0xc8/0x470
[ 56.333353] ? __softirqentry_text_start+0x8/0x8
[ 56.333808] __do_softirq+0xae/0x255
[ 56.334163] ? __softirqentry_text_start+0x8/0x8
[ 56.334617] call_on_stack+0x45/0x50
[ 56.334971] </SOFTIRQ>
[ 56.335219] ? irq_exit+0xb5/0xc0
[ 56.335549] ? smp_apic_timer_interrupt+0x6c/0x120
[ 56.336022] ? apic_timer_interrupt+0x3c/0x44
[ 56.336451] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[ 56.338295] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d94c9ef4
[ 56.338832] CR2: 0000000000000000
[ 56.339163] ---[ end trace 6b06ace1457ab251 ]---
[ 56.339616] Kernel panic - not syncing: Fatal exception in interrupt
[ 56.340448] Kernel Offset: 0x9000000 from 0xc1000000 (relocation range: 0xc0000000-0xdf7fdfff)
[ 56.341293] ---[ end Kernel panic - not syncing: Fatal exception in i...

While the tests that run on autopkgtest infra suggest it still is an issue, I first I tried to reproduce as-is to be sure the trigger is good (new release, new kernel, new OVS):

$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --qemu-command=qemu-system-i386 --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img

While faster in the crash with more CPUs I eventually reduced to 1 to have better (more clear) stack traces.

It hangs (tests POV) and crashes (main console running dmesg -w).

[   56.320025] BUG: unable to handle kernel NULL pointer dereference at 00000000
[   56.320760] IP: add_grec+0x28/0x450
[   56.321137] *pdpt = 000000001ebe7001 *pde = 0000000000000000 
[   56.321699] Oops: 0000 [#1] SMP
[   56.322009] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel ppdev kvm irqbypass joydev 9pnet_virtio input_leds parport_pc serio_raw 9pnet parport qemu_fw_cfg mac_hid sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[   56.325571] CPU: 0 PID: 240 Comm: systemd-journal Tainted: G        W        4.15.0-34-generic #37-Ubuntu
[   56.326485] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   56.327356] EIP: add_grec+0x28/0x450
[   56.327712] EFLAGS: 00010202 CPU: 0
[   56.328052] EAX: 00000000 EBX: dda65420 ECX: 00000006 EDX: dda65420
[   56.328651] ESI: dc489a00 EDI: dc489a00 EBP: d94c9f34 ESP: d94c9ef4
[   56.329259]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   56.329774] CR0: 80050033 CR2: 00000000 CR3: 1e9adba0 CR4: 000006f0
[   56.330379] Call Trace:
[   56.330623]  <SOFTIRQ>
[   56.330864]  mld_ifc_timer_expire+0x10e/0x260
[   56.331285]  ? igmp6_timer_handler+0x60/0x60
[   56.331699]  call_timer_fn+0x2f/0x120
[   56.332066]  ? igmp6_timer_handler+0x60/0x60
[   56.332489]  run_timer_softirq+0x3b5/0x410
[   56.332899]  ? rcu_process_callbacks+0xc8/0x470
[   56.333353]  ? __softirqentry_text_start+0x8/0x8
[   56.333808]  __do_softirq+0xae/0x255
[   56.334163]  ? __softirqentry_text_start+0x8/0x8
[   56.334617]  call_on_stack+0x45/0x50
[   56.334971]  </SOFTIRQ>
[   56.335219]  ? irq_exit+0xb5/0xc0
[   56.335549]  ? smp_apic_timer_interrupt+0x6c/0x120
[   56.336022]  ? apic_timer_interrupt+0x3c/0x44
[   56.336451] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[   56.338295] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d94c9ef4
[   56.338832] CR2: 0000000000000000
[   56.339163] ---[ end trace 6b06ace1457ab251 ]---
[   56.339616] Kernel panic - not syncing: Fatal exception in interrupt
[   56.340448] Kernel Offset: 0x9000000 from 0xc1000000 (relocation range: 0xc0000000-0xdf7fdfff)
[   56.341293] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

With that, try the new kernel.

Note: console after starting the test
$ sudo nc -U /tmp/autopkgtest-qemu*/ttyS0

Umm, I was stopped in my tracks realizing this is an amd64 kernel.
@Jsalisbury - I'll need i386 kernels to do this.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-09-13:

#50

There is a 32bit kernel now posted to:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-09-14:

#51

Thanks,
installed that in the test env, after a manual reboot I got:

$ uname -a
Linux autopkgtest 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC i686 i686 i686 GNU/Linux

The change is persistent into the autopkgtest:
autopkgtest [05:11:17]: testbed running kernel: Linux 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC

The test kernel works fine where the other one failed.
To be sure I ran it multiple times and with different cpu options enables in KVM (e.g. to also run the DPDK tests which need sse3).

But they all worked, no crash.
That said - yes reverting that change seems to be the solution.
Yet for what was it needed and what would break if it is reverted?

commit 120645513f55a4ac5543120d9e79925d30a0156f
Author: Jarno Rajahalme <email address hidden>
Date: Fri Apr 21 16:48:06 2017 -0700

openvswitch: Add eventmask support to CT action.

    Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
    which can be used in conjunction with the commit flag
    (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
    conntrack events (IPCT_*) should be delivered via the Netfilter
    netlink multicast groups. Default behavior depends on the system
    configuration, but typically a lot of events are delivered. This can be
    very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
    types of events are of interest.

    Netfilter core init_conntrack() adds the event cache extension, so we
    only need to set the ctmask value. However, if the system is
    configured without support for events, the setting will be skipped due
    to extension not being found.

That is odd, I thought in the past we had identified an Ubuntu-sauce patch, but that is a normal upstream change.
I'd hope that other are affected as well and this is fixed, or could it be that we are affected by 1206455 due to some Ubuntu-sauce?

For the sake of checking if latest upstream (no sauce and 4.19-rc3) might be better I ran the latest mainline kernel.

autopkgtest [05:21:50]: testbed running kernel: Linux 4.19.0-041900rc3-generic #201809120832 SMP Wed Sep 12 12:47:16 UTC 2018

But that is crashing still.

@James: can you estimate what we loose on non-i386 when reverting that change for now?
@Joseph: what would we do now, report upstream - if so what exactly a description and link sent to the author and the ML as we don#t have a fix yet?

Thanks,
installed that in the test env, after a manual reboot I got:

$ uname -a
Linux autopkgtest 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC  i686 i686 i686 GNU/Linux

The change is persistent into the autopkgtest:
autopkgtest [05:11:17]: testbed running kernel: Linux 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC

The test kernel works fine where the other one failed.
To be sure I ran it multiple times and with different cpu options enables in KVM (e.g. to also run the DPDK tests which need sse3).

But they all worked, no crash.
That said - yes reverting that change seems to be the solution.
Yet for what was it needed and what would break if it is reverted?

commit 120645513f55a4ac5543120d9e79925d30a0156f
Author: Jarno Rajahalme <jarno@ovn.org>
Date:   Fri Apr 21 16:48:06 2017 -0700

openvswitch: Add eventmask support to CT action.
    
    Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
    which can be used in conjunction with the commit flag
    (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
    conntrack events (IPCT_*) should be delivered via the Netfilter
    netlink multicast groups.  Default behavior depends on the system
    configuration, but typically a lot of events are delivered.  This can be
    very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
    types of events are of interest.
    
    Netfilter core init_conntrack() adds the event cache extension, so we
    only need to set the ctmask value.  However, if the system is
    configured without support for events, the setting will be skipped due
    to extension not being found.

That is odd, I thought in the past we had identified an Ubuntu-sauce patch, but that is a normal upstream change.
I'd hope that other are affected as well and this is fixed, or could it be that we are affected by 1206455 due to some Ubuntu-sauce?

For the sake of checking if latest upstream (no sauce and 4.19-rc3) might be better I ran the latest mainline kernel.

autopkgtest [05:21:50]: testbed running kernel: Linux 4.19.0-041900rc3-generic #201809120832 SMP Wed Sep 12 12:47:16 UTC 2018

But that is crashing still.

@James: can you estimate what we loose on non-i386 when reverting that change for now?
@Joseph: what would we do now, report upstream - if so what exactly a description and link sent to the author and the ML as we don#t have a fix yet?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-09-17:

#52

Could you give the latest mainline kernel a test before I ping upstream? It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc4

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-09-19:

#53

Hi Joseph, due to some maas accident I got my test system destroyed by a coworker.
I tested v4.19-rc3 as I wrote in comment #51 - do you mind accepting that as a valid "test latest mainline" even thou it was not -rc4 as it would be now?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-09-20:

#54

That should be good. I just like to have the latest mainline already tested in case upstream asks for it. I'll ping upstream and see what the next steps should be.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-09-21: [Regression] openvswitch: Add eventmask support to CT action.

#55

Hi Jarno,

A kernel bug report was opened against Ubuntu [0]. This bug is a
regression introduced in v4.12-rc1. The latest mainline kernel was
tested and still exhibits the bug. The following commit was identified
as the cause of the regression:

120645513f55 ("openvswitch: Add eventmask support to CT action.")

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue?

Thanks,

Joe

http://pad.lv/1736390

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-10-01:

#56

I built a Bionic test kernel with a patch from upstream. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Can you test this kernel and see if it resolves this bug?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-10-09:

#57

Hi Joseph, I'm back from my PTO, but have to ask you again for an update - as in the past I'll need 32bit kernels for this test

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-10-09:

#58

Note to myself test instructions carried from my original bug

Update Kernel:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1712831/comments/12
and
sudo qemu-system-i386 -hda autopkgtest-bionic-i386.img -enable-kvm -nographic -curses -m 4096

Test:
sudo autopkgtest --shell-fail --apt-upgrade --no-built-binaries openvswitch_2.10.0-0ubuntu2.dsc -- qemu --qemu-options='-cpu host' --cpus 8 --ram-size=4096 ~/autopkgtest-bionic-i386.img

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-10-12:

#59

I built a i386 version of the Bionic test kernel with a patch from upstream. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-10-15:

#60

Download full text (7.6 KiB)

Repro crash with the case - still triggering

Installed 32bit Test kernel

It boots this one:
Linux 4.15.0-36-generic #40 SMP Fri Oct 12 00:17:54 UTC 2018

Seems to have no "special" version suffix to identify it other than #40 and build time.
But #40 and the build time indicate this is the provided test kernel.

With that kernel it still fails.
Here an updated BUG output of that kernel:

[ 74.352331] IP: add_grec+0x28/0x450
[ 74.353422] *pdpt = 000000001df53001 *pde = 0000000000000000
[ 74.355527] Oops: 0000 [#1] SMP
[ 74.356517] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel kvm irqbypass crc32_pclmul pcbc aesni_intel aes_i586 crypto_simd ppdev cryptd joydev input_leds 9pnet_virtio 9pnet parport_pc parport mac_hid serio_raw qemu_fw_cfg sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[ 74.367244] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W 4.15.0-36-generic #40
[ 74.368932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 74.370719] EIP: add_grec+0x28/0x450
[ 74.371319] EFLAGS: 00010202 CPU: 2
[ 74.372213] EAX: 00000000 EBX: dd92c360 ECX: 00000006 EDX: dd92c360
[ 74.373451] ESI: d7406600 EDI: d7406600 EBP: d8db7f34 ESP: d8db7ef4
[ 74.374648] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 74.375540] CR0: 80050033 CR2: 00000000 CR3: 1e3e1220 CR4: 001406f0
[ 74.376881] Call Trace:
[ 74.377301] <SOFTIRQ>
[ 74.377708] ? pcpu_chunk_relocate+0x14/0x70
[ 74.378426] mld_ifc_timer_expire+0x10e/0x260
[ 74.379328] ? igmp6_timer_handler+0x60/0x60
[ 74.380047] call_timer_fn+0x2f/0x120
[ 74.380654] ? igmp6_timer_handler+0x60/0x60
[ 74.381367] run_timer_softirq+0x3b5/0x410
[ 74.382519] ? rcu_process_callbacks+0xc8/0x470
[ 74.383287] ? __softirqentry_text_start+0x8/0x8
[ 74.384396] __do_softirq+0xae/0x255
[ 74.385000] ? __softirqentry_text_start+0x8/0x8
[ 74.385769] call_on_stack+0x45/0x50
[ 74.386367] </SOFTIRQ>
[ 74.386800] ? irq_exit+0xb5/0xc0
[ 74.387377] ? smp_apic_timer_interrupt+0x6c/0x120
[ 74.388355] ? apic_timer_interrupt+0x3c/0x44
[ 74.389085] ? __sched_text_end+0x3/0x3
[ 74.389728] ? native_safe_halt+0x5/0x10
[ 74.390851] ? default_idle+0x1c/0x100
[ 74.391621] ? arch_cpu_idle+0x12/0x20
[ 74.392388] ? default_idle_call+0x1e/0x30
[ 74.393390] ? do_idle+0x145/0x1c0
[ 74.394410] ? cpu_startup_entry+0x65/0x70
[ 74.395432] ? start_secondary+0x18a/0x1d0
[ 74.396275] ? startup_32_smp+0x164/0x168
[ 74.397098] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[ 74.401207] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d8db7ef4
[ 74.402470] CR2: 0000000000000000
[ 74.403158] ---[ end trace b2832e49d4542abf ]---
[ 74.404247] Kernel panic - not syncing: Fatal exception in interrupt
[ 74.405513] Kernel Offset: 0x9000000 from 0xc1000000 (relocati...

Repro crash with the case - still triggering

Installed 32bit Test kernel

It boots this one:
Linux 4.15.0-36-generic #40 SMP Fri Oct 12 00:17:54 UTC 2018

Seems to have no "special" version suffix to identify it other than #40 and build time.
But #40 and the build time indicate this is the provided test kernel.

With that kernel it still fails.
Here an updated BUG output of that kernel:

[   74.352331] IP: add_grec+0x28/0x450
[   74.353422] *pdpt = 000000001df53001 *pde = 0000000000000000 
[   74.355527] Oops: 0000 [#1] SMP
[   74.356517] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel kvm irqbypass crc32_pclmul pcbc aesni_intel aes_i586 crypto_simd ppdev cryptd joydev input_leds 9pnet_virtio 9pnet parport_pc parport mac_hid serio_raw qemu_fw_cfg sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[   74.367244] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W        4.15.0-36-generic #40
[   74.368932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   74.370719] EIP: add_grec+0x28/0x450
[   74.371319] EFLAGS: 00010202 CPU: 2
[   74.372213] EAX: 00000000 EBX: dd92c360 ECX: 00000006 EDX: dd92c360
[   74.373451] ESI: d7406600 EDI: d7406600 EBP: d8db7f34 ESP: d8db7ef4
[   74.374648]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   74.375540] CR0: 80050033 CR2: 00000000 CR3: 1e3e1220 CR4: 001406f0
[   74.376881] Call Trace:
[   74.377301]  <SOFTIRQ>
[   74.377708]  ? pcpu_chunk_relocate+0x14/0x70
[   74.378426]  mld_ifc_timer_expire+0x10e/0x260
[   74.379328]  ? igmp6_timer_handler+0x60/0x60
[   74.380047]  call_timer_fn+0x2f/0x120
[   74.380654]  ? igmp6_timer_handler+0x60/0x60
[   74.381367]  run_timer_softirq+0x3b5/0x410
[   74.382519]  ? rcu_process_callbacks+0xc8/0x470
[   74.383287]  ? __softirqentry_text_start+0x8/0x8
[   74.384396]  __do_softirq+0xae/0x255
[   74.385000]  ? __softirqentry_text_start+0x8/0x8
[   74.385769]  call_on_stack+0x45/0x50
[   74.386367]  </SOFTIRQ>
[   74.386800]  ? irq_exit+0xb5/0xc0
[   74.387377]  ? smp_apic_timer_interrupt+0x6c/0x120
[   74.388355]  ? apic_timer_interrupt+0x3c/0x44
[   74.389085]  ? __sched_text_end+0x3/0x3
[   74.389728]  ? native_safe_halt+0x5/0x10
[   74.390851]  ? default_idle+0x1c/0x100
[   74.391621]  ? arch_cpu_idle+0x12/0x20
[   74.392388]  ? default_idle_call+0x1e/0x30
[   74.393390]  ? do_idle+0x145/0x1c0
[   74.394410]  ? cpu_startup_entry+0x65/0x70
[   74.395432]  ? start_secondary+0x18a/0x1d0
[   74.396275]  ? startup_32_smp+0x164/0x168
[   74.397098] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[   74.401207] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d8db7ef4
[   74.402470] CR2: 0000000000000000
[   74.403158] ---[ end trace b2832e49d4542abf ]---
[   74.404247] Kernel panic - not syncing: Fatal exception in interrupt
[   74.405513] Kernel Offset: 0x9000000 from 0xc1000000 (relocation range: 0xc0000000-0xdf7fdfff)
[   74.406968] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[   74.408309] ------------[ cut here ]------------
[   74.409079] sched: Unexpected reschedule of offline CPU#0!
[   74.410748] WARNING: CPU: 2 PID: 0 at /home/jsalisbury/bugs/lp1736390/bionic/ubuntu-bionic/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x3b/0x50
[   74.413690] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel kvm irqbypass crc32_pclmul pcbc aesni_intel aes_i586 crypto_simd ppdev cryptd joydev input_leds 9pnet_virtio 9pnet parport_pc parport mac_hid serio_raw qemu_fw_cfg sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[   74.423253] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D W        4.15.0-36-generic #40
[   74.424752] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   74.426392] EIP: native_smp_send_reschedule+0x3b/0x50
[   74.427240] EFLAGS: 00010096 CPU: 2
[   74.427817] EAX: 0000002e EBX: d9912fc0 ECX: d9945630 EDX: 00000007
[   74.428854] ESI: 00000000 EDI: d994efc0 EBP: d8db7c70 ESP: d8db7c68
[   74.430058]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   74.431608] CR0: 80050033 CR2: 00000000 CR3: 1e3e1220 CR4: 001406f0
[   74.432626] Call Trace:
[   74.433034]  <SOFTIRQ>
[   74.433610]  trigger_load_balance+0x10e/0x210
[   74.434335]  ? put_prev_task_idle+0x10/0x10
[   74.435294]  scheduler_tick+0x9e/0xd0
[   74.436057]  update_process_times+0x3f/0x50
[   74.436787]  tick_sched_handle+0x32/0x80
[   74.437458]  tick_sched_timer+0x38/0x90
[   74.438113]  __hrtimer_run_queues+0xb3/0x230
[   74.438845]  ? tick_sched_do_timer+0x60/0x60
[   74.439577]  hrtimer_interrupt+0x8c/0x190
[   74.440434]  smp_apic_timer_interrupt+0x62/0x120
[   74.441527]  apic_timer_interrupt+0x3c/0x44
[   74.442627] EIP: panic+0x195/0x1e6
[   74.443516] EFLAGS: 00000296 CPU: 2
[   74.444222] EAX: 00000041 EBX: 00000000 ECX: d9945630 EDX: 00000007
[   74.445649] ESI: 00000000 EDI: 00000000 EBP: d8db7db8 ESP: d8db7da0
[   74.446909]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   74.447993]  ? snapshot_ioctl+0xa8/0x490
[   74.448789]  oops_end+0xb6/0xc0
[   74.449429]  no_context+0x101/0x290
[   74.450314]  __bad_area_nosemaphore+0xa4/0x130
[   74.451733]  ? kvm_async_pf_task_wait+0x1b0/0x1b0
[   74.452522]  bad_area_nosemaphore+0x12/0x20
[   74.453229]  __do_page_fault+0xcc/0x510
[   74.454209]  ? ip6_mc_hdr.constprop.39+0x47/0xe0
[   74.454989]  ? kvm_async_pf_task_wait+0x1b0/0x1b0
[   74.455958]  do_page_fault+0x27/0xf0
[   74.456566]  ? kvm_async_pf_task_wait+0x1b0/0x1b0
[   74.457353]  do_async_page_fault+0x55/0x90
[   74.458044]  common_exception+0x84/0x8a
[   74.458691] EIP: add_grec+0x28/0x450
[   74.459299] EFLAGS: 00010202 CPU: 2
[   74.460066] EAX: 00000000 EBX: dd92c360 ECX: 00000006 EDX: dd92c360
[   74.461109] ESI: d7406600 EDI: d7406600 EBP: d8db7f34 ESP: d8db7ef4
[   74.462632]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   74.464042]  ? fib6_add+0x54b/0xac0
[   74.464785]  ? pcpu_chunk_relocate+0x14/0x70
[   74.465499]  mld_ifc_timer_expire+0x10e/0x260
[   74.466214]  ? igmp6_timer_handler+0x60/0x60
[   74.466920]  call_timer_fn+0x2f/0x120
[   74.467524]  ? igmp6_timer_handler+0x60/0x60
[   74.468224]  run_timer_softirq+0x3b5/0x410
[   74.468894]  ? rcu_process_callbacks+0xc8/0x470
[   74.469636]  ? __softirqentry_text_start+0x8/0x8
[   74.470389]  __do_softirq+0xae/0x255
[   74.471644]  ? __softirqentry_text_start+0x8/0x8
[   74.472824]  call_on_stack+0x45/0x50
[   74.473547]  </SOFTIRQ>
[   74.474050]  ? irq_exit+0xb5/0xc0
[   74.474743]  ? smp_apic_timer_interrupt+0x6c/0x120
[   74.475706]  ? apic_timer_interrupt+0x3c/0x44
[   74.476582]  ? __sched_text_end+0x3/0x3
[   74.477357]  ? native_safe_halt+0x5/0x10
[   74.478150]  ? default_idle+0x1c/0x100
[   74.479268]  ? arch_cpu_idle+0x12/0x20
[   74.480221]  ? default_idle_call+0x1e/0x30
[   74.481235]  ? do_idle+0x145/0x1c0
[   74.481920]  ? cpu_startup_entry+0x65/0x70
[   74.482943]  ? start_secondary+0x18a/0x1d0
[   74.483761]  ? startup_32_smp+0x164/0x168
[   74.484564] Code: 1f 8b 15 20 b6 bb ca 8b 4a 18 ba fd 00 00 00 e8 f4 ef 84 00 c9 c3 8d 76 00 8d bc 27 00 00 00 00 50 68 68 3d ae ca e8 65 56 02 00 <0f> 0b 58 5a c9 c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90
[   74.488603] ---[ end trace b2832e49d4542ac0 ]---

Joseph Salisbury (jsalisbury) on 2019-01-19

Changed in linux (Ubuntu Bionic):
assignee:	Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Cosmic):
assignee:	Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Artful):
assignee:	Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee:	Joseph Salisbury (jsalisbury) → nobody

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-01-22:

#61

Hmm, so are we giving up on this?

Revision history for this message

Juerg Haefliger (juergh) wrote on 2019-03-11:

#62

https://lore.kernel.org/lkml/20190305104010.6342e9b9@gollum/

Juerg Haefliger (juergh) on 2019-03-11

description:

updated

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2019-03-28:

#63

There is an openvswitch related issue, bug 1813244. Perhaps these two are identical?

Revision history for this message

Andrea Righi (arighi) wrote on 2019-04-05:

#64

I've done a test with the fix from bug #1813244 and the problem doesn't seem to happen. Probably a duplicate bug.

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Ubuntu
linux package

openvswitch: kernel oops destroying interfaces on i386

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	In Progress	High	Unassigned
Artful	Won't Fix	High	Unassigned
Bionic	In Progress	High	Unassigned
Cosmic	In Progress	High	Unassigned
Disco	In Progress	High	Unassigned
openvswitch (Ubuntu)	Invalid	Undecided	Unassigned
Artful	Invalid	Undecided	Unassigned
Bionic	Invalid	Undecided	Unassigned
Cosmic	Invalid	Undecided	Unassigned
Disco	Invalid	Undecided	Unassigned

Ubuntulinux package

openvswitch: kernel oops destroying interfaces on i386

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package