openvswitch: kernel oops destroying interfaces on i386

Bug #1736390 reported by James Page
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
High
Unassigned
Artful
Won't Fix
High
Unassigned
Bionic
In Progress
High
Unassigned
Cosmic
In Progress
High
Unassigned
Disco
In Progress
High
Unassigned
openvswitch (Ubuntu)
Invalid
Undecided
Unassigned
Artful
Invalid
Undecided
Unassigned
Bionic
Invalid
Undecided
Unassigned
Cosmic
Invalid
Undecided
Unassigned
Disco
Invalid
Undecided
Unassigned

Bug Description

== SRU Justification ==

Commit 120645513f55 ("openvswitch: Add eventmask support to CT action.") introduced a regression on i386. Simply running the following commands in a loop will trigger a crash rather quickly:

ovs-vsctl add-br test
ovs-vsctl del-br test

== Fix ==

Disable the logic introduced by the above commit on i386.

== Regression Potential ==

Low, the above commit introduced a new feature. Per upstream, the result of not having this feature results in higher CPU usage and potential buffering issues in user space.

== Test Case ==

See SRU justification above.

Original bug description:

Reproducable on bionic using the autopkgtest's from openvswitch on i386:

[ 41.420568] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 41.421000] IP: igmp_group_dropped+0x21/0x220
[ 41.421246] *pdpt = 000000001d62c001 *pde = 0000000000000000

[ 41.421659] Oops: 0000 [#1] SMP
[ 41.421852] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache ppdev kvm_intel kvm 9pnet_virtio irqbypass input_leds joydev 9pnet parport_pc serio_raw parport i2c_piix4 qemu_fw_cfg mac_hid sch_fq_codel ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse virtio_blk virtio_net pata_acpi floppy
[ 41.423855] CPU: 0 PID: 5 Comm: kworker/u2:0 Tainted: G W 4.13.0-18-generic #21-Ubuntu
[ 41.424355] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 41.424849] Workqueue: netns cleanup_net
[ 41.425071] task: db8fba80 task.stack: dba10000
[ 41.425346] EIP: igmp_group_dropped+0x21/0x220
[ 41.425656] EFLAGS: 00010202 CPU: 0
[ 41.425864] EAX: 00000000 EBX: dd726360 ECX: dba11e6c EDX: 00000002
[ 41.426335] ESI: 00000000 EDI: dd4db500 EBP: dba11dcc ESP: dba11d94
[ 41.426687] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 41.426990] CR0: 80050033 CR2: 00000000 CR3: 1e6d6d60 CR4: 000006f0
[ 41.427340] Call Trace:
[ 41.427485] ? __wake_up+0x36/0x40
[ 41.427680] ip_mc_down+0x27/0x90
[ 41.427869] inetdev_event+0x398/0x4e0
[ 41.428082] ? skb_dequeue+0x5b/0x70
[ 41.428286] ? wireless_nlevent_flush+0x4c/0x90
[ 41.428541] notifier_call_chain+0x4e/0x70
[ 41.428772] raw_notifier_call_chain+0x11/0x20
[ 41.429023] call_netdevice_notifiers_info+0x2a/0x60
[ 41.429301] dev_close_many+0x9d/0xe0
[ 41.429509] rollback_registered_many+0xd7/0x380
[ 41.429768] unregister_netdevice_many.part.102+0x10/0x80
[ 41.430075] default_device_exit_batch+0x134/0x160
[ 41.430344] ? do_wait_intr_irq+0x80/0x80
[ 41.430650] ops_exit_list.isra.8+0x4d/0x60
[ 41.430886] cleanup_net+0x18e/0x260
[ 41.431090] process_one_work+0x1a0/0x390
[ 41.431317] worker_thread+0x37/0x450
[ 41.431525] kthread+0xf3/0x110
[ 41.431714] ? process_one_work+0x390/0x390
[ 41.431941] ? kthread_create_on_node+0x20/0x20
[ 41.432187] ret_from_fork+0x19/0x24
[ 41.432382] Code: 90 90 90 90 90 90 90 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 53 89 c3 83 ec 2c 8b 33 65 a1 14 00 00 00 89 45 f0 31 c0 80 7b 4b 00 <8b> 06 8b b8 20 03 00 00 8b 43 04 0f 85 5e 01 00 00 3d e0 00 00
[ 41.433405] EIP: igmp_group_dropped+0x21/0x220 SS:ESP: 0068:dba11d94
[ 41.433750] CR2: 0000000000000000
[ 41.433961] ---[ end trace 595db54cab84070c ]---

system then becomes unresponsive; no further interfaces can be created.

Revision history for this message
James Page (james-page) wrote :

Only seen on i386; all other archs pass OK.

description: updated
Revision history for this message
James Page (james-page) wrote :

See the same without proposed (i.e. I don't think its the version of ovs in proposed that's causing this problem).

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1736390

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
James Page (james-page)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Re: openvswitch: kernel opps destroying interfaces on i386

Is this issue new in bionic? Do you you happen to know of a prior kernel version that does not exhibit the bug?

Also, could you see if this bug happens with the latest mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance: Medium → High
tags: added: kernel-key
Changed in linux (Ubuntu Bionic):
status: Confirmed → Triaged
tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
James Page (james-page) wrote :

From autopkgtest histories:

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-artful/artful/i386/o/openvswitch/20170904_183010_82746@/log.gz

linux-generic is already the newest version (4.12.0.12.13).

That was for the 2.8.0-0ubuntu1 upload during artful development.

that said subsequent tests using:

linux-generic is already the newest version (4.12.0.12.13).

Revision history for this message
James Page (james-page) wrote :

Testing with the mainline kernel I don't even get as far as the original error - the instance I'm testing with locks up as soon as the performance test is executed.

Revision history for this message
James Page (james-page) wrote :

OK so to summarize what's concrete here:

Only impacts i386
Impacts both ovs 2.8.0 and 2.8.1
Impacts artful and bionic

Revision history for this message
James Page (james-page) wrote :
Download full text (5.0 KiB)

Stacktrace from artful:

Dec 13 12:06:09 artful-i386-testing kernel: [ 160.913030] BUG: unable to handle kernel NULL pointer dereference at (null)
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.915102] IP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.916317] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.916329]
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.917728] Oops: 0000 [#1] SMP
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.918345] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass crc32_pclmul input_leds serio_raw joydev parport_pc parport ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid aesni_intel hid aes_i586 crypto_simd cryptd virtio_blk psmouse virtio_net floppy
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.931511] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G W 4.13.0-16-generic #19-Ubuntu
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.933321] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.934970] Workqueue: netns cleanup_net
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.937847] task: f577c200 task.stack: f4c5a000
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.938694] EIP: igmp_group_dropped+0x21/0x220
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.940625] EFLAGS: 00010202 CPU: 0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.941432] EAX: 00000000 EBX: f2b32c60 ECX: f4c5be68 EDX: 00000002
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.942712] ESI: 00000000 EDI: f29c1700 EBP: f4c5bdc8 ESP: f4c5bd90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.944471] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.945612] CR0: 80050033 CR2: 00000000 CR3: 1ad8a000 CR4: 000406f0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.946879] Call Trace:
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.949247] ? __wake_up+0x36/0x40
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.950074] ip_mc_down+0x27/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.951557] inetdev_event+0x398/0x4e0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.953391] ? skb_dequeue+0x5b/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.954246] ? wireless_nlevent_flush+0x4c/0x90
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.955656] notifier_call_chain+0x4e/0x70
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.956814] raw_notifier_call_chain+0x11/0x20
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.957803] call_netdevice_notifiers_info+0x2a/0x60
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.958886] dev_close_many+0x9d/0xe0
Dec 13 12:06:09 artful-i386-testing kernel: [ 160.962228] rollback_registered_many+0xd7/0x...

Read more...

Revision history for this message
James Page (james-page) wrote :

And some more testing

xenial: ovs 2.5.2 + 4.4/4.10/4.13 - OK
xenial: ovs 2.8.1 + 4.13 - FAIL

so maybe this is something that ovs is doing that's breaking the kernel.

Revision history for this message
James Page (james-page) wrote :

[ 118.059308] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 118.065034] IP: rtmsg_ifa+0x2d/0xe0
[ 118.065744] *pdpt = 000000003434e001 *pde = 0000000000000000

[ 118.067166] Oops: 0000 [#1] SMP
[ 118.067863] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev kvm_intel kvm irqbypass parport_pc input_leds parport joydev serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crc32_pclmul pcbc hid_generic usbhid aesni_intel hid aes_i586 psmouse crypto_simd virtio_net cryptd virtio_blk floppy
[ 118.081568] CPU: 0 PID: 29 Comm: kworker/u2:1 Tainted: G W 4.13.0-19-generic #22~16.04.1-Ubuntu
[ 118.088041] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 118.090334] Workqueue: netns cleanup_net
[ 118.091146] task: f537e300 task.stack: f485a000
[ 118.092077] EIP: rtmsg_ifa+0x2d/0xe0
[ 118.092778] EFLAGS: 00010246 CPU: 0
[ 118.093654] EAX: 00000000 EBX: f3674e00 ECX: 00000000 EDX: 014000c0
[ 118.095002] ESI: 00000000 EDI: f34dad80 EBP: f485bd94 ESP: f485bd7c
[ 118.101126] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 118.102531] CR0: 80050033 CR2: 00000000 CR3: 33df3860 CR4: 000406f0
[ 118.104276] Call Trace:
[ 118.105159] __inet_del_ifa+0x129/0x270
[ 118.106192] ? igmpv3_clear_delrec+0x28/0xb0
[ 118.107392] inetdev_event+0x1ff/0x4e0
[ 118.113963] ? __schedule+0x41e/0x8d0
[ 118.115418] notifier_call_chain+0x4e/0x70
[ 118.117049] ? notifier_call_chain+0x4e/0x70
[ 118.118081] raw_notifier_call_chain+0x11/0x20
[ 118.119125] call_netdevice_notifiers_info+0x2a/0x60
[ 118.125600] rollback_registered_many+0x268/0x370
[ 118.127652] unregister_netdevice_many+0x16/0x80
[ 118.129980] ? unregister_netdevice_many+0x16/0x80
[ 118.136510] default_device_exit_batch+0x126/0x150
[ 118.138577] ? do_wait_intr_irq+0x80/0x80
[ 118.140326] ops_exit_list.isra.8+0x4d/0x60
[ 118.142126] cleanup_net+0x18e/0x270
[ 118.143731] process_one_work+0x118/0x390
[ 118.149200] worker_thread+0x37/0x410
[ 118.150881] kthread+0xdb/0x110
[ 118.152617] ? process_one_work+0x390/0x390
[ 118.154530] ? kthread_create_on_node+0x20/0x20
[ 118.161804] ret_from_fork+0x19/0x24
[ 118.163523] Code: 66 66 90 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 ec 0f 84 93 00 00 00 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 e8 b8 60 00 00 00 e8 7c 3e
[ 118.175426] EIP: rtmsg_ifa+0x2d/0xe0 SS:ESP: 0068:f485bd7c
[ 118.177188] CR2: 0000000000000000
[ 118.178305] ---[ end trace d1d2a116a66e2f9d ]---

Revision history for this message
James Page (james-page) wrote :

And some more testing:

xenial: ovs 2.8.1 + 4.4 - OK
xenial: ovs 2.8.1 + 4.10 - OK

so the issue appears to be the combination of ovs 2.8.1 (and 2.8.0) with the 4.13 kernel.

Revision history for this message
James Page (james-page) wrote :

More log output from ovs:

2017-12-13T12:40:23.502Z|00005|dpif(revalidator20)|WARN|system@ovs-system: failed to put[modify] (Invalid argument) ufid:684991e8-860e-44eb-ab44-bd8bd41520e1 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=76:bd:b5:1d:36:0f,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::74bd:b5ff:fe1d:360f/::,dst=ff02::2/::,label=0/0,proto=58/0,tclass=0/0,hlimit=255/0,frag=no),icmpv6(type=133/0,code=0/0), actions:userspace(pid=0,slow_path(controller))
2017-12-13T12:40:23.502Z|00006|dpif(revalidator20)|WARN|system@ovs-system: failed to put[modify] (Invalid argument) ufid:526e564d-5dba-42da-b433-dbb77cddef23 recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(10),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=42:f1:94:53:48:5b,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::40f1:94ff:fe53:485b/::,dst=ff02::2/::,label=0/0,proto=58/0,tclass=0/0,hlimit=255/0,frag=no),icmpv6(type=133/0,code=0/0), actions:userspace(pid=0,slow_path(controller))
2017-12-13T12:40:23.505Z|00162|bridge|INFO|bridge s1: deleted interface s1 on port 65534
2017-12-13T12:40:23.505Z|00163|bridge|INFO|bridge s1: deleted interface s1-eth8 on port 8
2017-12-13T12:40:23.526Z|00164|bridge|INFO|bridge s2: deleted interface s2-eth4 on port 4
2017-12-13T12:40:23.526Z|00165|bridge|INFO|bridge s2: deleted interface s2-eth8 on port 8
2017-12-13T12:40:23.526Z|00166|bridge|INFO|bridge s2: deleted interface s2 on port 65534
2017-12-13T12:40:23.526Z|00167|bridge|INFO|bridge s2: deleted interface s2-eth6 on port 6
2017-12-13T12:40:23.526Z|00168|bridge|INFO|bridge s2: deleted interface s2-eth7 on port 7
2017-12-13T12:40:23.526Z|00169|bridge|INFO|bridge s2: deleted interface s2-eth2 on port 2
2017-12-13T12:40:23.527Z|00170|bridge|INFO|bridge s2: deleted interface s2-eth5 on port 5
2017-12-13T12:40:23.527Z|00171|bridge|INFO|bridge s2: deleted interface s2-eth3 on port 3
2017-12-13T12:40:23.527Z|00172|bridge|INFO|bridge s2: deleted interface s2-eth1 on port 1
2017-12-13T12:40:24.543Z|00001|ovs_rcu(urcu5)|WARN|blocked 1005 ms waiting for main to quiesce
2017-12-13T12:40:25.539Z|00002|ovs_rcu(urcu5)|WARN|blocked 2001 ms waiting for main to quiesce
2017-12-13T12:40:27.545Z|00003|ovs_rcu(urcu5)|WARN|blocked 4007 ms waiting for main to quiesce
2017-12-13T12:40:31.542Z|00004|ovs_rcu(urcu5)|WARN|blocked 8004 ms waiting for main to quiesce
2017-12-13T12:40:39.546Z|00005|ovs_rcu(urcu5)|WARN|blocked 16008 ms waiting for main to quiesce

Revision history for this message
James Page (james-page) wrote :

root 16037 1 0 12:39 ? 00:00:00 ovsdb-server: monitoring pid 16038 (healthy)
root 16038 16037 0 12:39 ? 00:00:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch
root 16051 1 0 12:39 ? 00:00:00 ovs-vswitchd: monitoring pid 16052 (healthy)
root 16052 16051 0 12:39 ? 00:00:00 [ovs-vswitchd] <defunct>

and

root 17517 16216 0 12:40 pts/0 00:00:00 ovs-vsctl --if-exists del-br s1 -- --if-exists del-br s2

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
James Page (james-page) wrote :

I don't have a data point for 4.11, but at 4.10 (as shipped in zesty) ovs 2.8.1 tests are OK.

Revision history for this message
James Page (james-page) wrote :

Looking at the test history - artful passed tests with 4.11 and 4.12 kernels with 2.8.0, but appears to have started failing sporadically when 4.13 entered the archive.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would you be able to test some test kernels? If so, we can try to bisect down to which commit introduced the regression.

Revision history for this message
Greg Rose (gvrose) wrote :
Download full text (4.9 KiB)

James Page asked me to post some findings here:

Here’s the trace I’m getting (same as one in comment #10:

[ 5152.142936] device s1 left promiscuous mode
[ 5152.427823] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5152.428422] IP: rtmsg_ifa+0x30/0xd0
[ 5152.428816] *pdpt = 0000000033f65001 *pde = 0000000000000000
[ 5152.428820]
[ 5152.429682] Oops: 0000 [#1] SMP
[ 5152.430046] Modules linked in: veth netconsole openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec joydev snd_hda_core snd_hwdep snd_pcm input_leds serio_raw snd_timer snd pvpanic parport_pc i2c_piix4 soundcore mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm crc32_pclmul drm_kms_helper pcbc aesni_intel syscopyarea aes_i586 sysfillrect crypto_simd sysimgblt fb_sys_fops psmouse cryptd virtio_net virtio_blk drm pata_acpi floppy
[ 5152.433348] CPU: 1 PID: 90 Comm: kworker/u4:3 Tainted: G W 4.13.0-16-generic #19-Ubuntu
[ 5152.433852] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5152.434346] Workqueue: netns cleanup_net
[ 5152.434816] task: f17aa100 task.stack: f4ef0000
[ 5152.435302] EIP: rtmsg_ifa+0x30/0xd0
[ 5152.435780] EFLAGS: 00010246 CPU: 1
[ 5152.436254] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 014000c0
[ 5152.436764] ESI: 00000000 EDI: f063a6c0 EBP: f4ef1dcc ESP: f4ef1db4
[ 5152.437267] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 5152.437780] CR0: 80050033 CR2: 00000000 CR3: 33c3f4a0 CR4: 001406f0
[ 5152.438311] Call Trace:
[ 5152.438816] __inet_del_ifa+0xbb/0x260
[ 5152.439344] ? igmpv3_clear_delrec+0x28/0xa0
[ 5152.439868] inetdev_event+0x22f/0x4e0
[ 5152.440401] ? skb_dequeue+0x5b/0x70
[ 5152.440934] ? wireless_nlevent_flush+0x4c/0x90
[ 5152.441487] notifier_call_chain+0x4e/0x70
[ 5152.442016] raw_notifier_call_chain+0x11/0x20
[ 5152.442554] call_netdevice_notifiers_info+0x2a/0x60
[ 5152.443097] rollback_registered_many+0x21c/0x380
[ 5152.443646] unregister_netdevice_many.part.102+0x10/0x80
[ 5152.444180] default_device_exit_batch+0x134/0x160
[ 5152.444709] ? do_wait_intr_irq+0x80/0x80
[ 5152.445223] ops_exit_list.isra.8+0x4d/0x60
[ 5152.445744] cleanup_net+0x18e/0x260
[ 5152.446264] process_one_work+0x1a0/0x390
[ 5152.446790] worker_thread+0x37/0x440
[ 5152.447321] kthread+0xf3/0x110
[ 5152.447843] ? process_one_work+0x390/0x390
[ 5152.448380] ? kthread_create_on_node+0x20/0x20
[ 5152.448919] ret_from_fork+0x19/0x24
[ 5152.449462] Code: 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 e8 c7 45 f0 00 00 00 00 74 06 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 ec b8 60 00 00 00 e8 19 46
[ 5152.450719] EIP: rtmsg_ifa+0x30/0xd0 SS:ESP: 0068:f4ef1db4
[ 5152.451308] CR2: 0000000000000000
[ 5152.451885] ---[ end trace 5cdfc95a5b343f5c ]---

Looks to me like 4.13 is missing this set of patches from Cong Wang:

c...

Read more...

Revision history for this message
James Page (james-page) wrote :

Thanks Greg

Joseph - would it be possible to get a 4.13 kernel prepared with the patch identified picked?

Changed in linux (Ubuntu Artful):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Committed
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: added: bionic
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Greg, do you happen to know which commit in that merge is the actual fix for this bug?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu Artful):
status: New → Confirmed
Changed in openvswitch (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From the Dup - FYI
To reproduce do:
$ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
$ pull-lp-source openvswitch
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.8.0~git20170809.7aa47a19d-0ubuntu1.dsc -- qemu ~/work/autopkgtest-artful-i386.img
# This guest currently will crash after a while of testing

Really easy to reproduce (and close to 100% fail rate)

Steve Langasek (vorlon)
summary: - openvswitch: kernel opps destroying interfaces on i386
+ openvswitch: kernel oops destroying interfaces on i386
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I tried to build a test kernel with the patch set from Cong mentioned in comment #18. However, there was a build failure due to some missing prereq commits. I also had to backport many of the patches, since they were not clean cherry picks.

Per comment 16, the bug did not start happening until the 4.13 kernel was introduced to Artful. We may want to look for a commit to revert instead of Congs patch set, since it may not pass all the SRU requirements.

I'll create a VM and see if I can reproduce the bug. If I can, I'll bisect down the offending commit in 4.13.

Just to confirm, this is only happening on the i386 server release?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm attempting to reproduce the bug. However, I get a hang when I run the following:
$ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G

Attached is a screen shot of the hang. It seems to be trying to boot the image.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, I'm trying to run this on a VM instead of bare hardware. Could that be an issue?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Joseph - if you mean 2nd level so:
Machine
->KVM
  -> $ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
That could be an issue - for 2nd level being famous for only working mostly.

But why would you do so - since the tests are in VMs they are already more or less host release agnostic. You can run that on any host system - although in general having "autopkgtest" from $RELEASE-backports is often a good idea.

Just ran a check if buildvm works for me atm, and it did with the same cmdline you had (Xenial + autopkgtest backport) - it "appears" to hang at that stage, but is done after ~1-2 minutes. In 2nd level that could as well be just way more inefficient and take much longer?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Christina, Thanks for the info. I am now able to reproduce the bug using your steps. Thanks again for that!

I'm going to dig deeper into this and see if I can bisect down to the commit that caused this. I next need to figure out how to swap in different test kernels. I think you explained how to do that in the duplicate bug, so I'll take a look there.

I'll post an update shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm running into an issue mounting the image to get a kernel onto it. For some reason, I cannot mount /dev/nbd0p1:

sudo modprobe nbd

sudo qemu-nbd --connect=/dev/nbd0 ./autopkgtest-artful-i386.img

sudo mount /dev/nbd0p1 /mnt/test/
mount: /mnt/test: special device /dev/nbd0p1 does not exist.

sudo fdisk /dev/nbd0

Welcome to fdisk (util-linux 2.30.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): p
Disk /dev/nbd0: 12.2 GiB, 13098811392 bytes, 25583616 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xae40d318

Device Boot Start End Sectors Size Id Type
/dev/nbd0p1 * 2048 25583582 25581535 12.2G 83 Linux

fdisk can see that the nbd device has one partition, but I can't seem to mount it. Any suggestions?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

After running qemu-nbd:

ls /dev/nbd*
nbd0 nbd1 nbd10 nbd11 nbd12 nbd13 nbd14 nbd15 nbd2 nbd3 nbd4 nbd5 nbd6 nbd7 nbd8 nbd9

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm able to install a kernel using kvm:
kvm -m 512 -hda ./autopkgtest-artful-i386.img

Is there a way to modify the autopkgtest command line to tell it which kernel to boot? When I have multiple test kernels installed, I usually just select them from the GRUB menu. What would be the equivalent here?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1736390] Re: openvswitch: kernel oops destroying interfaces on i386

On Fri, Jan 5, 2018 at 9:27 PM, Joseph Salisbury
<email address hidden> wrote:
> I'm able to install a kernel using kvm:
> kvm -m 512 -hda ./autopkgtest-artful-i386.img

If that works for you fine.

> Is there a way to modify the autopkgtest command line to tell it which
> kernel to boot? When I have multiple test kernels installed, I usually
> just select them from the GRUB menu. What would be the equivalent here?

I must admit, for automated tests I set the one to test as the default
booted kernel.

If you really want to bisect you might want to go a step further.
The tool to drive via qmeu is "autopkgtest-virt-qemu" and it has
"--qemu-options=".
That said (I never tried) you should be able to build (e.g. by bisect)
a valid kernel outside (make sure that it has all modules in a way to
work without having the kernel actually installed).
Via that you could use qemu options:
-kernel
-initrd
-append
as needed.
Once set up in a way to work you could really bisect through that.
OTOH that might be just as much work as installing them manually a few
times - so your choice what you prefer.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the feedback, Christian. I would much rather install them manually. I am able to do that without a problem. However, I am unable to access the GRUB menu in the usual way to select a specific kernel.

I tried all the usual way, holding shift, modifying /etc/default/grub setting, but none seem to work.

@James Page, how did you tell your image to test the mainline kernel in comment #6? Do you identify the position of the new kernel in /boot/grub/grub.cfg then modify GREUB_DEFAULT in /etc/default/grub to this number?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

On Wed, Jan 10, 2018 at 5:45 PM, Joseph Salisbury
<email address hidden> wrote:
> Thanks for the feedback, Christian. I would much rather install them
> manually. I am able to do that without a problem. However, I am unable
> to access the GRUB menu in the usual way to select a specific kernel.

If you test through autotest, then there isn't a good way to manually
intercept "while" running.

> I tried all the usual way, holding shift, modifying /etc/default/grub
> setting, but none seem to work.

Other than just making sure that grub picks the right default by
making sue the to-be-tested kernel is the latest I worked by modifying
grub.

Checking again if this still works ...
I had a zesty testbed to remove anyway, so I could kill it if needed.
By default it does boot "4.8.0-26-generic"

Note: working in guest image via:
$ qemu-system-x86_64 -m 1024 -smp 1 -nographic -net nic,model=virtio
-net user -enable-kvm -cpu kvm64,+vmx,+lahf_lm
~/work/autopkgtest-zesty-amd64.img

I installed 4.7.10-040710_4.7.10-040710.201610220847 from mainline
builds as it is older and therefore would not be selected by grub
automatically.
After install I checked autopkgtest output...

autopkgtest [07:54:48]: testbed running kernel: Linux 4.8.0-26-generic
#28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016

Ok, now lets modify grub to boot the older kernel:
I found that (at least in this case) the BIOS boot partition kind of
breaks update-grub.
/dev/sda1 227328 25583582 25356255 12.1G Linux filesystem
/dev/sda14 2048 10239 8192 4M BIOS boot
/dev/sda15 10240 227327 217088 106M EFI System

The middle one is the odd one - that is the non efi compat grub img
storage area.
Anyway - to get around that I was adding:

$ apt-get install --reinstall grub-efi
$ echo "GRUB_DISABLE_OS_PROBER=true" | sudo tee -a /etc/default/grub
So as usual e.g.:
echo 'GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux
4.7.10-040710-generic"' | sudo tee -a /etc/default/grub
$ sudo update-grub

And e voila:
autopkgtest [08:20:29]: testbed running kernel: Linux
4.7.10-040710-generic #201610220847 SMP Sat Oct 22 12:50:14 UTC 2016

Other than the extra hoop I had to jump for the BIOS boot there was
nothing special in my try.
And I'd assume that could as much appear on real HW.

I hope that helps to drive your test kernels.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the feedback, Christian! This gives me enough to be able to try and bisect this issue down now. I'll post an update shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm swapping the bug back in and working on the bisect again. For some reason, all of the prior kernels that worked, are now freezing on the test while "Stopping Links":

*** Configuring hosts
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14
*** Starting controller
c1
*** Starting 2 switches
s1 s2 ...
INFO:root:Running TCP performance test
*** Iperf: testing TCP bandwidth between h1 and h14
*** Results: ['20.9 Gbits/sec', '20.9 Gbits/sec']
INFO:root:Stopping network
*** Stopping 1 controllers
c1
*** Stopping 15 links
............

Has anyone else seen this?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The 4.10 based kernel does not experience this hang, but I am seeing it with 4.12. Maybe the 4.12 based kernels does in fact have the bug. I'll test further to find out.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

After some more testing, it appears v4.11 final does not have the bug. Version 4.12-rc1 appears to be the first kernel version to hit the bug. I'll run a few more tests to confirm this. Once confirmed, I'll start a bisect between v4.11 and v4.12-rc1.

If others want to try these two versions, they can be downloaded from:

v4.11: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/
v4.12-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Revision history for this message
James Page (james-page) wrote :

@Joseph

Any update on your bisecting?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm still working on bisecting the issue down. I should have another update in a day or two.

Revision history for this message
James Page (james-page) wrote :

Marking OVS tasks as invalid as we think this issue is in the kernel.

Changed in openvswitch (Ubuntu Artful):
status: Confirmed → Invalid
Changed in openvswitch (Ubuntu Bionic):
status: Confirmed → Invalid
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The first bisect report the following commit as the first bad commit:
2f34c1231bfc ("Merge tag 'drm-for-v4.12' of git://people.freedesktop.org/~airlied/linux")

However, because this commit is a merge, I have to perform another round of bisecting between this commits parents. I'll have another update shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm still in the process of bisecting this. Some of the testing that did not exhibit the bug, does sometimes. I need to run through the prior testing I did to confirm whether the kernels were really good or bad.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm still working on the bisect. It looks like two commits may have introduced the bug, which is slowing the bisect process.

I found that the following commit causes a kernel trace with rtmsg_ifa as the EIP:

120645513f55 ("openvswitch: Add eventmask support to CT action.")

Reverting this commit allows the tests to work just fine in v4.12, which freeze without the revert. This is allowing me to bisect into newer kernel versions now.

I'm bisecting further now to identify the commit that is causing the trace with igmp_group_dropped as the EIP.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It looks like only commit 120645513f55 would need to be reverted in v4.13.0-38. Can you test the following test kernel:

http://kernel.ubuntu.com/~jsalisbury/lp1736390/

Be sure to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message
Andy Whitcroft (apw) wrote : Closing unsupported series nomination.

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status: In Progress → Won't Fix
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Joseph,
neither me nor James have realized that this waited for a retest on our side.
The old kernel that you had linked is gone by now (together with Artful I assume) :-/

Would you mind prepping a new test kernel of your choice (This still is an issue in cosmic, so whatever works best for you is ok) so that we can check and verify if it helps?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Bionic test kernel with a revert of commit 120645513f55. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.4 KiB)

While the tests that run on autopkgtest infra suggest it still is an issue, I first I tried to reproduce as-is to be sure the trigger is good (new release, new kernel, new OVS):

$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --qemu-command=qemu-system-i386 --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.9.0-0ubuntu1.dsc -- qemu --cpus 4 --ram-size=4096 ~/autopkgtest-bionic-i386.img

While faster in the crash with more CPUs I eventually reduced to 1 to have better (more clear) stack traces.

It hangs (tests POV) and crashes (main console running dmesg -w).

[ 56.320025] BUG: unable to handle kernel NULL pointer dereference at 00000000
[ 56.320760] IP: add_grec+0x28/0x450
[ 56.321137] *pdpt = 000000001ebe7001 *pde = 0000000000000000
[ 56.321699] Oops: 0000 [#1] SMP
[ 56.322009] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel ppdev kvm irqbypass joydev 9pnet_virtio input_leds parport_pc serio_raw 9pnet parport qemu_fw_cfg mac_hid sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[ 56.325571] CPU: 0 PID: 240 Comm: systemd-journal Tainted: G W 4.15.0-34-generic #37-Ubuntu
[ 56.326485] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 56.327356] EIP: add_grec+0x28/0x450
[ 56.327712] EFLAGS: 00010202 CPU: 0
[ 56.328052] EAX: 00000000 EBX: dda65420 ECX: 00000006 EDX: dda65420
[ 56.328651] ESI: dc489a00 EDI: dc489a00 EBP: d94c9f34 ESP: d94c9ef4
[ 56.329259] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 56.329774] CR0: 80050033 CR2: 00000000 CR3: 1e9adba0 CR4: 000006f0
[ 56.330379] Call Trace:
[ 56.330623] <SOFTIRQ>
[ 56.330864] mld_ifc_timer_expire+0x10e/0x260
[ 56.331285] ? igmp6_timer_handler+0x60/0x60
[ 56.331699] call_timer_fn+0x2f/0x120
[ 56.332066] ? igmp6_timer_handler+0x60/0x60
[ 56.332489] run_timer_softirq+0x3b5/0x410
[ 56.332899] ? rcu_process_callbacks+0xc8/0x470
[ 56.333353] ? __softirqentry_text_start+0x8/0x8
[ 56.333808] __do_softirq+0xae/0x255
[ 56.334163] ? __softirqentry_text_start+0x8/0x8
[ 56.334617] call_on_stack+0x45/0x50
[ 56.334971] </SOFTIRQ>
[ 56.335219] ? irq_exit+0xb5/0xc0
[ 56.335549] ? smp_apic_timer_interrupt+0x6c/0x120
[ 56.336022] ? apic_timer_interrupt+0x3c/0x44
[ 56.336451] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[ 56.338295] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d94c9ef4
[ 56.338832] CR2: 0000000000000000
[ 56.339163] ---[ end trace 6b06ace1457ab251 ]---
[ 56.339616] Kernel panic - not syncing: Fatal exception in interrupt
[ 56.340448] Kernel Offset: 0x9000000 from 0xc1000000 (relocation range: 0xc0000000-0xdf7fdfff)
[ 56.341293] ---[ end Kernel panic - not syncing: Fatal exception in i...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

There is a 32bit kernel now posted to:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks,
installed that in the test env, after a manual reboot I got:

$ uname -a
Linux autopkgtest 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC i686 i686 i686 GNU/Linux

The change is persistent into the autopkgtest:
autopkgtest [05:11:17]: testbed running kernel: Linux 4.15.0-34-generic #38~lp1736390Commit12064551Reverted SMP Thu Sep 13 13:28:33 UTC

The test kernel works fine where the other one failed.
To be sure I ran it multiple times and with different cpu options enables in KVM (e.g. to also run the DPDK tests which need sse3).

But they all worked, no crash.
That said - yes reverting that change seems to be the solution.
Yet for what was it needed and what would break if it is reverted?

commit 120645513f55a4ac5543120d9e79925d30a0156f
Author: Jarno Rajahalme <email address hidden>
Date: Fri Apr 21 16:48:06 2017 -0700

    openvswitch: Add eventmask support to CT action.

    Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
    which can be used in conjunction with the commit flag
    (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
    conntrack events (IPCT_*) should be delivered via the Netfilter
    netlink multicast groups. Default behavior depends on the system
    configuration, but typically a lot of events are delivered. This can be
    very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
    types of events are of interest.

    Netfilter core init_conntrack() adds the event cache extension, so we
    only need to set the ctmask value. However, if the system is
    configured without support for events, the setting will be skipped due
    to extension not being found.

That is odd, I thought in the past we had identified an Ubuntu-sauce patch, but that is a normal upstream change.
I'd hope that other are affected as well and this is fixed, or could it be that we are affected by 1206455 due to some Ubuntu-sauce?

For the sake of checking if latest upstream (no sauce and 4.19-rc3) might be better I ran the latest mainline kernel.

autopkgtest [05:21:50]: testbed running kernel: Linux 4.19.0-041900rc3-generic #201809120832 SMP Wed Sep 12 12:47:16 UTC 2018

But that is crashing still.

@James: can you estimate what we loose on non-i386 when reverting that change for now?
@Joseph: what would we do now, report upstream - if so what exactly a description and link sent to the author and the ML as we don#t have a fix yet?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Could you give the latest mainline kernel a test before I ping upstream? It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc4

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Joseph, due to some maas accident I got my test system destroyed by a coworker.
I tested v4.19-rc3 as I wrote in comment #51 - do you mind accepting that as a valid "test latest mainline" even thou it was not -rc4 as it would be now?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That should be good. I just like to have the latest mainline already tested in case upstream asks for it. I'll ping upstream and see what the next steps should be.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote : [Regression] openvswitch: Add eventmask support to CT action.

Hi Jarno,

A kernel bug report was opened against Ubuntu [0].  This bug is a
regression introduced in v4.12-rc1.  The latest mainline kernel was
tested and still exhibits the bug.  The following commit was identified
as the cause of the regression:

    120645513f55 ("openvswitch: Add eventmask support to CT action.")

I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue?

Thanks,

Joe

http://pad.lv/1736390

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Bionic test kernel with a patch from upstream. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Joseph, I'm back from my PTO, but have to ask you again for an update - as in the past I'll need 32bit kernels for this test

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Note to myself test instructions carried from my original bug

Update Kernel:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1712831/comments/12
and
sudo qemu-system-i386 -hda autopkgtest-bionic-i386.img -enable-kvm -nographic -curses -m 4096

Test:
sudo autopkgtest --shell-fail --apt-upgrade --no-built-binaries openvswitch_2.10.0-0ubuntu2.dsc -- qemu --qemu-options='-cpu host' --cpus 8 --ram-size=4096 ~/autopkgtest-bionic-i386.img

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a i386 version of the Bionic test kernel with a patch from upstream. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1736390

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (7.6 KiB)

Repro crash with the case - still triggering

Installed 32bit Test kernel

It boots this one:
Linux 4.15.0-36-generic #40 SMP Fri Oct 12 00:17:54 UTC 2018

Seems to have no "special" version suffix to identify it other than #40 and build time.
But #40 and the build time indicate this is the provided test kernel.

With that kernel it still fails.
Here an updated BUG output of that kernel:

[ 74.352331] IP: add_grec+0x28/0x450
[ 74.353422] *pdpt = 000000001df53001 *pde = 0000000000000000
[ 74.355527] Oops: 0000 [#1] SMP
[ 74.356517] Modules linked in: veth openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel kvm irqbypass crc32_pclmul pcbc aesni_intel aes_i586 crypto_simd ppdev cryptd joydev input_leds 9pnet_virtio 9pnet parport_pc parport mac_hid serio_raw qemu_fw_cfg sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse virtio_blk virtio_net i2c_piix4 pata_acpi floppy
[ 74.367244] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W 4.15.0-36-generic #40
[ 74.368932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 74.370719] EIP: add_grec+0x28/0x450
[ 74.371319] EFLAGS: 00010202 CPU: 2
[ 74.372213] EAX: 00000000 EBX: dd92c360 ECX: 00000006 EDX: dd92c360
[ 74.373451] ESI: d7406600 EDI: d7406600 EBP: d8db7f34 ESP: d8db7ef4
[ 74.374648] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 74.375540] CR0: 80050033 CR2: 00000000 CR3: 1e3e1220 CR4: 001406f0
[ 74.376881] Call Trace:
[ 74.377301] <SOFTIRQ>
[ 74.377708] ? pcpu_chunk_relocate+0x14/0x70
[ 74.378426] mld_ifc_timer_expire+0x10e/0x260
[ 74.379328] ? igmp6_timer_handler+0x60/0x60
[ 74.380047] call_timer_fn+0x2f/0x120
[ 74.380654] ? igmp6_timer_handler+0x60/0x60
[ 74.381367] run_timer_softirq+0x3b5/0x410
[ 74.382519] ? rcu_process_callbacks+0xc8/0x470
[ 74.383287] ? __softirqentry_text_start+0x8/0x8
[ 74.384396] __do_softirq+0xae/0x255
[ 74.385000] ? __softirqentry_text_start+0x8/0x8
[ 74.385769] call_on_stack+0x45/0x50
[ 74.386367] </SOFTIRQ>
[ 74.386800] ? irq_exit+0xb5/0xc0
[ 74.387377] ? smp_apic_timer_interrupt+0x6c/0x120
[ 74.388355] ? apic_timer_interrupt+0x3c/0x44
[ 74.389085] ? __sched_text_end+0x3/0x3
[ 74.389728] ? native_safe_halt+0x5/0x10
[ 74.390851] ? default_idle+0x1c/0x100
[ 74.391621] ? arch_cpu_idle+0x12/0x20
[ 74.392388] ? default_idle_call+0x1e/0x30
[ 74.393390] ? do_idle+0x145/0x1c0
[ 74.394410] ? cpu_startup_entry+0x65/0x70
[ 74.395432] ? start_secondary+0x18a/0x1d0
[ 74.396275] ? startup_32_smp+0x164/0x168
[ 74.397098] Code: 74 26 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 f6 42 44 08 8b 42 10 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 0f 85 f1 01 00 00 8b 80 54 01 00 00
[ 74.401207] EIP: add_grec+0x28/0x450 SS:ESP: 0068:d8db7ef4
[ 74.402470] CR2: 0000000000000000
[ 74.403158] ---[ end trace b2832e49d4542abf ]---
[ 74.404247] Kernel panic - not syncing: Fatal exception in interrupt
[ 74.405513] Kernel Offset: 0x9000000 from 0xc1000000 (relocati...

Read more...

Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Cosmic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Artful):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, so are we giving up on this?

Revision history for this message
Juerg Haefliger (juergh) wrote :
Juerg Haefliger (juergh)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

There is an openvswitch related issue, bug 1813244. Perhaps these two are identical?

Revision history for this message
Andrea Righi (arighi) wrote :

I've done a test with the fix from bug #1813244 and the problem doesn't seem to happen. Probably a duplicate bug.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.