System hang with Linux kernel due to mainline commit 24247aeeabe

Bug #1733662 reported by Rod Smith
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Artful
Fix Released
High
Joseph Salisbury
Bionic
Fix Committed
High
Joseph Salisbury

Bug Description

== SRU Justification ==
The following mainline commit introduced a regression in v4.14-rc1:
24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")

This commit made it's way into Artful via Launchpad bug 1591609 as Artful commit
ac2fc5adab0f4b.

This bug was causing regression tests to hang about one in four
times when running cpu_offlining tests.

This patch to fix this regression was just submitted to mainline, so it is also
needed in Bionic.

== Fix ==
commit d47924417319e3b6a728c0b690f183e75bc2a702
Author: Thomas Gleixner <email address hidden>
Date: Tue Jan 16 19:59:59 2018 +0100

    x86/intel_rdt/cqm: Prevent use after free

== Regression Potential ==
Low. This patch fixes a current regression that is a use after free.

### Original Bug Description ###
In doing Ubuntu 17.10 regression testing, we've encountered one computer (boldore, a Cisco UCS C240 M4 [VIC]), that hangs about one in four times when running our cpu_offlining test. This test attempts to take all the CPU cores offline except one, then brings them back online again. This test ran successfully on boldore with previous releases, but with 17.10, the system sometimes (about one in four runs) hangs. Reverting to Ubuntu 16.04.3, I found no problems; but when I upgraded the 16.04.3 installation to linux-image-4.13.0-16-generic, the problem appeared again, so I'm confident this is a problem with the kernel. I'm attaching two files, dmesg-output-4.10.txt and dmesg-output-4.13.txt, which show the dmesg output that appears when running the cpu_offlining test with 4.10.0-38 and 4.13.0-16 kernels, respectively; the system hung on the 4.13 run. (I was running "dmesg -w" in a second SSH login; the files are cut-and-pasted from that.)

I initiated this bug report from an Ubuntu 16.04.3 installation running a 4.10 kernel; but as I said, this applies to the 4.13 kernel.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.10.0-38-generic 4.10.0-38.42~16.04.1
ProcVersionSignature: User Name 4.10.0-38.42~16.04.1-generic 4.10.17
Uname: Linux 4.10.0-38-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
Date: Tue Nov 21 17:36:06 2017
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Rod Smith (rodsmith) wrote :
tags: added: hwcert-server
Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :

I've discovered what may be the same bug on another system -- feebas, a Cisco UCS C220 M4 (Intel Series v3), with the same CPU type (Intel Xeon E5-2640 v3). I'm attaching dmesg output from it, but on this particular run, the computer did not hang indefinitely, although it did become unresponsive for a few seconds.

Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the dmesg output from another run on feebas. In this case, the system has become unresponsive via SSH, although the console remains active.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've tried upgrading to the latest development kernel, from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/, and re-testing. The details of the problem have changed (but they were never 100% consistent), but the problem definitely still exists. I'm attaching dmesg output from three runs:

* run1.txt -- In this run, the cpu_offlining script successfully shut
  down all CPU nodes (except node 0, of course), but when bringing
  them up again, the system segfaulted after bringing up several
  nodes. Thereafter, any remotely substantive command (top or
  shutdown, for instance) hung, although bash remained responsive
  and I could take file listings with ls.
* run2.txt -- In this run, the cpu_offlining script segfaulted
  when taking CPU nodes offline. The system then became unreliable
  in the same way as with run 1.
* run3.txt -- In this run, the script seemed to complete successfully,
  but the dmesg output includes errors associated with bringing up
  several nodes. The system SEEMED TO operate normally thereafter,
  but my testing was limited.

Revision history for this message
Rod Smith (rodsmith) wrote :

Here are some more test runs on boldore, using different kernels, mostly from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D. The attachment is a tarball containing dmesg output associated with runs of the cpu_offlining script. An overview:

* 4.10.0-38-generic: No hang or misbehavior; verbose dmesg output.
* 4.11.0-041100-generic: No hang or misbehavior; verbose dmesg output.
* 4.12.0-041200-generic: No hang or misbehavior; dmesg output is even
  more verbose and includes multiple "error -22" messages.
* 4.13.0-041300-generic: Similar to the above, but dmesg errors are
  now "error -19".
* 4.13.16-041316-generic: No system hang or misbehavior; dmesg
  output has no errors and is much shorter.
* 4.14.0-041400-generic: Segfault and limited functionality
  thereafter; dmesg has multiple "error -19" messages and multiple
  general protection fault dumps.

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

When you have a chance, could you also test the current mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/

This will tell us if we should perform a regular bisect to find the offending commit, or if it's fixed in mainline, we would perform a "Reverse" bisect to find the commit that fixes things.

tags: added: kernel-da-key performing-bisect
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I see you already tested 4.15-rc1, but it's worth while to also test -rc4.

Changed in linux (Ubuntu Artful):
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
Changed in linux (Ubuntu Artful):
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Rod Smith (rodsmith) wrote :

Joseph, I've just tested 4.15-rc4, and the script crashed and the system became responsive to only the simplest commands when bringing CPU 9 back up, accompanied by this out of dmesg:

[ 166.722460] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 166.722540] RIP: 0010:__kmalloc_track_caller+0xc5/0x210
[ 166.722578] RSP: 0000:ffffb75e8c7cbb08 EFLAGS: 00010206
[ 166.722615] RAX: 0000000000000000 RBX: 43ea0882f873c0e8 RCX: 00000000000001bf
[ 166.722663] RDX: 00000000000001be RSI: 0000000000000000 RDI: 0000000000021040
[ 166.722711] RBP: ffffb75e8c7cbb40 R08: ffff9cc35d341eaa R09: ffff9ca3ff807c00
[ 166.722757] R10: ffffb75e8c7cbd08 R11: bc159441a547de42 R12: ffff9cc35d341eaa
[ 166.722805] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff9ca3ff807c00
[ 166.722852] FS: 0000000000000000(0000) GS:ffff9cc3ff240000(0000) knlGS:0000000000000000
[ 166.722905] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 166.722945] CR2: 0000000000000000 CR3: 0000001be7e09001 CR4: 00000000001606e0
[ 166.722992] Call Trace:
[ 166.723020] ? idr_alloc_cmn+0x97/0xd0
[ 166.723051] ? kstrdup_const+0x23/0x30
[ 166.723081] kstrdup+0x31/0x60
[ 166.723107] kstrdup_const+0x23/0x30
[ 166.723137] __kernfs_new_node+0x2c/0x120
[ 166.723168] kernfs_new_node+0x28/0x50
[ 166.723197] kernfs_create_dir_ns+0x34/0x90
[ 166.723229] sysfs_create_dir_ns+0x40/0x90
[ 166.723261] kobject_add_internal+0xac/0x2b0
[ 166.723294] kobject_add+0x71/0xd0
[ 166.723323] ? device_private_init+0x23/0x70
[ 166.723356] device_add+0x12c/0x680
[ 166.723385] cpu_device_create+0xe1/0x100
[ 166.723418] ? __slab_alloc+0x20/0x40
[ 166.723449] ? _cond_resched+0x19/0x40
[ 166.723481] cacheinfo_cpu_online+0x29a/0x3f0
[ 166.723515] ? get_cpu_cacheinfo+0x50/0x50
[ 166.723549] cpuhp_invoke_callback+0x9b/0x550
[ 166.723587] ? padata_replace+0xf0/0xf0
[ 166.725151] cpuhp_thread_fun+0xc4/0x150
[ 166.726682] smpboot_thread_fn+0xec/0x160
[ 166.728221] kthread+0x11e/0x140
[ 166.729701] ? sort_range+0x30/0x30
[ 166.731145] ? kthread_create_worker_on_cpu+0x70/0x70
[ 166.732551] ret_from_fork+0x1f/0x30
[ 166.733906] Code: 4d 01 e0 4d 8b 18 4d 33 99 40 01 00 00 4c 89 c3 4c 31 db 65 48 0f c7 0f 0f 94 c0 84 c0 74 ac 4d 39 d8 74 14 49 63 41 20 48 01 c3 <48> 33 1b 49 33 99 40 01 00 00 0f 18 0b 41 f7 c5 00 80 00 00 0f
[ 166.736776] RIP: __kmalloc_track_caller+0xc5/0x210 RSP: ffffb75e8c7cbb08
[ 166.738188] ---[ end trace 39ce10746b0f4324 ]---

If you want direct access to the affected hardware, that can be arranged. (If you've already got access to the certification network in 1SS, the affected system on which I've been doing most of the testing is boldore.) I'm also happy to run tests using test kernels that you give me.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing mainline. The stack trace looks the same as prior kernels. We should perform a regular kernel bisect to identify the commit that introduced this regression.

It sounds like none of the upstream kernels exhibit this bug per comment #6, is that correct?

If that is the case, it may be due to an Ubuntu SAUCE patch. Can you give an early 17.10 kernel a test:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+build/13358561

Revision history for this message
Rod Smith (rodsmith) wrote :

The upstream 4.14.0 kernel DOES segfault, but none of the 4.13-series kernels does. Some of the 4.13-series kernels do have "error -19" or "error -22" messages in their dmesg output, though.

I've tried the kernel at https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+build/13358561, and it ran through the cpu_offlining script five times without error, so I think it's OK.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. So we now know that 4.13.0-16 has the bug but 4.13.0-10 does not.

Can you next try 4.13.0-14:
https://launchpad.net/ubuntu/+source/linux/4.13.0-14.15/+build/13541235

Revision history for this message
Rod Smith (rodsmith) wrote :

4.13.0-14 failed when offlining CPU 9:

[ 104.500965] ------------[ cut here ]------------
[ 104.500968] kernel BUG at /build/linux-0p6sBa/linux-4.13.0/mm/slub.c:3878!
[ 104.501256] invalid opcode: 0000 [#1] SMP
[ 104.501422] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic igb crct10dif_pclmul crc32_pclmul ghash_clmulni_intel usbhid pcbc hid aesni_intel dca aes_x86_64 crypto_simd ptp glue_helper cryptd ahci pps_core i2c_algo_bit libahci megaraid_sas
[ 104.503659] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-14-generic #15-Ubuntu
[ 104.504019] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 104.504537] task: ffff9a9838b6ae80 task.stack: ffffb7e90c7b8000
[ 104.504827] RIP: 0010:kfree+0x11c/0x160
[ 104.505003] RSP: 0018:ffffb7e90c7bbd60 EFLAGS: 00010246
[ 104.505311] RAX: ffffd9d77eff0020 RBX: ffff9a9800000000 RCX: 00000001802a001a
[ 104.505617] RDX: 0000000000000000 RSI: ffffd9d77fe02400 RDI: 000065a740000000
[ 104.505938] RBP: ffffb7e90c7bbd78 R08: ffff9a9838091ec0 R09: 00000001802a001a
[ 104.506255] R10: ffffd9d77f000000 R11: 0000000000000000 R12: ffffffff87798960
[ 104.506763] R13: ffffffff869dd4f0 R14: 0000000000000009 R15: 0000000000000001
[ 104.507216] FS: 0000000000000000(0000) GS:ffff9a983f240000(0000) knlGS:0000000000000000
[ 104.507638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.507884] CR2: 00007ffdd8f1bff8 CR3: 00000016ff209000 CR4: 00000000001406e0
[ 104.508188] Call Trace:
[ 104.508311] kfree_const+0x20/0x30
[ 104.508468] kobject_put+0x91/0x1a0
[ 104.508626] device_unregister+0x28/0x60
[ 104.508796] cpu_cache_sysfs_exit+0x5a/0xc0
[ 104.508971] ? free_cache_attributes.part.7+0x110/0x110
[ 104.509201] cacheinfo_cpu_pre_down+0x48/0x50
[ 104.509401] cpuhp_invoke_callback+0x84/0x3b0
[ 104.509616] cpuhp_down_callbacks+0x42/0x80
[ 104.509812] cpuhp_thread_fun+0x88/0xe0
[ 104.509997] smpboot_thread_fn+0xec/0x160
[ 104.510182] kthread+0x125/0x140
[ 104.510322] ? sort_range+0x30/0x30
[ 104.510491] ? kthread_create_on_node+0x70/0x70
[ 104.510706] ret_from_fork+0x25/0x30
[ 104.510870] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c
[ 104.511761] RIP: kfree+0x11c/0x160 RSP: ffffb7e90c7bbd60
[ 104.512003] ---[ end trace 2290fcc444ad32ff ]---

Bash remained active, but I couldn't issue any significant commands.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :

4.13.0-12 seems to be OK; I ran it seven or eight times without a failure.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

There was no version 4.13.0-13, so I'll start a bisect between 4.13.0-12 and 4.13.0-14. I'll build a test kernel and post it shortly.

Changed in linux (Ubuntu Artful):
status: Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hmm, now that I looked at the commits between 4.13.0-12 and 4.13.0-14, bug 1734327 looks similar. I built a test kernel already for that bug, and was wondering if you could test it.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1734327/revert-test

Can you test that kernel and report back if it has the bug or not?

Revision history for this message
Rod Smith (rodsmith) wrote :

That one failed (the script stopped running after taking CPU 9 offline) with the following dmesg output:

[ 119.360953] ------------[ cut here ]------------
[ 119.360955] kernel BUG at /home/jsalisbury/bugs/lp1734327/ac8f82a-revert-test/ubuntu-artful/mm/slub.c:3878!
[ 119.361405] invalid opcode: 0000 [#1] SMP
[ 119.361586] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic pcbc igb usbhid dca aesni_intel hid aes_x86_64 crypto_simd glue_helper ptp cryptd ahci pps_core libahci i2c_algo_bit megaraid_sas
[ 119.363826] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-19-generic #22~lp1731031TwoReverts
[ 119.364209] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 119.364687] task: ffff98cff8b49740 task.stack: ffffb3274c7b8000
[ 119.364973] RIP: 0010:kfree+0x11c/0x160
[ 119.365133] RSP: 0018:ffffb3274c7bbd60 EFLAGS: 00010246
[ 119.365356] RAX: fffff57a3bff0020 RBX: ffff98cf00000000 RCX: 0000000000000490
[ 119.365663] RDX: 0000000000000000 RSI: ffff98cfff25f4a0 RDI: 0000676f80000000
[ 119.365964] RBP: ffffb3274c7bbd78 R08: 000000000001f4a0 R09: ffffffffbb5dcf6a
[ 119.366262] R10: fffff57a3c000000 R11: 0000000000000000 R12: ffffffffbbf98e60
[ 119.366552] R13: ffffffffbb1dd820 R14: 0000000000000009 R15: 0000000000000001
[ 119.366844] FS: 0000000000000000(0000) GS:ffff98cfff240000(0000) knlGS:0000000000000000
[ 119.367176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.367412] CR2: 000055cc84772018 CR3: 0000000e48e09000 CR4: 00000000001406e0
[ 119.367706] Call Trace:
[ 119.367824] kfree_const+0x20/0x30
[ 119.367975] kobject_put+0x91/0x1a0
[ 119.368134] device_unregister+0x28/0x60
[ 119.368311] cpu_cache_sysfs_exit+0x5a/0xc0
[ 119.368486] ? free_cache_attributes.part.7+0x110/0x110
[ 119.368709] cacheinfo_cpu_pre_down+0x48/0x50
[ 119.368897] cpuhp_invoke_callback+0x84/0x3b0
[ 119.369082] cpuhp_down_callbacks+0x42/0x80
[ 119.369253] cpuhp_thread_fun+0x88/0xe0
[ 119.369433] smpboot_thread_fn+0xec/0x160
[ 119.369598] kthread+0x125/0x140
[ 119.369732] ? sort_range+0x30/0x30
[ 119.369882] ? kthread_create_on_node+0x70/0x70
[ 119.370075] ret_from_fork+0x25/0x30
[ 119.370233] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 1c
[ 119.371052] RIP: kfree+0x11c/0x160 RSP: ffffb3274c7bbd60
[ 119.371313] ---[ end trace edef5d0868ec0d2a ]---

The system continued to run, and I was able to issue other commands (ifconfig, efibootmgr), but I rebooted just to be safe.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v4.13.0-12 and v4.13.0-14. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
1c8d41925cff57972056048511a451040fa3b790

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Rod Smith (rodsmith) wrote :

There's nothing at the URL you posted, Joseph. Do I just need to give it more time to build, or is something wrong?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, the packages should be there now. You should only need the linux-image and linux-image-extra .deb files.

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (3.7 KiB)

OK, I've run tests now. The system did not crash or otherwise misbehave, but the dmesg output was quite verbose, and included "error -19" messages. Here's a sample (apparently for just one CPU core; this sequence was repeated quite a few times):

[ 439.341956] smpboot: Booting Node 1 Processor 31 APIC 0x1f
[ 439.354783] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354795] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354814] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354836] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[ 439.354849] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354853] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354859] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354866] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354870] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354876] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354882] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354886] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354892] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354898] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354902] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354909] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354915] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354919] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354925] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354931] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354936] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354942] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354948] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354953] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354960] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354965] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[ 439.354978] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[ 439.354991] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[ 439.355003] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[ 439.355016] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[ 439.355029] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355033] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355039] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355046] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355049] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355055] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355062] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355067] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355073] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355079] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355084] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355090] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355095] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355101] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355107] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355112] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355117] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355123] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355355] EDAC MC0: Giving out device to module sb_eda...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
8d9d2235a82ea41e65eff607005ea4f334e2e503

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (4.9 KiB)

That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:

[ 160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 160.596679] EDAC sbridge: Some needed devices are missing
[ 160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 160.651271] EDAC sbridge: Couldn't find mci handler
[ 160.651422] EDAC sbridge: Couldn't find mci handler
[ 160.651572] EDAC sbridge: Failed to register device with error -19.
[ 161.099074] BUG: unable to handle kernel paging request at 0000000180040100
[ 161.099512] IP: __kmalloc_node+0x135/0x2a0
[ 161.099704] PGD 1ff1f01067
[ 161.099705] P4D 1ff1f01067
[ 161.099871] PUD 0

[ 161.100373] Oops: 0000 [#2] SMP
[ 161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp intel_cstate kvm_intel kvm irqbypass intel_rapl_perf joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler mei_me mei shpchp lpc_ich acpi_pad mac_hid acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas fnic crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc igb hid_generic drm_kms_helper aesni_intel dca syscopyarea i2c_algo_bit sysfillrect aes_x86_64 sysimgblt usbhid libfcoe crypto_simd fb_sys_fops ahci ptp glue_helper hid mxm_wmi libfc cryptd libahci
[ 161.102507] pps_core drm enic scsi_transport_fc megaraid_sas wmi
[ 161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G D 4.13.0-13-generic #14~lp1733662Commit8d9d2235a82ea41
[ 161.103230] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000
[ 161.104024] RIP: 0010:__kmalloc_node+0x135/0x2a0
[ 161.104431] RSP: 0018:ffffa3a7ce28bc30 EFLAGS: 00010246
[ 161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95
[ 161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00
[ 161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0
[ 161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00
[ 161.107057] FS: 00007f7849b98700(0000) GS:ffff8f3dffc80000(0000) knlGS:0000000000000000
[ 161.107530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0
[ 161.108509] Call Trace:
[ 161.109012] ? alloc_cpumask_var_node+0x1f/0x30
[ 161.109523] ? on_each_cpu_cond+0x160/0x160
[ 161.110036] alloc_cpumask_var_node+0x1f/0x30
...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
83d4a97746e5fac08e2a1498c3649586bab953a3

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Rod Smith (rodsmith) wrote :

The build from http://kernel.ubuntu.com/~jsalisbury/lp1733662/ successfully completed about six runs of the test script, albeit with the verbose dmesg output that includes the "EDAC sbridge: Failed to register device with error -19" messages.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. I'll mark that kernel as good. I think it's safe to ignore the "error -19" messages during the bisect. We just need to tell the bisect whether the kernel exhibits the original bug or not.

I built the next test kernel, up to the following commit:
97327adfdaf5d72053b1ce8d0847e93706c10dc6

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (7.6 KiB)

That one hung much like the others, with the system responding only to very basic commands (mostly bash internals), although the dmesg output continued further after the kernel bug message. Here's the dmesg output:

[ 107.652875] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 107.652995] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 107.653010] EDAC sbridge: Some needed devices are missing
[ 107.675559] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 107.703606] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 107.703639] EDAC sbridge: Couldn't find mci handler
[ 107.704195] EDAC sbridge: Couldn't find mci handler
[ 107.704618] EDAC sbridge: Failed to register device with error -19.
[ 108.163612] smpboot: Booting Node 1 Processor 8 APIC 0x10
[ 108.189804] intel_rapl: Found RAPL domain package
[ 108.189810] intel_rapl: Found RAPL domain dram
[ 108.189812] intel_rapl: DRAM domain energy unit 15300pj
[ 108.190389] ------------[ cut here ]------------
[ 108.190390] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 108.191016] invalid opcode: 0000 [#1] SMP
[ 108.191511] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm input_leds irqbypass joydev mei_me intel_cstate ipmi_si intel_rapl_perf shpchp acpi_power_meter ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc fnic hid_generic drm_kms_helper igb syscopyarea aesni_intel usbhid dca sysfillrect i2c_algo_bit sysimgblt aes_x86_64 ptp fb_sys_fops crypto_simd mxm_wmi hid libfcoe glue_helper ahci cryptd libfc drm
[ 108.195174] libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[ 108.195756] CPU: 8 PID: 302 Comm: kworker/8:3 Not tainted 4.13.0-13-generic #14~lp1733662Commit97327adfdaf5d
[ 108.196353] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 108.196971] Workqueue: events cpuset_hotplug_workfn
[ 108.197583] task: ffff8e3432fcae80 task.stack: ffffb5fb4e104000
[ 108.198236] RIP: 0010:kfree+0x11c/0x160
[ 108.198861] RSP: 0000:ffffb5fb4e107cc8 EFLAGS: 00010246
[ 108.199485] RAX: fffffb0ffeff0020 RBX: ffff8e3400000000 RCX: 000000018020001d
[ 108.200121] RDX: 0000000000000000 RSI: fffffb0fffd33600 RDI: 0000720b40000000
[ 108.200764] RBP: ffffb5fb4e107ce0 R08: ffff8e3434cd8c00 R09: 000000018020001d
[ 108.201405] R10: fffffb0fff000000 R11: 0000000000000000 R12: ffff8e343254f058
[ 108.202053] R13: ffffffff876ce3d3 R14: ffff8e34382b6d10 R15: 0000000000000000
[ 108.202703] FS: 0000000000000000(0000) GS:ffff8e343f200000(0000) knlGS:0000000000000000
[ 108.203367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
646779c79c8ab1382f81d79d346937c51746b07e

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :

That one ran our test script half a dozen times without failure, albeit with the "Error -19" messages in the dmesg output.

Note that I'm about to EOD, so I probably won't get to the next one until next year. Have a good holiday, Joseph!

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I hope you had a good holiday, Rod. I started up the bisect again.

I built the next test kernel, up to the following commit:
9ebf47f152918cce0caaa9c2767656635fbf59e4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :

Thanks, Joseph. My break was good; I hope yours was, too!

That latest version you posted completed half a dozen runs of the test script without incident, aside from the "error -19" messages.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The bisect should only require testing about 2 or 3 more kernels.

I built the next test kernel, up to the following commit:
aa0998e265482fd260b188dea8c03e7dd7c83c72

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :

Joseph, that one also completed six runs with no problems except the "error -19" messages.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
e6108d5475696d0deaf37b59ff704aead9c5a8a7

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (7.8 KiB)

That one completed its first run, but then crashed when bringing CPU 14 back online, with the following dmesg output:

[ 163.176945] ------------[ cut here ]------------
[ 163.176949] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 163.178043] invalid opcode: 0000 [#1] SMP
[ 163.178995] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate joydev input_leds shpchp ipmi_ssif intel_rapl_perf acpi_power_meter lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea pcbc sysfillrect fnic aesni_intel hid_generic sysimgblt igb fb_sys_fops aes_x86_64 dca usbhid crypto_simd i2c_algo_bit glue_helper libfcoe hid ahci ptp libfc mxm_wmi cryptd libahci
[ 163.186785] drm pps_core enic scsi_transport_fc megaraid_sas wmi
[ 163.188025] CPU: 14 PID: 93 Comm: cpuhp/14 Not tainted 4.13.0-13-generic #14~lp1733662Commite6108d5475696
[ 163.189294] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 163.190606] task: ffff8dbaf809c5c0 task.stack: ffffae2acc8a8000
[ 163.191926] RIP: 0010:kfree+0x11c/0x160
[ 163.193255] RSP: 0000:ffffae2acc8abb80 EFLAGS: 00010246
[ 163.194600] RAX: fffff9cb3bff0020 RBX: ffff8dba00000000 RCX: ffffae2acc8abb60
[ 163.195954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000728480000000
[ 163.197311] RBP: ffffae2acc8abb98 R08: ffffae2acc8abaec R09: 0000000000000002
[ 163.198703] R10: fffff9cb3c000000 R11: 0000000000000000 R12: ffff8d9aff94beb0
[ 163.200096] R13: ffffffffa6f2034b R14: ffff8dbaf27e4318 R15: ffff8dbaf27e4200
[ 163.201497] FS: 0000000000000000(0000) GS:ffff8dbaff380000(0000) knlGS:0000000000000000
[ 163.202919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 163.204351] CR2: 0000000000000000 CR3: 000000101aa09000 CR4: 00000000001406e0
[ 163.205802] Call Trace:
[ 163.207253] acpi_ns_get_node_unlocked+0xac/0xd8
[ 163.208704] ? kernfs_add_one+0xe4/0x130
[ 163.210183] ? down_timeout+0x37/0x60
[ 163.211644] ? acpi_os_wait_semaphore+0x4c/0x70
[ 163.213098] acpi_ns_get_node+0x41/0x58
[ 163.214550] ? acpi_ns_get_node+0x41/0x58
[ 163.216016] acpi_get_handle+0x95/0xbe
[ 163.217486] acpi_has_method+0x25/0x40
[ 163.218932] acpi_processor_get_performance_info+0x57/0x580
[ 163.220391] ? wrmsrl_on_cpu+0x57/0x70
[ 163.221870] acpi_processor_register_performance+0x5e/0xd0
[ 163.223354] __intel_pstate_cpu_init.part.16+0xed/0x2e0
[ 163.224835] ? intel_pstate_init_cpu+0xc9/0x2d0
[ 163.226323] intel_pstate_cpu_init+0x24/0x40
[ 163.227819] cpufreq_online+0xd8/0x750
[ 163.229301] ? cpufreq_online+0x750/0x750
[ 163.230781] cpuhp_cpufreq_online+0xe/0x20
[ 163.232262] cpuhp_invoke_callback+0x84/0x3b0
[ 163.233758] cpuhp_up_callbacks+0x36/0xc0
[ 163.235254] cpuhp_thr...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
ac2fc5adab0f4b83f01214af61c8478c6ef186f9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (12.5 KiB)

That one completed two runs, but on the second run, dmesg included the following message at one point:

[ 240.841694] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 240.842765] invalid opcode: 0000 [#1] SMP
[ 240.843718] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[ 240.851457] drm pps_core megaraid_sas scsi_transport_fc enic wmi
[ 240.852693] CPU: 8 PID: 2724 Comm: irqbalance Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[ 240.853965] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 240.855281] task: ffff9b62a76645c0 task.stack: ffffb973cf6fc000
[ 240.856603] RIP: 0010:kfree+0x11c/0x160
[ 240.857937] RSP: 0018:ffffb973cf6ffa08 EFLAGS: 00010246
[ 240.859280] RAX: fffff8803cff0020 RBX: ffff9b6200000000 RCX: 0000000000000000
[ 240.860632] RDX: 0000000000000000 RSI: ffff9b62b0eb5348 RDI: 000064dcc0000000
[ 240.861995] RBP: ffffb973cf6ffa20 R08: ffff9b62b22f70f0 R09: 0000000180220021
[ 240.863367] R10: fffff8803d000000 R11: 0000000000000001 R12: ffff9b62b1648780
[ 240.864756] R13: ffffffffb65dd4e0 R14: ffff9b62a872f0d8 R15: ffff9b62a872fac0
[ 240.866145] FS: 00007ff8c4d06740(0000) GS:ffff9b62bf200000(0000) knlGS:0000000000000000
[ 240.867562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 240.868986] CR2: 00007fff9ef860f8 CR3: 0000003fe7876000 CR4: 00000000001406e0
[ 240.870438] Call Trace:
[ 240.871882] kfree_const+0x20/0x30
[ 240.873328] kernfs_put+0x71/0x180
[ 240.874778] kernfs_dop_release+0x12/0x20
[ 240.876218] __dentry_kill+0xe5/0x150
[ 240.877644] shrink_dentry_list+0x11f/0x2e0
[ 240.879078] d_invalidate+0x67/0x110
[ 240.880526] lookup_fast+0x2b9/0x310
[ 240.881968] ? dput.part.23+0x2d/0x1e0
[ 240.883393] walk_component+0x49/0x340
[ 240.884811] ? kernfs_iop_permission+0x4f/0x60
[ 240.886253] link_path_walk+0x1bc/0x590
[ 240.887690] ? path_init+0x177/0x2f0
[ 240.889105] path_lookupat+0x56/0x1f0
[ 240.890529] filename_lookup+0xb6/0x190
[ 240.891964] ? sprintf+0x51/0x70
[ 240.893387] ? __check_object_size+0xaf/0x1b0
[ 240.894822] ? strncpy_from_user+0x4d/0x170
[ 240.896240] user_path_at_empty+0x36/0x40
[ 240.897673] ? user_path_at_empty+0x36/0x40
[ 240.899101] vfs_statx+0x76/0xe0
[ 240.900517] SYSC_newstat+0x3d/0x70
[ 240.901934] ? ____fput+0xe/0x10
[ 240.903365] ? task_work_run+0x7b/0x90
[ 240.904783] ? exit_to_usermode...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The bisect reported the following as the first bad commit:
commit ac2fc5adab0f4b83f01214af61c8478c6ef186f9
Author: Vikas Shivappa <email address hidden>
Date: Tue Aug 15 18:00:43 2017 -0700
    x86/intel_rdt/cqm: Improve limbo list processing

I built a test kernel with a revert of ac2fc5adab0.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (3.6 KiB)

I'm afraid that one fails, too, on the second run when bringing CPU10 back online. Here's the dmesg output:

[ 154.987312] smpboot: Booting Node 1 Processor 10 APIC 0x14
[ 154.992953] BUG: unable to handle kernel paging request at 0000317865646e69
[ 154.993932] IP: __kmalloc_track_caller+0x97/0x1f0
[ 154.994847] PGD 0
[ 154.994848] P4D 0

[ 154.997397] Oops: 0000 [#1] SMP
[ 154.998250] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds ipmi_ssif irqbypass mac_hid ipmi_si shpchp intel_cstate intel_rapl_perf acpi_power_meter ipmi_devintf acpi_pad mei_me lpc_ich ipmi_msghandler mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm fnic hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_kms_helper pcbc usbhid syscopyarea igb sysfillrect libfcoe aesni_intel sysimgblt dca fb_sys_fops i2c_algo_bit aes_x86_64 hid crypto_simd glue_helper libfc ptp mxm_wmi ahci drm cryptd
[ 155.005714] libahci pps_core scsi_transport_fc enic megaraid_sas wmi
[ 155.006913] CPU: 10 PID: 69 Comm: cpuhp/10 Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[ 155.008154] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 155.009427] task: ffff91c7b8785d00 task.stack: ffffa8760c7e8000
[ 155.010718] RIP: 0010:__kmalloc_track_caller+0x97/0x1f0
[ 155.012014] RSP: 0000:ffffa8760c7ebc48 EFLAGS: 00010206
[ 155.013308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000014b9
[ 155.014618] RDX: 00000000000014b8 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 155.015946] RBP: ffffa8760c7ebc80 R08: ffff91c7bf29f3e0 R09: ffff91a7bf807c00
[ 155.017284] R10: ffffa8760c7ebce0 R11: 0000000000000006 R12: 0000317865646e69
[ 155.018620] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff91a7bf807c00
[ 155.019965] FS: 0000000000000000(0000) GS:ffff91c7bf280000(0000) knlGS:0000000000000000
[ 155.021329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 155.022710] CR2: 0000317865646e69 CR3: 0000000ec6c09000 CR4: 00000000001406e0
[ 155.024101] Call Trace:
[ 155.025490] ? kvasprintf_const+0x45/0xa0
[ 155.026906] kvasprintf+0x66/0xd0
[ 155.028304] kvasprintf_const+0x45/0xa0
[ 155.029703] kobject_set_name_vargs+0x23/0x90
[ 155.031101] cpu_device_create+0xa4/0x100
[ 155.032485] ? smp_call_function_single+0xb9/0xe0
[ 155.033891] cacheinfo_cpu_online+0x2ac/0x400
[ 155.035295] ? get_cpu_cacheinfo+0x50/0x50
[ 155.036709] cpuhp_invoke_callback+0x84/0x3b0
[ 155.038101] cpuhp_up_callbacks+0x36/0xc0
[ 155.039513] cpuhp_thread_fun+0xd4/0xe0
[ 155.040923] smpboot_thread_fn+0xec/0x160
[ 155.042319] kthread+0x125/0x140
[ 155.043706] ? sort_range+0x30/0x30
[ 155.045107] ? kthread_create_on_node+0x70/0x70
[ 155.046515] ret_from_fork+0x25/0x30
[ 155.047906] Code: 08 65 4c 03 05 ab e5 7d 5b 49 83 78 10 00 4d 8b 20 0f 84 ef 00 00 00 4d 85 e4 0f 84 e6 00 00 00 49 63 41 20 4...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The uname looks like you may still be running the kernel from comment #37. The test kernel with the revert should have a name like:

linux-image-4.13.0-21-generic_4.13.0-21.24~lp1733662Revert_amd64

The string "Revert" should be in the uname output.

Revision history for this message
Rod Smith (rodsmith) wrote :

You're right. (I've got too many kernels installed on that system!) When I tested again, it got through eight runs without problems, beyond the "error -19" message. Here's the uname information, just to be sure:

$ uname -a
Linux oil-boldore 4.13.0-21-generic #24~lp1733662Revert SMP Mon Jan 8 15:35:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. I'll ping the author of mainline commit 24247aeeabe99eab to get some feedback.

Before I do that, can you confirm the bug still exists with the latest mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc7/

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (4.1 KiB)

Yes, it still exists. To confirm the kernel version:

$ uname -a
Linux oil-boldore 4.15.0-041500rc7-generic #201801072330 SMP Sun Jan 7 23:31:29 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The system hung bringing CPU 11 back online, with the following dmesg output:

[ 101.179624] smpboot: Booting Node 1 Processor 11 APIC 0x16
[ 101.727507] general protection fault: 0000 [#1] SMP PTI
[ 101.727812] Modules linked in: nls_iso8859_1 intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds irqbypass ipmi_ssif intel_cstate intel_rapl_perf ipmi_si acpi_power_meter mei_me shpchp ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul mgag200 crc32_pclmul ttm ghash_clmulni_intel pcbc usbhid ses igb drm_kms_helper enclosure dca syscopyarea sysfillrect aesni_intel scsi_transport_sas hid fnic aes_x86_64 sysimgblt ptp crypto_simd libfcoe fb_sys_fops glue_helper ahci pps_core mxm_wmi
[ 101.730450] cryptd libfc libahci i2c_algo_bit drm scsi_transport_fc enic megaraid_sas wmi
[ 101.730883] CPU: 6 PID: 3205 Comm: python3 Not tainted 4.15.0-041500rc7-generic #201801072330
[ 101.731319] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 101.731773] RIP: 0010:__kmalloc_node+0x16a/0x2c0
[ 101.732224] RSP: 0018:ffffa7d0cf86bbe0 EFLAGS: 00010206
[ 101.732682] RAX: 0000000000000000 RBX: 3b37355eb8b32f18 RCX: 0000000000000349
[ 101.733146] RDX: 0000000000000348 RSI: 0000000000000000 RDI: 0000000000027040
[ 101.733609] RBP: ffffa7d0cf86bc20 R08: ffff94818ede9cdc R09: ffff9461bf807c00
[ 101.734075] R10: ffffffffaaa16cc0 R11: c4c8a1df366db3c4 R12: 00000000014080c0
[ 101.734547] R13: 0000000000000008 R14: ffff94818ede9cdc R15: ffff9461bf807c00
[ 101.735023] FS: 00007f8b0a2c2700(0000) GS:ffff9461bfd80000(0000) knlGS:0000000000000000
[ 101.735510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.735997] CR2: 000056075d0c11a8 CR3: 0000001fe0b32003 CR4: 00000000001606e0
[ 101.736491] Call Trace:
[ 101.736988] ? alloc_cpumask_var_node+0x1f/0x30
[ 101.737488] ? on_each_cpu_cond+0x140/0x140
[ 101.737986] alloc_cpumask_var_node+0x1f/0x30
[ 101.738489] zalloc_cpumask_var_node+0xf/0x20
[ 101.738988] smpcfd_prepare_cpu+0x46/0xc0
[ 101.739493] cpuhp_invoke_callback+0x9b/0x550
[ 101.740012] ? init_idle+0x179/0x190
[ 101.740515] _cpu_up+0xb1/0x180
[ 101.741017] do_cpu_up+0x8b/0xb0
[ 101.741515] cpu_up+0x13/0x20
[ 101.742012] cpu_subsys_online+0x3d/0x90
[ 101.742510] device_online+0x4a/0x90
[ 101.743010] online_store+0x89/0xa0
[ 101.743506] dev_attr_store+0x18/0x30
[ 101.744003] sysfs_kf_write+0x37/0x40
[ 101.744501] kernfs_fop_write+0x11c/0x1a0
[ 101.744998] __vfs_write+0x37/0x170
[ 101.745494] ? common_file_perm+0x50/0x140
[ 101.745994] ? apparmor_file_permission+0x1a/0x20
[ 101.746495] ? security_file_permission+0x3b/0xc0
[ 101.746993] ? _cond_resched...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe (Ubuntu Artful):
status: New → Confirmed
Changed in linux-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : [REGRESSION][v4.14.y][v4.15] x86/intel_rdt/cqm: Improve limbo list processing

Hi Vikas,

A kernel bug report was opened against Ubuntu [0].  After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
Author: Vikas Shivappa <email address hidden>
Date:   Tue Aug 15 18:00:43 2017 -0700

    x86/intel_rdt/cqm: Improve limbo list processing

The regression was introduced as of v4.14-r1 and still exists with
current mainline.  The trace with v4.15-rc7 is in comment #44[1].

I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

Thanks,

Joe
[0] http://pad.lv/1733662
[1]
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/44

summary: - System hang with Linux kernel 4.13, not with 4.10
+ System hang with Linux kernel due to mainline commit 24247aeeabe
Revision history for this message
tglx (tglx) wrote :

On Fri, 12 Jan 2018, Joseph Salisbury wrote:

> Hi Vikas,
>
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
>
> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> Author: Vikas Shivappa <email address hidden>
> Date:   Tue Aug 15 18:00:43 2017 -0700
>
>     x86/intel_rdt/cqm: Improve limbo list processing
>
>
> The regression was introduced as of v4.14-r1 and still exists with
> current mainline.  The trace with v4.15-rc7 is in comment #44[1].
>
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?

That stinks like a use after free. Can you run with KASAN enabled?

Thanks,

 tglx

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Rod,

I built an Artful test kernel with KASAN enable.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test this kernel as requested by upstream?

Revision history for this message
tglx (tglx) wrote :

Vikas, Fenghua can you please look at that ASAP?

On Sun, 14 Jan 2018, Thomas Gleixner wrote:

> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>
> > Hi Vikas,
> >
> > A kernel bug report was opened against Ubuntu [0].  After a kernel
> > bisect, it was found that reverting the following commit resolved this bug:
> >
> > commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> > Author: Vikas Shivappa <email address hidden>
> > Date:   Tue Aug 15 18:00:43 2017 -0700
> >
> >     x86/intel_rdt/cqm: Improve limbo list processing
> >
> >
> > The regression was introduced as of v4.14-r1 and still exists with
> > current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> >
> > I was hoping to get your feedback, since you are the patch author.  Do
> > you think gathering any additional data will help diagnose this issue,
> > or would it be best to submit a revert request?
>
> That stinks like a use after free. Can you run with KASAN enabled?
>
> Thanks,
>
> tglx

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (4.7 KiB)

Joseph,

The first run of your latest kernel completed; however, I noticed the following in the dmesg output:

[ 426.281083] ==================================================================
[ 426.286615] BUG: KASAN: use-after-free in find_first_bit+0x1f/0x80
[ 426.291841] Read of size 8 at addr ffff883ff7c1e780 by task cpuhp/31/195

[ 426.302209] CPU: 31 PID: 195 Comm: cpuhp/31 Not tainted 4.13.0-25-generic #29~lp1733662KASANenabled
[ 426.302213] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 426.302215] Call Trace:
[ 426.302233] dump_stack+0xb8/0x12d
[ 426.302241] ? dma_virt_map_sg+0xd3/0xd3
[ 426.302252] ? show_regs_print_info+0x41/0x41
[ 426.302263] print_address_description+0x6f/0x280
[ 426.302269] kasan_report+0x27a/0x370
[ 426.302276] ? find_first_bit+0x1f/0x80
[ 426.302288] __asan_load8+0x54/0x90
[ 426.302295] find_first_bit+0x1f/0x80
[ 426.302306] has_busy_rmid+0x47/0x70
[ 426.302314] intel_rdt_offline_cpu+0x4b4/0x510
[ 426.302321] ? clear_closid_rmid.isra.4+0x70/0x70
[ 426.302333] ? sysfs_remove_group+0x7a/0xc0
[ 426.302339] ? clear_closid_rmid.isra.4+0x70/0x70
[ 426.302351] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.302360] ? cpuhp_kick_ap_work+0x2d0/0x2d0
[ 426.302372] ? __schedule+0x4f1/0xeb0
[ 426.302377] ? cpuhp_kick_ap_work+0x2d0/0x2d0
[ 426.302385] ? firmware_map_remove+0x1b1/0x1b1
[ 426.302395] ? migrate_swap_stop+0x2f0/0x2f0
[ 426.302402] ? firmware_map_remove+0x1b1/0x1b1
[ 426.302407] ? migrate_swap_stop+0x2f0/0x2f0
[ 426.302414] ? schedule+0xd8/0x2a0
[ 426.302421] ? __schedule+0xeb0/0xeb0
[ 426.302427] ? default_wake_function+0x2f/0x40
[ 426.302439] ? __wake_up_common+0xa1/0xc0
[ 426.302446] cpuhp_down_callbacks+0x52/0xa0
[ 426.302453] cpuhp_thread_fun+0x117/0x1a0
[ 426.302459] ? cpu_up+0x20/0x20
[ 426.302468] smpboot_thread_fn+0x20e/0x2f0
[ 426.302474] ? sort_range+0x30/0x30
[ 426.302482] kthread+0x1b7/0x1e0
[ 426.302488] ? sort_range+0x30/0x30
[ 426.302493] ? kthread_create_on_node+0xc0/0xc0
[ 426.302500] ret_from_fork+0x1f/0x30

[ 426.307683] Allocated by task 56:
[ 426.312817] save_stack_trace+0x1b/0x20
[ 426.312824] save_stack+0x43/0xd0
[ 426.312829] kasan_kmalloc+0xad/0xe0
[ 426.312834] __kmalloc+0x105/0x230
[ 426.312840] intel_rdt_online_cpu+0x5a8/0x830
[ 426.312846] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.312850] cpuhp_thread_fun+0x8b/0x1a0
[ 426.312856] smpboot_thread_fn+0x20e/0x2f0
[ 426.312861] kthread+0x1b7/0x1e0
[ 426.312866] ret_from_fork+0x1f/0x30

[ 426.317887] Freed by task 195:
[ 426.322879] save_stack_trace+0x1b/0x20
[ 426.322887] save_stack+0x43/0xd0
[ 426.322891] kasan_slab_free+0x72/0xc0
[ 426.322896] kfree+0x94/0x1a0
[ 426.322902] intel_rdt_offline_cpu+0x17d/0x510
[ 426.322908] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.322912] cpuhp_down_callbacks+0x52/0xa0
[ 426.322917] cpuhp_thread_fun+0x117/0x1a0
[ 426.322925] smpboot_thread_fn+0x20e/0x2f0
[ 426.322929] kthread+0x1b7/0x1e0
[ 426.322935] ret_from_fork+0x1f/0x30

[ 426.327837] The buggy address belongs to the object at ffff883ff7c1e780
                which belongs to the c...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> Vikas on vacation until end of the month. Fenghua will look into this
> issue.
>
> On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> <mailto:<email address hidden>>> wrote:
>
>>
>> Vikas, Fenghua can you please look at that ASAP?
>>
>> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
>>
>>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>>>
>>>> Hi Vikas,
>>>>
>>>> A kernel bug report was opened against Ubuntu [0].  After a kernel
>>>> bisect, it was found that reverting the following commit resolved
>>>> this bug:
>>>>
>>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
>>>> Author: Vikas Shivappa <<email address hidden>
>>>> <mailto:<email address hidden>>>
>>>> Date:   Tue Aug 15 18:00:43 2017 -0700
>>>>
>>>>     x86/intel_rdt/cqm: Improve limbo list processing
>>>>
>>>>
>>>> The regression was introduced as of v4.14-r1 and still exists with
>>>> current mainline.  The trace with v4.15-rc7 is in comment #44[1].
>>>>
>>>> I was hoping to get your feedback, since you are the patch author.  Do
>>>> you think gathering any additional data will help diagnose this issue,
>>>> or would it be best to submit a revert request?
>>>
>>> That stinks like a use after free. Can you run with KASAN enabled?
>>>
>>> Thanks,
>>>
>>>    tglx

Here is some data wiht KASAN enabled:
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/51

Are there any specific logs you would like to see, or specific actions
executed?

Thanks,

Joe

Revision history for this message
tglx (tglx) wrote :

On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > Vikas on vacation until end of the month. Fenghua will look into this
> > issue.
> >
> > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > <mailto:<email address hidden>>> wrote:
> >
> >>
> >> Vikas, Fenghua can you please look at that ASAP?
> >>
> >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> >>
> >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> >>>
> >>>> Hi Vikas,
> >>>>
> >>>> A kernel bug report was opened against Ubuntu [0].  After a kernel
> >>>> bisect, it was found that reverting the following commit resolved
> >>>> this bug:
> >>>>
> >>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> >>>> Author: Vikas Shivappa <<email address hidden>
> >>>> <mailto:<email address hidden>>>
> >>>> Date:   Tue Aug 15 18:00:43 2017 -0700
> >>>>
> >>>>     x86/intel_rdt/cqm: Improve limbo list processing
> >>>>
> >>>>
> >>>> The regression was introduced as of v4.14-r1 and still exists with
> >>>> current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> >>>>
> >>>> I was hoping to get your feedback, since you are the patch author.  Do
> >>>> you think gathering any additional data will help diagnose this issue,
> >>>> or would it be best to submit a revert request?
> >>>
> >>> That stinks like a use after free. Can you run with KASAN enabled?
> >>>
> >>> Thanks,
> >>>
> >>>    tglx
>
>
> Here is some data wiht KASAN enabled:
> https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/51
>
> Are there any specific logs you would like to see, or specific actions
> executed?

No, the KASAN output is pretty clear where the issue is.

Thanks,

 tglx

Revision history for this message
Fenghua Yu (fyu) wrote :

> From: Thomas Gleixner [mailto:<email address hidden>]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > > <mailto:<email address hidden>>> wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> > >>>> Hi Vikas,
> > >>>>
> > >>>> A kernel bug report was opened against Ubuntu [0].  After a
> > >>>> kernel bisect, it was found that reverting the following commit
> > >>>> resolved this bug:
> > >>>>
> > >>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> > >>>> Author: Vikas Shivappa <<email address hidden>
> > >>>> <mailto:<email address hidden>>>
> > >>>> Date:   Tue Aug 15 18:00:43 2017 -0700
> > >>>>
> > >>>>     x86/intel_rdt/cqm: Improve limbo list processing
> > >>>>
> > >>>>
> > >>>> The regression was introduced as of v4.14-r1 and still exists
> > >>>> with current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> > >>>>
> > >>>> I was hoping to get your feedback, since you are the patch
> > >>>> author.  Do you think gathering any additional data will help
> > >>>> diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>>    tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https://bugs.launchpad.net/ubuntu/+source/linux-
> hwe/+bug/1733662/comme
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
>
> No, the KASAN output is pretty clear where the issue is.
>
> Thanks,
>
> tglx

Is this a Haswell specific issue?

I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
for ((;;)) do
        for ((i=1;i<88;i++)) do
                echo 0 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo |wc
        for ((i=1;i<88;i++)) do
                echo 1 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo|wc
done

I'm finding a Haswell to reproduce the issue.

Thanks.

-Fenghua

Revision history for this message
tglx (tglx) wrote :

On Tue, 16 Jan 2018, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:<email address hidden>]
> Is this a Haswell specific issue?
>
> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
> for ((;;)) do
> for ((i=1;i<88;i++)) do
> echo 0 >/sys/devices/system/cpu/cpu$i/online
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo |wc
> for ((i=1;i<88;i++)) do
> echo 1 >/sys/devices/system/cpu/cpu$i/online
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo|wc
> done
>
> I'm finding a Haswell to reproduce the issue.

Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.

You simply do not run into it because on your machine

    is_llc_occupancy_enabled() is false...

Thanks,

 tglx

8<--------------------

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 88dcf8479013..99442370de40 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
    */
   if (static_branch_unlikely(&rdt_mon_enable_key))
    rmdir_mondata_subdir_allrdtgrp(r, d->id);
- kfree(d->ctrl_val);
- kfree(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
   list_del(&d->list);
   if (is_mbm_enabled())
    cancel_delayed_work(&d->mbm_over);
@@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
    cancel_delayed_work(&d->cqm_limbo);
   }

+ kfree(d->ctrl_val);
+ kfree(d->rmid_busy_llc);
+ kfree(d->mbm_total);
+ kfree(d->mbm_local);
   kfree(d);
   return;
  }

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_occupancy_enabled() is false...
>
> Thanks,
>
> tglx
>
> 8<--------------------
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 88dcf8479013..99442370de40 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> */
> if (static_branch_unlikely(&rdt_mon_enable_key))
> rmdir_mondata_subdir_allrdtgrp(r, d->id);
> - kfree(d->ctrl_val);
> - kfree(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> @@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cancel_delayed_work(&d->cqm_limbo);
> }
>
> + kfree(d->ctrl_val);
> + kfree(d->rmid_busy_llc);
> + kfree(d->mbm_total);
> + kfree(d->mbm_local);
> kfree(d);
> return;
> }

Thanks, Thomas.  I'll build some test kernels and have your patch tested
out.

Thanks,

Joe

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built Artful and mainline test kernels with the patch from tglx. The test kernels can be downloaded from:

Artful: http://kernel.ubuntu.com/~jsalisbury/lp1733662/artful
mainline: http://kernel.ubuntu.com/~jsalisbury/lp1733662/mainline

Can you test these kernels out and see if they resolve the bug?

Revision history for this message
Rod Smith (rodsmith) wrote :

That seems to have fixed it! I've run the test script six or seven times on both kernels, with nary a hiccup (aside from the "error -19" messages with the 4.13 kernel). Below is the reported kernel information from both your builds, just to be sure I booted the correct kernels.

$ uname -a
Linux oil-boldore 4.13.0-25-generic #29~lp1733662PatchFromUpstream SMP Wed Jan 17 20:13:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ uname -a
Linux oil-boldore 4.15.0-041500rc8-generic #201801172011 SMP Wed Jan 17 20:13:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_occupancy_enabled() is false...
>
> Thanks,
>
> tglx
>
> 8<--------------------
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 88dcf8479013..99442370de40 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> */
> if (static_branch_unlikely(&rdt_mon_enable_key))
> rmdir_mondata_subdir_allrdtgrp(r, d->id);
> - kfree(d->ctrl_val);
> - kfree(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> @@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cancel_delayed_work(&d->cqm_limbo);
> }
>
> + kfree(d->ctrl_val);
> + kfree(d->rmid_busy_llc);
> + kfree(d->mbm_total);
> + kfree(d->mbm_local);
> kfree(d);
> return;
> }

Hi Thomas,

Testing of your patch shows that your patch resolves the bug.  Thanks
for the assistance!  Is this something you could submit to mainline?

Thanks,

Joe

Revision history for this message
tglx (tglx) wrote :

On Wed, 17 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>
> Testing of your patch shows that your patch resolves the bug.  Thanks
> for the assistance!  Is this something you could submit to mainline?

Already there :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d47924417319e3b6a728c0b690f183e75bc2a702

Tagged for stable.

Thanks,

 tglx

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 01/17/2018 05:55 PM, Thomas Gleixner wrote:
> On Wed, 17 Jan 2018, Joseph Salisbury wrote:
>> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>>
>> Testing of your patch shows that your patch resolves the bug.  Thanks
>> for the assistance!  Is this something you could submit to mainline?
> Already there :)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d47924417319e3b6a728c0b690f183e75bc2a702
>
> Tagged for stable.
>
> Thanks,
>
> tglx

Thanks so much!

no longer affects: linux-hwe (Ubuntu)
no longer affects: linux-hwe (Ubuntu Artful)
no longer affects: linux-hwe (Ubuntu Bionic)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built one last Artful test kernel with the patch tglx submitted to mainline. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test this kernel and confirm it resolves the bug?

Revision history for this message
Rod Smith (rodsmith) wrote :

I ran it half a dozen times with your latest kernel and it seemed fine, aside from the usual "error -19" messages. To be sure it's the right one, here's the kernel version information:

ubuntu@oil-boldore:~$ uname -a
Linux oil-boldore 4.13.0-25-generic #29~lp1733662PatchInMainline SMP Thu Jan 18 15:58:13 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
Seth Forshee (sforshee)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Per Allansson (per-allansson) wrote :

I have similar issues on 16.04.4 with latest HWE kernel - and when double-checking against the source code I can see that this fix is now AWOL from:

linux-image-4.13.0-36-generic 4.13.0-36.40~16.04.1

Changed in linux (Ubuntu Artful):
status: Fix Committed → In Progress
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Revision history for this message
Rod Smith (rodsmith) wrote :

I've tested kernel 4.13.0-38-generic #43-Ubuntu from artful-proposed and the problem does not occur with that kernel.

tags: added: verification-done-artful
removed: verification-needed-artful
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Jeff Lane  (bladernr)
tags: removed: hwcert-server
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.