Kernel 3.13 fail to boot with LSI SAS1068E (Dell SAS 6/iR)

Bug #1276705 reported by PierreF
44
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Trusty
Fix Released
High
Unassigned

Bug Description

We have recently upgraded an Dell R300 server to Trusty (was running fine in precise), and after upgrade it fail to boot.

It is an issue with the SAS controller during the initilisation. It fail to detect the disk, we have the following error in console log:

[ 36.539955] scsi4: error handler thread failed to spawn, error = -12
[ 36.552694] mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem

After this error, initramfs drop to a shell complaining that rootfs is not found. No disk is seen at all (cat /proc/partition only show sr0 - cdrom drive).

We have this issue with two different server (both R300, both Dell SAS 6/iR controller and same hardware).

We don't have this issue with another Dell server (R310, Dell PERC H200).

We also tester with old kernel (generic, 3.2.0-58.88), it is working.

Those server need a greater rootdelay (probably #579572), so we have rootdelay=45. If we remove rootdelay=45, then disk are correctly recognized ! (but few second too late, initramfs dropped to a shell. Pressing control-D resume normal boot)

So the issue is that with the (mandatory) rootdelay greater that 30 (default value I think), the disk are not detected due to the error shown above. This is a regression since those server worked in precise (and work with precise old kernel).

System information

* Dell R300 with Dell SAS 6/iR controller
* Ubuntu Trusty Tahr (14.04)
* Running arch: x86_64
* Kernel version: 3.13.0-7-generic (dpkg version : 3.13.0-7.25)
* Kernel command line: BOOT_IMAGE=/vmlinuz-3.13.0-7-generic root=UUID=174e14b5-46fc-479b-9f94-05cb33c75ac9 ro rootdelay=45 console=tty0 console=ttyS1,57600 quiet
* uname -a: Linux frtls-perf01 3.13.0-7-generic #25-Ubuntu SMP Tue Feb 4 10:19:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Attached files:

* console output when error occure.
* dmesg when system boot (no rootdelay, need to press control-d during initramfs boot)
* lspci -vnn

Tell me if you need more informations.

Revision history for this message
PierreF (pierre-fersing) wrote :
Revision history for this message
PierreF (pierre-fersing) wrote :
Revision history for this message
PierreF (pierre-fersing) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: trusty
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: performing-bisect
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

v3.6 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-quantal/
v3.7 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7-raring/
v3.8 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-raring
v3.9 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-saucy
v3.10 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10-saucy
v3.11-final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-saucy/
v3.12 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-trusty/

You don't have to test every kernel, just up until the kernel that first has this bug.

One thing to note, some of this kernels require both the linux-image and linux-image-extra .deb packages. You will know if the linux-image-extra .deb package exists at the download link.

Thanks in advance!

Revision history for this message
PierreF (pierre-fersing) wrote :

I booted kernel with following common option : "ro console=tty0 console=ttyS1,57600".

When booted with rootdelay, it's "rootdelay=45".

The result are the following:

* 3.13.0-7-generic, rootdelay => error
* 3.13.0-7-generic, no rootdelay => Ok
* 3.6, rootdelay => Ok
* 3.12, rootdelay => Ok, tested twice.
* 3.13.0-6-generic, rootdelay => error
* 3.13.0-6-generic, no rootdelay => Error... then on next try Ok. The error is due to some race condition ?
* Tested once more time 3.12 with rootdelay => Ok.
* 3.13.0-7-generic, no rootdelay => Ok

So the issue is between 3.12 and 3.13.

Also on 3.13 with same condition (console=tty0 console=ttyS1,57600), sometime we got the error, sometime we didn't get the error. We always got the error with rootdelay set.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for all the testing. So it sounds like this bug was introduced in the 3.13 kernel. In order to bisect, we need to narrow down further and identify which v3.13 release candidate introduced the regression.

However, it might be good to first test the latest mainline kernel to see if this has already been fixed upstream. The latest mainline kernel is v3.14-rc1, and can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-rc1-trusty/

Revision history for this message
PierreF (pierre-fersing) wrote :

It still occure with 3.14.0-031400rc1-generic.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

v3.13-rc3: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc3-trusty/

If v3.13-rc3 does not exhibit the bug then test v3.13-rc6:
v3.13-rc6: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc6-trusty/

If v3.13-rc3 does exhibit the bug then test v3.13-rc2:
v3.13-rc2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc2-trusty

Thanks in advance!

Revision history for this message
PierreF (pierre-fersing) wrote :

None of them worked. All had the same issue.

Tested:

* v3.13-rc3
* v3.13-rc2
* v3.13-rc1 (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc1-trusty/)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you confirm the bug does not exists in the 3.12 final kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-saucy/

If that is the case, we can bisect between v3.12 final and v3.13-rc1.

Revision history for this message
PierreF (pierre-fersing) wrote :

Yes, I confirm that 3.12-saucy works.

Revision history for this message
Faidon Liambotis (paravoid) wrote :
Download full text (4.3 KiB)

We're also experiencing the same issue when booting the trusty installer on a Dell PowerEdge R610 with a Dell SAS 6/iR controller (LSI SAS1068E), resulting in the inability to install/run this system. This is an extremely popular controller, prevalent in a signifcant portion of Dell systems (or systems in general) out there, so this is a pretty serious bug for Trusty.

The kernel actually oopses, here's the relevant lines from dmesg:
[ 80.845434] Fusion MPT SAS Host driver 3.04.20
[ 80.845640] mptbase: ioc0: Initiating bringup
[ 80.987894] ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x2)
[ 80.987908] ipmi_si ipmi_si.0: IPMI kcs interface initialized
[ 81.550894] ioc0: LSISAS1068E B3: Capabilities={Initiator}
[ 111.853426] scsi0: error handler thread failed to spawn, error = -12
[ 111.853431] mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem
[ 111.853441] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[ 111.861275] IP: [<ffffffff81710382>] mutex_lock+0x12/0x2f
[ 111.866676] PGD 421531067 PUD 4207f2067 PMD 0
[ 111.871137] Oops: 0002 [#1] SMP
[ 111.874375] Modules linked in: mptsas(F+) mptscsih(F) mptbase(F) scsi_transport_sas(F) ipmi_)
[ 111.899462] CPU: 0 PID: 9847 Comm: systemd-udevd Tainted: GF 3.13.0-8-generic #27u
[ 111.908489] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 3.0.0 01/31/2011
[ 111.915955] task: ffff880420734800 ti: ffff880422356000 task.ti: ffff880422356000
[ 111.923496] RIP: 0010:[<ffffffff81710382>] [<ffffffff81710382>] mutex_lock+0x12/0x2f
[ 111.931319] RSP: 0018:ffff880422357b10 EFLAGS: 00010246
[ 111.936619] RAX: 0000000000000000 RBX: 0000000000000060 RCX: 0000000000000006
[ 111.943737] RDX: 0000000000000007 RSI: 00000000f0aef0ac RDI: 0000000000000060
[ 111.950856] RBP: ffff880422357b18 R08: 0000000000000086 R09: 000000000000037f
[ 111.957974] R10: 0000000000000000 R11: ffff8804223578ae R12: 0000000000000060
[ 111.965091] R13: ffff880423e9f000 R14: ffff880423e9f098 R15: ffffffffa01063a0
[ 111.972210] FS: 00007fd5221af7c0(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
[ 111.980283] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 111.986014] CR2: 0000000000000060 CR3: 000000042144f000 CR4: 00000000000007f0
[ 111.993132] Stack:
[ 111.995135] 0000000000000000 ffff880422357b40 ffffffff814c842d ffff880423da6000
[ 112.002549] 0000000000000000 ffff880423e9f000 ffff880422357b70 ffffffffa00ee2a1
[ 112.009966] ffff880423da6000 00000000ffffffff ffff880423e9f000 ffff880423e9f098
[ 112.017384] Call Trace:
[ 112.019830] [<ffffffff814c842d>] scsi_remove_host+0x1d/0x120
[ 112.025565] [<ffffffffa00ee2a1>] mptscsih_remove+0x31/0xc0 [mptscsih]
[ 112.032082] [<ffffffffa0100259>] mptsas_probe+0x419/0x5a0 [mptsas]
[ 112.038337] [<ffffffff8139c775>] local_pci_probe+0x45/0xa0
[ 112.043897] [<ffffffff8139da15>] ? pci_match_device+0xc5/0xd0
[ 112.049718] [<ffffffff8139db39>] pci_device_probe+0xd9/0x130
[ 112.055454] [<ffffffff814839b5>] driver_probe_device+0x125/0x3b0
[ 112.061534] [<ffffffff81483d13>] __driver_attach+0x93/0xa0
[ 112.067094] [<ffffffff81483c80>] ? __device_att...

Read more...

tags: added: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v3.12 final and v3.13-rc1. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
5cbb3d216e2041700231bcfc383ee5f8b7fc8b74

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
PierreF (pierre-fersing) wrote :

Tested this kernel. It is NOT working, it has the issue.

Extract of console log:

[...]
Linux version 3.12.0-031200-generic (jsalisbury@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201402101715 SMP Thu Feb 13 14:58:01 UTC 2014
[...]
[ 42.455969] scsi4: error handler thread failed to spawn, error = -12
[ 42.541170] mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem
[ 42.630361] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[...]

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing and the update. I'm actually going to restart the bisect to just go through the ~drivers/scsi subsystem. That should speed up the process considerably.

I'll build the next test kernel and post it shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
53151bbb83f11b358ac94eddd81347c581dc51ea

This kernel is from a new bisect, which will only bisect through the scsi directory.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
PierreF (pierre-fersing) wrote :

This one is good, it is working:

[...]
Linux version 3.12.0-031200rc5-generic (jsalisbury@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201402131150 SMP Thu Feb 13 16:54:49 UTC 2014
[...]

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
323f6226a816f0b01514d25fba5529e0e68636c3

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
PierreF (pierre-fersing) wrote :

This kernel version is also good:

Linux version 3.12.0-031200rc5-generic (jsalisbury@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201402131403 SMP Thu Feb 13 19:04:57 UTC 2014

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
2f466d33f5f60542d3d82c0477de5863b22c94b9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Can you test that kernel and report back if it has the bug or not. I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
PierreF (pierre-fersing) wrote :

I can not test this kernel, it was only build for i386. The server is installed with amd64 :(

Because of timezone difference we can only test one kernel per day, to speed up the bisect, I've done one by myself, the result is the following:

$ git bisect log
# bad: [6ce4eac1f600b34f2f7f58f9cd8f0503d79e42ae] Linux 3.13-rc1
# good: [5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52] Linux 3.12
git bisect start 'v3.13-rc1' 'v3.12' '--' 'drivers/scsi'
# good: [53151bbb83f11b358ac94eddd81347c581dc51ea] [SCSI] lpfc 8.3.43: Fixed not processing task management IOCB response status
git bisect good 53151bbb83f11b358ac94eddd81347c581dc51ea
# good: [323f6226a816f0b01514d25fba5529e0e68636c3] Merge tag 'fcoe-3.13' into for-linus
git bisect good 323f6226a816f0b01514d25fba5529e0e68636c3

[Above this point, I didn't build kernel. It was the result from your kernel. Bellow the result are from kernel compiled by myself]

# bad: [2f466d33f5f60542d3d82c0477de5863b22c94b9] Merge tag 'pci-v3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad 2f466d33f5f60542d3d82c0477de5863b22c94b9
# bad: [0910c0bdf7c291a41bc21e40a97389c9d4c1960d] Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
git bisect bad 0910c0bdf7c291a41bc21e40a97389c9d4c1960d
# good: [0324e74534241f3f00910ec04ef67de1fe1542f4] Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect good 0324e74534241f3f00910ec04ef67de1fe1542f4
# good: [e37459b8e2c7db6735e39e019e448b76e5e77647] Merge branch 'blk-mq/core' into for-3.13/core
git bisect good e37459b8e2c7db6735e39e019e448b76e5e77647
# bad: [8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587] Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm
git bisect bad 8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587
# good: [7d35496dd98229cdf923238367fd3b3833fbde52] ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations
git bisect good 7d35496dd98229cdf923238367fd3b3833fbde52
# first bad commit: [8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587] Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm

From my bisect, the commit which introduced the error is 8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587.

For information, to build the kernel I did the following:

git remote add ubuntu-trusty git://kernel.ubuntu.com/ubuntu/ubuntu-trusty.git

git checkout ubuntu-trusty/master -- debian
git checkout ubuntu-trusty/master -- debian.master
fakeroot debian/rules clean defaultconfigs
fakeroot debian/rules binary-generic skipmodule=true

Build area was cleaned after each build.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hmm, commit 8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587 is a merge commit, so the bisect should have went deeper. Did the bisect specifically tell you that commit 8ceafbfa91ffbdbb2afaea5c24ccb519ffb8b587 was the first bad commit?

Revision history for this message
Faidon Liambotis (paravoid) wrote :

Well, Pierre gave the full output, so the answer to your question is yes (that's really easy to verify).

The merge commit doesn't look like having any relevant code changes, so this is a dead end.

I actually think that limiting to drivers/scsi was a bad idea from the start. The v3.12..v3.13-rc1 drivers/scsi diff wasn't that large and it was clear that most of the changes were not relevant to this breakage. I'd personally redo the bisect. I'll try to find some time to do this, but Pierre, if you're already prepared for this, by all means ;)

Revision history for this message
PierreF (pierre-fersing) wrote :

Ok, I've restarted a bisect without limitation on driver/scsi (git bisect start v3.13-rc1 v3.12).

Git tell me it's 13 steps, will took some time, but during middle of next week we should have the bad commit.

Revision history for this message
PierreF (pierre-fersing) wrote :

Bisect finished. The first bad commit is 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21. It seem more likely as this commit concerne kthread (and the first error is "scsi4: error handler thread failed to spawn, error = -12").

I also attach my bisect log if needed.

Revision history for this message
Faidon Liambotis (paravoid) wrote :

Pretty sure that's it. -12 is -ENOMEM, and there are two sites in that commit that return -ENOMEM, right when we have an error message from the failure to spawn the SCSI error handler thread.

Note that the oops/backtrace is a red herring. There is a secondary, unrelated bug in the mptsas code that is triggered by this untested codepath: when scsi_host_alloc fails to allocate (and hence eventually set ioc->sh), "Unable to register controller with SCSI subsystem" is printed (which we see), then it jumps to out_mptsas_probe where mptscsih_remove() is called; however, mptscsih_remove() tries to scsi_remove_host(host), but host = ioc->sh & ioc->sh == NULL, as it was the reason we ended up here. The solution for this -again, unrelated- bug would be to have a different label for early failures that won't call mptscsih_remove(). I'll prepare a patch for this and submit it to linux-scsi shortly.

The real issue of this bug is why the kthread spawning fails; I haven't figured that out yet.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21 reverted.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Note there is a linux-image and linux-image-extra package, both need to be installed.

Can you test that kernel and report back if it has the bug or not.

Revision history for this message
PierreF (pierre-fersing) wrote :

Yes, this version is working:

Linux version 3.13.0-12-generic (root@gomeisa) (gcc version 4.8.2 (Ubuntu 4.8.2-15ubuntu3) ) #32 SMP Mon Feb 24 18:50:37 UTC 2014 (Ubuntu 3.13.0-12.32-generic 3.13.4)

Revision history for this message
PierreF (pierre-fersing) wrote :

Any update ?
If i can help for something tell me, but I don't know kernel and can't do debuging of it by myself.

I've tried to identify which ENOMEM cause the issue by added the printk (one before the first ENOMEM, one before the second ENOMEM, one after both ENOMEM)... but with just this change bug no longer occure !

I've already suspected that this bug is due to some race-condition because it seems to occure nearly everytime with rootdelay + serial console, and seems to sometime success when using neither rootdelay nor serial console.

I've attached the diff of printk I've added.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hmm, if you think there is a race condition, then there should be more testing done with the kernel posted in comment #28. If we can be certain that reverting commit 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21 fixes this bug, then I can send a message upstream and to the bug author to get their feeback.

tags: added: patch
Revision history for this message
PierreF (pierre-fersing) wrote :

By more testing, you just mean reboot several time on this kernel to check that the isssue do not appear sometime ?

During my bisect, I always booted 3 times on good kernel to make sure it was not by "luck" that the kernel worked. I also booted three time the kernel from comment #28.

To double check, I just booted 5 times with this kernel and all 5 times worked.

Revision history for this message
PierreF (pierre-fersing) wrote :

If it help, I've done another change (against git hash 786235ee):

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b5ae3ee..25a4780 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -298,7 +298,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
                 * that thread.
                 */
                if (xchg(&create->done, NULL))
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(-42);
                /*
                 * kthreadd (or new kernel thread) will call complete()
                 * shortly.

So, depending on error (-12 / -ENOMEM or -42) we could know which return triggered the bug.

Result is:

[ 37.607981] scsi4: error handler thread failed to spawn, error = -42

To make sure the race condition do not affect which error is returned, I've booted 5 times that kernel. Each time I get error = -42.

Revision history for this message
PierreF (pierre-fersing) wrote :

With few hopes, I've tried the latest kernel from:

* trusty: linux 3.13.0-16.36 (linux-image-3.13.0-16-generic)
* trusty-proposed (downloaded from launchpad directly) : linux 3.13.0-17.37

Both still have the bug.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the detailed feedback, Pierre. I'll ask for some feedback from upstream about reverting commit: 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21

tags: added: kernel-bug-exists-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Link to message sent upstream:
https://lkml.org/lkml/2014/3/14/475

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Just for completeness, can you test the latest mainline kernel, which can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-rc6-trusty/

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

That return statement is called only when wait_for_completion_killable()
returned an error. That is, the caller received SIGKILL while waiting for
kthreadd to create a kernel thread.

That matches your bisection result because commit 786235ee changed to return to
the caller when the caller received SIGKILL in order to allow the OOM killer to
kill the process waiting for kthreadd to create a kernel thread.
The changelog which I expected for that commit is shown below.

----------
[PATCH] kthread: Make kthread_create() killable.

Any user process callers of wait_for_completion() except global init process
might be chosen by the OOM killer while waiting for completion() call by some
other process which does memory allocation.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

kthread_create() is one of such users and this patch fixes the problem for
kthreadd by making kthread_create() killable.

Signed-off-by: Tetsuo Handa <email address hidden>
Cc: Oleg Nesterov <email address hidden>
Acked-by: David Rientjes <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
----------

I think there are two problems listed below.

  (a) Somebody is sending SIGKILL to the caller of kthread_create().

        Somebody is "systemd" waited for timeout?
        The caller is "PID: 9847 Comm: systemd-udevd" ?

  (b) Error handling of the caller of kthread_create() is wrong.

        mptsas_probe() calls mptsas_remove() when
        scsi_host_alloc() returned NULL due to receiving SIGKILL.

        But mptsas_remove() assumes that "ioc->sh = sh;" was already called
        with sh != NULL which means scsi_host_alloc() did not return NULL.

        scsi_host_alloc() can return NULL when kzalloc() returned NULL.
        In other words, the caller of scsi_host_alloc() must be prepared for
        scsi_host_alloc() returning NULL even if the caller did not receive
        SIGKILL while waiting for kthreadd to create a kernel thread.

Therefore, I don't think reverting commit 786235ee is appropriate because
the problem will again happen when kzalloc() in scsi_host_alloc() fails.

Revision history for this message
Faidon Liambotis (paravoid) wrote :

Tetsuo, see my comments above for the diagnosis of two different bugs. I've already identified (b) as well and it's completely orthogonal to the bug in question. No idea about (a), though, your comment is interesting.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

OK. I read this thread.

I'm sure that somebody is sending SIGKILL to the systemd-udevd process
who is doing finit_module() system call, after waiting for 30 seconds.
However, since the probe function takes more than 30 seconds, the probe
function already received SIGKILL by the moment scsi_host_alloc() calls
kthread_run().

Therefore, reverting this commit will allow systemd-udevd process to
ignore SIGKILL until leaving finit_module() system call, making
kthread_run() to start successfully.

I think you can find the process who is sending SIGKILL after waiting
for 30 seconds, by inserting WARN_ON().

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

Would you try this patch?

Revision history for this message
PierreF (pierre-fersing) wrote :

I've tested the following:

* v3.14-rc6-trusty from comment #38 : still fail with same error.
* Kernel 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21 with patch check-sigkill : still got the fail to spawn thread. I will attach full output from serial console.
* Kernel 786235eeba0e1e85e5cbbb9f97d1087ad03dfa21 with patch kthread-defer-leaving.patch : also fail, but this time their is no error about failure to spawn thread. Only systemd-udevd blocked for more that 120 seconds. Also ouput of serial console attached.

Note: on second console output, command result from ps is not complet, serial console seem to discard output when we generate it too fast.

Revision history for this message
PierreF (pierre-fersing) wrote :
Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

Thank you. I missed that we are not allowed to call wait_for_completion() again
if wait_for_completion_timeout() succeeded, for do_wait_for_common() does
x->done-- which cancels x->done++ done by complete(). I must update this patch.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

I changed this patch to call wait_for_completion() again only if
wait_for_completion_timeout() returned 0, for
wait_for_completion_timeout() will return non-0 if completed.

Revision history for this message
PierreF (pierre-fersing) wrote :

Yes, it is working!

With this new patch applied (on 786235ee), server boot without any issue.

I've attached the console ouput (which show no error).

As for other test, I've booter 5 times on this kernel to be sure it was not by luck that it work.

Thanks for this fix.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

Great!

I updated this patch to be more OOM killer friendly.
I will propose this patch for 3.14-final.

Revision history for this message
PierreF (pierre-fersing) wrote :

I've tested the final patch againt both 786235ee and tag v3.14-rc6 (fa389e22). It still works.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :
Download full text (9.0 KiB)

I reproduced a similar result using test patch shown below.

---------- test patch start ----------
diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 5653e50..eaaa5e2 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1412,6 +1412,9 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
   return 0;
  }

+ printk(KERN_WARNING "Sleep injection start\n");
+ ssleep(40);
+ printk(KERN_WARNING "Sleep injection end\n");
  sh = scsi_host_alloc(&mptspi_driver_template, sizeof(MPT_SCSI_HOST));

  if (!sh) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..52ec166 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1051,6 +1051,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
  int override_rlimit;
  int ret = 0, result;

+ WARN_ON(sig == SIGKILL);
  assert_spin_locked(&t->sighand->siglock);

  result = TRACE_SIGNAL_IGNORED;
---------- test patch end ----------

---------- dmesg start ----------
[ 2.665872] Fusion MPT base driver 3.04.20
[ 2.666876] Copyright (c) 1999-2008 LSI Corporation
[ 2.676334] Fusion MPT SPI Host driver 3.04.20
[ 2.679817] mptbase: ioc0: Initiating bringup
[ 2.692059] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
[ 2.693733] e1000: Copyright (c) 1999-2006 Intel Corporation.
[ 2.761570] ioc0: LSI53C1030 B0: Capabilities={Initiator}
[ 2.919766] Sleep injection start
[ 3.062094] e1000 0000:02:00.0 eth0: (PCI:66MHz:32-bit) 00:0c:29:d7:20:d6
[ 3.064980] e1000 0000:02:00.0 eth0: Intel(R) PRO/1000 Network Connection
[ 35.716117] ------------[ cut here ]------------
[ 35.718364] WARNING: CPU: 2 PID: 174 at kernel/signal.c:1054 __send_signal+0x476/0x4b0()
[ 35.722092] Modules linked in: e1000 mptspi(+) mptscsih mptbase floppy
[ 35.725666] CPU: 2 PID: 174 Comm: systemd-udevd Not tainted 3.14.0-rc5+ #267
[ 35.729478] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/20/2012
[ 35.734379] 0000000000000009 ffff8800377c5d50 ffffffff816f9669 0000000000000000
[ 35.738226] ffff8800377c5d88 ffffffff8106970d ffff88007a470000 0000000000000009
[ 35.742039] ffff8800377c5ec0 0000000000000001 0000000000000003 ffff8800377c5d98
[ 35.745852] Call Trace:
[ 35.747083] [<ffffffff816f9669>] dump_stack+0x45/0x56
[ 35.749482] [<ffffffff8106970d>] warn_slowpath_common+0x7d/0xa0
[ 35.752228] [<ffffffff810697ea>] warn_slowpath_null+0x1a/0x20
[ 35.754904] [<ffffffff8107ac66>] __send_signal+0x476/0x4b0
[ 35.757467] [<ffffffff8107acde>] send_signal+0x3e/0x80
[ 35.759889] [<ffffffff8107b753>] do_send_sig_info+0x43/0x80
[ 35.762490] [<ffffffff8107bc16>] group_send_sig_info+0x46/0x50
[ 35.764040] [<ffffffff8107bd24>] kill_pid_info+0x34/0x50
[ 35.765244] [<ffffffff8107bdee>] SYSC_kill+0x8e/0x1a0
[ 35.766360] [<ffffffff810a18db>] ? account_user_time+0x8b/0xa0
[ 35.767654] [<ffffffff810a1ef4>] ? vtime_account_user+0x54/0x60
[ 35.768946] [<ffffffff81022635>] ? syscall_trace_enter+0x145/0x250
[ 35.770298] [<ffffffff8107d9be>] SyS_kill+0xe/0x10
[ 35.771373] [<ffffffff8170aabf>] tr...

Read more...

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

Pierre, would you give me a hand? I proposed the final patch but
I'm unable to prove that SIGKILL sent by systemd-udevd's 30 seconds
timeout is the trigger of this problem, for I don't have a real
machine which takes very long time upon initialization.

According to https://lkml.org/lkml/2014/3/18/396 , I assume that the
probe did not hang but Oleg assumes that the probe did hang and the
SIGKILL wakes up from the hung.

To test whether the probe did actually hang, I think we can ignore
the SIGKILL sent by systemd-udevd process, by applying the patch
shown below on a kernel without the final patch and booting with
rootdelay=45 . If the machine boots fine, we can assert that the
SIGKILL is not needed for waking up from the hung.

---------- patch start ----------
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..49359cb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1051,6 +1051,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
  int override_rlimit;
  int ret = 0, result;

+ WARN_ON(sig == SIGKILL);
  assert_spin_locked(&t->sighand->siglock);

  result = TRACE_SIGNAL_IGNORED;
@@ -2915,6 +2916,10 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
  info.si_code = SI_USER;
  info.si_pid = task_tgid_vnr(current);
  info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+ if (sig == SIGKILL && !strcmp(current->comm, "systemd-udevd")) {
+ printk(KERN_WARNING "Ignored SIGKILL by systemd-udevd\n");
+ return -ESRCH;
+ }

  return kill_something_info(sig, &info, pid);
 }
---------- patch end ----------

Revision history for this message
PierreF (pierre-fersing) wrote :

Applied patch on tag v3.14-rc6 (fa389e2), run kernel 4 four times, all worked.

We seen on output (full output attached):

[ 5.537193] mousedev: PS/2 mouse device common for all mice
[ [ 9.776032] floppy0: no floppy controllers found
[ 36.823538] Ignored SIGKILL by systemd-udevd
[ 38.356082] scsi4 : ioc0: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=16
[ 38.408276] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 9, phy 0, sas_addr 0x5000cca00f2e18fd
[...]

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

PierreF wrote:
> Applied patch on tag v3.14-rc6 (fa389e2), run kernel 4 four times, all
> worked.

Thank you!

Now we proved that systemd-udevd's 30 seconds timeout is the trigger of
this problem. It would be best if we can fix systemd side.

Joseph, is there any possibility that systemd-udevd's timeout is
extended to e.g. 120 seconds (or rootdelay= + alpha seconds if
explicitly given)?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Tetsuo,

Thanks so much for taking the time to work on this bug. I sent an email to the systemd mailing list asking for some feedback.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel that has a debugging patch from upstream[0].

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Can you test that kernel and report back ?

[0] https://lkml.org/lkml/2014/3/19/494

Revision history for this message
Kamilion (kamilion) wrote :

Just a note; I've got a lot of experience with the 1068E controller. Make sure you're running a recent firmware.

I have 1.33.00, "Phase 21", flashed on most of my 1068Es of various vendor brands.

http://lime-technology.com/forum/index.php?topic=12767.0

Versions at Phase 20 (1.32.00 and below) had significant issues while negotiating link speeds for me, I could either get everything running fine at 1.5/3Gbit and all the drives worked, or I would get 6Gbit and all the older drives would be missing.

I believe the latest dell-branded firmware is:
http://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=8R2N4&fileId=3098294144&osCode=LNUX&productCode=poweredge-r300&languageCode=EN&categoryId=SF

Unfortunately, it seems to come only as an .exe, but I didn't try to download it.

I've got two 1068Es paired up with HP SAS Expanders:
http://www.amazon.com/HP-468406-B21-Sas-Expander-Card/dp/B0025ZQ16K
At one point I tried to connect I think about 60 drives (If I recall correctly, I got up to /dev/sdbd)

Getting that working properly was another firmware hassle story involving purchasing a used HP server just to flash it; but I won't bore you with it in a LP bug# comment.

But still -- out of date firmware can lead to wacky results.
Good luck convincing Kay to raise the udev timeout from 30sec!

Revision history for this message
PierreF (pierre-fersing) wrote :

I've tested with kernel from comment #56.

The kernel generated too much logs for IPMI serial console (which generated too much garbage), so I switched to a real serial console (and at 115kbauds).

I've attached a archive with 3 runs (the last run it the most interesting I think):

First run with serial console and a bigger kernel log buffer (log_buf_len=8M).
The hope with larger log buffer was to catch full kernel message with dmesg once server is running.
Sadly, after about 30 minutes, server was still printing stacks.
Serial console capture attached under name "01-serial-capture-large-buffer.txt"

Second run, still with serial console but with default kernel log (no log_buf_len)
This time, kernel booted fine (with exception to disk beeing discovered after rootdelay, but a ctrl+d resumed the boot process).
Note: this boot generated WAY less message and booted. The only change is the log_buf_len=8M present or not.
Serial console capture attached under name "02-serial-capture-default-buffer.txt"

Third run, this time with serial console disabled and very large kernel log buffer (log_buf_len=32M).
Probably the most interesting one, dmesg was complete (include very first messages). It is 12 MB large !
Server booted without any issue (disk detected before the end of rootdelay).

Revision history for this message
PierreF (pierre-fersing) wrote :

Joseph,

kernel freeze is planed in 7 days, which will arrive very fast. Do you think we could have a fix committed before this deadline ?

I still didn't tested the firmware upgrade. I didn't tested it to keep a machine which exhibit the bug... upgrading firmware is okay with a local machine, but always trickier with remote server :(

If their is something I can do for this issue, please tell me.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

(a) Linux kernel guys think that a hardcoded timeout is a systemd bug.

    https://lkml.org/lkml/2014/3/23/42

(b) The systemd guys think that kernel module loading takes more than
    30 seconds is a kernel module's bug. But Linux kernel guys won't
    be able to fix it immediately. Also, solution by updating firmware
    won't be acceptable because there will be users who can't update
    firmware for some reason.

    http://lists.freedesktop.org/archives/systemd-devel/2014-March/018007.html

    The systemd guys suggest that adding OPTIONS+="event_timeout=120"
    to udev rules might help, but I don't think it will help.

    https://lkml.org/lkml/2014/3/22/207

(c) The LVM guys added OPTIONS+="event_timeout=180" before
    commit 786235ee was merged into Linux kernel.

    http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html

(d) I tried to rebuild systemd package with longer timeout but I was
    unable to build it without build failures. Therefore, I opened a
    bug report in order to ask for systemd package with longer
    timeout, but no response so far.

    https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1297248

Well, it is unlikely that this situation is solved within 7 days.
I think that applying the patch in comment #48 as a
"[trusty] UBUNTU: SAUCE:" patch is the safest choice for 14.04 LTS
kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel based on the Trusty master-next branch. The kernel is available at:

http://kernel.ubuntu.com/~jsalisbury/l;1276705

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, typo in my link. The kernel is at:
http://kernel.ubuntu.com/~jsalisbury/lp1276705

Revision history for this message
PierreF (pierre-fersing) wrote :

I've tested this new kernel, it boot without issue on the server (as usual, I booted three time the kernel to make it always works well).

Revision history for this message
PierreF (pierre-fersing) wrote :

I see that trusty has now a kernel with the fix included:

$ cat changelog.Debian
linux (3.13.0-21.43) trusty; urgency=low
[...]
  [ Tetsuo Handa ]

  * SAUCE: kthread: Do not leave kthread_create() immediately upon SIGKILL.
[...]

After a apt-get dist-upgrade to this kernel, I've successfully booted the server 5 times. So this kernel fix the issue.

Thanks all for your work.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Trusty):
status: Confirmed → Fix Released
Revision history for this message
Marco Scholl (traxanos) wrote :

for us the problem is not solved! we can't install ubuntu an hardware with LSI SAS1068E. now we get the next problem see attachment. we have DO AN additional test:

we have install ubuntu 13.10 and upgrade to 14.04 with latest kernel and have set rootdelay=120. we must multiple restart server to boot system.

so its not possible to use the most dell hardware with 14.04.

Revision history for this message
Tetsuo Handa (9-launchpad-i-love-sakura-ne-jp) wrote :

Hello, Marco.

The problem handled by this entry is about mptsas_probe() hitting

  scsi4: error handler thread failed to spawn, error = -12
  mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000060

due to kthread_create() returning immediately upon SIGKILL.

But the problem you are reporting is about mptsas_remove().
It seems to me that what you are reporting is an irrelevant one.

tags: added: kernel-da-key
removed: kernel-key
tags: removed: performing-bisect
Revision history for this message
Luca Lesinigo (luca404) wrote :

I have reproduced today this same problem with a freshly installed and fully up-to-date Ubuntu Trusty 14.04.3 64bit on a Dell Poweredge R300 with the SAS6i/R controller. The system has 2 SAS disks configured as a RAID1 volume in the SAS 6i/R.

Switching to kernel linux-generic-lts-wily did not solve the problem.

We implemented the rootdelay workaround and that let the system boot without problems. After experimenting with various root delays we settled on 45 seconds, it seems ok after trying various cold and warm reboots. We tried 30 seconds but that was not enough.

Hope this helps...

To post a comment you must log in.