[arm64] compute nodes unstable after upgrading from 4.2 to 4.4 kernel

Bug #1602577 reported by Junien F
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Xenial
Fix Released
High
Joseph Salisbury

Bug Description

Hi,

In order to investigate bug LP#1531768, we upgraded some arm64 compute nodes (swirlices) to a 4.4 kernel. I think it made the VMs work better, but the hosts became extremely unstable.

After some time, getting a shell on them would be impossible. Connecting on the VSP, you'd get a prompt, and once you typed your username and password, you'd see the motd but the shell would never spawn.

Because of these instability issues, all the arm64 compute nodes are now back on 4.2. However, we managed to capture "perf record" data when a host was failing. I'll attach it to the bug. Perhaps it will give you hints as to what we can do to help you troubleshoot this bug further.

Once we have your instructions, we'll happily reboot one (or a few) nodes to 4.4 to continue troubleshooting.

Thanks !
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 12 12:54 seq
 crw-rw---- 1 root audio 116, 33 Jul 12 12:54 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.21
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: console=ttyS0,9600n8r ro compat_uts_machine=armv7l
ProcVersionSignature: Ubuntu 4.2.0-41.48~14.04.1-generic 4.2.8-ckt11
RelatedPackageVersions:
 linux-restricted-modules-4.2.0-41-generic N/A
 linux-backports-modules-4.2.0-41-generic N/A
 linux-firmware 1.127.22
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 4.2.0-41-generic aarch64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Revision history for this message
Junien F (axino) wrote :

perf data is at : https://private-fileshare.canonical.com/~axino/swirlix01.perf.data.LP1602577.xz

I'll run apport from this host, note that it's now running 4.2.

Version string of the unstable kernel was :
swirlix01 kernel: [ 0.000000] Linux version 4.4.0-30-generic (buildd@bos01-arm64-006) (gcc version 4.8.4 (Ubuntu/Linaro 4.8.4-2ubuntu1~14.04.3) ) #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 (Ubuntu 4.4.0-30.49~14.04.1-generic 4.4.13)

tags: added: apport-collected trusty uec-images
description: updated
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : IwConfig.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
tags: added: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you have a machine that could be used to test some kernels? If so, we can perform a kernel bisect to identify the commit that caused the regression.

Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Revision history for this message
Junien F (axino) wrote :

Yes we can do that. If you have documentation for this process, please let us know.

Thanks

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It could basically just require that you install and test kernels the we build.

cking mentioned that bisecting between 4.2 <-> 4.4 on this same platform caused various non-bootable issues in 4.3. Because of this, we may want to first try the mainline kernel to see if we can perform a "Reverse" bisect. That is if the bug is actually fixed in mainline.

Can you test the following mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7-rc7/

Revision history for this message
Junien F (axino) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, the arm64 kernels are not built automatically like the other arches. I'll build it an post a link shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you test out the latest 4.6 based yakkety kernel while I get an arm64 mainline kernel built?

The yakkety master-next kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1602577/yakkety/

Note that you need to install both the linux-image and linux-image-extra .deb packages.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

There is now a 4.7 arm64 mainline kernel available here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7/

Can you test this kernel when you have a chance?

Revision history for this message
Junien F (axino) wrote :

Hi Joseph,

I tried the 4.7, but it made the compute node unbootable : https://paste.ubuntu.com/21865889/
From the initramfs shell, I had to manually load a few modules (ahci_xgene, sd_mod and ext4) so I could mount the root filesystem and get back to the previous working version (4.2).

Was that expected ? Will the same thing happen with the 4.6 based yakkety kernel you pasted above ?

Thanks

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It wasn't expected that the system would not boot. The 4.8-rc1 kernel is now available, can you give it a try:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc1/

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Junien F (axino) wrote :

4.8-rc1 had the same behaviour (landed into initramfs), so I figure I'd dig a bit. Long story short, this was entirely due to the following error :

/scripts/init-top/udev: line 14: can't create /sys/kernel/uevent_helper: Permission denied

CONFIG_UEVENT_HELPER was y in 4.2, but got switched to "not set" in at least 4.7 and 4.8.

I modified /usr/share/initramfs-tools/scripts/init-top/udev to not error out on the "echo" to /sys/kernel/uevent_helper, and then the node booted properly \o/

I'll let you know if it's more stable or not.

Revision history for this message
Junien F (axino) wrote :

Well, this 4.8-rc1 doesn't have CONFIG_VIRTUALIZATION so it's not very useful for a compute node kernel ...

I reverted to 4.2. Can you please build a kernel with CONFIG_VIRTUALIZATION and CONFIG_KVM ?

Thanks

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I build a v4.8-rc1 kernel using the Xenial 4.4 kernel configs. This set of configs set CONFIG_VIRTUALIZATION and CONFIG_KVM.

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1602577/v4.8-rc1With4.4Configs/

Can you give this test kernel a test?

Revision history for this message
Paul Collins (pjdc) wrote :

Kernel from #24 fails to boot with the following, and keeps looping. Unclear if it's due to a problem with our system or with the kernel. I've sought advice from someone more familiar with the hardware we're using and will update with any further info.

306 bytes read in 18 ms (16.6 KiB/s)
## Executing script at 4004000000
19403840 bytes read in 499 ms (37.1 MiB/s)
28340037 bytes read in 723 ms (37.4 MiB/s)
## Booting kernel from Legacy Image at 4002000000 ...
   Image Name: kernel 4.8.0-040800rc1-generic
   Created: 2016-08-11 4:06:14 UTC
   Image Type: ARM Linux Kernel Image (uncompressed)
   Data Size: 19403776 Bytes = 18.5 MiB
   Load Address: 00080000
   Entry Point: 00080000
   Verifying Checksum ... OK
## Loading init Ramdisk from Legacy Image at 4005000000 ...
   Image Name: ramdisk 4.8.0-040800rc1-generic
   Created: 2016-08-11 4:06:15 UTC
   Image Type: ARM Linux RAMDisk Image (gzip compressed)
   Data Size: 28339973 Bytes = 27 MiB
   Load Address: 00000000
   Entry Point: 00000000
   Verifying Checksum ... OK
ERROR: Did not find a cmdline Flattened Device Tree
Could not find a valid device tree

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you give the 4.8-rc8 kernel a try? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc8/

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Junien F (axino) wrote :

It's booting \o/ We're trying to replicate the instability, we'll let you know.

Revision history for this message
Martin Pitt (pitti) wrote :

My two instances have been up and idle for 1:30 hours by now. They don't have any actual lxd workload due to bug 1628946 (juju deploy currently fails), but the original hang bug actually happened on idle boxes. Thus, so far so good :-)

I'll check again tomorrow.

Revision history for this message
William Grant (wgrant) wrote :

(For those playing at home, the boot failure in comment #25 was because the m400 firmware preloads the dtb at 0x4003000000, which was clobbered by the >16MiB kernel. The latest 4.8 builds are gzipped, so they're short enough to leave it intact.)

Revision history for this message
Martin Pitt (pitti) wrote :

My two arm64 instances had been idle for 16 hours, and after that fully busy with running tests for about 5 hours. So from my POV, the 4.8 kernel does not have the RCU hang (bug 1531768) any more, or at least much less noticeable. And apparently the host has survived about 24 hours as well now. I don't know what "after some time" in the bug description translates to (minutes? hours? days?), but so far this is looking pretty good \o/

Revision history for this message
Martin Pitt (pitti) wrote :

Both of my instances have been working happily for three days now. SHIP IT! :-)

Many thanks to the kernel team and Junien!

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Martin Pitt, is this with the upstream v4.8-rc8 kernel, or with an Ubuntu kernel? If it is with 44.8-rc8, we may still have to perform a reverse bisect to identify the commit that fixes the bug, so we can SRU it to the stable kernels. Yakkety will pick up this fix when we rebase, but kernel prior to Yakkety may not have the fix, if it was not sent to upstream stable.

Revision history for this message
Martin Pitt (pitti) wrote :

@Joseph: I don't know what is running on the compute hosts, I don't have access to those. I suppose that Junien used the kernel you offered in comment #26, though?

Revision history for this message
Junien F (axino) wrote :

Yes, this is with the kernel from #26

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Junien. Now we know the fix is in the upstream v4.8-rc8 kernel. Then next step is to identify the commit that fixes this bug, and then SRU it to Xenial.

It would be best to next step the latest upstream 4.4 kernel. That will tell us if the fix was already sent to upstream stable. If it was not, we will need to test some other upstream kernels to narrow down further when the fix landed in mainline.

Can you test the upstream 4.4.23 kernel? It can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23

Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Revision history for this message
Junien F (axino) wrote :

upstream 4.4.23 has been installed on a compute node, we'll let you know how it goes

Revision history for this message
Junien F (axino) wrote :

FAOD

$ cat /proc/version
Linux version 4.4.23-040423-generic (kernel@tangerine) (gcc version 5.3.1 20160413 (Ubuntu/Linaro 5.3.1-14ubuntu2) ) #201609300709 SMP Fri Sep 30 11:52:40 UTC 2016

Revision history for this message
Martin Pitt (pitti) wrote :

My lxd-armhf1 node that is supposedly running on 4.4.23-040423-generic (on the compute host) has worked fine for the last 13 hours.

Revision history for this message
Junien F (axino) wrote :

Compute node is still up. I think we can consider upstream 4.4.23 to be good.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is great news. That means the fix in mainline was already cc'd to upstream stable. Xenial will get the fix through the normal upstream stable update process.

If you want to, we can still perform a kernel bisect to identify the fix for this bug, but that will require testing about 10 test kernels. Or we can just wait for the fix to come down via updates.

Revision history for this message
Junien F (axino) wrote :

OK cool. I think we can wait. Do you have an ETA for the fix to come down ? And will trusty also get the fix ?

Revision history for this message
Martin Pitt (pitti) wrote :

Great news! until that happens, is there any harm in leaving 4.8 or 4.4.23 running on the current compute nodes?

tags: added: kernel-da-key
removed: kernel-key
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.