Write performance regression severely affecting hpsa controllers

Bug #1668557 reported by Niklas Edmundsson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Upgrading our HP(E) servers to xenial we have discovered a severe performance regression affecting write performance on Smart Array RAID logical drives.

Firmware is up to date with latest HPE SPP release at the time of testing.

This performance regression has been verified on multiple sites with different HPE systems and OS:s, downgrading kernel brings back the expected performance.

We are NOT seeing this on Dell hardware with H730P (LSI based) controllers, using the same OS installs.

Our test system setup:

Ubuntu 16.04.2 LTS
Server: HP(E) DL380e, 36 G RAM, 1x E5-2420 CPU
RAID Controller: Smart Array P420 2GB FBWC
RAID setup: 24x 500G SAS HDDs in RAID50 with 2 parity groups.
File system: xfs

We are also seeing this issue on setups using DL380e with P430 controller and 14 HDDs in RAID6, but those systems are not available for testing.

Fast/normal (previous) bulk IO performance is approx 1600-1800 MB/s sustained read and write using a simple dd bs=256k based test.

Slow/performance regression reduces the write performance to approx 500-600 MB/s sustained using the same tests and filesystem.

We have tested using the following OS installs and kernels:

Precise:

All tested Ubuntu kernels are fast.

Trusty:

trusty 3.13.0.110.118 fast
utopic 3.16.0.77.68 fast
vivid 3.19.0.80.62 slow
xenial 4.4.0.64.50 slow

Xenial:

xenial 4.4.0.64.68 slow
hwe 4.8.0.39.10 slow

mainline 3.12.64-031264.201610030943 fast
mainline 3.16.41-031641.201702270232 fast
mainline 3.17.8-031708.201501081837 fast
mainline 3.18.0-031800.201412071935 fast
mainline 3.18.12-031812.201504221338 fast
mainline 3.18.18-031818.201507101433 fast
mainline 3.18.21-031821.201509020527 fast
mainline 3.18.22-031822.201510031227 slow
mainline 3.18.23-031823.201510291931 slow
mainline 3.18.24-031824.201511031331 slow
mainline 3.18.47-031847.201701181631 slow
mainline 4.10.1-041001.201702260735 slow
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 28 10:29 seq
 crw-rw---- 1 root audio 116, 33 Feb 28 10:29 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: HP ProLiant DL380e Gen8
Package: linux (not installed)
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-64-generic root=/dev/mapper/rootvg-rootlv ro console=ttyS0,115200n8 noquiet nosplash nomodeset
ProcVersionSignature: Ubuntu 4.4.0-64.85-generic 4.4.44
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-64-generic N/A
 linux-backports-modules-4.4.0-64-generic N/A
 linux-firmware 1.157.8
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial xenial
Uname: Linux 4.4.0-64-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 08/02/2014
dmi.bios.vendor: HP
dmi.bios.version: P73
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP73:bd08/02/2014:svnHP:pnProLiantDL380eGen8:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL380e Gen8
dmi.sys.vendor: HP

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1668557

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : CRDA.txt

apport information

tags: added: apport-collected xenial
description: updated
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : JournalErrors.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : Lspci.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : Lsusb.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : ProcEnviron.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : ProcModules.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : UdevDb.txt

apport information

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Mattias Wadenstein (maswan) wrote :

Here is some iostat and command lines from the slow mode (xenial kernel).

Revision history for this message
Mattias Wadenstein (maswan) wrote :

Last fast mainline kernel iostat.

Revision history for this message
Mattias Wadenstein (maswan) wrote :

And iostat from the first slow kernel.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks maswan for the later logs as discussed.

TL;DR: the kernel now no more knows/considers this as a raid

It now merged write requests: wrqm/s 0 -> ~500
The writes are bigger due to that x8: avgrq-sz ~1000 -> ~8000
But we know it is a raid, it has to split it on stripe size
And we can see x8 in time would be ok, but we see: w_await ~10 -> ~600

I asked on IRC for the following on good/bad case while checking source.
 $ for i in /sys/class/block/sda/queue/iosched/* /sys/class/block/sda/queue/*; do echo $i $(cat $i); done

Revision history for this message
Mattias Wadenstein (maswan) wrote :

sys/block output for the slow case

Revision history for this message
Mattias Wadenstein (maswan) wrote :

Fast style sys/block output

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Only diff is max_sectors_kb, 512->4096

That matches the iostat, 8k there is in 512b sectors which matched 4k max sectors.
I don't see where the patch is that breaks that in linux-stable, but it is worth a try to boot a bad kernel and set 512 in /sys.

Later to tune for your raid case with so many disks you might consider setting also
/sys/class/block/sdb/queue/rotational = 0
Locality doesn't mean a lot to your setup and might safe some cpu cycles.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I think I found the change in linux stable: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=20d74bf29cfae86649bf1ec75038c79a9bc5010f

It is correct to "fix" it, the maximum size is bigger.

Yet in your case with the drives very likely not having 4k formatting and the strip-size/offset being different it appears as an anti-tuning which you need to overcome now.

Let me know how the results with setting 512 (and maybe also rotational=0) in a bad kernel work for you.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Theory confirmed, due to the fix the real "capable" max kb is set, that lets the block device layer merge and hold back requests. Only to later be split by the raid controller.

Following the IRC discussion using that as a tuning to "fix" is ok.

Closing as as it turned out not to be a bug.

The kernel has the proper tunable already (the one we used), but having a knob in the HW controller might be worth to avoid everybody finding that the hard way.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

Last time I checked the hpsa driver is part of the kernel shipped with Ubuntu...

Regardless of whether the core kernel is doing it right or not, the hpsa driver should not advertise a larger max_sectors_kb than it can handle with good performance.

FYI, the hpsa driver in the xenial default kernel advertise the stripe size so this breaks performance for any setup with a stripe size more than approx 1 MiB.

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

As a workaround this udev rule can be used.

Caveat: It works as intended for us, but I'm not an udev expert.

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

We realized that this likely affects older controllers using the cciss driver as well, updated udev rule for completeness.

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

Johan Guldmyr pointed out that my udev rule attempted to apply the attribute changes to partitions as well, which causes errors in system log files.

Uploading a fixed version.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.