unable to enable iommu on HPE Proliant Gen9 server

Bug #1641593 reported by James Page
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

I'm using MAAS to enable the following kernel flags on install/boot:

  iommu=pt intel_iommu=on

in order to be able to passthrough SR-IOV VF functions to KVM guess; however when these options are enabled, the servers fail to install (see attached screenshot).

The install eventually fails - it looks like the writes back to one of the disks starts to fail for some reason.

Servers are targeted with Xenial and the release 4.4 kernel (no HWE).

Here's the LSHW output from the system: http://pastebin.ubuntu.com/23875929/

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

Device 03:00.0 is the SAS smart array controller in the server (grabbed from a machine deployed without iommu enable):

03:00.0 Serial Attached SCSI controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
        DeviceName: Embedded RAID 1
        Subsystem: Hewlett-Packard Company P440ar
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at 93300000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at 93400000 (64-bit, non-prefetchable) [size=1K]
        Region 4: I/O ports at 4000 [size=256]
        [virtual] Expansion ROM at 93480000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: hpsa
        Kernel modules: hpsa

Revision history for this message
James Page (james-page) wrote :

potentially related bug 1590072

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1641593

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
James Page (james-page)
description: updated
Revision history for this message
James Page (james-page) wrote :

Unable todo apport-collect as machine won't actually boot to usable.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
ryeterrell (ryeterrell)
tags: added: v-pil vpil
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Is it possible to install the system, then boot with different kernel versions to see if it only happens at install time, or if it also happens after install on regular boot.

We could bisect and also test newer kernels to see if there is a fix already if we can boot different kernels.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Revision history for this message
James Page (james-page) wrote :

Hi Joseph

The problem happens both during install, and during a regular boot (for example if I enable iommu post install and then reboot).

Happy to try new kernel versions for testing.

Revision history for this message
James Page (james-page) wrote :

OK tried with:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc5/

still see the same issue with intel_iommu=on; booted fine prior to enabling.

Revision history for this message
James Page (james-page) wrote :

(I also tried a firmware upgrade for the storage controller to the latest version - no luck).

tags: added: kernel-bug-exists-upstream-4.9-rc5
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Is there a prior kernel version that does boot properly, like Trusty or Precise, etc?

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Is there a prior kernel version that does boot properly? If there is, we can perform a kernel bisect to find the commit that introduced this bug.

Revision history for this message
James Page (james-page) wrote :

I'll try with trusty

Revision history for this message
James Page (james-page) wrote :

OK so tripped on another issue trying with trusty - the servers I'm using have NVMe SSD's for caching, and it looks like trusty blkid is unable to read meta from from nvme devices.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. That would make it difficult to know if this is a regression, or if this bug always existed.

It might be worth while testing the latest Xenial kernel in -proposed:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/11278866

The 4.9-rc7 kernel is now also available, and might be worth testing. It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc7/

tags: added: kernel-da-key
removed: kernel-key
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We found testing with the latest Xenial kernel (4.4.0.62.65) from https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/11278866 fixes this issue - no firmware updates required. We did also test with just the latest firmware updates, and that did not fix the issue. Latest firmware + 4.4.0.62.65 also works.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1641593] Re: unable to enable iommu on HPE Proliant Gen9 server

So, I appear to have spoken too soon on exactly what fixes this.

We have two systems being tested with 4.4.0-62-generic #83 - one with the
firmware update and one without.

The one with the firmware updates has been up for over 6 hours now without
any issues.

The one without firmware updates has been up for 40 minutes and is getting
I/O errors now.

I'm also seeing a system with 4.4.0-59-generic #80 and no firmware updates
boot up with iommu enabled, I will see how long it stays up..

I'll also test with 4.4.0-59-generic #80 and the firmware updates.

On Mon, Jan 30, 2017 at 6:13 PM, Jason Hobbs <email address hidden>
wrote:

> We found testing with the latest Xenial kernel (4.4.0.62.65) from
> https://launchpad.net/~canonical-kernel-
> team/+archive/ubuntu/ppa/+build/11278866 fixes this issue - no firmware
> updates required. We did also test with just the latest firmware
> updates, and that did not fix the issue. Latest firmware + 4.4.0.62.65
> also works.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1641593
>
> Title:
> unable to enable iommu on HPE Proliant Gen9 server
>
> Status in linux package in Ubuntu:
> Incomplete
>
> Bug description:
> I'm using MAAS to enable the following kernel flags on install/boot:
>
> iommu=pt intel_iommu=on
>
> in order to be able to passthrough SR-IOV VF functions to KVM guess;
> however when these options are enabled, the servers fail to install
> (see attached screenshot).
>
> The install eventually fails - it looks like the writes back to one of
> the disks starts to fail for some reason.
>
> Servers are targeted with Xenial and the release 4.4 kernel (no HWE).
>
> Here's the LSHW output from the system:
> http://pastebin.ubuntu.com/23875929/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
> 1641593/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Doing some more testing it looks like the systems without the firmware
udpate are not stable. I can sometimes, but not always, get them to boot
using either 4.4.0-59-generic #80 or 4.4.0-62-generic #83, but once
they're up, they don't last long. The longest I've seen is about 40
minutes before getting "sd 0:0:2:0: rejecting I/O to offline device" errors
and /dev/sda going offline. I can get it to go offline quicker - almost
immediately - by doing "cat /dev/urandom > /dev/sdb".

The two systems with the firmware updates both reliably boot up and stay up
using either 4.4.0-59-generic #80 or 4.4.0-62-generic #83, and haven't gone
offline yet from the "cat /dev/urandom > /dev/sdb" test. I will leave them
running over night.

On Mon, Jan 30, 2017 at 6:51 PM, Jason Hobbs <email address hidden>
wrote:

> So, I appear to have spoken too soon on exactly what fixes this.
>
> We have two systems being tested with 4.4.0-62-generic #83 - one with the
> firmware update and one without.
>
> The one with the firmware updates has been up for over 6 hours now without
> any issues.
>
> The one without firmware updates has been up for 40 minutes and is getting
> I/O errors now.
>
> I'm also seeing a system with 4.4.0-59-generic #80 and no firmware updates
> boot up with iommu enabled, I will see how long it stays up..
>
> I'll also test with 4.4.0-59-generic #80 and the firmware updates.
>
> On Mon, Jan 30, 2017 at 6:13 PM, Jason Hobbs <email address hidden>
> wrote:
>
>> We found testing with the latest Xenial kernel (4.4.0.62.65) from
>> https://launchpad.net/~canonical-kernel-
>> team/+archive/ubuntu/ppa/+build/11278866 fixes this issue - no firmware
>> updates required. We did also test with just the latest firmware
>> updates, and that did not fix the issue. Latest firmware + 4.4.0.62.65
>> also works.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1641593
>>
>> Title:
>> unable to enable iommu on HPE Proliant Gen9 server
>>
>> Status in linux package in Ubuntu:
>> Incomplete
>>
>> Bug description:
>> I'm using MAAS to enable the following kernel flags on install/boot:
>>
>> iommu=pt intel_iommu=on
>>
>> in order to be able to passthrough SR-IOV VF functions to KVM guess;
>> however when these options are enabled, the servers fail to install
>> (see attached screenshot).
>>
>> The install eventually fails - it looks like the writes back to one of
>> the disks starts to fail for some reason.
>>
>> Servers are targeted with Xenial and the release 4.4 kernel (no HWE).
>>
>> Here's the LSHW output from the system:
>> http://pastebin.ubuntu.com/23875929/
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1641593
>> /+subscriptions
>>
>
>

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.