kernel crashes during boot unless IOMMU is disabled on Ryzen 1800X

Bug #1747463 reported by Peridot
44
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
Confirmed
Medium
Unassigned
Bionic
Confirmed
Medium
Unassigned
Cosmic
Confirmed
Medium
Unassigned

Bug Description

I'm on a Ryzen 1800X and Biostar B350GT5 on bionic kubuntu.

There are lots of AMD-Vi logged events and I get irq crashes or acpi hangups with a 'normal' boot. I got it to boot by disabling IOMMU in the BIOS and adding "iommu=soft" to the kernel booting options in grub.

linux can then detect everything properly (all cores) and I've had zero crashes. The only issue is that it's using software IOMMU which could have a performance penalty because it has to copy all the data of some PCI devices to sub 4G regions.

Alternatively it boots with the kernel option "acpi=off" but only detects a single core/thread.

I attached a kernel log.

I believe(d) this might be related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360
and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085
---
ApportVersion: 2.20.8-0ubuntu8
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: fixme 1487 F.... pulseaudio
 /dev/snd/controlC0: fixme 1487 F.... pulseaudio
CurrentDesktop: KDE
DistroRelease: Ubuntu 18.04
HibernationDevice: RESUME=UUID=bc971fcc-8e63-4fa5-a149-af4af6c8eece
InstallationDate: Installed on 2018-01-31 (4 days ago)
InstallationMedia: Kubuntu 18.04 LTS "Bionic Beaver" - Alpha amd64 (20180131)
IwConfig:
 lo no wireless extensions.

 enp3s0 no wireless extensions.
MachineType: BIOSTAR Group B350GT5
Package: linux (not installed)
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-32-generic.efi.signed root=/dev/mapper/kubuntu--vg-root ro iommu=soft quiet splash vt.handoff=1
ProcVersionSignature: Ubuntu 4.13.0-32.35-generic 4.13.13
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-32-generic N/A
 linux-backports-modules-4.13.0-32-generic N/A
 linux-firmware 1.170
RfKill:

Tags: bionic
Uname: Linux 4.13.0-32-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 11/30/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 5.13
dmi.board.asset.tag: None
dmi.board.name: B350GT5
dmi.board.vendor: BIOSTAR Group
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr5.13:bd11/30/2017:svnBIOSTARGroup:pnB350GT5:pvr:rvnBIOSTARGroup:rnB350GT5:rvr:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: None
dmi.product.name: B350GT5
dmi.sys.vendor: BIOSTAR Group

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254611
crash logged using netconsole

I bought my daughter a notebook HP 15-ba062nc
(http://support.hp.com/us-en/product/HP-15-ba000-Notebook-PC-series/10862317/model/11792430).
Installed is Debian Stretch/Sid with kernel 4.9.6.

Successful boot without crash is possible with
    - disabled amdgpu (e.g. old nomodeset)
    - or disabled iommu (iommu=off)
otherwise the kernel crashes and the file-system is corrupted.

iommu=off is much better way now, because the notebook runs in energy
efficient manner - the fan is quiet or stopped.

Attached are kernel messages using netconsle.

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254621
nomodeset - no crash

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254631
iommu=off - no crash

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254641
lspci -vvv

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254651
/proc/cpuinfo

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254661
crash logged using netconsole

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

Do you have the amdgpu firmware installed?

When you create bugs against amdgpu driver, use the latest kernel and mesa code:
https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.11-wip
https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers

Problems might be fixed in the latest code.

Latest polaris firmware:
https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/

How to create a custom kernel, see:
https://bugzilla.kernel.org/show_bug.cgi?id=193651

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Doubt the problem is in the amdgpu driver. What about bug in the amd_iommu?
I think this because I tried to switch off external GPU using acpi_call module.
The following command was successful:

    echo '\_SB_.PCI0.VGA.PX02' > /proc/acpi/call

while running kernel with no KMS (no amdgpu). Fan really went silent
after this, but kernel crashed in several seconds in similar way like
with amdgpu and active iommu.

The filesystem is after every crash corrupted. I'm afraid that storage
controller goes through iommu too and crash causes some random writes to
disk :(. But may be I am wrong and this ACPI call is illegal in reality
and amdgpu does something wrong regarding iommu to. Nevertheless amdgpu works
with iommu=off fine. Maybe the problem is with some buggy BIOS/firmware
from vendor.

I will try a newer kernel.

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

I tried kernel 4.10.0-rc6-amd64 from Debian experimental archive and the result is very similar. To minimize harm on filesystem I booted into emergency mode with read-only file-system and tried to switch off GPU using ACPI call. There is some warning during call, but something happened :)

ACPI Warning: \_SB.PCI0.VGA.PX02: Insufficient arguments - Caller passed 0, method requires 1 (20160930/nsarguments-256)

I did this twice - one time with and one time without iommu=off.

I'm attaching netconsole logs...

Is this a proof the problem is in the amd_iommu.c?

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254695
logged using netconsole - 4.10.0rc6 with iommu=off, gpu turned off using acpi_call

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254697
logged using netconsole - 4.10.0rc6, gpu turned off using acpi_call -> crash

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

Stock kernels have very little amdgpu driver code, see kernel.org and click diff. You have very new amd gpu so Use the command:
git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux

The kernel configuration file of Debian Official kernel are available in /boot, named after the kernel release. Copy the .config file to the linux directory. Connect all your devices and run the command: make localmodconfig. You can use the command make defconfig too for creating initial .config file.

Use the command: make xconfig and check that you have enabled: Reroute Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap, Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control CPU and devices. In the drivers->graphics->amdgpu enable cik support for a gcn 1.1 gpu and si support for a gcn 1.0 gpu.

Create debian kernel package:
export CONCURRENCY_LEVEL=4
fakeroot make-kpkg --initrd kernel_image

Install the kernel package with Gdebi. To make a custom kernel to boot, add a line to /etc/initramfs-tools/modules:
unix
And run: sudo update-initramfs
Reboot.

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

Created attachment 254739
drm-next-4.11-wip: boot into emergency mode - crash after modprobe amdgpu

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

Comment on attachment 254739
drm-next-4.11-wip: boot into emergency mode - crash after modprobe amdgpu

You have Carrizo and Topaz gpus. Can you disable the other from bios? The linux driver does not support amd dual graphics to speed up fps. In the kernel configuration you can try to disable iommu and vgaswitcheroo. From the kernel command line you can blacklist pci devices.

Revision history for this message
In , vaclav.ovsik (vaclav.ovsik-linux-kernel-bugs) wrote :

The BIOS is really primitive, there is nearly nothing regarding HW that can be changed :-/. I can continue to use iommu=off, it seems to be fine.
Thanks

Revision history for this message
In , mrromanze (mrromanze-linux-kernel-bugs) wrote :

Same issue persists on HP 15-ba028ur using latest mainline kernel. (4.12-rc7)

iommu=off
and
amd_iommu=fullflush
make boot possible.

Revision history for this message
In , mrromanze (mrromanze-linux-kernel-bugs) wrote :
Revision history for this message
Peridot (peridot) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1747463

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Peridot (peridot)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Peridot (peridot) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected bionic
description: updated
Revision history for this message
Peridot (peridot) wrote : CRDA.txt

apport information

Revision history for this message
Peridot (peridot) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Peridot (peridot) wrote : JournalErrors.txt

apport information

Revision history for this message
Peridot (peridot) wrote : Lspci.txt

apport information

Revision history for this message
Peridot (peridot) wrote : Lsusb.txt

apport information

Revision history for this message
Peridot (peridot) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Peridot (peridot) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Peridot (peridot) wrote : ProcEnviron.txt

apport information

Revision history for this message
Peridot (peridot) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Peridot (peridot) wrote : ProcModules.txt

apport information

Revision history for this message
Peridot (peridot) wrote : PulseList.txt

apport information

Revision history for this message
Peridot (peridot) wrote : UdevDb.txt

apport information

Revision history for this message
Peridot (peridot) wrote : WifiSyslog.txt

apport information

Peridot (peridot)
summary: - kernel crashes unless IOMMU is disabled on Ryzen 1800X
+ kernel crashes during boot unless IOMMU is disabled on Ryzen 1800X
description: updated
Peridot (peridot)
description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Peridot (peridot) wrote :

this is the result of booting with the upstream kernel with IOMMU turned on in the bios

Revision history for this message
Peridot (peridot) wrote :

When I exit the above pictured initram prompt I get a full crash.

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Peridot (peridot) wrote :

I found it also boots with IOMMU turned on in the bios as long as you set iommu=soft, both with the ubuntu kernel and the mainline kernel.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is AMD-V enabled in BIOS?

Revision history for this message
Peridot (peridot) wrote :

AMD-v was not enabled but with it enabled booting without iommu=soft results in a crash that only lasts about a second and then loses HDMI connection. This is with the Ubuntu Kernel

Revision history for this message
Peridot (peridot) wrote :

And with the mainline kernel

Revision history for this message
Peridot (peridot) wrote :

Though with iommu=soft it boots on both kernels with IOMMU and AMD-v enabled in the BIOS

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Looks like there's a patch but not upstreamed yet,
https://patchwork.freedesktop.org/patch/157327/

Revision history for this message
Peridot (peridot) wrote :

The upstream bug https://bugs.freedesktop.org/show_bug.cgi?id=101029 makes reference to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683184 which is marked as fix released, just a heads up.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I don't see the patch is in upstream Linux though, so still worth a try.

Revision history for this message
Peridot (peridot) wrote :

I agree, I just meant that that bug might also be useful with debugging. It was closed because zesty reached EOL.

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Triaged
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The patch mentioned in the upstream bug report and comment #25 never landed in mainline. I tried to apply it to Bionic, but it does not apply cleanly. I'll work on back porting it. I'll post a test kernel shortly. We can then update upstream with testing results.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if this bug still exists in current mainline:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc5/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote : iommu/amd: flush IOTLB for specific domains only (v2)

Hello Arindam,

There is a bug report[0] that you created a patch[1] for a while back.
However, the patch never landed in mainline.  There is a bug reporter in
Ubuntu[2] that is affected by this bug and is willing to test the
patch.  I attempted to build a test kernel with the patch, but it does
not apply to currently mainline cleanly.  Do you still think this patch
may resolve this bug?  If so, is there a version of your patch available
that will apply to current mainline?

Thanks,

Joe

[0] https://bugs.freedesktop.org/show_bug.cgi?id=101029
[1] https://patchwork.freedesktop.org/patch/157327/
[2] http://pad.lv/1747463

Revision history for this message
Arindam Nath (arindam-nath) wrote :

Adding Tom.

Hi Joe,

My original patch was never accepted. Tom and Joerg worked on another patch series which was supposed to fix the issue in question in addition to do some code cleanups. I believe their patches are already in the mainline. If I remember correctly, one of the patches disabled PCI ATS for the graphics card which was causing the issue.

Do you still see the issue with latest mainline kernel?

BR,
Arindam

-----Original Message-----
From: Joseph Salisbury [mailto:<email address hidden>]
Sent: Tuesday, May 15, 2018 1:17 AM
To: Nath, Arindam <email address hidden>
Cc: <email address hidden>; Bridgman, John <email address hidden>; joro@8bytes.org; <email address hidden>; <email address hidden>; <email address hidden>; Suthikulpanit, Suravee <email address hidden>; Deucher, Alexander <email address hidden>; Kuehling, Felix <email address hidden>; <email address hidden>; <email address hidden>; <email address hidden>
Subject: iommu/amd: flush IOTLB for specific domains only (v2)

Hello Arindam,

There is a bug report[0] that you created a patch[1] for a while back. However, the patch never landed in mainline.  There is a bug reporter in Ubuntu[2] that is affected by this bug and is willing to test the patch.  I attempted to build a test kernel with the patch, but it does not apply to currently mainline cleanly.  Do you still think this patch may resolve this bug?  If so, is there a version of your patch available that will apply to current mainline?

Thanks,

Joe

[0] https://bugs.freedesktop.org/show_bug.cgi?id=101029
[1] https://patchwork.freedesktop.org/patch/157327/
[2] http://pad.lv/1747463

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 05/15/2018 04:03 AM, Nath, Arindam wrote:
> Adding Tom.
>
> Hi Joe,
>
> My original patch was never accepted. Tom and Joerg worked on another patch series which was supposed to fix the issue in question in addition to do some code cleanups. I believe their patches are already in the mainline. If I remember correctly, one of the patches disabled PCI ATS for the graphics card which was causing the issue.
>
> Do you still see the issue with latest mainline kernel?
>
> BR,
> Arindam
>
> -----Original Message-----
> From: Joseph Salisbury [mailto:<email address hidden>]
> Sent: Tuesday, May 15, 2018 1:17 AM
> To: Nath, Arindam <email address hidden>
> Cc: <email address hidden>; Bridgman, John <email address hidden>; joro@8bytes.org; <email address hidden>; <email address hidden>; <email address hidden>; Suthikulpanit, Suravee <email address hidden>; Deucher, Alexander <email address hidden>; Kuehling, Felix <email address hidden>; <email address hidden>; <email address hidden>; <email address hidden>
> Subject: iommu/amd: flush IOTLB for specific domains only (v2)
>
> Hello Arindam,
>
> There is a bug report[0] that you created a patch[1] for a while back. However, the patch never landed in mainline.  There is a bug reporter in Ubuntu[2] that is affected by this bug and is willing to test the patch.  I attempted to build a test kernel with the patch, but it does not apply to currently mainline cleanly.  Do you still think this patch may resolve this bug?  If so, is there a version of your patch available that will apply to current mainline?
>
> Thanks,
>
> Joe
>
> [0] https://bugs.freedesktop.org/show_bug.cgi?id=101029
> [1] https://patchwork.freedesktop.org/patch/157327/
> [2] http://pad.lv/1747463
>
Hi Arindam,

Thanks for the feedback.  Yes, the latest mainline kernel was tested,
and it is reported the bug still happens in the Ubuntu kernel bug[0].
Is there any specific diagnostic info we can collect that might help?

Thanks,

Joe

[0] http://pad.lv/1747463

Revision history for this message
Arindam Nath (arindam-nath) wrote :

> -----Original Message-----
> From: Joseph Salisbury [mailto:<email address hidden>]
> Sent: Tuesday, May 15, 2018 5:40 PM
> To: Nath, Arindam <email address hidden>
> Cc: <email address hidden>; Bridgman, John
> <email address hidden>; joro@8bytes.org; amd-
> <email address hidden>; <email address hidden>; <email address hidden>;
> Suthikulpanit, Suravee <email address hidden>; Deucher,
> Alexander <email address hidden>; Kuehling, Felix
> <email address hidden>; <email address hidden>; <email address hidden>;
> <email address hidden>; Lendacky, Thomas
> <email address hidden>
> Subject: Re: iommu/amd: flush IOTLB for specific domains only (v2)
>
> On 05/15/2018 04:03 AM, Nath, Arindam wrote:
> > Adding Tom.
> >
> > Hi Joe,
> >
> > My original patch was never accepted. Tom and Joerg worked on another
> patch series which was supposed to fix the issue in question in addition to do
> some code cleanups. I believe their patches are already in the mainline. If I
> remember correctly, one of the patches disabled PCI ATS for the graphics
> card which was causing the issue.
> >
> > Do you still see the issue with latest mainline kernel?
> >
> > BR,
> > Arindam
> >
> > -----Original Message-----
> > From: Joseph Salisbury [mailto:<email address hidden>]
> > Sent: Tuesday, May 15, 2018 1:17 AM
> > To: Nath, Arindam <email address hidden>
> > Cc: <email address hidden>; Bridgman, John
> > <email address hidden>; joro@8bytes.org;
> > <email address hidden>; <email address hidden>;
> <email address hidden>;
> > Suthikulpanit, Suravee <email address hidden>; Deucher,
> > Alexander <email address hidden>; Kuehling, Felix
> > <email address hidden>; <email address hidden>; <email address hidden>;
> > <email address hidden>
> > Subject: iommu/amd: flush IOTLB for specific domains only (v2)
> >
> > Hello Arindam,
> >
> > There is a bug report[0] that you created a patch[1] for a while back.
> However, the patch never landed in mainline.  There is a bug reporter in
> Ubuntu[2] that is affected by this bug and is willing to test the patch.  I
> attempted to build a test kernel with the patch, but it does not apply to
> currently mainline cleanly.  Do you still think this patch may resolve this
> bug?  If so, is there a version of your patch available that will apply to current
> mainline?
> >
> > Thanks,
> >
> > Joe
> >
> > [0] https://bugs.freedesktop.org/show_bug.cgi?id=101029
> > [1] https://patchwork.freedesktop.org/patch/157327/
> > [2] http://pad.lv/1747463
> >
> Hi Arindam,
>
> Thanks for the feedback.  Yes, the latest mainline kernel was tested, and it is
> reported the bug still happens in the Ubuntu kernel bug[0]. Is there any
> specific diagnostic info we can collect that might help?

Joe, I believe all the information needed is already provided in [2]. Let us wait for inputs from Tom and Joerg.

I could take a look at the issue locally, but it will take me some really long time since I am occupied with other assignments right now.

BR,
Arindam

>
> Thanks,
>
> Joe
>
> [0] http://pad.lv/1747463

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Download full text (3.8 KiB)

On 05/15/2018 09:08 AM, Tom Lendacky wrote:
> On 5/15/2018 7:34 AM, Nath, Arindam wrote:
>>
>>> -----Original Message-----
>>> From: Joseph Salisbury [mailto:<email address hidden>]
>>> Sent: Tuesday, May 15, 2018 5:40 PM
>>> To: Nath, Arindam <email address hidden>
>>> Cc: <email address hidden>; Bridgman, John
>>> <email address hidden>; joro@8bytes.org; amd-
>>> <email address hidden>; <email address hidden>; <email address hidden>;
>>> Suthikulpanit, Suravee <email address hidden>; Deucher,
>>> Alexander <email address hidden>; Kuehling, Felix
>>> <email address hidden>; <email address hidden>; <email address hidden>;
>>> <email address hidden>; Lendacky, Thomas
>>> <email address hidden>
>>> Subject: Re: iommu/amd: flush IOTLB for specific domains only (v2)
>>>
>>> On 05/15/2018 04:03 AM, Nath, Arindam wrote:
>>>> Adding Tom.
>>>>
>>>> Hi Joe,
>>>>
>>>> My original patch was never accepted. Tom and Joerg worked on another
>>> patch series which was supposed to fix the issue in question in addition to do
>>> some code cleanups. I believe their patches are already in the mainline. If I
>>> remember correctly, one of the patches disabled PCI ATS for the graphics
>>> card which was causing the issue.
>>>> Do you still see the issue with latest mainline kernel?
>>>>
>>>> BR,
>>>> Arindam
>>>>
>>>> -----Original Message-----
>>>> From: Joseph Salisbury [mailto:<email address hidden>]
>>>> Sent: Tuesday, May 15, 2018 1:17 AM
>>>> To: Nath, Arindam <email address hidden>
>>>> Cc: <email address hidden>; Bridgman, John
>>>> <email address hidden>; joro@8bytes.org;
>>>> <email address hidden>; <email address hidden>;
>>> <email address hidden>;
>>>> Suthikulpanit, Suravee <email address hidden>; Deucher,
>>>> Alexander <email address hidden>; Kuehling, Felix
>>>> <email address hidden>; <email address hidden>; <email address hidden>;
>>>> <email address hidden>
>>>> Subject: iommu/amd: flush IOTLB for specific domains only (v2)
>>>>
>>>> Hello Arindam,
>>>>
>>>> There is a bug report[0] that you created a patch[1] for a while back.
>>> However, the patch never landed in mainline.  There is a bug reporter in
>>> Ubuntu[2] that is affected by this bug and is willing to test the patch.  I
>>> attempted to build a test kernel with the patch, but it does not apply to
>>> currently mainline cleanly.  Do you still think this patch may resolve this
>>> bug?  If so, is there a version of your patch available that will apply to current
>>> mainline?
>>>> Thanks,
>>>>
>>>> Joe
>>>>
>>>> [0] https://bugs.freedesktop.org/show_bug.cgi?id=101029
>>>> [1] https://patchwork.freedesktop.org/patch/157327/
>>>> [2] http://pad.lv/1747463
>>>>
>>> Hi Arindam,
>>>
>>> Thanks for the feedback.  Yes, the latest mainline kernel was tested, and it is
>>> reported the bug still happens in the Ubuntu kernel bug[0]. Is there any
>>> specific diagnostic info we can collect that might help?
>> Joe, I believe all the information needed is already provided in [2]. Let us wait for inputs from Tom and Joerg.
>>
>> I could take a look at the issue locally, but it will take me some really long time since I am occupied with oth...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Peridot, I know you responded that the current mainline kernel still exhibits the bug on IRC. However, could you also add that test result to this bug report for upstream tracking?

Revision history for this message
Peridot (peridot) wrote :

I tested with 4.17 rc4 and the problem persists

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Peridot,
Request from Upstream:

For the original 4.13 kernel, I don't
see any attachments that have the AMD-Vi messages in question. Were they
completion timeouts (like in the later mainline kernel test, which I'll
get to in a bit) or I/O page fault messages? Without that information it
is hard to determine what the issue really is.

(Just as an FYI, if the IOMMU is disabled in BIOS, then iommu=soft is not
 necessary on the kernel command line).

For the upstream kernel test, since this is a Ryzen system, it's possible
that the BIOS does not have a requisite fix for SME and IOMMU (see [1]).
On the upstream kernel, if memory encryption is active by default without
this BIOS fix, then the result is AMD-Vi completion-wait timeout messages.
Try booting with mem_encrypt=off on the kernel command line or build a
kernel with CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n and see if that
allows the kernel to boot.

Thanks,
Tom

[1] https://bugzilla.kernel.org/show_bug.cgi?id=199513

Changed in linux (Ubuntu Bionic):
status: Triaged → Incomplete
Changed in linux (Ubuntu Cosmic):
status: Triaged → Incomplete
Revision history for this message
Peridot (peridot) wrote :

The attached screenshots are the result of booting 4.18 rc1 kernel

Revision history for this message
Peridot (peridot) wrote :

4.18 rc1

Revision history for this message
Peridot (peridot) wrote :

4.18 rc1 + mem_encrypt=off

Revision history for this message
Peridot (peridot) wrote :

4.18 rc1 + mem_encrypt=off

Revision history for this message
Peridot (peridot) wrote :

4.18 rc1 + mem_encrypt=off

Changed in linux (Ubuntu Bionic):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Cosmic):
status: Incomplete → Confirmed
Revision history for this message
Peridot (peridot) wrote :

All the tests were done with iommu turned off and SVE turned on in the bios, and it does not boot without iommu=soft

When booting with SVE and IOMMU enabled in the bios I got an endless screen of text and I couldn't make up anything from it.

booting with iommu turned on in the kernel and SVE turned off gives the screenshot below

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Hi,
  I still encounter a same issue concerned to ext4 fs corruption using linux kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).

  Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop)
I'm using OS ALT linux distribution (www.altlinux.org)
  Boot and installation of the system is performed flawlessly using LiveCD if LAN cable is NOT attached. dmesg shows plenty of "AMD-Vi: Completion-Wait loop timed out" errors.
  Connecting LAN cable during LiveCD boot results graphical target boot failure or kernel panic.
  After first reboot the system won't boot anyway and ext4 filesystem corruption occurs.

 As investigtion revealed that switching IOMMU off (amd_iommu=off and/or iommu=soft) solves the issue. "amd_iommu=fullflush" doesn't work for me.

I've discovered several patches concerning solution of (amd_)iommu issues in linux-kernel mailing list archive, but they are either applied to kernels mentioned above already or their application doesn't solve the issue for me.
Above mentioned patch (https://patchwork.freedesktop.org/patch/157327/) is not applicable to mentioned kernel versions any more.
  Therefore my question is: am I missing some patch that already solved the issue or should I provide more specific bug report?

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 281945
HP laptop dmesg output with plenty of errors while LAN cable present

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 281947
HP laptop dmesg output while IOMMU turned on, but no LAN cable

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 281949
HP laptop HW config via dmidecode

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

(In reply to Nikolai from comment #17)

> using linux
> kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).

I'm sorry for typo. Should be read as 4.19.16... 4.20.17

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 282035
lspci -v for HP laptop 17-ak041ur

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 282037
lspci for HP laptop 17-ak041ur

Revision history for this message
calcatinge (calcatinge) wrote :

I have the same issue, but with a Ryzen 7 1700 with the following setups:

Setup 1:
Motherboard: Gigabyte B450 Aorus M (Latest BIOS installed)
RAM: 32 GB DDR4 Crucial 3000 Mhz
GPU1: Sapphire Radeon RX 580 Nitro+ Special Edition 8GB GDDR5
GPU2: Asus Mining Radeon RX 470 4G GDDR5
SSD: 240 GB WD Green

Setup 2:
Motherboard: ASRock AB350M Pro4 (Latest BIOS installed)
RAM: 32 GB DDR4 Crucial 3000 Mhz
GPU1: Sapphire Radeon RX 580 Nitro+ Special Edition 8GB GDDR5
GPU2: Sapphire Radeon RX 550 2G GDDR5
SSD: 240 GB WD Green

The issue first started after installing the second GPU on each of the systems. I can't even boot the system, as the errors appear right after BIOS initialization.

The errors are like:

[exerpt]
AMD-Vi: Completion-Wait loop timed out
AMD-Vi: Completion-Wait loop timed out
AMD-Vi: Completion-Wait loop timed out

...

AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb7560]

Entering emergecy mode. Exit the shell to continue.
Type "journalctl" to view system logs.

....

AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb75d0]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb7700]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb7630]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb7660]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb7690]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb76c0]
AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=22:00.0 address=0x0000000174bb76f0]

...

AMD-Vi: Completion-Wait loop timed out
AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.3 domain=0x000 address=0x00000000fffd3990 flags=0x0070]

...

Buffer I/O error on dev dm-0, logical block 104857472, async page read
Buffer I/O error on dev dm-0, logical block 104857473, async page read
Buffer I/O error on dev dm-0, logical block 104857474, async page read
Buffer I/O error on dev dm-0, logical block 104857475, async page read
Buffer I/O error on dev dm-0, logical block 104857476, async page read
Buffer I/O error on dev dm-0, logical block 104857477, async page read

.....

After some research I discovered that it is a IOMMU issue.
I turned IOMMU off on both motherboards, and I managed to boot the system.
I am an architect and I use blender for GPU rendering (this is the idea of having two), but the AMDGPU-Pro driver from the AMD's website (the only one that Blender uses for GPU rendering, as it can't use the open one) affects the way Gnome works on Xorg. It affects in the way that it doesn't start at all, it keeps restarting to the GDM screen...

Any ideas?

Thanks.

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

(In reply to Nikolai from comment #17)
> Hi,
> I still encounter a same issue concerned to ext4 fs corruption using linux
> kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).
>
> Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB
> Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop)

This patch works for me:

https://lkml.org/lkml/2019/4/8/331

One also can use a 'pci=noats' as a temporary countermeasure.

Thanks to Joerg Roedel <email address hidden> who guided me to a solution.

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Cosmic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Revision history for this message
Alex (alexandersim) wrote :

I have also had this problem - MSI Tomahawk Arctic B350, ryzen 1700, vega 56 gpu
ubuntu 19.04 - ignores bios setting, attempts to boot, kernel panic with lots of AMD-Vi: Completion-Wait loop timed out

Reboot to find grub in rescue mode, and all ext4 partitions wiped. Can't boot live disc either, unless swap to an nvidia graphics card

Only work around is to add the kernel parameters, but booting ubuntu shouldn't wipe my OS?

Revision history for this message
In , alexdeucher (alexdeucher-linux-kernel-bugs) wrote :

Does booting with amdgpu.runpm=0 on the kernel command line help?

Revision history for this message
In , alexdeucher (alexdeucher-linux-kernel-bugs) wrote :
Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

(In reply to Alex Deucher from comment #25)
> Does booting with amdgpu.runpm=0 on the kernel command line help?

Yes it does. System is able to boot and no filesystem corruption occurs either.

So which solution is preferable in such case then?

Revision history for this message
In , nickel (nickel-linux-kernel-bugs) wrote :

Created attachment 282309
dmesg for HP laptop 17-ak041ur with amdgpu.runpm=0 kernel parameter

Revision history for this message
Tommy Zone (tommyzone11) wrote :

I have the same issue on Radeon 570 4gb, Ryzen 1700, MSI Tomahawk on 18.04.

The only way to boot is to use IOMMU=soft.

Revision history for this message
In , Peridot (peridot) wrote :

No, only pci=noats works

To post a comment you must log in.