nvme drive fails after some time

Bug #1910866 reported by Alan Pope 🍺🐧🐱 🦄
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Debian
New
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned

Bug Description

Sorry for the vague title. I thought this was a hardware issue until someone else online mentioned their nvme drive goes "read only" after some time. I tend not to reboot my system much, so have a large journal. Either way this happens once in a while. The / drive is fine, but /home is on nvme which just disappears. I reboot and everything is fine. But leave it long enough and it'll fail again.

Here's the most recent snippet about the nvme drive before I restarted the system.

Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, aborting
Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 449 QID 5 timeout, aborting
Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 450 QID 5 timeout, aborting
Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 451 QID 5 timeout, aborting
Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, reset controller
Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 22 QID 0 timeout, reset controller
Jan 08 19:21:04 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
Jan 08 19:21:25 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Jan 08 19:21:25 robot kernel: nvme nvme1: Removing after probe failure status: -19
Jan 08 19:21:41 robot kernel: INFO: task jbd2/nvme1n1p1-:731 blocked for more than 120 seconds.
Jan 08 19:21:41 robot kernel: jbd2/nvme1n1p1- D 0 731 2 0x00004000
Jan 08 19:21:45 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993784 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123967, lost async page write
Jan 08 19:21:45 robot kernel: EXT4-fs error (device nvme1n1p1): __ext4_find_entry:1535: inode #57278595: comm gsd-print-notif: reading directory lblock 0
Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993384 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123917, lost async page write
Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993320 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1833166472 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123909, lost async page write
Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1909398624 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 0, lost sync page write
Jan 08 19:21:45 robot kernel: EXT4-fs (nvme1n1p1): I/O error while writing superblock

ProblemType: Bug
DistroRelease: Ubuntu 20.10
Package: linux-image-5.8.0-34-generic 5.8.0-34.37
ProcVersionSignature: Ubuntu 5.8.0-34.37-generic 5.8.18
Uname: Linux 5.8.0-34-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu50.3
Architecture: amd64
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Sat Jan 9 11:56:28 2021
InstallationDate: Installed on 2020-08-15 (146 days ago)
InstallationMedia: Ubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731)
MachineType: Intel Corporation NUC8i7HVK
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.8.0-34-generic root=UUID=c212e9d4-a049-4da0-8e34-971cb7414e60 ro quiet splash vt.handoff=7
RebootRequiredPkgs:
 linux-image-5.8.0-36-generic
 linux-base
RelatedPackageVersions:
 linux-restricted-modules-5.8.0-34-generic N/A
 linux-backports-modules-5.8.0-34-generic N/A
 linux-firmware 1.190.2
SourcePackage: linux
UpgradeStatus: Upgraded to groovy on 2020-09-20 (110 days ago)
dmi.bios.date: 12/17/2018
dmi.bios.release: 5.6
dmi.bios.vendor: Intel Corp.
dmi.bios.version: HNKBLi70.86A.0053.2018.1217.1739
dmi.board.name: NUC8i7HVB
dmi.board.vendor: Intel Corporation
dmi.board.version: J68196-502
dmi.chassis.type: 3
dmi.chassis.vendor: Intel Corporation
dmi.chassis.version: 2.0
dmi.modalias: dmi:bvnIntelCorp.:bvrHNKBLi70.86A.0053.2018.1217.1739:bd12/17/2018:br5.6:svnIntelCorporation:pnNUC8i7HVK:pvrJ71485-502:rvnIntelCorporation:rnNUC8i7HVB:rvrJ68196-502:cvnIntelCorporation:ct3:cvr2.0:
dmi.product.family: Intel NUC
dmi.product.name: NUC8i7HVK
dmi.product.version: J71485-502
dmi.sys.vendor: Intel Corporation

CVE References

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Which one is the failing one? Samsung or OCZ?

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

I'm on Ubuntu 20.04, and after updating to the HWE 5.8 kernel recently I have also been suffering my nvme drive becoming read only after a period of time. I have now switched back to the 5.4 kernel and not suffered the issue again.

I am on a single disk system so had to run dmesg --follow remotely on another machine to retrieve log information.

Here is a pastebin of around the time my system locks up https://pastebin.ubuntu.com/p/FKsJV8VwRw/ (note it has similar errors, a timeout aborting, then a reset, then i have a call trace etc).

Here is a pastebin of the smartctl output https://pastebin.ubuntu.com/p/W9w2nHYhd2/ the drive itself appears to be fine and not failing (it does seem to increment "Error Information Log Entries" when this lockup happens - but when viewing the error it is just full of 0xffff).

System info when the lockup happened:

Machine: Dell XPS 13 9360
Drive: THNSN5512GPUK NVMe TOSHIBA 512GB
Kernel at the time: $ apt policy linux-image-generic-hwe-20.04
linux-image-generic-hwe-20.04:
  Installed: 5.8.0.36.40~20.04.21
  Candidate: 5.8.0.36.40~20.04.21
  Version table:
 *** 5.8.0.36.40~20.04.21 500
        500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
        100 /var/lib/dpkg/status
     5.4.0.26.32 500
        500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Let me know if I can provide any more info :-)

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

It's the TOSHIBA-RD400 on /home for me that's failing.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is this a regresison? Did it start to happen after upgrade from 5.4 to 5.8?

And is it possible to attach `lspci -vv` after the issue happen?

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

@kairhengfeng Yes this is a regression after the upgrade from 5.4 to 5.8. After the upgrade I had it multiple times and now I have switched back to 5.4 my machine is stable again.

I do not think I can run `lspci -vv` *after* the issue happens, as my NVMe drive goes read-only, so all commands fail.

This is the output of `sudo lspci -vv` on the kernel 5.4 and *before* it happens https://pastebin.ubuntu.com/p/tCshwbhpqs/ Let me know if also running this on 5.8 *before* it happens could be useful or not.

@popey are you able to run this command before and after it happens with your dual disk system ?

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

I can try, but I can't trigger it to happen. Given I had 60 days uptime on my system before it happened last time, and 12 days the time before that. That gives you some idea of the interval between it happening.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

Note for me it is happening quite rapidly (sometimes after 5-10 minutes) of high disk load. Eg the first times it happened when apt was running update-grub and then when pip3 install was running. Then to capture the logs above i started a `find /` and `find ~` at the same time and this was enough to break it.

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

I've tried doing various IO intensive things to trigger it but no luck yet.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

FYI I have captured the `sudo lspci -vv` output on the kernel 5.8 *before* the issue here https://pastebin.ubuntu.com/p/GtZyTWzKTd/ it is subtly different to the 5.4 kernel (which has not had the issue) in case that mattered.

I was also able to reproduce the issue again by causing high disk I/O, specifically I needed to have writes occurring for it to happen (I was recursive grep'ing the whole filesystem while installing apt/pip packages inside a docker container).

This then froze the system for 120 seconds until write timeouts occurred, then the disk was remounted as read-only. After this point commands on the system would fail with I/O errors (even basic ones such as "top", although some such as "mount" still work).

However our plan was to try to retrieve more information by copying the lspci binary and libs into a tmpfs system in RAM, so it'd still be accessible when the disk stopped. This almost worked, but it appears a few more configuration files would need to be placed in RAM (I could run "lspci --help" but not "lspci" or "lspci -vv"). Instead popey has suggested maybe using a USB key with debootstrap/chroot. (Any suggestions of how we can retrieve more information at this point are welcome and any commands that would be useful to run).

Also as a note, if I use REISUB ( https://en.m.wikipedia.org/wiki/Magic_SysRq_key#Uses ) to reboot the machine it enters a Dell BIOS/recovery thing that states that "No Hard Disk is found". Then after a full power off the machine works again.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Andrew, since you can reliably reproduce the issue, can you please test latest mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11-rc3/amd64/

And we'll do a bisect or reverse-bisect based on the result.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

@kaihengfeng

I have found that running the command "fio --name=basic --directory=/path/to/empty/directory --size=1G --rw=randrw --numjobs=4 --loops=5" runs fine on linux-image-5.4.0-59-generic but when trying with linux-image-5.8.0-36-generic it would freeze the system in the "Laying out IO file" stage. I checked with two subsequent boots that the 5.8 does fail like this on an empty directory and will now use this as my "test" if a kernel works or not.

I have installed the 5.11 rc3 mainline kernel you linked, note I have had to disable secure boot to be able to use it. But this kernel worked successfully on two boots with the fio test above.

So in summary so far on my system with the fio test:
linux-image-5.4.0-59-generic: PASS
linux-image-5.8.0-36-generic: FAIL
linux-image-unsigned-5.11.0-051100rc3-generic: PASS

Please advise how to proceed here, should I start manually picking (by bisecting) kernels between 5.8 and 5.11 or between 5.4 and 5.8 ?

Also I guess I should also try 5.8 mainline to ensure that any Ubuntu patches aren't causing an issue?

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

OK, so using https://people.canonical.com/~kernel/info/kernel-version-map.html that states that Ubuntu kernel 5.8.0-36.40~20.04.1 matches mainline version 5.8.18. I have installed 5.8.18 and it fails ! So it is not the Ubuntu patches.

Ubuntu Kernels:
linux-image-5.4.0-59-generic: PASS
linux-image-5.8.0-36-generic: FAIL

Mainline Kernels:
linux-image-unsigned-5.8.18-050818-generic: FAIL
linux-image-unsigned-5.11.0-051100rc3-generic: PASS

I'll see if I can find where it changes from FAIL to PASS between 5.8.18 in the mainline kernels. Please advise if should also/instead compare between 5.4 and 5.8.18 :-)

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

So bisecting between 5.8.18 (bad) and 5.11-rc3 (good).

The following results with the mainline kernel
v5.11-rc3/ PASS
v5.9.12/ PASS
v5.9.10/ PASS
v5.9.9/ MISSING
v5.9.8/ FAIL (could not boot long enough for full test)
v5.9.7/ FAIL (could not boot long enough for full test)
v5.9.2/ FAIL (could not boot long enough for full test)
v5.8.18/ FAIL

Note that 5.9.2, 5.9.7, 5.9.8 all crashed during either boot or logging in (but after performing REISUB they all entered the Dell BIOS/recovery stating that the hard disk could not be found, so I assume this is the same failure).

From these results it appears that between 5.9.8 and 5.9.10 it was fixed.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

And the bisect between 5.4.78 (good) and 5.8.18 (bad).

The following results with the mainline kernel
v5.8.18/ FAIL
v5.8.4/ FAIL
v5.8-rc5/ FAIL
v5.8-rc1/ FAIL
v5.7.19/ PASS
v5.7.18/ PASS
v5.7.16/ PASS
v5.6.14/ PASS
v5.4.78/ PASS

From these and the previous comment's results it appears that the issue was introduced with 5.8-rc1 and then was fixed with 5.9.9 or 5.9.10. (it is unfortunate that 5.9.9 is missing so I cannot try it).

@kaihengfeng let me know if there is any other information I can provide.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Thanks a lot!
Can you please test v5.7? Stable release (point release) isn't linear with mainline kernel.

Once you are sure v5.7 is good, we can start a bisect:
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect good v5.7
$ git bisect bad v5.8-rc1
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If it still have the same issue,
$ git bisect bad
Otherwise,
$ git bisect good
Repeat to "make -j`nproc` deb-pkg" until you find the offending commit.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :
Download full text (3.6 KiB)

@kaihengfeng

So v5.7 was fine and after many reboots it has been found that this commit below introduced the issue.

Do I also need to find when the issue was resolved ? (between v5.8-rc1 and v5.9.10) or is this information enough ?

54b2fcee1db041a83b52b51752dade6090cf952f is the first bad commit
commit 54b2fcee1db041a83b52b51752dade6090cf952f
Author: Keith Busch <email address hidden>
Date: Mon Apr 27 11:54:46 2020 -0700

    nvme-pci: remove last_sq_tail

    The nvme driver does not have enough tags to wrap the queue, and blk-mq
    will no longer call commit_rqs() when there are no new submissions to
    notify.

    Signed-off-by: Keith Busch <email address hidden>
    Reviewed-by: Sagi Grimberg <email address hidden>
    Signed-off-by: Christoph Hellwig <email address hidden>
    Signed-off-by: Jens Axboe <email address hidden>

 drivers/nvme/host/pci.c | 23 ++++-------------------
 1 file changed, 4 insertions(+), 19 deletions(-)

And my $ git bisect log is the following FWIW.
git bisect start
# good: [3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162] Linux 5.7
git bisect good 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
# bad: [b3a9e3b9622ae10064826dccb4f7a52bd88c7407] Linux 5.8-rc1
git bisect bad b3a9e3b9622ae10064826dccb4f7a52bd88c7407
# bad: [ee01c4d72adffb7d424535adf630f2955748fa8b] Merge branch 'akpm' (patches from Andrew)
git bisect bad ee01c4d72adffb7d424535adf630f2955748fa8b
# bad: [16d91548d1057691979de4686693f0ff92f46000] Merge tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad 16d91548d1057691979de4686693f0ff92f46000
# good: [cfa3b8068b09f25037146bfd5eed041b78878bee] Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
git bisect good cfa3b8068b09f25037146bfd5eed041b78878bee
# good: [3fd911b69b3117e03181262fc19ae6c3ef6962ce] Merge tag 'drm-misc-next-2020-05-07' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good 3fd911b69b3117e03181262fc19ae6c3ef6962ce
# good: [1966391fa576e1fb2701be8bcca197d8f72737b7] mm/migrate.c: attach_page_private already does the get_page
git bisect good 1966391fa576e1fb2701be8bcca197d8f72737b7
# bad: [0c8d3fceade2ab1bbac68bca013e62bfdb851d19] bcache: configure the asynchronous registertion to be experimental
git bisect bad 0c8d3fceade2ab1bbac68bca013e62bfdb851d19
# bad: [84b8d0d7aa159652dc191d58c4d353b6c9173c54] nvmet: use type-name map for ana states
git bisect bad 84b8d0d7aa159652dc191d58c4d353b6c9173c54
# good: [72e6329f86c714785ac195d293cb19dd24507880] nvme-fc and nvmet-fc: revise LLDD api for LS reception and LS request
git bisect good 72e6329f86c714785ac195d293cb19dd24507880
# good: [e4fcc72c1a420bdbe425530dd19724214ceb44ec] nvmet-fc: slight cleanup for kbuild test warnings
git bisect good e4fcc72c1a420bdbe425530dd19724214ceb44ec
# good: [31fdad7be18992606078caed6ff71741fa76310a] nvme: consolodate io settings
git bisect good 31fdad7be18992606078caed6ff71741fa76310a
# bad: [2a5bcfdd41d68559567cec3c124a75e093506cc1] nvme-pci: align io queue count with allocted nvme_queue in nvme_probe
git bisect bad 2a5bcfdd41d68559567cec3c124a75e093506cc1
# good: [6623c5b3dfa5513190d729a8516db7a5163ec7de] nvme: clean up error handling in nvme_...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

OK, the fix will be in next 5.8 update:
commit f62ddacc4cb141b86ed647f9dd9eeb7653b0cc43
Author: Keith Busch <email address hidden>
Date: Fri Oct 30 10:28:54 2020 -0700

    Revert "nvme-pci: remove last_sq_tail"

    BugLink: https://bugs.launchpad.net/bugs/1908555

    [ Upstream commit 38210800bf66d7302da1bb5b624ad68638da1562 ]

    Multiple CPUs may be mapped to the same hctx, allowing mulitple
    submission contexts to attempt commit_rqs(). We need to verify we're
    not writing the same doorbell value multiple times since that's a spec
    violation.

    Revert commit 54b2fcee1db041a83b52b51752dade6090cf952f.

    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1878596
    Reported-by: "B.L. Jones" <email address hidden>
    Signed-off-by: Keith Busch <email address hidden>
    Signed-off-by: Sasha Levin <email address hidden>
    Signed-off-by: Kamal Mostafa <email address hidden>
    Signed-off-by: Ian May <email address hidden>

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

@kaihengfeng Thanks for the quick response! bug 1908555 linked there only lists groovy as a target series, I hope that this will also be applied to the focal HWE kernel :-)

Also I am happy to test any kernel in a -proposed channel or PPA to confirm it fixes the issue if that helps :-)

Revision history for this message
Terry Rudd (terrykrudd) wrote :

Andrew, we plan to address this in the Focal 5.8 hwe kernel and we're going to be building a test kernel. We would really appreciate you testing it since you have a reliable reproducer.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Andrew.

I created a test kernel with the fix and it is available at:

https://kernel.ubuntu.com/~mhcerri/lp1910866_linux-5.8.0-38-generic_5.8.0-38.43+lp1910866_amd64.tar.gz

You can install it by extracting the tarball and installing the debian packages:

$ tar xf lp1910866_linux-5.8.0-38-generic_5.8.0-38.43+lp1910866_amd64.tar.gz
$ sudo apt install ./*.deb

Please let us know if the test kernel solves the problem.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

Thanks! I'll take a look :-)

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

@Marcelo So far it looks good :-) It passes the "fio" command test when A/B testing between a known bad kernel and this new kernel. I will continue running it on this machine over the weekend to ensure longer usage doesn't have any remaining issues - but looks like it resolves the issue so far :-D Thanks!

$ uname -a
Linux xps-13-9360 5.8.0-38-generic #43+lp1910866 SMP Fri Jan 15 20:29:27 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Changed in linux (Ubuntu Groovy):
status: New → In Progress
Changed in linux (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Thank you Andrew for your feedback!

We have applied the fix for groovy/linux (and focal/linux-hwe-5.8) and the new kernels will be available in -proposed soon. These packages are planned to be promoted to -updates early next week.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-groovy' to 'verification-done-groovy'. If the problem still exists, change the tag 'verification-needed-groovy' to 'verification-failed-groovy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-groovy
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hello Alan or anyone else affected,

The fix for this bug is also available on the hwe kernel for Focal currently in -proposed (version 5.8.0-41.46~20.04.1). Feedback whether this kernel fixes the nvme issue would be appreciated.

Thank you.

Revision history for this message
Andrew Hayzen (ahayzen) wrote :

@Kleber I have installed the focal hwe kernel from proposed (as seen below). So far when A/B testing this kernel it is working correctly :-) I will continue running this kernel and report any issues I have.

Also note that I have been continuously running the test kernel (from comment 22) since last week and it has worked perfectly so far :-)

I look forward to this migrating from -proposed into focal.

$ uname -a
Linux xps-13-9360 5.8.0-41-generic #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ apt policy linux-generic-hwe-20.04
linux-generic-hwe-20.04:
  Installed: 5.8.0.41.46~20.04.27
  Candidate: 5.8.0.41.46~20.04.27
  Version table:
 *** 5.8.0.41.46~20.04.27 500
        500 http://gb.archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     5.8.0.40.45~20.04.25 500
        500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     5.8.0.38.43~20.04.23 500
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
     5.4.0.26.32 500
        500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Revision history for this message
Kelsey Steele (kelsey-steele) wrote :

@Andrew, thank you for testing! I'm switching verification status to 'verification-done-groovy'.

tags: added: verification-done-groovy
removed: verification-needed-groovy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.8.0-41.46

---------------
linux (5.8.0-41.46) groovy; urgency=medium

  * groovy/linux: 5.8.0-41.46 -proposed tracker (LP: #1912219)

  * Groovy update: upstream stable patchset 2020-12-17 (LP: #1908555) // nvme
    drive fails after some time (LP: #1910866)
    - Revert "nvme-pci: remove last_sq_tail"

  * initramfs unpacking failed (LP: #1835660)
    - SAUCE: lib/decompress_unlz4.c: correctly handle zero-padding around initrds.

  * overlay: permission regression in 5.4.0-51.56 due to patches related to
    CVE-2020-16120 (LP: #1900141)
    - ovl: do not fail because of O_NOATIME

 -- Kleber Sacilotto de Souza <email address hidden> Mon, 18 Jan 2021 17:01:08 +0100

Changed in linux (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

I'm seeing this in focal kernel 5.4.0-88. Is this expected? Do I have to switch to the hwe kernel pointed above to fix this?

The laptop has been stable for a long time and then suddenly started having this exact symptom a few days ago. I'm wondering if this was introduced in latest ga kernels for focal or if it was always there.

I'm not sure why I was using the ga kernel to this date, probably because I'm upgrading from version to version for a long time, but anyway, if this is till present on the ga kernel for focal today I think it is a problem that needs fixing.

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :
Download full text (5.7 KiB)

We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0 NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64). We've seen this happen at least 5 times over the past month, and not always on the same SSD. We first saw it happen on 5.4.0-81. Some samples from dmesg are below.

This is a production system that runs a set of virtual desktop instances. Thankfully we use these in a zfs pool with four pairs of RAID 1 vdevs, so the only outage we've had so far is when it hit both members of a mirrored pair. After a reboot the SSDs come back up.

[Mon Sep 6 12:58:36 2021] nvme nvme5: I/O 132 QID 46 timeout, aborting
[Mon Sep 6 12:58:37 2021] nvme nvme5: I/O 133 QID 46 timeout, aborting
[Mon Sep 6 12:58:39 2021] nvme nvme5: I/O 134 QID 46 timeout, aborting
[Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 135 QID 46 timeout, aborting
[Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 784 QID 48 timeout, aborting
[Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 136 QID 46 timeout, aborting
[Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 137 QID 46 timeout, aborting
[Mon Sep 6 12:58:42 2021] nvme nvme5: I/O 492 QID 28 timeout, aborting
[Mon Sep 6 12:59:07 2021] nvme nvme5: I/O 132 QID 46 timeout, reset controller
[Mon Sep 6 12:59:38 2021] nvme nvme5: I/O 24 QID 0 timeout, reset controller
[Mon Sep 6 13:00:29 2021] nvme nvme5: Device not ready; aborting reset
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:33 2021] INFO: task txg_quiesce:2172 blocked for more than 120 seconds.
[Mon Sep 6 13:00:33 2021] Tainted: P OE 5.4.0-81-generic #91-Ubuntu

[Tue Sep 21 21:18:36 2021] nvme nvme2: I/O 175 QID 38 timeout, aborting
[Tue Sep 21 21:18:37 2021] nvme nvme2: I/O 240 QID 26 timeout, aborting
[Tue Sep 21 21:18:47 2021] nvme nvme2: I/O 718 QID 23 timeout, aborting
[Tue Sep 21 21:18:56 2021] nvme nvme2: I/O 719 QID 23 timeout, aborting
[Tue Sep 21 21:19:06 2021] nvme nvme2: I/O 175 QID 38 timeout, reset controller
[Tue Sep 21 21:19:37 2021] nvme nvme2: I/O 17 QID 0 timeout, reset controller
[Tue Sep 21 21:20:27 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:47 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:47 2021] nvme nvme2: Removing after probe failure status: -19
[Tue Sep 21 21:21:08 2021] nvme nvme2: Device not ready; aborting reset

[Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 1013 QID 38 timeout, aborting
[Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 727 QID 39 timeout, aborting
[Tue Oct 5 16:55:03 20...

Read more...

Revision history for this message
João Pedro Seara (jpseara) wrote :

Hello all.

I think this needs some serious attention. I just observed this same issue on my secondary NVMe drive on a ThinkPad T480s running Ubuntu 22.04 @ 5.15.0-40-generic.

Revision history for this message
João Pedro Seara (jpseara) wrote :

Still observing the same I wrote in my comment above. Now upgraded to 5.15.0-46-generic.

This is very frustrating.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

João Pedro Seara,

Does this issue only happen after system sleep?

Revision history for this message
João Pedro Seara (jpseara) wrote :

Hello Kai-Heng,

Unrelated to system sleep.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please attach dmesg? Thanks!

Revision history for this message
João Pedro Seara (jpseara) wrote :

Attaching dmesg.

Revision history for this message
João Pedro Seara (jpseara) wrote :

Attaching also journalctl, which shows the problem happening several times. One of them: set 01 01:20:02

Revision history for this message
João Pedro Seara (jpseara) wrote :

There you go, Kay-Heng. Thanks.

Revision history for this message
João Pedro Seara (jpseara) wrote :

Hello, all.

I have solved this issue for now by applying the following workaround:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0"

As per: https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support

Thanks,
JP

Revision history for this message
Pete (peter-kruse) wrote :

Hello,
same with me. I solved the issue with the nvme drive with kernel 5.19.0-28 also with the parameter:
GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"

Kind regards,
Peter

Revision history for this message
Pete (peter-kruse) wrote :

Kernel 5.19.0-28 FAIL

Revision history for this message
João Pedro Seara (jpseara) wrote (last edit ):

Pete,

Seems like in [1] it is suggested that in latest Kernel revisions, the issue * may * be fixed. In that same page we can be linked to the commit discussion, and it seems the problem fixed is only related to Kingston - which is not my case.

Anyway, I will test the newest Kernel for Jammy (5.15.0-57) and report back.

JP

[1] https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support

Revision history for this message
João Pedro Seara (jpseara) wrote :

Well, no occurrences since my last post on Jan 19. Seems that something changed for the better.

Revision history for this message
João Pedro Seara (jpseara) wrote :

I spoke to soon. Problem still appears. : - )

Revision history for this message
faattori (fatordee) wrote :

I have just encountered this bug with Ubuntu 23.04 with kernel 6.2.0-20-generic. System is using default settings.

NVME is a Samsung SSD 970 EVO Plus 2TB with latest 4B2QEXM7 firmware available since it left factory.
Motherboard is Asus TUF GAMING X670E-PLUS WIFI with firmware 1410.

My issue happens only after an extended period of time, more than a week +-day or two.

System turns to read-only and the last thing in journalctl -f I see this:

touko 27 03:21:01 cereza kernel: nvme nvme0: I/O 657 (I/O Cmd) QID 14 timeout, aborting
touko 27 03:21:01 cereza kernel: nvme nvme0: Abort status: 0x0
touko 27 03:21:31 cereza kernel: nvme nvme0: I/O 657 (I/O Cmd) QID 14 timeout, aborting
touko 27 03:21:31 cereza kernel: nvme nvme0: Abort status: 0x0
touko 27 03:21:35 cereza kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller

I have now enabled nvme_core.default_ps_max_latency_us=1200 to see if the issue appears again since this should disable the lowest power state of the drive according to smartctl:

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 7.59W - - 0 0 0 0 0 0
 1 + 7.59W - - 1 1 1 1 0 200
 2 + 7.59W - - 2 2 2 2 0 1000
 3 - 0.0500W - - 3 3 3 3 2000 1200
 4 - 0.0050W - - 4 4 4 4 500 9500

Revision history for this message
João Pedro Seara (jpseara) wrote :

@fatordee, please keep us updated.

Revision history for this message
João Pedro Seara (jpseara) wrote (last edit ):

I have implemented a similar workaround to @fatordee:

$ sudo smartctl -a /dev/nvme0
(...)
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 3.00W - - 0 0 0 0 0 0
 1 + 2.60W - - 1 1 1 1 0 0
 2 + 1.70W - - 2 2 2 2 0 0
 3 - 0.0250W - - 3 3 3 3 5000 9000
 4 - 0.0025W - - 4 4 4 4 5000 44000
(...)

$ cat /etc/default/grub | grep latency
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=9000"

I used Ex_Lat from the state right before the last one, as per [1].

It's a less aggressive workaround, as this one just disables the lowest power state, instead of them all.

Seems to be working pretty well.

[1] https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.