Fans switching on and off every 10 seconds after update to kernel 5.8.0-34

Bug #1910562 reported by munbi
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
linux-signed-hwe-5.8 (Ubuntu)
Confirmed
Undecided
Unassigned
lm-sensors (Ubuntu)
New
Undecided
Unassigned
xserver-xorg-video-amdgpu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

After updating via apt dist-upgrade from kernel 5.4.0-59 to kernel 5.8.0-34 the fan on my machine started switching on (for an instant) and off every 10 seconds even when idle with CPU at 48/50°C.

Switching back to previous kernel solves temporary the problem, i.e. fans are always off with light desktop work.

The new behavior is really annoying and I guess not healthy for the fans.

I'm on latest Dell bios, with every other package updated.

Di something in the thermal policies change between those two kernels?
Is it possible to go back to previous behavior ?

Thanks in advance for help

Machine: Dell Precison 7540, intel i7-9750

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.8.0-34-generic 5.8.0-34.37~20.04.2
ProcVersionSignature: Ubuntu 5.8.0-34.37~20.04.2-generic 5.8.18
Uname: Linux 5.8.0-34-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.14
Architecture: amd64
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Thu Jan 7 17:14:54 2021
InstallationDate: Installed on 2020-05-06 (246 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
SourcePackage: linux-signed-hwe-5.8
UpgradeStatus: No upgrade log present (probably fresh install)
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: gabriele 2370 F.... pulseaudio
 /dev/snd/controlC1: gabriele 2370 F.... pulseaudio
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 20.04
InstallationDate: Installed on 2020-05-06 (352 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
MachineType: Dell Inc. Precision 7540
Package: lm-sensors 1:3.6.0-2ubuntu1
PackageArchitecture: amd64
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.8.0-50-generic root=UUID=c39f518b-f5c9-47c5-8e7f-42d970d2dedb ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 5.8.0-50.56~20.04.1-generic 5.8.18
RelatedPackageVersions:
 linux-restricted-modules-5.8.0-50-generic N/A
 linux-backports-modules-5.8.0-50-generic N/A
 linux-firmware 1.187.11
Tags: focal
Uname: Linux 5.8.0-50-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip docker lpadmin lxd plugdev sambashare sudo vboxusers wireshark
_MarkForUpload: True
dmi.bios.date: 01/08/2021
dmi.bios.release: 1.11
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.11.2
dmi.board.name: 0XMC3F
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.11.2:bd01/08/2021:br1.11:svnDellInc.:pnPrecision7540:pvr:rvnDellInc.:rn0XMC3F:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: Precision
dmi.product.name: Precision 7540
dmi.product.sku: 0926
dmi.sys.vendor: Dell Inc.

Revision history for this message
munbi (gabriele) wrote :
Revision history for this message
munbi (gabriele) wrote :

Kernel 5.4 lsmod, lscpic, dmesg, sensors output

Revision history for this message
munbi (gabriele) wrote :

Kernel 5.8 lsmod, lscpic, dmesg, sensors output

description: updated
description: updated
Revision history for this message
munbi (gabriele) wrote :

Just as a note, problem still persists in kernel 5.8.0-40.45

Revision history for this message
munbi (gabriele) wrote :
Download full text (4.6 KiB)

So, after testing several different kernels and live distros to pinpoint this bug, I finally found out the problem: it's an interaction between lm-sensors and amdgpu driver with kernel > 5.4.0.

I found out by chance because I noticed the problem happened only after logging in with a graphical session.

This is what is happening:
- a gnome extension to monitor sensors/temps calls the 'sensors' utility from package lm-sensors every 10 senconds
- sensors 'hangs' for a couple of seconds when poking something related to the amdgpu driver
- amdgpu driver spits some warning/errors on vt console and dmesg
- fans starts spinning for one sec
- then sensors continue normally displaying the readouts from other sensor

This is the output of 'sensors', taken in a non-graphical console (ctr+alt+F3) with kernel 5.8.0-41:

ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +0.00 A)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +37.0°C

ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +0.00 A)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +55.0°C

BAT0-acpi-0
Adapter: ACPI interface
in0: 8.48 V
curr1: 1000.00 uA

amdgpu-pci-0100
Adapter: PCI adapter
[ 112.780951] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.380939] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
vddgfx: 1.05 V
edge: +44.0°C (crit = +94.0°C, hyst = -273.1°C)
power1: 7.12 W (cap = 35.00 W)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +47.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +46.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +45.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +45.0°C (high = +100.0°C, crit = +100.0°C)

dell_smm-virtual-0
Adapter: Virtual device
fan1: 2480 RPM
fan2: 2471 RPM

nvme-pci-0200
Adapter: PCI adapter
Composite: +46.9°C (low = -273.1°C, high = +69.8°C)
                       (crit = +79.8°C)
Sensor 1: +46.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +47.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 5: +66.8°C (low = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: +25.0°C (crit = +107.0°C)

This is the complete kernel log from amgpu when this happens:

[ 111.572873] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 112.780951] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.380939] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.411556] [drm] UVD and UVD ENC initialized successfully.
[ 113.521534] [drm] VCE initialized successfully.

It seems that lm-sensors poking t...

Read more...

Revision history for this message
Paride Legovini (paride) wrote :

Hello munbi and thanks for this bug report. I marked the other two bugs you filed as duplicates of this one, and added "tasks" for the packages you filed the other bugs against. One bug report can track a problem across several packages, so there's no need for multiple reports.

I don't think xserver-xorg-video-amdgpu is really involved here. If you can reproduce this issue without logging it to the graphical console, i.e. by calling `sensors` from tty, and you agree that this proves that Xorg is not involved, then please change the status of the xserver-xorg-video-amdgpu task to Invalid.

My feeling is that we're seeing a kernel issue here. If you can reproduce it with Hirsute (the current Ubuntu devel version) then I think we can add a "regular" kernel task, as the problem is not HWE-specific.

Currently linux-image-generic 5.10.0.14.16 is in the Hirsute -proposed pocket. Could you try to test that newer kernel and report how it behaves?

I'm setting the status of this bug to Incomplete for the moment, waiting for the results of your testing.

Changed in linux-signed-hwe-5.8 (Ubuntu):
status: New → Incomplete
Changed in lm-sensors (Ubuntu):
status: New → Incomplete
Changed in xserver-xorg-video-amdgpu (Ubuntu):
status: New → Incomplete
Revision history for this message
munbi (gabriele) wrote :
Download full text (3.3 KiB)

First of all, thanks for your time.

> One bug report can track a problem across several packages, so there's no need for multiple reports.

Sorry for the noise :-)

> If you can reproduce it with Hirsute

I can reproduce the problem using linux-image-generic 5.10.0.14.16 from hirsute in a tty.
To be sure I did it right, those are the steps I followed:

1. added this repo to the list of repos:
deb http://archive.ubuntu.com/ubuntu/ hirsute-proposed restricted main multiverse universe

2. updated apt list and installed those packages:
linux-headers-generic/hirsute-proposed
linux-modules-5.10.0-14-generic/hirsute-proposed
linux-modules-extra-5.10.0-14-generic/hirsute-proposed
linux-image-5.10.0-14-generic/hirsute-proposed

3. Rebooted, choose kernel 5.10 from advanced list in grub

4. On gnome login prompt, switched to tty with ctrl+alt+F3

5. Check kernel and packages:
$ uname -a
Linux bestia3 5.10.0-14-generic #15-Ubuntu SMP Fri Jan 29 15:10:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep 5.10.0
ii linux-headers-5.10.0-14 5.10.0-14.15 all Header files related to Linux kernel version 5.10.0
ii linux-headers-5.10.0-14-generic 5.10.0-14.15 amd64 Linux kernel headers for version 5.10.0 on 64 bit x86 SMP
ii linux-headers-generic 5.10.0.14.16 amd64 Generic Linux kernel headers
ii linux-image-5.10.0-14-generic 5.10.0-14.15 amd64 Signed kernel image generic
ii linux-modules-5.10.0-14-generic 5.10.0-14.15 amd64 Linux kernel extra modules for version 5.10.0 on 64 bit x86 SMP
ii linux-modules-extra-5.10.0-14-generic 5.10.0-14.15 amd64 Linux kernel extra modules for version 5.10.0 on 64 bit x86 SMP

Now, calling 'sensors' from console still causes the problem, i.e. the output of sensors hangs a couple of seconds just after printing 'amdgpu-pci-0100' and the fans start spinning.
The only difference with kernel 5.10 wrt kernel 5.8 is that I don't see the kernel error messages directly into the tty, but they are present in dmesg:

[ 25.039603] vboxdrv: TSC mode is Invariant, tentative frequency 2592005345 Hz
[ 25.039607] vboxdrv: Successfully loaded version 6.1.18 (interface 0x00300000)
[ 25.251013] VBoxNetFlt: Successfully started.
[ 25.253147] VBoxNetAdp: Successfully started.
[ 41.364575] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 41.493122] [drm] UVD and UVD ENC initialized successfully.
[ 41.603133] [drm] VCE initialized successfully.
[ 41.609182] amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
[ 41.680294] rfkill: input handler enabled
[ 47.719525] Bluetooth: RFCOMM TTY layer initialized
[ 47.719530] Bluetooth: RFCOMM socket layer initialized
[ 47.719535] Bluetooth: RFCOMM ver 1.11
***** END OF BOOT ******

***** Called 'sensors' in tty vvv ******
[ 96.156579] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 96.285097] [drm]...

Read more...

Changed in xserver-xorg-video-amdgpu (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
munbi (gabriele) wrote :

Hi Paride, where the test and logs above sufficient to validate the bug ? Can I do anything more to help looking into this issue ?

Regards, Gabriele.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux-signed-hwe-5.8 (Ubuntu) because there has been no activity for 60 days.]

Changed in linux-signed-hwe-5.8 (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for lm-sensors (Ubuntu) because there has been no activity for 60 days.]

Changed in lm-sensors (Ubuntu):
status: Incomplete → Expired
Revision history for this message
munbi (gabriele) wrote :

Please do not mark as expired. The problem is not resolved and I did all the test requested.
If there is anything more I can do to help tracking down the problem please feel free to ask.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi Gabriele,

Thank you for the bug report and the help with the investigation!

I apologize that this bug was marked as Expired; this is done by an automatic process after a period of inactivity. Somehow this fell through the cracks of our triage process, but now we're on it again :-).

Thank you for following up on Paride's request for further tests. I'm wondering here: I know that the 'sensors' command lets you specify which chips you want to get the temperature for, and given that this problem seems to happen every time you execute the command, then I think it's worth trying to pinpoin which chip is causing the problem here. Could you please try to run the command several times, specifying the different chip names each time, and see what happens? Maybe there's nothing to it, but we never know.

Another thing I was going to ask you is this: based on your comments, it seemed to me that you were able to reproduce this problem not only in Ubuntu, but also in other distributions. Is that correct? If it is, do you think you could test a pristine Linux kernel (from https://kernel.org)? If the bug reproduces with it, then I would recommend opening a bug report against the upstream Linux project.

For this specific bug, I think it's worth adding the Ubuntu Linux kernel package and seeing what the kernel folks have to say. Maybe they can come up with other things you can try to determine what's going on.

I'm setting this bug status back to New and adding the Ubuntu Linux package to it. I'm also going to subscribe myself to this bug so that we don't miss any more updates from you. Thank you.

Changed in linux-signed-hwe-5.8 (Ubuntu):
status: Expired → New
Changed in lm-sensors (Ubuntu):
status: Expired → New
Changed in linux-signed-hwe-5.8 (Ubuntu):
status: New → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1910562

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
munbi (gabriele) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
munbi (gabriele) wrote : CRDA.txt

apport information

Revision history for this message
munbi (gabriele) wrote : CurrentDmesg.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Dependencies.txt

apport information

Revision history for this message
munbi (gabriele) wrote : IwConfig.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Lspci.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Lspci-vt.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Lsusb.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Lsusb-t.txt

apport information

Revision history for this message
munbi (gabriele) wrote : Lsusb-v.txt

apport information

Revision history for this message
munbi (gabriele) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
munbi (gabriele) wrote : ProcEnviron.txt

apport information

Revision history for this message
munbi (gabriele) wrote : ProcInterrupts.txt

apport information

Revision history for this message
munbi (gabriele) wrote : ProcModules.txt

apport information

Revision history for this message
munbi (gabriele) wrote : PulseList.txt

apport information

Revision history for this message
munbi (gabriele) wrote : RfKill.txt

apport information

Revision history for this message
munbi (gabriele) wrote : UdevDb.txt

apport information

Revision history for this message
munbi (gabriele) wrote : WifiSyslog.txt

apport information

Revision history for this message
munbi (gabriele) wrote : acpidump.txt

apport information

Changed in linux-signed-hwe-5.8 (Ubuntu):
status: Invalid → Confirmed
Revision history for this message
munbi (gabriele) wrote :
Download full text (4.5 KiB)

Hi Sergio. First of all, thank you (and anyone else involved) for your time investigating in the issue.

> Could you please try to run the command several times, specifying the different chip

I did not find how to specify a certain chip with the 'sensors' command after reading the man page.
But I'm sure the problem happens when sensors tries to read the temperature from the AMD discrete graphics card, because when the console output of sensors reaches this section during the scan:

amdgpu-pci-0100
Adapter: PCI adapter
[ --> HERE IT HANGS FOR 2 SECS <-- ]
vddgfx: 1.05 V
edge: +42.0°C (crit = +94.0°C, hyst = -273.1°C)
power1: 7.11 W (cap = 35.00 W)

it hangs for a couple of seconds and two things happens:
1. the fans start spinning
2. I see an drm related warning appearing in the dmesg logs:

[ 41.397707] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 42.609649] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 43.213733] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 43.242390] [drm] UVD and UVD ENC initialized successfully.
[ 43.352370] [drm] VCE initialized successfully.
[ 43.358578] amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes

Could it be that scanning the amdgpu-pci-0100 causes the AMD card to reset and this is related to the fans starting ?

> you were able to reproduce this problem not only in Ubuntu, but also in other distributions. Is that correct ?

I was able to reproduce on Ubuntu 20.04 with all 5.8 kernels available and with a 5.10.0-14 kernel installed from Hirsute beta repositories.

However today I made some tests again using my stable 20.04 and also the just released final Hirsute 21.04 from a live USB. Below the results of running the 'sensors' command in terminal, ordered by Ubuntu version and kernel:

1. 20.04, 5.4.0-59-generic #65-Ubuntu SMP
- everything ok
- no fans spinning
- no dmesg warinings
- no sensors hanging during scan:
amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: N/A
edge: N/A (crit = +94.0°C, hyst = -273.1°C)
power1: N/A (cap = 35.00 W)

2. 20.04, 5.8.0-50-generic #56~20.04.1-Ubuntu SMP
- problem present
- fans spinning
- sensors hangs during output when reaching amd section:
amdgpu-pci-0100
Adapter: PCI adapter
[ 2 SECS ]
vddgfx: 1.05 V
edge: +38.0°C (crit = +94.0°C, hyst = -273.1°C)
power1: 7.12 W (cap = 35.00 W)
- dmesg warnings appearing when fans start:
[ 41.397707] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 42.609649] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 43.213733] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 43.242390] [drm] UVD and UVD ENC initialized successfully.
[ 43.352370] [drm] VCE initialized successfully.
[ 43.358578] amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes

3. USB/Live 21.04, 5.11.0-16-generic #17-Ubuntu SMP
- problem present
- fans spinning
- sensors hangs during output when reaching amd...

Read more...

Revision history for this message
munbi (gabriele) wrote :

I've made the same tests with mainline kernel '5.12.4-051204-generic #202105140931 SMP Fri May 14 09:35:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux' and the results are the same, i.e. sensors command stopping when reaching amd section and this error message showing in dmesg:

[ 65.636330] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 67.104101] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 67.704167] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 67.735924] [drm] UVD and UVD ENC initialized successfully.
[ 67.845908] [drm] VCE initialized successfully.
[ 67.850738] amdgpu 0000:01:00.0: [drm] Cannot find any crtc or sizes
[ 68.144823] drm_dp_i2c_do_msg: 2 callbacks suppressed
[ 68.241327] [drm:lspcon_init [i915]] *ERROR* Failed to probe lspcon
[ 68.241388] [drm:lspcon_resume [i915]] *ERROR* LSPCON init failed on port D
[ 68.893269] [drm:lspcon_init [i915]] *ERROR* Failed to probe lspcon
[ 68.893321] [drm:lspcon_resume [i915]] *ERROR* LSPCON init failed on port D

So the problem is not Ubuntu related.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Thanks. Please file an upstream bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.