calling nvidia-smi in udev rule is a gap to the driver from Nvidia

Bug #1839309 reported by Alex Tu
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
HWE Next
Fix Released
Undecided
Unassigned
NVIDIA Drivers Ubuntu
Fix Released
Undecided
Alberto Milone
OEM Priority Project
Fix Released
Critical
Cyrus Lien

Bug Description

There's some timing issue (LP: #1812784) that caused by nvidia-driver-430 counting on udev rule calling nvidia-smi in 71-nvidia.rules to create device node "nvidia0" and "nvidiactl".

This behavior is different from the .run driver released from Nvidia which is counting on nvidia-modprobe(a setuid root utility which nvidia-installer installs by default.), and the issue does not happen to the system which installed the same version of .run driver directly.

Consulting Nvidia, they don't expect the nodes to be created by udev rules, so the solution should be either revise our way to follow Nvida calling nvidia-modprobe or find some way to close this gap instead of calling nvidia-smi.

# Target:
 - remove nvidia-smi from 71-nvidia.rules , because it's a workaround and it impacts some platform (LP: #1812784).
 - either revise our way to follow Nvida calling nvidia-modprobe or find some way to close this gap instead of calling nvidia-smi

# Concern:
 - nvidia-modprobe is a setuid root utility, will it be a security concern?

# Machine environment:
 - on an ice lake machine (BIOS ID: 097B)
 - 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1d11] (rev a1)
 - kernel: 5.0.0-1016-oem-osp1
 - nvidia driver: nvidia-driver-430

# caseA : install nvidia-driver-430:
 1. install nvidia-driver-430 on machine.
 2. remove nvidia-smi from 71-nvidia.rules (same as https://github.com/tseliot/nvidia-graphics-drivers/pull/31)
 3. reboot machine to text mode.
 4. $ systemctl start gdm <= "nvidia0" and "nvidiactl" will not be created under /dev
 5. loging failure with xorg failure
"gdm-x-session[2165]: (EE) NVIDIA: Failed to initialize the NVIDIA kernel module."

# caseB : install .run from source code of nvidia-driver-430:
 1. execute NVIDIA-Linux-x86_64-430.26.run to install driver
 2. reboot machine to text mode.
 3. $ systemctl start gdm <= "nvidia0" and "nvidiactl" will be created automatically.
 4. login works well.

# Analyze :
In caseB, from strace of gnome-shell which will be initialed by gdm systemd service, it was using libnvidia-glsi.so.430.26 and trying accessing /dev/nvidiactl but failed. Then it call nvidia-modprobe to create /dev/nvidiactl.

In caseA, from strace of gnome-shell, it follow the same way of caseB, but the nvidia-modprobe is not there, then /dev/nvidiactl will never be created.

Then on caseA, coping nvidia-modprobe to /usr/bin and setuid root. then it works well as caseB.

related issues:
https://bugs.launchpad.net/somerville/+bug/1812784
 - on Precision-7730 NVIDIA Corporation GP104GLM [Quadro P5200 Mobile][10de:1bb5]
https://bugs.launchpad.net/somerville/+bug/1831013
https://bugs.launchpad.net/somerville/+bug/1834323
https://bugs.launchpad.net/nvidia/+bug/1839279
https://github.com/tseliot/nvidia-graphics-drivers/pull/34
https://github.com/tseliot/nvidia-graphics-drivers/pull/31
https://github.com/tseliot/nvidia-graphics-drivers/pull/32

the nvidia-modprobe was rejected by security concern: https://bugs.launchpad.net/ubuntu/+source/nvidia-modprobe/+bug/1421209

Alex Tu (alextu)
Changed in oem-priority:
importance: Undecided → Critical
assignee: nobody → Alex Tu (alextu)
Revision history for this message
Alex Tu (alextu) wrote :

the strace of gnome-shell of success caseB.
check line 137927
mknod("/dev/nvidiactl", S_IFCHR|0666, makedev(195, 255)) = -1 EACCES (Permission denied)
then it call /usr/bin/nvidia-modprobe to create it.

Revision history for this message
Alex Tu (alextu) wrote :

the strace of gnome-shell of failed caseA.
check line 5181
mknod("/dev/nvidiactl", S_IFCHR|0666, makedev(195, 255)) = -1 EACCES (Permission denied)

Then it can not find nvidia-modprobe , so gave up create device node
stat("/usr/bin/nvidia-modprobe", 0x7ffd5ddb3f50) = -1 ENOENT (No such file or directory)

Alex Tu (alextu)
description: updated
Revision history for this message
Alex Tu (alextu) wrote :

short term workaround:
 - https://github.com/tseliot/nvidia-graphics-drivers/pull/35
 - ppa for test: https://code.launchpad.net/~alextu/+recipe/lp1839309-not-use-nvidia-smi-in-udev-rule

verified on
 - ice lake machine (BIOS ID: 097B)
 - Precision-7730 NVIDIA Corporation GP104GLM [Quadro P5200 Mobile] [10de:1bb5]

next step:
 - verify the built out packages from ppa : https://code.launchpad.net/~alextu/+recipe/lp1839309-not-use-nvidia-smi-in-udev-rule

description: updated
Alex Tu (alextu)
Changed in nvidia-drivers-ubuntu:
assignee: nobody → Alberto Milone (albertomilone)
Changed in oem-priority:
status: New → Triaged
Revision history for this message
You-Sheng Yang (vicamo) wrote :

I have began some work to try implement device_create() support into nvidia driver recently, but I soon bumped into a problem that I had to stop.

device_create() takes a `struct class` pointer, which is supposed to be created by class_create(). However, class_create() along with all other device class APIs are exposed GPL-only, which follows there is no way nvidia proprietary driver may ever access them. End story. Some other work-around must be used.

You-Sheng Yang (vicamo)
tags: added: nvidia-smi
description: updated
tags: added: oem-priority originate-from-1831013 somerville
Revision history for this message
You-Sheng Yang (vicamo) wrote :
Revision history for this message
Alberto Milone (albertomilone) wrote :

I think that using udev and another tool to create device nodes would be a simpler solution. I had some code to do that, but I never got around to completing it. I can have another look at this.

Revision history for this message
You-Sheng Yang (vicamo) wrote :

@Alberto, we're on you now. But please don't hesitate to assign to me if you don't really have the time recently.

Changed in hwe-next:
status: New → Triaged
Alex Tu (alextu)
Changed in oem-priority:
assignee: Alex Tu (alextu) → Leon Liao (lihow731)
Revision history for this message
Alberto Milone (albertomilone) wrote :

I have provided the tool that I promised in this PPA:

https://launchpad.net/~oem-solutions-group/+archive/ubuntu/nvidia-driver-staging

The fix is only available for nvidia-graphics-drivers-440 (440.64-0ubuntu3~0.20.04.2 for Focal, 440.64-0ubuntu0~0.19.10.4 for Eoan, 440.64-0ubuntu0~0.18.04.3 for Bionic).

Please test it and let me know if you find any issues.

Revision history for this message
You-Sheng Yang (vicamo) wrote :

To verify:

* bug 1834323, which depending on nvidia-smi being executed in udev hooks
* bug 1812784, which depending on nvidia-smi not being executed in udev hooks

Revision history for this message
You-Sheng Yang (vicamo) wrote :

Verified bug 1834323 with 440.64-0ubuntu0~0.18.04.3 for Bionic, in performance or power saving mode, and suspend/resume under DC and AC mode. So far so good, except there are two (probably unrelated) issues:

  1. X hangs when switching from power saving mode to performance mode by nvidia-settings.
  2. /sbin/ub-device-create is not stripped

No device at hand for bug 1812784 yet.

Revision history for this message
You-Sheng Yang (vicamo) wrote :

dmesg log for case 1 in comment #11. I tested suspend/resumes under power saving mode first, and then try to switch to performance mode. When it hangs X, I find nvidia modules are being loaded in the background `dmesg -w` terminal window:

  nvidia-nvlink: Nvlink Core is being initialized, major device number 237
  NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
  nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.64 Fri Feb 21 00:43:19 UTC 2020
  [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
  ...

Then I have to reboot with sysrq.

This doesn't always happen as I tried a few times now but cannot reproduce. Seen twice before comment #11.

Revision history for this message
Rex Tsai (chihchun) wrote :

Updated from @tseliot on IRC

- The device-create tool is in Groovy now, and it will make its way into Bionic and Focal on the next update

Rex Tsai (chihchun)
Changed in oem-priority:
assignee: Leon Liao (lihow731) → Cyrus Lien (cyruslien)
Revision history for this message
Alberto Milone (albertomilone) wrote :

All the drivers (except for the 340 driver) should have the device-create tool in Bionic and Focal.

Changed in nvidia-drivers-ubuntu:
status: New → Fix Released
Changed in oem-priority:
status: Triaged → Fix Released
Timo Aaltonen (tjaalton)
Changed in hwe-next:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.