[Libvirt] IceLake CPU model not recognized

Bug #1978064 reported by Ammad Ali
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned
libvirt
Fix Released
Unknown
libvirt (Ubuntu)
Won't Fix
Undecided
Lena Voytek
Focal
Won't Fix
Undecided
Lena Voytek
Jammy
Won't Fix
Undecided
Lena Voytek
Kinetic
Won't Fix
Undecided
Lena Voytek

Bug Description

Hi,

I am using libvirt and qemu on ubuntu 22.04 on Dell R750 with Intel Xeon Gold 6338 CPU.

The libvirt is recognizing my cpu as Broadwell generation.

# virsh capabilities | more
<capabilities>

  <host>
    <uuid>4c4c4544-0050-4a10-8031-b6c04f534e33</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>Broadwell-noTSX-IBRS</model>
      <vendor>Intel</vendor>
      <microcode version='218104675'/>
......

Digging it further, I found that mpx extension does not exist in in Ice Lake generation CPUs.

I have updated x86_Icelake-Server-noTSX.xml by:

    <feature name='mpx' removed='yes'/>

Now it picked up correct cpu host model.

# virsh capabilities | more
<capabilities>

  <host>
    <uuid>4c4c4544-0050-4a10-8031-c2c04f534e33</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>Icelake-Server-noTSX</model>
      <vendor>Intel</vendor>
      <microcode version='218104675'/>
...

Is this a bug or am I missing somthing.

Tags: libvirt patch
Revision history for this message
Ammad Ali (syedammad83) wrote :

Correction: using ubuntu 20.04 and installed libvirt 8.0 from UCA yoga repo release.

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hello Ammad, thank you for the bug report. This does seem to be a valid issue, and I can confirm mpx is shown to be a feature in x86_Icelake-Server-noTSX.xml in kinetic, jammy, and impish. Our version in focal doesn't have that file but x86_Icelake-Server.xml also shows mpx. This may be due to the fact that some 10th gen mobile processors still have mpx, causing some confusion since the rest of Ice Lake does not. I'll mark this down for our team to review.

Thanks!

tags: added: server-triage-discuss
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Amad,
the cpu detection by libvirt is always an approximation as there are so many real cpus manufactured and only so few names for them.

In this case MPX [1] was added to some chips as early as Skylake (2015) and Icelake being a newer ~2019 chip is more likely to have more submodels having that present.
Those models in libvirt follow the definition in qemu like [2] which is usually added and defined by the HW manufacturer based on a common denominator of what these chips are able to do.
And it is important that these definitions match that qemu definition.

But in reality there are so many subsets that you'll always be able to find chips that "have" or "have not" a given feature even though all of them might be called by the same name.

What happens now is that in any real system libvirt detects all the features the cpu really has and then looks for the shortest list of added/removed features to express this.

Example:
CPU 1 general definition:
Name: Foo
Feature: 1 2 3

CPU 2 general definition:
Name: Bar
Feature: 1 2 3 4 5 6
Note: 4 5 6 might even be exclusive to this new series

If you now happen to buy a real chip from the "Bar" series, that has only features "1 2 3 4" it is still easier to express this as:
  "Foo +4"
Instead of
  "Bar -5 -6"

That is IMHO (simplified) what is happening here.
I said simplified because if there is a match to a known model/family then it will use that name (wins over the feature list length).

Now there are two detailed answers to this:
1. the generic one
2. the special one for MPX

I'll mention them for better later reference in extra posts.

[1]: https://en.wikipedia.org/wiki/Intel_MPX
[2]: https://gitlab.com/qemu-project/qemu/-/commit/8a11c62da9146dd89aee98947e6bd831e65a970d

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

1. Generic answer

This is an example about a similar issue and how it would usually be fixed/improved.

In 2019 there was a quite similar issue between Broadwell/Skylake.
And the fix there was by upstream realizing to avoid such more model/family combinations need to be added, which was done in [1].

In fact, not just a few:
367d96a5d6 cpu_map: Add more signatures for Skylake-Client CPU models
4ff74a806a cpu_map: Add more signatures for Broadwell CPU models
e58ca588cc cpu_map: Add more signatures for Haswell CPU models
194105fef1 cpu_map: Add more signatures for IvyBridge CPU models
4a3c3682f3 cpu_map: Add more signatures for SandyBridge CPU models
e89f877214 cpu_map: Add more signatures for Westmere CPU model
f349f3c53f cpu_map: Add more signatures for Nehalem CPU models
0a09e59457 cpu_map: Add more signatures for Penryn CPU model
c1f6a3269c cpu_map: Add more signatures for Conroe CPU model

Since then libvirt also learned to differentiate steppings, but it seems for your chip no data is present which is (assumption) why you see the behavior.

Maybe it is time to add such a barrage of new signatures as well, but one would want to do so consistently over all projects and to do so I think you'd want to file the same upstream at [2]

P.S. Before acting see next answer for why MPX is different

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1686895
[2]: https://gitlab.com/libvirt/libvirt/-/issues

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

2. special answer for MPX

MPX is special as it is not available (other than originally intended) for Icelake and being phased out [1].
The libvirt cpu_map doesn't know about it yet.

For the upstream project or the distribution this isn't as easy as removing the flag (well it might be, but needs to be checked). There are various cross-release and post-upgrade incompatibilities to consider.

Each of them is evaluated separately, this one in fact was already disabled in qemu. So since then mpx is no more really enabled if you use a 4.0 machine type or newer.

Still those issues often have a huge chance for issues hidden underneath and are best resolved consistently across all builds. Therefore one needs to report it upstream which already happened in [3].

Once there is a generally accepted fix to that we will look at its back-portability to different Ubuntu releases.

[1]: https://www.intel.com/content/www/us/en/support/articles/000059823/processors.html
[2]: https://gitlab.com/qemu-project/qemu/-/commit/ecb85fe48cacb2f8740186e81f2f38a2e02bd963
[3]: https://gitlab.com/libvirt/libvirt/-/issues/304

tags: removed: server-triage-discuss
Changed in libvirt:
status: Unknown → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Revision history for this message
Walid Moghrabi (walid-fdj) wrote :

Hi,

We are also concerned by this bug and this really impacts our deployment since we are noew integrating more and more Dell Poweredge R750 series which are equipped with CPU without the MPX flag which means they are seen as not compatible with the QEMU virtual CPU model we configured with the prior generation in our Openstack clusters.
Instead of phasing out or adjust older CPUs definition in libvirt, why not simply create a variant just like what has been done with the "-NOTSX" ones for example ?
By the time we have a proposer fix, this will be our way to go.

Revision history for this message
Alan Baghumian (alanbach) wrote :

Hello!

I agree with Walid. More and more firms will be switching to newer hardware and it would be nice if we can get this incorporated to get ahead of the curve, although it has not yet been fully merged upstream.

Thank you.

tags: added: server-todo
Revision history for this message
Lena Voytek (lvoytek) wrote :

I can take this bug and add a variant for kinetic and jammy.

Changed in libvirt (Ubuntu):
assignee: nobody → Lena Voytek (lvoytek)
status: Confirmed → In Progress
Changed in libvirt (Ubuntu Jammy):
status: New → Confirmed
assignee: nobody → Lena Voytek (lvoytek)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Lena, but just FYI those names usually should have agreement across distributions and therefore should start being discussed upstream.
I have not seen a submission yet, so please start providing one there when you look at this.

I didn't have much time if a new named type (like -NOTSX) or any of the new version mechanics are better. The upstream feedback might help with that choice.

Revision history for this message
Walid Moghrabi (walid-fdj) wrote :

Hi Lena,

Why not backporting the fix to focal ? Focal being a LTS, there are many Openstack deployment out there based on it and they will all be concerned by that bug.

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hi Walid,

Sure, I can backport the fix to focal too. It'll be slightly different since the xml files have changed since then but still doable, as long as there is someone who can confirm that they need it and have the right CPU to test it.
I'll add it as an affected distribution for now.

For the time being I'm still trying to work with upstream to see if they will take the fix so it is standardized for libvirt in all distributions.

Changed in libvirt (Ubuntu Focal):
assignee: nobody → Lena Voytek (lvoytek)
Revision history for this message
Ammad Ali (syedammad83) wrote :

Hello,

I am using libvirt openstack xena from UCA on focal. I have R750 and R650 servers with Xeon 6338 and Xeon 4314 CPUs. I can help in testing those fixes / proposed packages on them.

Revision history for this message
Walid Moghrabi (walid-fdj) wrote :

So do I (Openstack Ussuri on Focal).

Revision history for this message
Lena Voytek (lvoytek) wrote :

Thanks you two, I'll let you know when I have a fix ready to test

Changed in libvirt (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Ammad Ali (syedammad83) wrote :

Hi,

Just for testing I have modified /usr/share/libvirt/cpu_map/x86_Icelake-Server-noTSX.xml as below for mpx feature.

    <feature name='mpx' removed='yes'/>

After that libvirt is picking up correct CPU model i.e Icelake-Server-noTSX

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hello all,

I've been working with libvirt upstream for a while now, but it seems they do not want to add a new CPU model for 10nm Icelake. Their reasoning is that the list of capabilities for a given processor is not used within libvirt, so they don't believe maintaining a noMPX version of Icelake-Server is worth it.

Moving forward there are a few options:

The first is to deploy OpenStack with custom libvirt flags. Information on how to do that can be found here: https://docs.openstack.org/nova/latest/admin/cpu-models.html#cpu-feature-flags
In this case, something like the following can be added to the conf file:

[libvirt]
cpu_mode = custom
cpu_models = Icelake-Server,Icelake-Server-noTSX
cpu_model_extra_flags = -mpx

Secondarily, <feature name='mpx' removed='yes'/> can be added manually in x86_Icelake-Server-noTSX.xml or x86_Icelake-Server.xml, but that doesn't seem very efficient or sustainable for deployment.

If neither of these work, I can try and push harder for upstream to specify noMPX as a model or find a way to make this work in Ubuntu specifically.

Let me know what you prefer. Thanks!

Revision history for this message
Alan Baghumian (alanbach) wrote :

Hi Lena,

Thank you so much for working on this.

I just messaged a client that was affected by this issue and hopefully they will return with some thoughts.

Best,
Alan

Revision history for this message
Ammad Ali (syedammad83) wrote :
Download full text (4.3 KiB)

Hi,

By adding CPU mode to custom, nova-compute service failed to start with below error.

2022-10-05 13:55:41.130 1095229 ERROR nova.virt.libvirt.driver [-] CPU doesn't have compatibility.

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service [-] Error starting thread.: nova.exception.InvalidCPUInfo: Configured CPU model: Icelake-Server-noTSX is not compatible with host CPU. Please correct your config and try again. Unacceptable CPU info: CPU doesn't have compatibility.

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service Traceback (most recent call last):
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 986, in _check_cpu_compatibility
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service self._compare_cpu(cpu, self._get_cpu_info(), None)
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 9726, in _compare_cpu
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service 0
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service During handling of the above exception, another exception occurred:
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service Traceback (most recent call last):
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 806, in run_service
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service service.start()
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/service.py", line 159, in start
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service self.manager.init_host()
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1507, in init_host
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service self.driver.init_host(host=self.host)
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 814, in init_host
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service self._check_cpu_compatibility()
2022-10-05 13:55:41.131 1095229 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 992, in _check_cpu_...

Read more...

Revision history for this message
Lena Voytek (lvoytek) wrote :

Thanks for looking into that Ammad. Knowing that the custom CPU mode is not a valid alternative should help my chances with upstream. If they still refuse in the end though I'll make sure to fix this in Ubuntu.

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hello all,

It looks like upstream still will not take this change, so I am working on a fix for it in Ubuntu specifically. In the meantime, I've created a PPA (https://launchpad.net/~lvoytek/+archive/ubuntu/libvirt-add-icelake-10nm-to-cpu-map) that adds two new CPU models for 10nm Icelake:

Icelake-Server-noMPX
Icelake-Server-noTSX-noMPX

These can replace the use of Icelake-Server and Icelake-Server-noTSX respectively. If you would like to test the change in 20.04, 22.04, or 22.10, then you can add the PPA to your system with the following commands:

sudo add-apt-repository ppa:lvoytek/libvirt-add-icelake-10nm-to-cpu-map
sudo apt update

Thanks for your patience

Changed in libvirt (Ubuntu Jammy):
status: Confirmed → In Progress
Changed in libvirt (Ubuntu Focal):
status: Confirmed → In Progress
Changed in libvirt:
status: New → Fix Released
Bryce Harrington (bryce)
summary: - [Libvirt} IceLake CPU model not recognized
+ [Libvirt] IceLake CPU model not recognized
Revision history for this message
Paul Goins (vultaire) wrote :

I think it may be worth noting on the upstream bug (https://gitlab.com/libvirt/libvirt/-/issues/304) one of the key issues here. They did ask:

> Anyway, what is the actual issue here? The CPU model shown by virsh capabilities may be wrong, but that's mostly aesthetic issue as it is not used for anything.

The ticket got auto-closed as no one got back to them quickly enough.

That developer might be correct from a libvirt perspective, but perhaps not accurate from the wider software ecosystem perspective. If Nova *is* using "virsh capabilities" under the hood for its checks, that may be reason for them to take upstream action, or they may say Nova shouldn't do that, in which case it can be pushed to Nova.

I'm totally in favor of getting this to work on Ubuntu in some way, but we may want to add that context to the upstream ticket.

Revision history for this message
David Sedgmen (dsedgmen) wrote :

I believe the one reason that adding `cpu_model_extra_flags = -mpx` does not work
is because the models are evaluated before the flags.

So it won't even try to compare the flags on start up before crashing.

This might be a possible way forward if libvirt won't update the cpu_maps
and while we are waiting for "libvirt: Replace compareCPU() with compareHypervisorCPU()"
to be merged
https://review.opendev.org/c/openstack/nova/+/762330/20

tags: added: patch
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hi, for OpenStack Nova, I submitted a patch here to remove the compareCPU() check in _check_cpu_compatibility(). It is causing much grief at this point:

Please see the details in the commit message:

https://review.opendev.org/c/openstack/nova/+/869587 -- libvirt: Remove compareCPU() check in _check_cpu_compatibility()

Please comment in the change if you have critique/thoughts.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hi @Lena: FWIW, I have also posted this patch for OpenStack Nova:

https://review.opendev.org/c/openstack/nova/+/762330 -- libvirt: Replace compareCPU() with compareHypervisorCPU()

However David Sedgmen (from comment#23 above) reported that using it as-is still fails the same way - i.e. the compat check fails in Nova. So we need to still think of a way to fix this for Nova.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

As there are some action items for nova, I mark this bug report as Confirmed.

Changed in nova:
status: New → Confirmed
tags: added: libvirt
Changed in nova:
importance: Undecided → Low
Revision history for this message
Alan Baghumian (alanbach) wrote :

@Lena seems like the workaround does not work on Cascadelake based processors:

Error starting thread.: nova.exception.InvalidCPUInfo: Configured CPU model: Cascadelake-Server-noTSX is not compatible with host CPU.

The workaround was to use:

[libvirt]
cpu-model: Broadwell-noTSX
cpu-model-extra-flags: abm arat avx512bw avx512cd avx512dq avx512f avx512vl avx512vnni clflushopt clwb f16c pdpe1gb pku rdrand spec-ctrl ssbd vme xgetbv1 xsavec xsaveopt

Instead of:

[libvirt]
cpu-model: Cascadelake-Server-noTSX
cpu-model-extra-flags: -mpx

To make it work. Just wanted to paste it here in case anyone else encounters this issue.

Revision history for this message
Lena Voytek (lvoytek) wrote :

Hello all,
Since upstream blocked libvirt changes and this should be fixable in nova, I'll mark libvirt as wont-fix for now. If any help is needed for the fix on the libvirt side moving forward let me know.
Thanks

Changed in libvirt (Ubuntu):
status: In Progress → Won't Fix
Changed in libvirt (Ubuntu Focal):
status: In Progress → Won't Fix
Changed in libvirt (Ubuntu Jammy):
status: In Progress → Won't Fix
Changed in libvirt (Ubuntu Kinetic):
status: In Progress → Won't Fix
tags: removed: server-todo
Revision history for this message
Brett Milford (brettmilford) wrote :

I can see a patch that was linked in one of the other PRs was merged: https://review.opendev.org/c/openstack/nova/+/870794

Does this fix the issue identified in this bug for Nova?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.