DNS failure while trying to fetch user-data

Bug #2008952 reported by Ken VanDine
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Netplan
Invalid
Undecided
Danilo Egea Gondolfo
cloud-init
Fix Released
Medium
Chad Smith
subiquity
Invalid
Undecided
Unassigned
livecd-rootfs (Ubuntu)
Fix Released
Undecided
Unassigned
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

In testing netboot + autoinstall of the new ubuntu desktop subiquity based installer for 23.04 I found cloud-init is failing to retrieve user-data because it can't resolved the hostname in the URL. This same configuration does work for 22.04 based subiquity, so seems a regression.

From the ipxe config:

imgargs vmlinuz initrd=initrd \
 ip=dhcp \
 iso-url=http://cdimage.ubuntu.com/daily-live/pending/lunar-desktop-amd64.iso \
 fsck.mode=skip \
 layerfs-path=minimal.standard.live.squashfs \
 autoinstall \
 'ds=nocloud-net;s=http://boot.linuxgroove.com/ubuntu/23.04/' \

That fails, but if we replace boot.linuxgroove.com with the IP it works.

Related branches

Chad Smith (chad.smith)
Changed in cloud-init:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chad Smith (chad.smith) wrote :

There are a couple of significant issues that may be leading to this symptom:

 1. On Desktop live server images, something is laying down a competing file for management of the wired NIC on my laptop:

/etc/netlan/01-network-manager-all.yaml
# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager

Yet /etc/cloud/cloud.cfg specifies priority order of network renderers to prefer using netplan insteaad of NetworkManager:
   networks:
       renderers: ['netplan', 'eni', 'sysconfig']
       activators: ['netplan', 'eni', 'network-manager', 'networkd']

This causes cloud-init to also write out /etc/netplan/50-cloud-init.yaml which also tries to claim management of enp0s31f6 under systemd-networkd:

network:
  version: 2
  ethernets:
    enp0s21f6:
      dhcp4: true
      match:
        macaddress: 6c:24:08:9e:54:e7
      set-name: enp0s31f6

The result is undetermined behavior in early boot as NetworkManager and systemd-networkd fight over who owns the wired device. This could be the resulting DNS lookup failures we are seeing.

Resolution: I think desktop images want to avoid telling cloud-init to render networking using netplan. I believe we are missing this proposed fix to livecd rootfs:

https://code.launchpad.net/~dbungert/livecd-rootfs/+git/livecd-rootfs/+merge/427445

Revision history for this message
Chad Smith (chad.smith) wrote :

journalctl on a failed laptop is showwing a lot of throttling logs from NetworkManager such as :

  stat change: unavilable -> disconnected (reason 'carrier changed', sys-iface-state: 'managed')

This makes be believe that networkd and NetworkManager are attempting to grab management of the NIC multiple times throughout boot thereby rendering the network and carrier down.

I see cloud-init reflecting the 'down' carrier state of en0s31f6 in the 'init' timeframe too by

 Cloud-init v. 23.1.1-0ubuntu1 running 'init' at Tue, 07 Mar 2023 03:17:56 +0000. Up 62.47 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enp0s31f6 | False | . | . | . | 6c:24:08:9e:54:e6 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: | wlp0s20f3 | False | . | . | . | 38:7a:0e:2d:d0:bb |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+

2. The second interesting configuration 'error' to be aware of is that netplan doesn't like the 644 permissions on the rendered /etc/netplan/01-network-manager-all.yaml.

** (process:2015): WARNING **: 03:18:08.594: Permissions for /etc/netplan/01-network-manager-all.yaml are too open. Netplan configuration should NOT be accessible by others.

Full log paste: https://dpaste.com/APGZH46PN

Revision history for this message
Dan Bungert (dbungert) wrote :

Looking at the desktop ISOs, while the system is configured with Netplan to use NetworkManager, there seems to be some conflict here where systemd-networkd is trying to interact with the device. Some feedback would be appreciated. The practical result is that early boot cloud-init is not able to fetch data that it should be able to retrieve.

Dan Bungert (dbungert)
tags: added: rls-ll-incoming
Revision history for this message
Lukas Märdian (slyon) wrote :

I disagree with comment #1

/etc/netlan/01-network-manager-all.yaml and /etc/netplan/50-cloud-init.yaml should not be in conflict with each other, but will be merged by Netplan to produce the following configuration:

```yaml
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    enp0s21f6:
      dhcp4: true
      match:
        macaddress: 6c:24:08:9e:54:e7
      set-name: enp0s31f6

```

Which should render its config in /run/NetworkManager/system-connections instead of /run/systemd/network (which would be the case if the global renderer setting wasn't changed).

So I wonder what is the exact sequence of events for running the Netplan generator (by systemd), installing the 50-cloud-init.yaml file (by cloud-init), installing the 01-network-manager-all.yaml file (by the installer?), executing `netplan apply` (by cloud-init).

Also, what are the artifacts generated by Netplan in /run/systemd/network/ and /run/NetworkManager/system-connections/ ?

I assume some kind of race condition, where:
* cloud-init installs 50-cloud-init.yaml
* Netplan generator being run (as a systemd generator during early boot)
* systemd-network is now controlling the interface
* the installer putting the 01-network-manager-all.yaml file
* cloud-init calling `netplan apply` at runtime
* network configuration is being changed and NetworkManager is supposed to take over control
* Interfaces still managed by networkd from the earlier stage, and therefore getting into conflict

Revision history for this message
Dan Bungert (dbungert) wrote :

So while this bug talks about netboot, I think it isn't necessary to involve that for testing purposes. I suggest simply booting the desktop daily-live with the nocloud kernel command line listed above and seeing how things go. Are the various network tools doing what we expect?

Note that gnome-boxes is a convenient tool for running this ISO in a VM, it seems to have a better video driver than what kvm would suggest by default. It's still possible to give it kernel command line with config like:
    <kernel>/srv/iso/lunar/vmlinuz</kernel>
    <initrd>/srv/iso/lunar/initrd</initrd>
    <cmdline>autoinstall layerfs-path=minimal.standard.live.squashfs ds=nocloud-net;s=http://boot.linuxgroove.com/ubuntu/23.04/</cmdline>

Revision history for this message
Dan Bungert (dbungert) wrote :

> So I wonder what is the exact sequence of events for running the Netplan generator (by systemd), installing the 50-cloud-init.yaml file (by cloud-init), installing the 01-network-manager-all.yaml file (by the installer?), executing `netplan apply` (by cloud-init).

By the time Subiquity has started, the bad interaction has already taken place.
In this nocloud case, cloud-init should have been able to retrieve the user-data and other things, that failed. So at Subiquity start time, we ask for the autoinstall and get an empty answer. An empty answer is quite common - that is what happens in a normal interactive install - so it's not immediately obvious that a misbehavior has taken place.

Revision history for this message
Lukas Märdian (slyon) wrote :

01-network-manager-all.yaml seems to be shipped by livecd-rootfs

Changed in netplan:
assignee: nobody → Danilo Egea Gondolfo (danilogondolfo)
tags: added: foundations-todo
removed: rls-ll-incoming
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote (last edit ):

It doesn't seem to be caused by a race between networkd and NetworkManager.

I reproduced the issue with qemu here and I see the name resolution failure happening few seconds before NetworkManager started.

From /var/log/cloud-init.log:

2023-03-09 19:17:58,443 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
Traceback (most recent call last):
...
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='boot.linuxgroove.com', port=80): Max retries exceeded with url: /ubuntu/23.04/meta-data (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f650b663610>:
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

From systemd's journal
Mar 09 19:18:01 ubuntu systemd[1]: Starting NetworkManager.service - Network Manager...

(edit)

resolved wasn't running as well

2023-03-09T19:18:01.712656+00:00 ubuntu systemd[1]: Starting systemd-resolved.service - Network Name Resolution...

but it seems my instance had a nameserver:

2023-03-09 19:17:58,500 - stages.py[INFO]: Applying network configuration from initramfs bringup=True: {'config': [{'type': 'physical', 'name': 'ens3', 'subnets': [{'type': 'dhcp', 'control': 'manual', 'netmask': '255.255.255.0', 'broadc
ast': '10.0.2.255', 'gateway': '10.0.2.2', 'dns_nameservers': ['10.0.2.3']}], 'mac_address': '52:54:00:12:34:56'}], 'version': 1}

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

As /etc/resolv.conf is a symlink, is it possible that the nameservers received via DHCP in the early boot stages are never stored in /etc/resolv.conf?

cloud-init tries to resolve that address before resolved is started and there is nothing at /etc/resolv.conf.

Does that make sense?

Revision history for this message
Lukas Märdian (slyon) wrote :

Maybe we need some additional systemd service ordering, to make systemd-resolved start before calling into the DHCP client, so that it can properly receive the DNS servers.

Changed in netplan:
status: New → Invalid
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

I attached some logs that might be useful. After checking it again I realized that the syslog timestamps are a little off when compared to the systemd journal.

As shown in the timeline.txt (attached), cloud-init and systemd-resolved are starting at the same time.

So the name resolution might not be working yet when cloud-init needs it.

Revision history for this message
Nick Rosbrook (enr0n) wrote (last edit ):

It appears we have the following *ordering* relationship between systemd-resolved.service, cloud-init.service, network.target, and network-online.target:

network-online.target
|---network.target
....|---systemd-resolved.service
|---cloud-init.service

See attached graph for more details. This is consistent with the timeline shown in comment #13.

Since cloud-init.service apparently requires DNS, I would simply try to add `After=systemd-resolved.service` to `cloud-init.service`.

Revision history for this message
Nick Rosbrook (enr0n) wrote :

FWIW you can test this out by adding a drop-in config:

$ cat > /etc/systemd/system/cloud-init.service.d/10-after-systemd-resolved.conf << EOF
[Service]
After=systemd-resolved.service
EOF

Revision history for this message
Dan Bungert (dbungert) wrote :

I played with a two variation of dropping in some after directives, but obtained similar failing results where the network state was not what cloud-init wanted at the time it started.

@Chad - what's your thoughts on some service reordering?

Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Thank you all for your attention on this bug.

Sorry my earlier comment on this bug was ill-informed and incorrect. I'm able to reproduce this as well through the server installer qemu/kvm based installs as well so I can confirm as well that this isn't/wasn't NetworkManager and systemd-networkd fighting over management of the device because server ISOs don't have network-manager installed.

Also, I am concerned with cloud-init.service being ordered specifically after systemd-resolved.service on all deployments as we will be affecting all boots and delaying them on the systemd-resolved setup of DNS when only specific use-cases such NoCloudNet with an FQDN as kernel cmdline directive may need that service to be active.
## Update: from testing in comment #20: systemd-resolved.service doesn't seem to add cost to the underlying boot, it just re-orders the resolved service earlier. But, even though resolved is "up" and active it doesn't yet have the ability to resolve anything until NetworkManager-wait-online.service is complete and registers a connected NIC.

Some other datasources like GCP do rely on DNS resolution of the instance metadata service (GCP), but cloud-images inject a config into /etc/hosts to resolve that locally in absence of active DNS in early boot. Ec2 does also define instance-data:8773 as a potential fallback IMDS definition, but both IPv4 and IPv6 endpoints are defined earlier in the search order, so we never get back to that DNS lookup in all practical deployments.

## Update per comment #20, retries will work for systemd-networkd managed systems because systemd-networkd-wait-online.service happens after=network-pre.target and before=sysinit.target. Retries won't work for NetworkManager currently because NetworkManager is After=dbus.service which is After=sysinit.target

We may be able to avoid the cost of a strict `After=systemd-resolved.service` clause in cloud-init.service if we can add the following logic to nocloud by adding sensible retries in the NoCloud datasource.

 1. Check if seed URLs `netloc` is an ip address. If IP, no retries on failure.

 2a. When seed URL is non-IP, retry on specific 'network resolution error' URLError raised and retry X times for that failure mode

 - or -

 2b . When seed URL is non-IP, invoke socket.getaddrinfo to validate DNS resolution prior to attempting to download metadata, if not resolvable, retry only as long as systemd.resolved.services isn't yet active.

These retry approaches should allow us to avoid impacting typical boots on most systems, yet still support DNS-based needs for datasource detection in early boot if FQDN is used for IMDS.

Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

I'm also going back to the tabular network state printed out by in cloud-init.service shows: that the primary network interface is still down at this point in boot, which it shouldn't be. cloud-init should be waiting on the presence of link up before starting the cloud-init.service (network boot stage). I'm working on the hypothesis that we are missing an `After=NetworkManager-wait-online.service` which wasn't present in cloud-init because it didn't have to cope with non systemd-networkd managed devices in ubuntu-server installs.

ci-info: | enp0s31f6 | False | . | . | . | 6c:24:08:9e:54:e6 |

Running a couple of tests in a customized Desktop Live iso installer now to confirm

# update per #20. Cloud-init.service can't be after NetworkManager-wait-online.service without introducing a systemd ordering cycle because NetworkManager.service is After=dbus.service which is After=sysinit.target. The only way currently that cloud-init.service can wait until After=NM-wait-online.service is for cloud-init.service to drop it's Before=sysinit.target in desktop live installers (or NetworkManager to grow support for setup prior to dbus.service availability so it can drop the 'After=dbus.service' from systemd unit config)

Revision history for this message
Chad Smith (chad.smith) wrote :

I can confirm Dan's validation that After=systemd-resolved.service doesn't buy us anything here because, although systemd-resolved.service is up, NetworkManager.service && NetworkManager-wait-online.service don't finish bringing link up for related network devices until After=dbus.service timeframe. So blocking on systemd-resolved.service tells us only that the service is running, not that it provides useful DNS lookups.

Since dbus.service is After=sysinit.target and sysinit.target is After=cloud-init.service we have an ordering cycle for NetworkManager that doesn't exist for systemd-networkd controlled environments. Any attempts to include After=NetworkManager-wait-online.service in cloud-init.service definition result in ordering cycles. Even if we try to add DefaultDependecies=no to both NetworkManager.service and NetworkManager-wait-online.service to prevent them from pulling in `After=sysinit.target` for ordering.

NetworkManager images (desktop) differ from cloud-init's systemd-networkd managed images (server) because systemd-networkd-wait-online.service doesn't have a strict After=dbus.service config systemd-networkd can poll for dbus availability and use it once it's available. But, it doesn't seem NetworkManager has that facility though I haven't dug deeply into NM yet.

Revision history for this message
Chad Smith (chad.smith) wrote :
Download full text (12.8 KiB)

TLDR: My earlier suggestion to retry in cloud-init.service while awaiting DNS is not viable when thinking about NetworkManager. NetworkManager.service is After=sysinit.target due to After=dbus.service and cloud-init.service is Before=sysinit.target. NetworkManager is the only service bringing up the primary NIC in desktop images which gives cloud-init access to a functional DNS. Unless we can move NetworkManager.service before sysinit.target, I don't think we don't think we can leverage DNS.

In review of a KVM live desktop boot in which cloud-init.service defines After=systemd-resolved.service We can see that cloud-init.service still blocks start of NetworkManager (due to After=sysinit.target) and DNS is not active until enp1s0 is actually brought up by NetworkManager.

Here are snippets of the journalctl logs on a local KVM boot where we can see systemd-resolved coming up, then cloud-init.service with 30 retries and finally NetworkManager.service starting after cloud-init.service failed to download the metadata due to DNS resolution errors:

1. systemd-resolved "starts" @22:14:42.302181, which unblocks cloud-init.service
2. cloud-init.service @22:14:42.690949 (which emits that enp1s0's link is not actually up yet so no viable DNS at that time)
3. NetworkManager.service starting @22:15:16.012686up only after cloud-init.service finishes 30 seconds of retries @22:15:16.012686
4. systemd-resolved finally getting a viable DNS route through enp1s0 @22:15:19.630090 ubuntu systemd-resolved[1136]: enp1s0: Bus client set DNS server list to: 192.168.122.1
5. Network manager finally sees enp1s0 device activated @22:15:19.632846
6. NetworkManager-wait-online.service finally gets to CONNECTED status @22:15:19.637543

--- journalctl -b 0 -o short-precise | egrep 'enp1s0|resolved|NetworkManager|ci-info'
Mar 14 22:14:39.167190 ubuntu kernel: virtio_net virtio0 enp1s0: renamed from eth0
Mar 14 22:14:41.451373 ubuntu systemd[1]: Starting systemd-resolved.service - Network Name Resolution...
Mar 14 22:14:42.253268 ubuntu systemd-resolved[1136]: Positive Trust Anchors:
Mar 14 22:14:42.253583 ubuntu systemd-resolved[1136]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Mar 14 22:14:42.253685 ubuntu systemd-resolved[1136]: Negative trust anchors: home.arpa 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Mar 14 22:14:42.300435 ubuntu systemd-resolved[1136]: Using system hostname 'ubuntu'.
Mar 14 22:14:42.302181 ubuntu systemd[1]: Started systemd-resolved.service - Network Name Resolution.
### cloud-init noticing enp1s0 has no link
Mar 14 22:14:42.690949 ubuntu cloud-init[1341]: ci-info: | enp1s0 | False | . | . | . | 52:54:00:5b:ba:d5 |
Mar 14 22:15:16.012686 ubuntu systemd[1]: Starting NetworkManager.service - Network Manager...
Mar 14 22:...

Revision history for this message
Chad Smith (chad.smith) wrote :

This issue with NetworkManager.service and systemd is reminiscent of the related feature request against systemd-networkd https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1636912 that also points out the ordering issues.

Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

While retries on DNS resolution failures does not work for cloud-init.service for environments running NetworkManager.service (desktop live ISOs), I have found that the retries work where systemd-networkd is bringing up network config (server live ISOs). This is because cloud-init.service is After=systemd-network-wait-online.service which provides systemd-resolved with viable network interfaces which are up and providing access to functional DNS configuration.

Revision history for this message
Chad Smith (chad.smith) wrote :

Short-term solution here will likely be for livecd-rootfs to augment cloud-init.service which drops `Before=sysinit.target` and adds an `After=NetworkManager-wait-online.service`. This is quite a bit like what redhat and derivatives have done for a while. Redhat and derivitatives have an `After=NetworkManager.service` for cloud-init.service config and no `Before=sysinit.target`. Suse injects an `After=dbus.service` which also puts it at the same boot timeframe as NetworkManager-based environments.[1]

It does mean that cloud-init datasource gets detected a couple seconds later in boot, meaning that any service depending on `/run/cloud-init/instance-data.json` will also get delayed a couple of seconds, but doesn't delay overall boot process.

Long-term solution may be to see if we can improve NetworkManager dependency on dbus.service so that it could support late-bind interaction once dbus.socket comes online.

References:
[1] upstream suse/redhat cloud-init.service configuration NetworkManager/dbus ordering https://github.com/canonical/cloud-init/blob/main/systemd/cloud-init.service.tmpl#L19-L27

Revision history for this message
Chad Smith (chad.smith) wrote :

Short-term proposal update:
   Looks like we can't get away with suppplemental systemd drop-ins in /etc/systemd/system/cloud-init.service.d by themselves in livecd-roots because we need to remove the "Before=sysinit.target" from the default shipped cloud-init.service config. Systemd drop-ins are used only to augment or add configuration, we need to replace the entire [Unit] definition in /lib/systemd/system/cloud-init.service in order to remove the Before=sysinit.target because it will still conflict with NetworkManager.service ordering on After=dbus.service

Revision history for this message
Chad Smith (chad.smith) wrote :
Changed in cloud-init:
assignee: nobody → Chad Smith (chad.smith)
Chad Smith (chad.smith)
Changed in cloud-init:
status: Triaged → In Progress
Revision history for this message
Dan Bungert (dbungert) wrote :

What a mess!

I have uploaded the livecd-rootfs change proposed by Chad in #26. Note that there is another problem around jsonschema exposed by this that is in progress.

Changed in livecd-rootfs (Ubuntu):
status: New → Fix Committed
Revision history for this message
Chad Smith (chad.smith) wrote :

jsonschema 2.6.0 is now dropped from latest ubuntu-desktop-snap version 0+get.98600e08 in stable channel as of a couple hours ago per this issue/fix[1]. This resolves issues with curtin Tracebacks on jsonschema.validate(). Additionally cloud-init has a PR in progress[2] to avoid registering a strict draft4 schema with additionalProperties=False. The cloud-init fix is now unnecessary given the changes already published in ubuntu-desktop-installer to drop jsonschema 2.6.0 from the snap.

[1] drop duplicate python dependencies from site-packages https://github.com/canonical/ubuntu-desktop-installer/issues/1714

[2] https://github.com/canonical/cloud-init/pull/2098

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package livecd-rootfs - 2.817

---------------
livecd-rootfs (2.817) lunar; urgency=medium

  [ John Chittum ]
  * revert ipc change. kernel 6.2 will have the correct setting

 -- Steve Langasek <email address hidden> Mon, 27 Mar 2023 12:11:06 -0700

Changed in livecd-rootfs (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Chad Smith (chad.smith) wrote :

I can now confirm successful autoinstall runs with FQDN in kernel commandline in Desktop live installer ISOs dated 20230403. This allows cloud-init.service to be ordered `After=NetworkManager.service NetworkManager-wait-online.service` which ensures devices and resolved are both 'up' and active by the time cloud-init tries to download remote user-data/meta-data from a seedurl.

$ cat /var/log/installer/media-info # also found in /cdrom/.disk/info in ephemeral environment
Ubuntu 23.04 "Lunar Lobster" - Daily amd64 (20230403)

$ # installer version of the snap
2023-04-03 15:42:37,497 INFO subiquity:163 Starting Subiquity server revision 907 of snap /snap/ubuntu-desktop-installer/907

Presence of the correct systemd service ordering for cloud-init.service in Desktop live installer builds dated 20230403 placing cloud-init.service `After=NetworkManager.service NetworkManager-wait-online.service` guarantee that network is up before cloud-init datasource discovery runs which also implies systemd-resolved has started and has adequate connectivity to source FQDNs on any NetworkManager discovered NICs.

ubuntu@ubuntu:~$ systemctl show -p After,Before cloud-init.service --no-pager
Before=sshd-keygen.service cloud-config.target network-online.target sshd.service shutdown.target systemd-user-sessions.service
After=cloud-init-local.service NetworkManager.service system.slice networking.service systemd-journald.socket systemd-networkd-wait-online.service NetworkManager-wait-online.service

This allows cloud-init to download remote user-data from an FQDN provided to the live desktop installer via the kernel parameter: `ds=nocloud-net;s=http://YOUR-DOMAIN/'

So, FQDN lookup seems to be resolved by the systemd service ordering after NetworkManager is up and functional.

There may be a secondary issue to file related to environments with nameservers being specifically provided for pxe-based installs after cloud-init properly downloads remote user-data from a remote FQDN but ordering of systemd network configuration seems to alleviate the DNS resolution aspect pointed to in this bug.

Revision history for this message
Chad Smith (chad.smith) wrote :

I can confirm success manually launching images via kvm in virt-manager that live desktop image builds as of 20230403 images in `/cdrom/.disk/info` have the proper systemd service ordering which places cloud-init.service `After=NetworkManager.service NetworkManager-wait-online.service`. Which allows cloud-init to resolve DNS on Ubuntu ISOs where NetworkManager is the primary network backend.

We also found a secondary bug not related to the specific NetworkManager DNS issue, once cloud-init renders initial network config to detect the datasource, it writes direct network configuration to /etc/NetworkManager/systemc-connections. If networking changes are provided in autoinstall.network, ubuntu-desktop-installer(via subiquity) writes that network config to /etc/netplan/00-installer.yaml and invokes 'netplan apply'. This results in collisions in NetworkManager configuration as netplan isn't aware of cloud-init's direct config of in /etc/NetworkManager/system-connections/cloud-init-<device>.nmconnection.

https://bugs.launchpad.net/cloud-init/+bug/2015605

Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Closing the cloud-init task here as 'fix released' because we have livecd-rootfs overlay config files allow for cloud-init.service to get ordered After=NetworkManager.service which solves the immediate DNS issues in early boot. Long-term cloud-init will need to spec out options to prefer ordering after NetworkManager versus systemd-networkd at systemd generator timeframe because ordering After=NetworkManager is incompatible with cloud-init's default Before=sysinit.target.

We'll take that long-term work as a separate bug for cloud-init https://bugs.launchpad.net/cloud-init/+bug/2015949 to discern how best to position upstream cloud-init.service files to cope with service ordering conflicts to prefer NetworkManager.service over systemd-networkd.

Changed in cloud-init:
status: In Progress → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
Revision history for this message
Nick Rosbrook (enr0n) wrote :

My understanding from a quick read is that there is nothing to do in systemd. Please re-open if I am mistaken.

Changed in systemd (Ubuntu):
status: New → Invalid
Dan Bungert (dbungert)
Changed in subiquity:
status: New → Invalid
Benjamin Drung (bdrung)
tags: removed: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.