Bug #2008952 “DNS failure while trying to fetch user-data " : Bugs : cloud-init

Chad Smith (chad.smith) on 2023-03-02

Changed in cloud-init:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-07:

#1

There are a couple of significant issues that may be leading to this symptom:

1. On Desktop live server images, something is laying down a competing file for management of the wired NIC on my laptop:

/etc/netlan/01-network-manager-all.yaml
# Let NetworkManager manage all devices on this system
network:
version: 2
renderer: NetworkManager

Yet /etc/cloud/cloud.cfg specifies priority order of network renderers to prefer using netplan insteaad of NetworkManager:
   networks:
       renderers: ['netplan', 'eni', 'sysconfig']
       activators: ['netplan', 'eni', 'network-manager', 'networkd']

This causes cloud-init to also write out /etc/netplan/50-cloud-init.yaml which also tries to claim management of enp0s31f6 under systemd-networkd:

network:
  version: 2
  ethernets:
    enp0s21f6:
      dhcp4: true
      match:
        macaddress: 6c:24:08:9e:54:e7
      set-name: enp0s31f6

The result is undetermined behavior in early boot as NetworkManager and systemd-networkd fight over who owns the wired device. This could be the resulting DNS lookup failures we are seeing.

Resolution: I think desktop images want to avoid telling cloud-init to render networking using netplan. I believe we are missing this proposed fix to livecd rootfs:

https://code.launchpad.net/~dbungert/livecd-rootfs/+git/livecd-rootfs/+merge/427445

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-07:

#2

journalctl on a failed laptop is showwing a lot of throttling logs from NetworkManager such as :

stat change: unavilable -> disconnected (reason 'carrier changed', sys-iface-state: 'managed')

This makes be believe that networkd and NetworkManager are attempting to grab management of the NIC multiple times throughout boot thereby rendering the network and carrier down.

I see cloud-init reflecting the 'down' carrier state of en0s31f6 in the 'init' timeframe too by

Cloud-init v. 23.1.1-0ubuntu1 running 'init' at Tue, 07 Mar 2023 03:17:56 +0000. Up 62.47 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enp0s31f6 | False | . | . | . | 6c:24:08:9e:54:e6 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: | wlp0s20f3 | False | . | . | . | 38:7a:0e:2d:d0:bb |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+

2. The second interesting configuration 'error' to be aware of is that netplan doesn't like the 644 permissions on the rendered /etc/netplan/01-network-manager-all.yaml.

** (process:2015): WARNING **: 03:18:08.594: Permissions for /etc/netplan/01-network-manager-all.yaml are too open. Netplan configuration should NOT be accessible by others.

Full log paste: https://dpaste.com/APGZH46PN

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-03-07:

#3

Looking at the desktop ISOs, while the system is configured with Netplan to use NetworkManager, there seems to be some conflict here where systemd-networkd is trying to interact with the device. Some feedback would be appreciated. The practical result is that early boot cloud-init is not able to fetch data that it should be able to retrieve.

Dan Bungert (dbungert) on 2023-03-07

tags:

added: rls-ll-incoming

Revision history for this message

Lukas Märdian (slyon) wrote on 2023-03-08:

#4

I disagree with comment #1

/etc/netlan/01-network-manager-all.yaml and /etc/netplan/50-cloud-init.yaml should not be in conflict with each other, but will be merged by Netplan to produce the following configuration:

```yaml
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    enp0s21f6:
      dhcp4: true
      match:
        macaddress: 6c:24:08:9e:54:e7
      set-name: enp0s31f6

```

Which should render its config in /run/NetworkManager/system-connections instead of /run/systemd/network (which would be the case if the global renderer setting wasn't changed).

So I wonder what is the exact sequence of events for running the Netplan generator (by systemd), installing the 50-cloud-init.yaml file (by cloud-init), installing the 01-network-manager-all.yaml file (by the installer?), executing `netplan apply` (by cloud-init).

Also, what are the artifacts generated by Netplan in /run/systemd/network/ and /run/NetworkManager/system-connections/ ?

I assume some kind of race condition, where:
* cloud-init installs 50-cloud-init.yaml
* Netplan generator being run (as a systemd generator during early boot)
* systemd-network is now controlling the interface
* the installer putting the 01-network-manager-all.yaml file
* cloud-init calling `netplan apply` at runtime
* network configuration is being changed and NetworkManager is supposed to take over control
* Interfaces still managed by networkd from the earlier stage, and therefore getting into conflict

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-03-08:

#5

So while this bug talks about netboot, I think it isn't necessary to involve that for testing purposes. I suggest simply booting the desktop daily-live with the nocloud kernel command line listed above and seeing how things go. Are the various network tools doing what we expect?

Note that gnome-boxes is a convenient tool for running this ISO in a VM, it seems to have a better video driver than what kvm would suggest by default. It's still possible to give it kernel command line with config like:
    <kernel>/srv/iso/lunar/vmlinuz</kernel>
    <initrd>/srv/iso/lunar/initrd</initrd>
    <cmdline>autoinstall layerfs-path=minimal.standard.live.squashfs ds=nocloud-net;s=http://boot.linuxgroove.com/ubuntu/23.04/</cmdline>

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-03-08:

#6

> So I wonder what is the exact sequence of events for running the Netplan generator (by systemd), installing the 50-cloud-init.yaml file (by cloud-init), installing the 01-network-manager-all.yaml file (by the installer?), executing `netplan apply` (by cloud-init).

By the time Subiquity has started, the bad interaction has already taken place.
In this nocloud case, cloud-init should have been able to retrieve the user-data and other things, that failed. So at Subiquity start time, we ask for the autoinstall and get an empty answer. An empty answer is quite common - that is what happens in a normal interactive install - so it's not immediately obvious that a misbehavior has taken place.

Revision history for this message

Lukas Märdian (slyon) wrote on 2023-03-09:

#7

01-network-manager-all.yaml seems to be shipped by livecd-rootfs

Changed in netplan:
assignee:	nobody → Danilo Egea Gondolfo (danilogondolfo)
tags:	added: foundations-todo removed: rls-ll-incoming

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-09 (last edit on 2023-03-09):

#8

It doesn't seem to be caused by a race between networkd and NetworkManager.

I reproduced the issue with qemu here and I see the name resolution failure happening few seconds before NetworkManager started.

From /var/log/cloud-init.log:

2023-03-09 19:17:58,443 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
Traceback (most recent call last):
...
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='boot.linuxgroove.com', port=80): Max retries exceeded with url: /ubuntu/23.04/meta-data (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f650b663610>:
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

From systemd's journal
Mar 09 19:18:01 ubuntu systemd[1]: Starting NetworkManager.service - Network Manager...

(edit)

resolved wasn't running as well

2023-03-09T19:18:01.712656+00:00 ubuntu systemd[1]: Starting systemd-resolved.service - Network Name Resolution...

but it seems my instance had a nameserver:

2023-03-09 19:17:58,500 - stages.py[INFO]: Applying network configuration from initramfs bringup=True: {'config': [{'type': 'physical', 'name': 'ens3', 'subnets': [{'type': 'dhcp', 'control': 'manual', 'netmask': '255.255.255.0', 'broadc
ast': '10.0.2.255', 'gateway': '10.0.2.2', 'dns_nameservers': ['10.0.2.3']}], 'mac_address': '52:54:00:12:34:56'}], 'version': 1}

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-09:

#9

As /etc/resolv.conf is a symlink, is it possible that the nameservers received via DHCP in the early boot stages are never stored in /etc/resolv.conf?

cloud-init tries to resolve that address before resolved is started and there is nothing at /etc/resolv.conf.

Does that make sense?

Revision history for this message

Lukas Märdian (slyon) wrote on 2023-03-14:

#10

Maybe we need some additional systemd service ordering, to make systemd-resolved start before calling into the DHCP client, so that it can properly receive the DNS servers.

Danilo Egea Gondolfo (danilogondolfo) on 2023-03-14

Changed in netplan:
status:	New → Invalid

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-14:

#11

systemd_journal.txt Edit (159.1 KiB, text/plain)

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-14:

#12

cloud-init.log Edit (99.7 KiB, text/plain)

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-14:

#13

timeline.txt Edit (6.0 KiB, text/plain)

Revision history for this message

Danilo Egea Gondolfo (danilogondolfo) wrote on 2023-03-14:

#14

I attached some logs that might be useful. After checking it again I realized that the syslog timestamps are a little off when compared to the systemd journal.

As shown in the timeline.txt (attached), cloud-init and systemd-resolved are starting at the same time.

So the name resolution might not be working yet when cloud-init needs it.

Revision history for this message

Nick Rosbrook (enr0n) wrote on 2023-03-14 (last edit on 2023-03-14):

#15

systemd-analyze-dot.svg Edit (34.2 KiB, image/svg+xml)

It appears we have the following *ordering* relationship between systemd-resolved.service, cloud-init.service, network.target, and network-online.target:

network-online.target
|---network.target
....|---systemd-resolved.service
|---cloud-init.service

See attached graph for more details. This is consistent with the timeline shown in comment #13.

Since cloud-init.service apparently requires DNS, I would simply try to add `After=systemd-resolved.service` to `cloud-init.service`.

Revision history for this message

Nick Rosbrook (enr0n) wrote on 2023-03-14:

#16

FWIW you can test this out by adding a drop-in config:

$ cat > /etc/systemd/system/cloud-init.service.d/10-after-systemd-resolved.conf << EOF
[Service]
After=systemd-resolved.service
EOF

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-03-14:

#17

I played with a two variation of dropping in some after directives, but obtained similar failing results where the network state was not what cloud-init wanted at the time it started.

@Chad - what's your thoughts on some service reordering?

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-14 (last edit on 2023-03-15):

#18

Thank you all for your attention on this bug.

Sorry my earlier comment on this bug was ill-informed and incorrect. I'm able to reproduce this as well through the server installer qemu/kvm based installs as well so I can confirm as well that this isn't/wasn't NetworkManager and systemd-networkd fighting over management of the device because server ISOs don't have network-manager installed.

Also, I am concerned with cloud-init.service being ordered specifically after systemd-resolved.service on all deployments as we will be affecting all boots and delaying them on the systemd-resolved setup of DNS when only specific use-cases such NoCloudNet with an FQDN as kernel cmdline directive may need that service to be active.
## Update: from testing in comment #20: systemd-resolved.service doesn't seem to add cost to the underlying boot, it just re-orders the resolved service earlier. But, even though resolved is "up" and active it doesn't yet have the ability to resolve anything until NetworkManager-wait-online.service is complete and registers a connected NIC.

Some other datasources like GCP do rely on DNS resolution of the instance metadata service (GCP), but cloud-images inject a config into /etc/hosts to resolve that locally in absence of active DNS in early boot. Ec2 does also define instance-data:8773 as a potential fallback IMDS definition, but both IPv4 and IPv6 endpoints are defined earlier in the search order, so we never get back to that DNS lookup in all practical deployments.

## Update per comment #20, retries will work for systemd-networkd managed systems because systemd-networkd-wait-online.service happens after=network-pre.target and before=sysinit.target. Retries won't work for NetworkManager currently because NetworkManager is After=dbus.service which is After=sysinit.target

We may be able to avoid the cost of a strict `After=systemd-resolved.service` clause in cloud-init.service if we can add the following logic to nocloud by adding sensible retries in the NoCloud datasource.

1. Check if seed URLs `netloc` is an ip address. If IP, no retries on failure.

2a. When seed URL is non-IP, retry on specific 'network resolution error' URLError raised and retry X times for that failure mode

- or -

2b . When seed URL is non-IP, invoke socket.getaddrinfo to validate DNS resolution prior to attempting to download metadata, if not resolvable, retry only as long as systemd.resolved.services isn't yet active.

These retry approaches should allow us to avoid impacting typical boots on most systems, yet still support DNS-based needs for datasource detection in early boot if FQDN is used for IMDS.

Thank you all for your attention on this bug.

Sorry my earlier comment on this bug was ill-informed and incorrect. I'm able to reproduce this as well through the server installer qemu/kvm based installs as well so I can confirm as well that this isn't/wasn't NetworkManager and systemd-networkd fighting over management of the device because server ISOs don't have network-manager installed.

Also, I am concerned with cloud-init.service being ordered specifically after systemd-resolved.service on all deployments as we will be affecting all boots and delaying them on the systemd-resolved setup of DNS when only specific use-cases such NoCloudNet with an FQDN as kernel cmdline directive may need that service to be active.
## Update: from testing in comment #20: systemd-resolved.service doesn't seem to add cost to the underlying boot, it just re-orders the resolved service earlier. But, even though resolved is "up" and active it doesn't yet have the ability to resolve anything until NetworkManager-wait-online.service is complete and registers a connected NIC.

Some other datasources like GCP do rely on DNS resolution of the instance metadata service (GCP), but cloud-images inject a config into /etc/hosts to resolve that locally in absence of active DNS in early boot. Ec2 does also define instance-data:8773 as a potential fallback IMDS definition, but both IPv4 and IPv6 endpoints are defined earlier in the search order, so we never get back to that DNS lookup in all practical deployments.

## Update per comment #20, retries will work for systemd-networkd managed systems because systemd-networkd-wait-online.service happens after=network-pre.target and before=sysinit.target. Retries won't work for NetworkManager currently because NetworkManager is After=dbus.service which is After=sysinit.target

We may be able to avoid the cost of a strict `After=systemd-resolved.service` clause in cloud-init.service if we can add the following logic to nocloud by adding sensible retries in the NoCloud datasource.

1. Check if seed URLs `netloc` is an ip address. If IP, no retries on failure.

2a. When seed URL is non-IP, retry on specific 'network resolution error' URLError raised and retry X times for that failure mode

- or -

2b . When seed URL is non-IP, invoke socket.getaddrinfo to validate DNS resolution prior to attempting to download metadata, if not resolvable, retry only as long as systemd.resolved.services isn't yet active.

These retry approaches should allow us to avoid impacting typical boots on most systems, yet still support DNS-based needs for datasource detection in early boot if FQDN is used for IMDS.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-14 (last edit on 2023-03-18):

#19

I'm also going back to the tabular network state printed out by in cloud-init.service shows: that the primary network interface is still down at this point in boot, which it shouldn't be. cloud-init should be waiting on the presence of link up before starting the cloud-init.service (network boot stage). I'm working on the hypothesis that we are missing an `After=NetworkManager-wait-online.service` which wasn't present in cloud-init because it didn't have to cope with non systemd-networkd managed devices in ubuntu-server installs.

ci-info: | enp0s31f6 | False | . | . | . | 6c:24:08:9e:54:e6 |

Running a couple of tests in a customized Desktop Live iso installer now to confirm

# update per #20. Cloud-init.service can't be after NetworkManager-wait-online.service without introducing a systemd ordering cycle because NetworkManager.service is After=dbus.service which is After=sysinit.target. The only way currently that cloud-init.service can wait until After=NM-wait-online.service is for cloud-init.service to drop it's Before=sysinit.target in desktop live installers (or NetworkManager to grow support for setup prior to dbus.service availability so it can drop the 'After=dbus.service' from systemd unit config)

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-14:

#20

I can confirm Dan's validation that After=systemd-resolved.service doesn't buy us anything here because, although systemd-resolved.service is up, NetworkManager.service && NetworkManager-wait-online.service don't finish bringing link up for related network devices until After=dbus.service timeframe. So blocking on systemd-resolved.service tells us only that the service is running, not that it provides useful DNS lookups.

Since dbus.service is After=sysinit.target and sysinit.target is After=cloud-init.service we have an ordering cycle for NetworkManager that doesn't exist for systemd-networkd controlled environments. Any attempts to include After=NetworkManager-wait-online.service in cloud-init.service definition result in ordering cycles. Even if we try to add DefaultDependecies=no to both NetworkManager.service and NetworkManager-wait-online.service to prevent them from pulling in `After=sysinit.target` for ordering.

NetworkManager images (desktop) differ from cloud-init's systemd-networkd managed images (server) because systemd-networkd-wait-online.service doesn't have a strict After=dbus.service config systemd-networkd can poll for dbus availability and use it once it's available. But, it doesn't seem NetworkManager has that facility though I haven't dug deeply into NM yet.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-14:

#21

Download full text (12.8 KiB)

TLDR: My earlier suggestion to retry in cloud-init.service while awaiting DNS is not viable when thinking about NetworkManager. NetworkManager.service is After=sysinit.target due to After=dbus.service and cloud-init.service is Before=sysinit.target. NetworkManager is the only service bringing up the primary NIC in desktop images which gives cloud-init access to a functional DNS. Unless we can move NetworkManager.service before sysinit.target, I don't think we don't think we can leverage DNS.

In review of a KVM live desktop boot in which cloud-init.service defines After=systemd-resolved.service We can see that cloud-init.service still blocks start of NetworkManager (due to After=sysinit.target) and DNS is not active until enp1s0 is actually brought up by NetworkManager.

Here are snippets of the journalctl logs on a local KVM boot where we can see systemd-resolved coming up, then cloud-init.service with 30 retries and finally NetworkManager.service starting after cloud-init.service failed to download the metadata due to DNS resolution errors:

1. systemd-resolved "starts" @22:14:42.302181, which unblocks cloud-init.service
2. cloud-init.service @22:14:42.690949 (which emits that enp1s0's link is not actually up yet so no viable DNS at that time)
3. NetworkManager.service starting @22:15:16.012686up only after cloud-init.service finishes 30 seconds of retries @22:15:16.012686
4. systemd-resolved finally getting a viable DNS route through enp1s0 @22:15:19.630090 ubuntu systemd-resolved[1136]: enp1s0: Bus client set DNS server list to: 192.168.122.1
5. Network manager finally sees enp1s0 device activated @22:15:19.632846
6. NetworkManager-wait-online.service finally gets to CONNECTED status @22:15:19.637543

--- journalctl -b 0 -o short-precise | egrep 'enp1s0|resolved|NetworkManager|ci-info'
Mar 14 22:14:39.167190 ubuntu kernel: virtio_net virtio0 enp1s0: renamed from eth0
Mar 14 22:14:41.451373 ubuntu systemd[1]: Starting systemd-resolved.service - Network Name Resolution...
Mar 14 22:14:42.253268 ubuntu systemd-resolved[1136]: Positive Trust Anchors:
Mar 14 22:14:42.253583 ubuntu systemd-resolved[1136]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Mar 14 22:14:42.253685 ubuntu systemd-resolved[1136]: Negative trust anchors: home.arpa 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Mar 14 22:14:42.300435 ubuntu systemd-resolved[1136]: Using system hostname 'ubuntu'.
Mar 14 22:14:42.302181 ubuntu systemd[1]: Started systemd-resolved.service - Network Name Resolution.
### cloud-init noticing enp1s0 has no link
Mar 14 22:14:42.690949 ubuntu cloud-init[1341]: ci-info: | enp1s0 | False | . | . | . | 52:54:00:5b:ba:d5 |
Mar 14 22:15:16.012686 ubuntu systemd[1]: Starting NetworkManager.service - Network Manager...
Mar 14 22:...

TLDR: My earlier suggestion to retry in cloud-init.service while awaiting DNS is not viable when thinking about NetworkManager. NetworkManager.service is After=sysinit.target due to After=dbus.service and cloud-init.service is Before=sysinit.target. NetworkManager is the only service bringing up the primary NIC in desktop images which gives cloud-init access to a functional DNS. Unless we can move NetworkManager.service before sysinit.target, I don't think we don't think we can leverage DNS.

In review of a KVM live desktop boot in which cloud-init.service defines After=systemd-resolved.service We can see that cloud-init.service still blocks start of NetworkManager (due to After=sysinit.target) and DNS is not active until enp1s0 is actually brought up by NetworkManager.

Here are snippets of the journalctl logs on a local KVM boot where we can see systemd-resolved coming up, then cloud-init.service with 30 retries and finally NetworkManager.service starting after cloud-init.service failed to download the metadata due to DNS resolution errors:

1. systemd-resolved "starts" @22:14:42.302181, which unblocks cloud-init.service
2. cloud-init.service @22:14:42.690949 (which emits that enp1s0's link is not actually up yet so no viable DNS at that time)
3. NetworkManager.service starting @22:15:16.012686up only after cloud-init.service finishes 30 seconds of retries @22:15:16.012686
4. systemd-resolved finally getting a viable DNS route through enp1s0 @22:15:19.630090 ubuntu systemd-resolved[1136]: enp1s0: Bus client set DNS server list to: 192.168.122.1
5. Network manager finally sees enp1s0 device activated @22:15:19.632846
6. NetworkManager-wait-online.service finally gets to CONNECTED status @22:15:19.637543

--- journalctl -b 0 -o short-precise | egrep 'enp1s0|resolved|NetworkManager|ci-info'
Mar 14 22:14:39.167190 ubuntu kernel: virtio_net virtio0 enp1s0: renamed from eth0
Mar 14 22:14:41.451373 ubuntu systemd[1]: Starting systemd-resolved.service - Network Name Resolution...
Mar 14 22:14:42.253268 ubuntu systemd-resolved[1136]: Positive Trust Anchors:
Mar 14 22:14:42.253583 ubuntu systemd-resolved[1136]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Mar 14 22:14:42.253685 ubuntu systemd-resolved[1136]: Negative trust anchors: home.arpa 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Mar 14 22:14:42.300435 ubuntu systemd-resolved[1136]: Using system hostname 'ubuntu'.
Mar 14 22:14:42.302181 ubuntu systemd[1]: Started systemd-resolved.service - Network Name Resolution.
### cloud-init noticing enp1s0 has no link
Mar 14 22:14:42.690949 ubuntu cloud-init[1341]: ci-info: | enp1s0 | False |     .     |     .     |   .   | 52:54:00:5b:ba:d5 |
Mar 14 22:15:16.012686 ubuntu systemd[1]: Starting NetworkManager.service - Network Manager...
Mar 14 22:15:16.285604 ubuntu NetworkManager[1546]: <info>  [1678832116.2854] NetworkManager (version 1.40.12) is starting... (boot:1f3e11ab-fea1-4f3e-a6ef-fb9d48b0c4d6)
Mar 14 22:15:16.285972 ubuntu NetworkManager[1546]: <info>  [1678832116.2859] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-dns-resolved.conf, 20-connectivity-ubuntu.conf, no-mac-addr-change.conf) (run: 10-globally-managed-devices.conf) (etc: default-wifi-powersave-on.conf)
Mar 14 22:15:16.302542 ubuntu systemd[1]: Started NetworkManager.service - Network Manager.
Mar 14 22:15:16.303605 ubuntu NetworkManager[1546]: <info>  [1678832116.3035] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Mar 14 22:15:16.316514 ubuntu systemd[1]: Starting NetworkManager-wait-online.service - Network Manager Wait Online...
Mar 14 22:15:16.354249 ubuntu NetworkManager[1546]: <info>  [1678832116.3542] manager[0x5629cd53f000]: monitoring kernel firmware directory '/lib/firmware'.
Mar 14 22:15:16.354423 ubuntu NetworkManager[1546]: <info>  [1678832116.3544] monitoring ifupdown state file '/run/network/ifstate'.
Mar 14 22:15:16.356985 ubuntu dbus-daemon[1490]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.14' (uid=0 pid=1546 comm="/usr/sbin/NetworkManager --no-daemon" label="unconfined")
Mar 14 22:15:16.527848 ubuntu NetworkManager[1546]: <info>  [1678832116.5278] hostname: hostname: using hostnamed
Mar 14 22:15:16.527869 ubuntu NetworkManager[1546]: <info>  [1678832116.5278] hostname: static hostname changed from (none) to "ubuntu"
Mar 14 22:15:16.528346 ubuntu NetworkManager[1546]: <info>  [1678832116.5283] dns-mgr: init: dns=systemd-resolved rc-manager=unmanaged (auto), plugin=systemd-resolved
Mar 14 22:15:16.531098 ubuntu NetworkManager[1546]: <info>  [1678832116.5310] manager[0x5629cd53f000]: rfkill: Wi-Fi hardware radio set enabled
Mar 14 22:15:16.531208 ubuntu NetworkManager[1546]: <info>  [1678832116.5311] manager[0x5629cd53f000]: rfkill: WWAN hardware radio set enabled
Mar 14 22:15:16.536471 ubuntu NetworkManager[1546]: <info>  [1678832116.5364] Loaded device plugin: NMAtmManager (/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-device-plugin-adsl.so)
Mar 14 22:15:16.546189 ubuntu NetworkManager[1546]: <info>  [1678832116.5461] Loaded device plugin: NMBluezManager (/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-device-plugin-bluetooth.so)
Mar 14 22:15:16.551128 ubuntu NetworkManager[1546]: <info>  [1678832116.5511] Loaded device plugin: NMTeamFactory (/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-device-plugin-team.so)
Mar 14 22:15:16.553375 ubuntu NetworkManager[1546]: <info>  [1678832116.5533] Loaded device plugin: NMWifiFactory (/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-device-plugin-wifi.so)
Mar 14 22:15:16.553688 ubuntu NetworkManager[1546]: <info>  [1678832116.5536] Loaded device plugin: NMWwanFactory (/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-device-plugin-wwan.so)
Mar 14 22:15:16.554369 ubuntu NetworkManager[1546]: <info>  [1678832116.5543] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Mar 14 22:15:16.554716 ubuntu NetworkManager[1546]: <info>  [1678832116.5547] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Mar 14 22:15:16.555036 ubuntu NetworkManager[1546]: <info>  [1678832116.5550] manager: Networking is enabled by state file
Mar 14 22:15:16.556335 ubuntu dbus-daemon[1490]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.14' (uid=0 pid=1546 comm="/usr/sbin/NetworkManager --no-daemon" label="unconfined")
Mar 14 22:15:16.557609 ubuntu NetworkManager[1546]: <info>  [1678832116.5575] settings: Loaded settings plugin: ifupdown ("/usr/lib/x86_64-linux-gnu/NetworkManager/1.40.12/libnm-settings-plugin-ifupdown.so")
Mar 14 22:15:16.557936 ubuntu NetworkManager[1546]: <info>  [1678832116.5579] settings: Loaded settings plugin: keyfile (internal)
Mar 14 22:15:16.558552 ubuntu NetworkManager[1546]: <info>  [1678832116.5585] ifupdown: management mode: unmanaged
Mar 14 22:15:16.558849 ubuntu NetworkManager[1546]: <info>  [1678832116.5588] ifupdown: interfaces file /etc/network/interfaces doesn't exist
Mar 14 22:15:16.562187 ubuntu NetworkManager[1546]: <info>  [1678832116.5621] dhcp: init: Using DHCP client 'internal'
Mar 14 22:15:16.562326 ubuntu NetworkManager[1546]: <info>  [1678832116.5623] device (lo): carrier: link connected
Mar 14 22:15:16.562921 ubuntu NetworkManager[1546]: <info>  [1678832116.5629] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Mar 14 22:15:16.565623 ubuntu NetworkManager[1546]: <info>  [1678832116.5656] manager: (enp1s0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Mar 14 22:15:16.566802 ubuntu NetworkManager[1546]: <info>  [1678832116.5667] device (enp1s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
# enp1s0 Link is finally UP
Mar 14 22:15:16.567801 ubuntu systemd-networkd[1445]: enp1s0: Link UP
Mar 14 22:15:16.567808 ubuntu systemd-networkd[1445]: enp1s0: Gained carrier
Mar 14 22:15:16.576312 ubuntu systemd[1]: Starting NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service...
Mar 14 22:15:16.593064 ubuntu NetworkManager[1546]: <info>  [1678832116.5930] failed to open /run/network/ifstate
Mar 14 22:15:16.594050 ubuntu systemd[1]: Started NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service.
Mar 14 22:15:16.603054 ubuntu NetworkManager[1546]: <info>  [1678832116.6029] modem-manager: ModemManager available
# NM sees link is connected
Mar 14 22:15:16.603995 ubuntu NetworkManager[1546]: <info>  [1678832116.6038] device (enp1s0): carrier: link connected
Mar 14 22:15:16.605979 ubuntu NetworkManager[1546]: <info>  [1678832116.6059] device (enp1s0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Mar 14 22:15:16.615688 ubuntu NetworkManager[1546]: <info>  [1678832116.6156] policy: auto-activating connection 'netplan-enp1s0' (cac41fbe-bc18-3d87-bba7-af2af7f8ffab)
Mar 14 22:15:16.616090 ubuntu NetworkManager[1546]: <info>  [1678832116.6160] device (enp1s0): Activation: starting connection 'netplan-enp1s0' (cac41fbe-bc18-3d87-bba7-af2af7f8ffab)
Mar 14 22:15:16.616159 ubuntu NetworkManager[1546]: <info>  [1678832116.6161] device (enp1s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:16.616301 ubuntu NetworkManager[1546]: <info>  [1678832116.6162] manager: NetworkManager state is now CONNECTING
Mar 14 22:15:16.616404 ubuntu NetworkManager[1546]: <info>  [1678832116.6163] device (enp1s0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:16.616904 ubuntu NetworkManager[1546]: <info>  [1678832116.6168] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:16.617119 ubuntu NetworkManager[1546]: <info>  [1678832116.6171] dhcp4 (enp1s0): activation: beginning transaction (timeout in 45 seconds)
Mar 14 22:15:18.020005 ubuntu avahi-daemon[1489]: Joining mDNS multicast group on interface enp1s0.IPv6 with address fe80::5054:ff:fe5b:bad5.
Mar 14 22:15:18.020209 ubuntu systemd-networkd[1445]: enp1s0: Gained IPv6LL
Mar 14 22:15:18.020103 ubuntu avahi-daemon[1489]: New relevant interface enp1s0.IPv6 for mDNS.
Mar 14 22:15:18.020113 ubuntu avahi-daemon[1489]: Registering new address record for fe80::5054:ff:fe5b:bad5 on enp1s0.*.
Mar 14 22:15:19.622278 ubuntu NetworkManager[1546]: <info>  [1678832119.6222] dhcp4 (enp1s0): state changed new lease, address=192.168.122.87
Mar 14 22:15:19.622596 ubuntu NetworkManager[1546]: <info>  [1678832119.6225] policy: set 'netplan-enp1s0' (enp1s0) as default for IPv4 routing and DNS
Mar 14 22:15:19.623291 ubuntu avahi-daemon[1489]: Joining mDNS multicast group on interface enp1s0.IPv4 with address 192.168.122.87.
Mar 14 22:15:19.623380 ubuntu avahi-daemon[1489]: New relevant interface enp1s0.IPv4 for mDNS.
Mar 14 22:15:19.623395 ubuntu avahi-daemon[1489]: Registering new address record for 192.168.122.87 on enp1s0.IPv4.
Mar 14 22:15:19.628244 ubuntu systemd-resolved[1136]: enp1s0: Bus client set default route setting: yes
Mar 14 22:15:19.630090 ubuntu systemd-resolved[1136]: enp1s0: Bus client set DNS server list to: 192.168.122.1
Mar 14 22:15:19.630505 ubuntu NetworkManager[1546]: <info>  [1678832119.6304] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:19.632315 ubuntu NetworkManager[1546]: <info>  [1678832119.6322] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:19.632447 ubuntu NetworkManager[1546]: <info>  [1678832119.6324] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Mar 14 22:15:19.632671 ubuntu NetworkManager[1546]: <info>  [1678832119.6326] manager: NetworkManager state is now CONNECTED_SITE
Mar 14 22:15:19.632846 ubuntu NetworkManager[1546]: <info>  [1678832119.6328] device (enp1s0): Activation: successful, device activated.
Mar 14 22:15:19.633170 ubuntu NetworkManager[1546]: <info>  [1678832119.6331] manager: startup complete
Mar 14 22:15:19.637543 ubuntu systemd[1]: Finished NetworkManager-wait-online.service - Network Manager Wait Online.
Mar 14 22:15:19.863983 ubuntu NetworkManager[1546]: <info>  [1678832119.8639] manager: NetworkManager state is now CONNECTED_GLOBAL
Mar 14 22:15:29.879294 ubuntu systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Mar 14 22:15:56.584613 ubuntu NetworkManager[1546]: <info>  [1678832156.5845] agent-manager: agent[c06ac3176022f796,:1.47/org.gnome.Shell.NetworkAgent/1000]: agent registered

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-15:

#22

This issue with NetworkManager.service and systemd is reminiscent of the related feature request against systemd-networkd https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1636912 that also points out the ordering issues.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-15 (last edit on 2023-03-15):

#23

While retries on DNS resolution failures does not work for cloud-init.service for environments running NetworkManager.service (desktop live ISOs), I have found that the retries work where systemd-networkd is bringing up network config (server live ISOs). This is because cloud-init.service is After=systemd-network-wait-online.service which provides systemd-resolved with viable network interfaces which are up and providing access to functional DNS configuration.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-22:

#24

Short-term solution here will likely be for livecd-rootfs to augment cloud-init.service which drops `Before=sysinit.target` and adds an `After=NetworkManager-wait-online.service`. This is quite a bit like what redhat and derivatives have done for a while. Redhat and derivitatives have an `After=NetworkManager.service` for cloud-init.service config and no `Before=sysinit.target`. Suse injects an `After=dbus.service` which also puts it at the same boot timeframe as NetworkManager-based environments.[1]

It does mean that cloud-init datasource gets detected a couple seconds later in boot, meaning that any service depending on `/run/cloud-init/instance-data.json` will also get delayed a couple of seconds, but doesn't delay overall boot process.

Long-term solution may be to see if we can improve NetworkManager dependency on dbus.service so that it could support late-bind interaction once dbus.socket comes online.

References:
[1] upstream suse/redhat cloud-init.service configuration NetworkManager/dbus ordering https://github.com/canonical/cloud-init/blob/main/systemd/cloud-init.service.tmpl#L19-L27

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-23:

#25

Short-term proposal update:
Looks like we can't get away with suppplemental systemd drop-ins in /etc/systemd/system/cloud-init.service.d by themselves in livecd-roots because we need to remove the "Before=sysinit.target" from the default shipped cloud-init.service config. Systemd drop-ins are used only to augment or add configuration, we need to replace the entire [Unit] definition in /lib/systemd/system/cloud-init.service in order to remove the Before=sysinit.target because it will still conflict with NetworkManager.service ordering on After=dbus.service

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-23:

#26

PR up for discussion on this ordering change in livecd https://code.launchpad.net/~chad.smith/livecd-rootfs/+git/livecd-rootfs/+merge/439586

Changed in cloud-init:
assignee:	nobody → Chad Smith (chad.smith)

Chad Smith (chad.smith) on 2023-03-23

Changed in cloud-init:
status:	Triaged → In Progress

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-03-25:

#27

What a mess!

I have uploaded the livecd-rootfs change proposed by Chad in #26. Note that there is another problem around jsonschema exposed by this that is in progress.

Changed in livecd-rootfs (Ubuntu):
status:	New → Fix Committed

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-27:

#28

jsonschema 2.6.0 is now dropped from latest ubuntu-desktop-snap version 0+get.98600e08 in stable channel as of a couple hours ago per this issue/fix[1]. This resolves issues with curtin Tracebacks on jsonschema.validate(). Additionally cloud-init has a PR in progress[2] to avoid registering a strict draft4 schema with additionalProperties=False. The cloud-init fix is now unnecessary given the changes already published in ubuntu-desktop-installer to drop jsonschema 2.6.0 from the snap.

[1] drop duplicate python dependencies from site-packages https://github.com/canonical/ubuntu-desktop-installer/issues/1714

[2] https://github.com/canonical/cloud-init/pull/2098

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-03-30:

#29

This bug was fixed in the package livecd-rootfs - 2.817

---------------
livecd-rootfs (2.817) lunar; urgency=medium

[ John Chittum ]
* revert ipc change. kernel 6.2 will have the correct setting

-- Steve Langasek <email address hidden> Mon, 27 Mar 2023 12:11:06 -0700

Changed in livecd-rootfs (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-04-03:

#30

I can now confirm successful autoinstall runs with FQDN in kernel commandline in Desktop live installer ISOs dated 20230403. This allows cloud-init.service to be ordered `After=NetworkManager.service NetworkManager-wait-online.service` which ensures devices and resolved are both 'up' and active by the time cloud-init tries to download remote user-data/meta-data from a seedurl.

$ cat /var/log/installer/media-info # also found in /cdrom/.disk/info in ephemeral environment
Ubuntu 23.04 "Lunar Lobster" - Daily amd64 (20230403)

$ # installer version of the snap
2023-04-03 15:42:37,497 INFO subiquity:163 Starting Subiquity server revision 907 of snap /snap/ubuntu-desktop-installer/907

Presence of the correct systemd service ordering for cloud-init.service in Desktop live installer builds dated 20230403 placing cloud-init.service `After=NetworkManager.service NetworkManager-wait-online.service` guarantee that network is up before cloud-init datasource discovery runs which also implies systemd-resolved has started and has adequate connectivity to source FQDNs on any NetworkManager discovered NICs.

ubuntu@ubuntu:~$ systemctl show -p After,Before cloud-init.service --no-pager
Before=sshd-keygen.service cloud-config.target network-online.target sshd.service shutdown.target systemd-user-sessions.service
After=cloud-init-local.service NetworkManager.service system.slice networking.service systemd-journald.socket systemd-networkd-wait-online.service NetworkManager-wait-online.service

This allows cloud-init to download remote user-data from an FQDN provided to the live desktop installer via the kernel parameter: `ds=nocloud-net;s=http://YOUR-DOMAIN/'

So, FQDN lookup seems to be resolved by the systemd service ordering after NetworkManager is up and functional.

There may be a secondary issue to file related to environments with nameservers being specifically provided for pxe-based installs after cloud-init properly downloads remote user-data from a remote FQDN but ordering of systemd network configuration seems to alleviate the DNS resolution aspect pointed to in this bug.

I can now confirm successful autoinstall runs with FQDN in kernel commandline in Desktop live installer ISOs dated 20230403. This allows cloud-init.service to be ordered `After=NetworkManager.service NetworkManager-wait-online.service` which ensures devices and resolved are both 'up' and active by the time cloud-init tries to download remote user-data/meta-data from a seedurl.

$ cat /var/log/installer/media-info   # also found in /cdrom/.disk/info in ephemeral environment
Ubuntu 23.04 "Lunar Lobster" - Daily amd64 (20230403)

$ # installer version of the snap
2023-04-03 15:42:37,497 INFO subiquity:163 Starting Subiquity server revision 907 of snap /snap/ubuntu-desktop-installer/907

Presence of the correct systemd service ordering for cloud-init.service in Desktop live installer builds dated 20230403 placing cloud-init.service `After=NetworkManager.service NetworkManager-wait-online.service` guarantee that network is up before cloud-init datasource discovery runs which also implies systemd-resolved has started and has adequate connectivity to source FQDNs on any NetworkManager discovered NICs.

ubuntu@ubuntu:~$ systemctl show -p After,Before cloud-init.service --no-pager
Before=sshd-keygen.service cloud-config.target network-online.target sshd.service shutdown.target systemd-user-sessions.service
After=cloud-init-local.service NetworkManager.service system.slice networking.service systemd-journald.socket systemd-networkd-wait-online.service NetworkManager-wait-online.service

This allows cloud-init to download remote user-data from an FQDN provided to the live desktop installer via the kernel parameter:  `ds=nocloud-net;s=http://YOUR-DOMAIN/'

So, FQDN lookup seems to be resolved by the systemd service ordering after NetworkManager is up and functional.

There may be a secondary issue to file related to environments with nameservers being specifically provided for pxe-based installs after cloud-init properly downloads remote user-data from a remote FQDN but ordering of systemd network configuration seems to alleviate the DNS resolution aspect pointed to in this bug.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-04-07:

#31

I can confirm success manually launching images via kvm in virt-manager that live desktop image builds as of 20230403 images in `/cdrom/.disk/info` have the proper systemd service ordering which places cloud-init.service `After=NetworkManager.service NetworkManager-wait-online.service`. Which allows cloud-init to resolve DNS on Ubuntu ISOs where NetworkManager is the primary network backend.

We also found a secondary bug not related to the specific NetworkManager DNS issue, once cloud-init renders initial network config to detect the datasource, it writes direct network configuration to /etc/NetworkManager/systemc-connections. If networking changes are provided in autoinstall.network, ubuntu-desktop-installer(via subiquity) writes that network config to /etc/netplan/00-installer.yaml and invokes 'netplan apply'. This results in collisions in NetworkManager configuration as netplan isn't aware of cloud-init's direct config of in /etc/NetworkManager/system-connections/cloud-init-<device>.nmconnection.

https://bugs.launchpad.net/cloud-init/+bug/2015605

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-04-12 (last edit on 2023-04-12):

#32

Closing the cloud-init task here as 'fix released' because we have livecd-rootfs overlay config files allow for cloud-init.service to get ordered After=NetworkManager.service which solves the immediate DNS issues in early boot. Long-term cloud-init will need to spec out options to prefer ordering after NetworkManager versus systemd-networkd at systemd generator timeframe because ordering After=NetworkManager is incompatible with cloud-init's default Before=sysinit.target.

We'll take that long-term work as a separate bug for cloud-init https://bugs.launchpad.net/cloud-init/+bug/2015949 to discern how best to position upstream cloud-init.service files to cope with service ordering conflicts to prefer NetworkManager.service over systemd-networkd.

Changed in cloud-init:
status:	In Progress → Fix Released

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-12:

#33

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/4085

Revision history for this message

Nick Rosbrook (enr0n) wrote on 2023-06-09:

#34

My understanding from a quick read is that there is nothing to do in systemd. Please re-open if I am mistaken.

Changed in systemd (Ubuntu):
status:	New → Invalid

Dan Bungert (dbungert) on 2023-06-09

Changed in subiquity:
status:	New → Invalid

Benjamin Drung (bdrung) on 2023-10-23

tags:

removed: foundations-todo

cloud-init

DNS failure while trying to fetch user-data

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
Netplan	Invalid	Undecided	Danilo Egea Gondolfo
cloud-init	Fix Released	Medium	Chad Smith
subiquity	Invalid	Undecided	Unassigned
livecd-rootfs (Ubuntu)	Fix Released	Undecided	Unassigned
systemd (Ubuntu)	Invalid	Undecided	Unassigned