Azure: cannot start walinux agent (Transaction order is cyclic.)

Bug #1623570 reported by Scott Moser
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
walinuxagent (Ubuntu)
Fix Released
High
Scott Moser
Xenial
Fix Released
Medium
Unassigned

Bug Description

==== Begin SRU Template ====
[Impact]
On Azure (and only on Azure), cloud-init utilizes the walinuxagent
to do some of the instance initialization. A change to cloud-init
to support installation of packages on a systemd system (bug 1576692),
exposed an issue in the walinuxagent.service file.

The end result is freshly booted instances on azure with the cloud-init
fix and the old version of walinuxagent would fail to import ssh keys
that were provided to the instance.

[Test Case]
 * Start an instance on Azure
 * upgrade to cloud-init at 0.7.7-31-g65ace7b or better
   this is in both xenial-proposed and yakkety.

 * clean up, so it looks like a first boot.
   sudo sh -c 'rm -Rf /var/lib/cloud /var/log/cloud-init* /var/lib/waagent/*.crt'
 * At this point a reboot would cause failure, and you would see WARN messages in the logs.

 * upgrade walinuxagent with a fix (2.1.5-0ubuntu2 in yakkety)

 * reboot
 * ssh back in,
   * There should be no WARN in /var/log/cloud-init.log
     grep WARN /var/log/cloud-init.log
   * There should exist a file /var/lib/waagent/*.crt
   * walinuxagent.service should be running
     systemctl status walinuxagent.service

Note, that in event of failure you can still get into the system by
providing user-data that populates your ssh keys:
  #!/bin/sh
  ssh-import-id smoser

[Regression Potential]
Changing order of things in boot is scary. This is no different.
This would in theory allow walinux-agent to start earlier in the boot.
In practice, however, it is started by cloud-init.service so it was
already running that early in the boot.
==== End SRU Template ====

When bringing up the Azure datasource in cloud-init.service, cloud-init tries 'service start walinuxagent'.

That previously worked fine, and the agent would start and then would produce the certificate files that cloud-init needed (for ssh keys and things).

I found this when testing SRU for 0.7.7-31-g65ace7b-0ubuntu1~16.04.1
but it is likely present also in 0.7.7-31-g65ace7b-0ubuntu1 (yakkety)

Now, however we see a log like:
Sep 14 14:53:18 smoser0914x [CLOUDINIT] DataSourceAzure.py[DEBUG]: Getting metadata via agent. hostname=smoser0914x cmd=['service', 'walinuxagent', 'start']
Sep 14 14:53:18 smoser0914x [CLOUDINIT] util.py[DEBUG]: Running command hostname with allowed return codes [0] (shell=False, capture=True)
Sep 14 14:53:18 smoser0914x [CLOUDINIT] DataSourceAzure.py[DEBUG]: invoking agent: ['service', 'walinuxagent', 'start']
Sep 14 14:53:18 smoser0914x [CLOUDINIT] util.py[DEBUG]: Running command ['service', 'walinuxagent', 'start'] with allowed return codes [0] (shell=False, capture=True)
Sep 14 14:53:18 smoser0914x [CLOUDINIT] util.py[WARNING]: agent command '['service', 'walinuxagent', 'start']' failed.
Sep 14 14:53:19 smoser0914x [CLOUDINIT] util.py[DEBUG]: agent command '['service', 'walinuxagent', 'start']' failed.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 145, in get_metadata_from_agent
    invoke_agent(agent_cmd)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 452, in invoke_agent
    util.subp(cmd, shell=(not isinstance(cmd, list)))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1832, in subp
    cmd=args)
cloudinit.util.ProcessExecutionError: Unexpected error while running command.
Command: ['service', 'walinuxagent', 'start']
Exit code: 1
Reason: -
Stdout: ''
Stderr: "
  Failed to start walinuxagent.service: Transaction order is cyclic. See system logs for details.
  See system logs and 'systemctl status walinuxagent.service' for details

I believe the relevant change is in 34a26f7f
  https://git.launchpad.net/cloud-init/commit/?id=34a26f7f59f2963691e36ca0476bec9fc9ccef63
That added multi-user.target to the list of After for cloud-init-final.service.

Related bugs:
 * bug 1576692: fully support package installation in systemd

Scott Moser (smoser)
Changed in cloud-init:
status: New → Confirmed
Changed in cloud-init (Ubuntu):
status: New → Confirmed
no longer affects: ubuntu (Ubuntu)
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Changed in cloud-init (Ubuntu):
importance: Undecided → High
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

I backed out the single change below, and this functions again as it was. I've also tried using '--job-mode=ignore-dependencies', but i still get the:

  Failed to start walinuxagent.service: Transaction order is cyclic. See system logs for details.
  See system logs and 'systemctl status walinuxagent.service' for details.

$ git diff 0.7.7..0.7.8 systemd/cloud-final.service
diff --git a/systemd/cloud-final.service b/systemd/cloud-final.service
index 3927710..b8f69b7 100644
--- a/systemd/cloud-final.service
+++ b/systemd/cloud-final.service
@@ -1,6 +1,6 @@
 [Unit]
 Description=Execute cloud user/final scripts
-After=network-online.target cloud-config.service rc-local.service
+After=network-online.target cloud-config.service rc-local.service multi-user.target
 Wants=network-online.target cloud-config.service

 [Service]

Revision history for this message
Scott Moser (smoser) wrote :

So we have a couple of options here:
a.) use '__builtin__' mode in cloud-init for the walinux agent functionality.
this in theory should work, but we have not largely tested it. Basically
this path has cloud-init doing the metadata service exchange itself rather than relying on walinux-agent to pull the files it needs and then using them.

I've noticed one issue with this, is that walinuxagent.service is not started. Per journalctl,
  multi-user.target: Breaking ordering cycle by deleting job walinuxagent.service/start

b.) remove or change 'After' 'cloud-final' in walinuxagent.service
I'm not exactly sure why this is here, but I believe it was so that cloud-init had an opportunity to configure walinuxagent or otherwise stop them from fighting. That said, since cloud-init.service is starting walinux-agent (and has been for quite some time), it would seem that an After of 'cloud-init' should be sufficient.

It seems that because of the cyclic issue, 'b' is basically required.

Scott Moser (smoser)
Changed in walinuxagent (Ubuntu):
status: New → In Progress
assignee: nobody → Scott Moser (smoser)
importance: Undecided → High
Changed in walinuxagent (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :
Scott Moser (smoser)
description: updated
Scott Moser (smoser)
no longer affects: cloud-init
Scott Moser (smoser)
description: updated
Revision history for this message
Stephen A. Zarkos (stevez) wrote :

Hi Scott,

I think we would actually like to move to a world where cloud-init can function independently of the agent and does not start it. The walinuxagent service can then start after cloud-init is finished, and function as a guest agent to handle things like VM extensions, etc.

This behavior of cloud-init starting walinuxagent was needed early on to allow the agent to communicate with the wire server so that the provisioning ISO could be attached to the VM. That's no longer needed. All cloud-init needs to do is obtain the certificates for the SSH public key and post 'ready' to the fabric once it has processed its cloud-config.

Steve

Revision history for this message
Scott Moser (smoser) wrote :

Steve,
Right. That is 'a' in comment #2 above.

In order to do that, we still need the change I've proposed or walinuxagent.service will never start.
I've not suggested that we make the cloud-init change to use '__builtin__' now because I'm not interested in SRUing that at the moment.

The change I've suggested to walinuxagent *does* make it more independent of cloud-init.

Revision history for this message
Scott Moser (smoser) wrote :

My only hesitancy against 'builtin' is doing that now. It seems like the proposed change is the safer of the two, and I want/need the cloud-init update currently in -proposed to get into -updates sooner than later.

Ubuntu images have never used the __builtin__ path, so I just consider it less tested.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package walinuxagent - 2.1.5-0ubuntu2

---------------
walinuxagent (2.1.5-0ubuntu2) yakkety; urgency=medium

  * debian/walinuxagent.service: remove cloud-final.service from
    both Wants and After. (LP: #1623570)

 -- Scott Moser <email address hidden> Wed, 14 Sep 2016 13:06:56 -0400

Changed in walinuxagent (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

I'm removing the cloud-init tasks as pitti agrees the diagnosis seems sane.

@Steve,
feel free to open a bug against cloud-init to use the builtin support, and we can do that independently of this.

no longer affects: cloud-init (Ubuntu)
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Scott, or anyone else affected,

Accepted walinuxagent into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/walinuxagent/2.1.3-0ubuntu4.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in walinuxagent (Ubuntu Xenial):
status: Confirmed → Fix Committed
no longer affects: cloud-init (Ubuntu Xenial)
Revision history for this message
Martin Pitt (pitti) wrote :

Hello Scott, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-1-g3705bb5-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

verified as in description.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for walinuxagent has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package walinuxagent - 2.1.3-0ubuntu4.1

---------------
walinuxagent (2.1.3-0ubuntu4.1) xenial-proposed; urgency=medium

  * debian/walinuxagent.service: remove cloud-final.service from
    both Wants and After. (LP: #1623570)

 -- Scott Moser <email address hidden> Wed, 14 Sep 2016 15:17:38 -0400

Changed in walinuxagent (Ubuntu Xenial):
status: Fix Committed → Fix Released
Scott Moser (smoser)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.