MAAS Provider: LXC did not get DHCP address, stuck in "pending"

Bug #1353008 reported by David Britton
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
High
Unassigned
cloud-init (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned

Bug Description

=== Begin SRU Information ===
This bug causes lxc containers created by the ubuntu-cloud template (lxc-create -t ubuntu-cloud) to sometimes not obtain an IP address, and thus not correctly boot to completion.

The bug is in an assumption by cloud-init that /run is mounted before the cloud-init-local job is run. The fix is very simply to guarantee that it is via modification to its upstart 'start on'.

When booting with an initramfs /run will be mounted before /, so the race condition is not possible. Thus, the failure case is only either in non-initramfs boot (which is very unlikely) or in lxc boot. The lxc case seems only to occur very rarely, somewhere well under one percent of the time.

[Test Case]
A test case is written at [1] that launches many instances in an attempt brute force find the error. However, I've not been able to make it fail.

The original bug reporter has been running with the 'start on' change and has seen no errors since.

We will request the original bug reporter to apply the uploaded changes and run through their battery.

[Regression Potential]
The possibility for regression here is in the second boot of an instance. The following scenario is a change of behavior:
 * the user boots an instance with NoCloud or ConfigDrive in ds=local mode
 * user changes /etc/network/interfaces in a way that would cause
   static-networking to not be emitted on subsequent boot
 * user reboots
Now, instead of a quick boot, the user may see cloud-init-nonet blocking on network coming up.

This would be a uncommon scenario, and the broken-etc-network-interfaces scenario is already one that causes timeouts on boot.

--
[1] http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance

=== End SRU Information ===

Note, that after I went onto the system, it *did* have an IP address.

      0/lxc/3:
        agent-state: pending
        instance-id: juju-machine-0-lxc-3
        series: trusty
        hardware: arch=amd64

cloud-init-output.log snip:

Cloud-init v. 0.7.5 running 'init' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.29 seconds.
ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | . | . | 00:16:3e:34:aa:57 |
ci-info: +--------+------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Cloud-init v. 0.7.5 running 'modules:config' at Mon, 04 Aug 2014 23:57:12 +0000. Up 572.99 seconds.
Cloud-init v. 0.7.5 running 'modules:final' at Mon, 04 Aug 2014 23:57:14 +0000. Up 574.42 seconds.
Cloud-init v. 0.7.5 finished at Mon, 04 Aug 2014 23:57:14 +0000. Datasource DataSourceNoCloudNet [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]. Up 574.54 seconds

syslog on system, showing DHCPACK 1 second later:

root@juju-machine-0-lxc-3:/home/ubuntu# grep DHCP /var/log/syslog
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 255.255.255.255 port 67 (xid=0x1687c544)
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPOFFER of 10.96.3.173 from 10.96.0.10
Aug 4 23:57:13 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 05:28:15 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPREQUEST of 10.96.3.173 on eth0 to 10.96.0.10 port 67 (xid=0x1687c544)
Aug 5 11:15:00 juju-machine-0-lxc-3 dhclient: DHCPACK of 10.96.3.173 from 10.96.0.10

It appears in every case, cloud-init init-local has failed very early visible in juju logs /var/lib/juju/containers/<container>/console.log:

Traceback (most recent call last):
  File "/usr/bin/cloud-init", line 618, in <module>
    sys.exit(main())
  File "/usr/bin/cloud-init", line 614, in main
    get_uptime=True, func=functor, args=(name, args))
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1875, in log_time
    ret = func(*args, **kwargs)
  File "/usr/bin/cloud-init", line 491, in status_wrapper
    force=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1402, in sym_link
    os.symlink(source, link)
OSError: [Errno 2] No such file or directory

Related branches

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Curtis Hovey (sinzui)
tags: added: lxc maas-provider network
Changed in juju-core:
status: New → Triaged
Revision history for this message
Curtis Hovey (sinzui) wrote :

Looks like the machine was shutdown by something other than juju. juju wasn't installed in the machine before it was terminated. Since there wasn't a first call home, juju will wait indefinitely. Could maas have terminated the machine that juju requested?

Changed in juju-core:
status: Triaged → Incomplete
David Britton (dpb)
Changed in juju-core:
status: Incomplete → New
Revision history for this message
David Britton (dpb) wrote :

I've reproduced this again.

cloud-init appears to fail fast if the ip address is not yet allocated by the upstart job "network-interface-eth0". As you can see in my paste, cloud-init doesn't appear to wait at all, like I'm used to, it just fails.

This might take someone with deeper expertise (smoser?) to look into.

I'll attach full logs from the lxc

Revision history for this message
David Britton (dpb) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.21-alpha1
Revision history for this message
Scott Moser (smoser) wrote :

cloud-init doesn't "fail fast". Rather, the 'cloud-init' upstart job will not execute until either:
a.) all network interfaces defined in /etc/network/interfaces as 'auto' are up
or
b.) cloud-init-nonet has ended (which basically blocks boot for up to 120 seconds or until 'a' is accomplished).

one of the two scenarios must be reached before the 'mounted MOUNTPOINT=/' event occurs. That event then basically blocks the rest of boot. so cloud init forces a pinch point in boot on networking coming up.

something else must be blocking networking from coming up until afer mounted MOUNTPOINT=/ has occurred.

Possibly a dupe of https://bugs.launchpad.net/cloud-init/+bug/1345433

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Let's keep in mind the way juju creates containers: first, it creates a template container. Then, it clones it.

From the cloud init logs, I can't see cloud-init from the cloned container waiting 120s for an IP. In fact, doesn't look like it waited even 60s.

- template container ends: Cloud-init v. 0.7.5 finished at Mon, 04 Aug 2014 23:56:18
- cloned container starts: Cloud-init v. 0.7.5 running 'init' at Mon, 04 Aug 2014 23:57:12
- cloned container fails at, or before: Cloud-init v. 0.7.5 running 'modules:config' at Mon, 04 Aug 2014 23:57:12
- dhcp server gets a lease request, offers it, and the container ACKs it: all within this second: 23:57:13

Revision history for this message
David Britton (dpb) wrote :

Removing duplicate status. After understanding more of the failure mode, it's certainly distinct from 1345433. The problem here is the initial log message printed by "cloud-init init-local" is failing with a symlink error.

Traceback (most recent call last):
  File "/usr/bin/cloud-init", line 618, in <module>
    sys.exit(main())
  File "/usr/bin/cloud-init", line 614, in main
    get_uptime=True, func=functor, args=(name, args))
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1875, in log_time
    ret = func(*args, **kwargs)
  File "/usr/bin/cloud-init", line 491, in status_wrapper
    force=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1402, in sym_link
    os.symlink(source, link)
OSError: [Errno 2] No such file or directory

Aug 22 13:55:52 juju-trusty-lxc-template [CLOUDINIT] util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'

Thus it's hard to see in the log what is going wrong.

  - "cloud-init init-local" does the waiting for network bit.
  - "cloud-init init" expects the network devices to be up

Two issues:

1) Why is the symlink failing?

From the error message (No such file or directory), the only ways I've gotten it to repeat is if the directory where the link is requested to be placed does not exist, or if using relative paths, the directory cannot be calculated -- i.e., dirname(source) doesn't exist.

2) Why doesn't cloud-init-init block on success of cloud-init, and not just completion of it?

Perhaps this is a limitation of the upstart job chain. I'm not familiar enough with that to speak to it. Or, perhaps it's by design.

I'll add more to the bug debugging wise as I find it.

Revision history for this message
David Britton (dpb) wrote :
description: updated
Revision history for this message
David Britton (dpb) wrote :

FYI -- I can hit this pretty often in my test environment. If there is some way to add instrumentation to cloud-init I would be happy to debug further.

Revision history for this message
Ian Booth (wallyworld) wrote :

I'm removing this from 1.20 series as any Juju related work (if any is required) will be done for the next release.

no longer affects: juju-core/1.20
Revision history for this message
Scott Moser (smoser) wrote :

I pushed a test that creates and destroys lxc containers in an effort to reproduce my suspicion of the problem to http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-new-instance .

I've not been able to make it fail anywhere, though.

Changed in cloud-init:
status: New → In Progress
importance: Undecided → High
Revision history for this message
David Britton (dpb) wrote : Re: [Bug 1353008] Re: MAAS Provider: LXC did not get DHCP address, stuck in "pending"

On Wed, Sep 10, 2014 at 03:15:43PM -0000, Scott Moser wrote:
> I pushed a test that creates and destroys lxc containers in an effort to
> reproduce my suspicion of the problem to http://bazaar.launchpad.net
> /~cloud-init-dev/cloud-init/cloud-init-test/view/head:/tests/lxc-test-
> new-instance .
>
> I've not been able to make it fail anywhere, though.

FYI, with the workaround that you provided (of checking that /run is
mounted), I haven't seen the error since, but only have about 1500
containers created so far.

--
David Britton <email address hidden>

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.21-alpha1 → 1.21-alpha2
Revision history for this message
Scott Moser (smoser) wrote :

fixed in trunk at revno 1012

Changed in cloud-init:
status: In Progress → Fix Committed
Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
assignee: nobody → Scott Moser (smoser)
importance: Undecided → Medium
status: New → Fix Released
Changed in cloud-init (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in
  utopic: version 0.7.6~bzr1016-0ubuntu1 uploaded 2014-09-12
  trusty: version 0.7.5-0ubuntu1.2 uploaded to -proposed queue 2014-09-12

currently in unapproved at https://launchpad.net/ubuntu/trusty/+queue?queue_state=1&queue_text=cloud-init

Scott Moser (smoser)
description: updated
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello David, or anyone else affected,

Accepted cloud-init into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.5-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
David Britton (dpb) wrote :

With scott's help, I spun this about 2000x over the weekend and did not hit the issue. I would call it verified from my perspective.

Changing tag.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.5-0ubuntu1.2

---------------
cloud-init (0.7.5-0ubuntu1.2) trusty-proposed; urgency=medium

  * d/patches/lp-1353008-cloud-init-local-needs-run.conf:
    backport change to cloud-init-local.conf to depend on /run being
    mounted (LP: #1353008)
 -- Scott Moser <email address hidden> Wed, 17 Sep 2014 09:15:54 -0400

Changed in cloud-init (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Ian Booth (wallyworld) wrote :

Removing this bug from juju-core as the issue required a cloud init fix.

no longer affects: juju-core
Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in cloud-init 0.7.7

Changed in cloud-init:
status: Fix Committed → Fix Released
Changed in cloud-init (Ubuntu):
assignee: Scott Moser (smoser) → Roufique Hossain (roufique)
Changed in cloud-init:
assignee: nobody → Roufique Hossain (roufique)
Changed in cloud-init (Ubuntu Trusty):
assignee: nobody → Roufique Hossain (roufique)
Steve Langasek (vorlon)
Changed in cloud-init (Ubuntu):
assignee: Roufique Hossain (roufique) → nobody
Changed in cloud-init (Ubuntu Trusty):
assignee: Roufique Hossain (roufique) → nobody
Changed in cloud-init:
assignee: Roufique Hossain (roufique) → nobody
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.