cloud-init timeout waiting for metadata service on EC2

Bug #745930 reported by Scott Moser
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Won't Fix
Low
Unassigned

Bug Description

Binary package hint: cloud-init

In testing for natty beta-1, we hit a timeout waiting for the metadata service.

This would occur when the 'cloud-init' ran, meaning that there was a network interface up.

This occurred on m1.large in us-west-1.

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: cloud-init 0.6.1-0ubuntu4
ProcVersionSignature: User Name 2.6.38-7.39-virtual 2.6.38
Uname: Linux 2.6.38-7-virtual x86_64
Architecture: amd64
Date: Wed Mar 30 17:56:22 2011
Ec2AMI: ami-bb3261fe
Ec2AMIManifest: ubuntu-images-testing-us-west-1/ubuntu-natty-daily-amd64-server-20110329.manifest.xml
Ec2AvailabilityZone: us-west-1c
Ec2InstanceType: m1.large
Ec2Kernel: aki-9ba0f1de
Ec2Ramdisk: unavailable
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Scott Moser (smoser) wrote :
tags: added: iso-testing
Revision history for this message
Dave Walker (davewalker) wrote :

Scott, Was networking up at that stage; and was it possible to connect to it externally?

Have you only seen this once?

Thanks.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Scott Moser (smoser)
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

I had seen this recently on a lucid instance, that caused a publish failure from the daily build scripts.
That occurred on a m1.large of:
  us-west-1 ami-f5bfefb0 canonical ebs/ubuntu-lucid-10.04-amd64-server-20110201.1

Revision history for this message
Scott Moser (smoser) wrote :

This only occurred once in todays beta-1 testing. I have seen it on very rare occasion, as noted above.

I started digging a bit through my build logs. The daily build process launches an m1.large instance in each region to facilitate populating EBS root instances. On 2011-03-03, I updated the utility instances that we use to the latest lucid refreshed version (ebs/ubuntu-lucid-10.04-amd64-server-20110201.1), so this first info represents data only from that date.

- 66 build logs (17 failed for one reason or another)
- 298 instances launched from 2011-03-03 to 2011-03-30.
- 9 of 298 instances could not be reached via ssh
- 3 of those 9 instances were due to timeout on metadata service these were:
  2011-03-26 us-west-1 maverick-desktop-uec-20110326
  2011-03-28 us-west-1 natty-server-uec-20110328
  2011-03-29 us-west-1 lucid-server-uec-20110329

So, 298 instances launched, 3 failures on metadata service. All of those in us-west-1 and occurred in the last 5 days.

Going back further, though all the logs I have since ~ 2010-03-16
- 704 build logs
- 2620 instances launched
- 28 failed to have elastic IP associated and be reached via ssh.
- 4 TOTAL have messages in console suggesting time out on metadata service
  2011-03-01 us-east-1 lucid-server-uec-20110301
  2011-03-26 us-west-1 maverick-desktop-uec-20110326
  2011-03-28 us-west-1 natty-server-uec-20110328
  2011-03-29 us-west-1 lucid-server-uec-20110329

Changed in cloud-init (Ubuntu):
status: Incomplete → New
importance: Undecided → Low
Revision history for this message
Eric Hammond (esh) wrote :

In my experience in 2008-2009, just because networking is up on an EC2 instance does not mean that 169.254.169.254 is going to accept connections and requests for meta-data and user-data. You need to wait for this to become available. I had code to do this in Ubuntu AMIs I built back then.

    perl -MIO::Socket::INET -e 'until(new IO::Socket::INET("169.254.169.254:80")){sleep 1}'

I used to think I was waiting for networking to come up, but it became clear that the meta-data service was not listening even some times when networking was working and I had a local IP address on the instance.

Revision history for this message
Eric Hammond (esh) wrote :

On second glance, cloud-init may actually be retrying the connection and the problem is that sometimes it takes longer than the current number of retries.

Revision history for this message
Scott Moser (smoser) wrote :

Well, if you lok at the attached console logs, you can see that cloud-init is retrying. In lucid, we retried 100 times with a increasing back off, resulting in 17 minutes or so of trying. The build scripts gave up after 5 minutes.

In natty (both to reduce the pain if cloud-init was installed outside of EC2, and because i had extreme amounts of data -- 2000+ instances that waiting did not help), I reduced that timeout significantly, but made it configurable in /etc/cloud/cloud.cfg.

From the console logs attached:
-- lucid --
  cloud-init running: Sun, 27 Mar 2011 10:14:09 +0000. up 10.57 seconds
  waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id
  10:14:11 [ 1/100]: url error [timed out]
  ...
  10:16:56 [31/100]: url error [timed out]

-- natty --
  cloud-init start running: Wed, 30 Mar 2011 17:01:16 +0000. up 10.08 seconds
  2011-03-30 17:01:18,555 - DataSourceEc2.py[WARNING]: waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id
  2011-03-30 17:01:18,555 - DataSourceEc2.py[WARNING]: 17:01:18 [ 1/30]: url error [timed out]
  ...
  011-03-30 17:03:55,735 - DataSourceEc2.py[WARNING]: 17:03:55 [30/30]: url error [timed out]
  2011-03-30 17:04:01,742 - DataSourceEc2.py[CRITICAL]: giving up on md after 165 seconds

So, at the point at which the build publishing scripts gave up, cloud-init had been re-trying (after network had already come up) for 2 minutes and 35 seconds. In the natty test, it gave up after 2 minutes 45 seconds of trying.

If the metadata service isn't up 3 minutes after the instance is booted, then there is a bug in the platform, expecting an instance to wait 3 minutes is not acceptable.

Dave Walker (davewalker)
tags: added: server-nrs
Dave Walker (davewalker)
Changed in cloud-init (Ubuntu):
milestone: none → ubuntu-11.04
Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
milestone: ubuntu-11.04 → none
Dave Walker (davewalker)
tags: removed: server-nrs
Revision history for this message
Scott Moser (smoser) wrote :

I'm marking this WONT FIX, as I do truly believe it is a platform issue, not anything that cloud-init could recover from.

Changed in cloud-init (Ubuntu):
status: New → Won't Fix
Revision history for this message
mumcowpig (a2907395) wrote :

Don't ask me why I installed the aptitude "task" group uec on my ubuntu 10.04.2 server edition virtualbox image. I wanted to have a minimal streamlined ubuntu virtual machine for developing for Heroku cedar stack which is based on ubuntu 10.04 and ruby1.9.2

My virtualbox host machine is running the linux-server(pae) kernel with dkms compiled virtualbox-ose kernel module loaded.

Almost every boot/service hang can be addressed by holding shift on boot and choosing the (single user) recovery kernel/initramfs image in grub/lilo. From there, I can choose command prompt with networking and uninstall services like cloud-init. Then i could reboot.

Because of this design problem,
Cloud-init by default runs before the recovery menu appears. At a minimum, I wish cloud-init would resume to the recovery console after 100 attempts so I could fix this.

Possibly there may be a key sequence or kernel parameter that I could add to the recovery grub boot menu that would allow me to disable this server on a one shot basis?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.