root partition may not be grown

Bug #937352 reported by Scott Moser
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-utils
Fix Released
High
Scott Moser
cloud-initramfs-tools (Ubuntu)
Fix Released
High
Unassigned
cloud-utils (Ubuntu)
Fix Released
High
Scott Moser

Bug Description

Not a dupe of bug 906722, but it seems very similar.

Occasionally on openstack clouds, we're seeing the root partition not being/grown on first-boot as it should be.

This was reported, and I have observed it on precise alpha-2 and will attach console logs. It happens sporadically.

The output of the relevant section of initramfs in the failed case looks like this:

| Begin: Running /scripts/init-premount ... done.
| Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
| Begin: Running /scripts/local-premount ... done.
| [ 0.940836] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
| Begin: Running /scripts/local-bottom ... [ 0.955771] FDC 0 is a S82078B
| [ 1.012874] vda: vda1
| GROWROOT: WARNING: resize failed: FAILED: failed to resize
| ***** WARNING: Resize failed, attempting to revert ******
| Re-reading the partition table ...
| BLKRRPART: Device or resource busy
| The command to re-read the partition table failed.
| Run partprobe(8), kpartx(8) or reboot your system now,
| before using mkfs
| ***** Appears to have gone OK ****
| [ 1.114017] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
| done.
| done.
| Begin: Running /scripts/init-bottom ... done.

In the successful case, it looks like this:

| Begin: Running /scripts/init-premount ... done.
| Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
| Begin: Running /scripts/local-premount ... done.
| [ 1.035664] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
| Begin: Running /scripts/local-bottom ... [ 1.052365] FDC 0 is a S82078B
| [ 1.094201] vda: vda1
| [ 1.135831] vda: vda1
| GROWROOT: CHANGED: partition=1 start=16065 old: size=4176900 end=4192965 new: size=20948760,end=20964825
| [ 1.164067] Refined TSC clocksource calibration: 2532.595 MHz.
| [ 1.202383] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
| done.
| done.
| Begin: Running /scripts/init-bottom ... done.

The code that is running here is growroot/scripts/local-bottom/growroot [1], growpart is part of cloud-utils [2].

Other relevant info, in at least one case, a reboot got the partition resized.

--
[1] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-initramfs-tools/precise/view/head:/growroot/scripts/local-bottom/growroot
[2] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-utils/precise/view/head:/bin/growpart

Related bugs:
 * bug 942788: sfdisk without --no-reread is likely to cause race conditions

Tags: canonistack
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Scott Moser (smoser)
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

This attachment does the following to a given block device input
  * create a partition table for the block device with first partition using roughly half disk space
  * mkfs first partition
  * mount partition
  * run growpart --dry-run
  * umount partition
  * run growpart

If you run it in a loop, it will most likely fail over 50 runs:
  for((i=0;i<50;i++)); do sudo ./go-sfdisk-crazy /dev/vdb; done

I've tested/developed this on openstack instance pointing it at ephemeral disk (/dev/vdb).

Revision history for this message
Scott Moser (smoser) wrote :

note, it does seem that maybe this should be calling partprobe.

Revision history for this message
Scott Moser (smoser) wrote :

This snippet seems to reproduce the issue for me sporadically, and much simpler than the attached script. It assumes there is a filesystem already on /dev/vdb1 and /dev/vdb is partitioned already.

sudo sh -c 'd=$1; p=${d}1; mp=/mnt; cleanup() { umount $mp >/dev/null 2>&1; } ; trap cleanup EXIT; umount $p >/dev/null 2>&1; while : ; do mount $p $mp && umount $mp && sfdisk --re-read $d|| exit 1; echo -n .; done' -- /dev/vdb

Revision history for this message
Stefan Bader (smb) wrote :

Experimenting a bit with the snipped from comment #5, I can see the same happening for a xen based guest. One interesting fact here is that this is temporary only. So adding a sleep 1 between the umount and the sfdisk command will suppress that failure.
So it feels like umount returns while the kernel internally still cleans up. And trying to revalidate the partition table while this is not finished fails.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-initramfs-tools (Ubuntu):
status: New → Confirmed
Revision history for this message
Stefan Bader (smb) wrote :

Adding a "sync" call between umount and sfdisk seems to avoid the problem. At least the loop ran until I stopped it instead of failing very early (between one and three iterations).

Revision history for this message
Stefan Bader (smb) wrote :

The snippet from comment #5 actually has a major flaw. Since there is no waiting in the loop after doing the sfdisk --re-read and so we do actually run into races with the udev triggered partition checks (blkid/cdrom_id). Adding a udevadm settle after the sfdisk call does prevent all failures I saw before.

Similarly the go-fsdisk-crazy script. Not sure why I had no issues before but this morning I saw it fail. Looking closer the failures started before the grow part, though. So it usually had issues on the partition setup and then all the rest went wrong too. And that went away after adding a settle after the resize.

There also was something very wrong with the CHS output which seemed to change (even later on the successful runs). Not yet sure I understand what is happening there. But overall I think we may be hunting in the wrong direction. Looking at the initial failing output it seems like there is just the message about sfdisk failing. Then (and I think that is in the recovery) re-read of the partition table fails. Would it be possible that sfdisk reported failure but triggered a partition table update anyway, then in the recovery the call fails because we still race against udev commands triggered by the previous call?

Revision history for this message
Stefan Bader (smb) wrote :

Sidenote: it seems the CHS output for a partially used drive will not necessarily match the drive. Even when fdisk did show the drive values when creating the partial partition. Adding a second partition that fills the complete disk will fix that... And sfdisk seems never happy with the values when doing a sfdisk -l...

Revision history for this message
Scott Moser (smoser) wrote :

regarding CHS, i'm fairly sure that its not important. the only reason I actually use it is so that I can specify (and see) values in sectors rather than at a larger unit.

Revision history for this message
Scott Moser (smoser) wrote :

Sorry to have brought us down the wrong road with some of my snippets above.

I just uploaded a new cloud-utils (growpart) which will show the original output of the failed sfdisk command.
In the original summary of the bug, all that is present is the output of the *restore* command after the sfdisk to change it failed.

Next builds will have new growpart and we can upload and try to catch failure again.

Revision history for this message
Scott Moser (smoser) wrote :

I've now uploaded some new images, with the improved growroot, and the output of a failed resize looks like this:

------
Begin: Running /scripts/local-bottom ... [ 0.892501] vda: vda1
GROWROOT: WARNING: resize failed: attempt to resize /dev/vda failed. sfdisk output below:
| Checking that no-one is using this disk right now ...
| OK
|
| Disk /dev/vda: 1305 cylinders, 255 heads, 63 sectors/track
| Old situation:
| Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
|
| Device Boot Start End #cyls #blocks Id System
| /dev/vda1 * 1 260 260 2088450 83 Linux
| /dev/vda2 0 - 0 0 0 Empty
| /dev/vda3 0 - 0 0 0 Empty
| /dev/vda4 0 - 0 0 0 Empty
| New situation:
| Units = sectors of 512 bytes, counting from 0
|
| Device Boot Start End #sectors Id System
| /dev/vda1 * 16065 20964824 20948760 83 Linux
| /dev/vda2 0 - 0 0 Empty
| /dev/vda3 0 - 0 0 Empty
| /dev/vda4 0 - 0 0 Empty
| Successfully wrote the new partition table
|
| Re-reading the partition table ...
| BLKRRPART: Device or resource busy
| The command to re-read the partition table failed.
| Run partprobe(8), kpartx(8) or reboot your system now,
| before using mkfs
| If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
| to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
| (See fdisk(8).)
FAILED: failed to resize
***** WARNING: Resize failed, attempting to revert ******
Re-reading the partition table ...
BLKRRPART: Device or resource busy
The command to re-read the partition table failed.
Run partprobe(8), kpartx(8) or reboot your system now,
before using mkfs
***** Appears to have gone OK ****
[ 1.006837] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
-------

So, growroot tried to resize, the resize failed (with BLKRRPART busy). Then it tried to re-set it back to the original, and that BLKRRPART failed also.

So something other than the mount has the block device open.

Revision history for this message
Scott Moser (smoser) wrote :

I've just uploaded what we believe/hope will be a fix to this issue.
After much discussion, we decided to add a 'udevadm settle' prior to invoking 'growpart'.

The suspicion is that in the initial flurry of device activation, one of the udev spawned events still had an open filehandle on the block device (/dev/vda in this case) when sfdisk made the BLKRRPART call.

Unfortunately we did not definitively prove this, or even catch a 'ps' output where there was a failure. The one thing I fear here is that udevadm settle is quite likely to add enough delay to hide the issue if it is not the real solution.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-initramfs-tools - 0.4ubuntu1

---------------
cloud-initramfs-tools (0.4ubuntu1) precise; urgency=low

  * growroot: run 'udevadm settle' before attempting growpart to
    allow initial events to finish (LP: #937352)
 -- Scott Moser <email address hidden> Thu, 23 Feb 2012 22:19:20 -0500

Changed in cloud-initramfs-tools (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

Fix committed did not seem to fix this.

summary: - root partition in may not be grown
+ root partition may not be grown
Changed in cloud-initramfs-tools (Ubuntu):
status: Fix Released → In Progress
Scott Moser (smoser)
Changed in cloud-utils:
assignee: nobody → Scott Moser (smoser)
importance: Undecided → High
status: New → Fix Committed
Changed in cloud-initramfs-tools (Ubuntu):
importance: Undecided → High
status: In Progress → Fix Released
Changed in cloud-utils (Ubuntu):
assignee: nobody → Scott Moser (smoser)
importance: Undecided → High
milestone: none → ubuntu-12.04-beta-1
status: New → In Progress
Revision history for this message
Scott Moser (smoser) wrote :

There is a lot of confusing information in the above comments.
To sum up the solution that I'll be uploading shortly, for precise we did:
 * add to cloud-initramfs-growroot a 'udevadm settle' before invoking growpart to apply the change. This ensures that any disk events for the root block device are finished before sfdisk is run.
 * in cloud-utils's growpart, use '--no-reread' argument to sfdisk when making changes to the disk.

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-utils - 0.25-0ubuntu4

---------------
cloud-utils (0.25-0ubuntu4) precise; urgency=low

  * growpart: invoke sfdisk with '--no-reread' to avoid udev race
    conditions (LP: #937352)
 -- Scott Moser <email address hidden> Tue, 28 Feb 2012 14:40:51 -0500

Changed in cloud-utils (Ubuntu):
status: In Progress → Fix Released
James Troup (elmo)
tags: added: canonistack
Scott Moser (smoser)
Changed in cloud-utils:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.