race condition with neutron in nova migrate code

Bug #1357599 reported by Aaron Rosen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Aaron Rosen
Icehouse
Fix Released
High
Aaron Rosen

Bug Description

The tempest test that does a resize on the instance from time to time fails with a neutron virtual interface timeout error. The reason why this is occurring is because resize_instance() calls:

            disk_info = self.driver.migrate_disk_and_power_off(
                    context, instance, migration.dest_host,
                    instance_type, network_info,
                    block_device_info)

which calls destory() which unplugs the vifs(). Then,

            self.driver.finish_migration(context, migration, instance,
                                         disk_info,
                                         network_info,
                                         image, resize_instance,
                                         block_device_info, power_on)

is called which expects a vif_plugged event. Since this happens on the same host the neutron agent is able to detect that the vif was unplugged then plugged because it happens so fast. To fix this we should check if we are migrating to the same host if we are we should not expect to get an event.

8d1] Setting instance vm_state to ERROR
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] Traceback (most recent call last):
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/compute/manager.py", line 3714, in finish_resize
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] disk_info, image)
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/compute/manager.py", line 3682, in _finish_resize
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] old_instance_type, sys_meta)
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] six.reraise(self.type_, self.value, self.tb)
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/compute/manager.py", line 3677, in _finish_resize
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] block_device_info, power_on)
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5302, in finish_migration
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] block_device_info, power_on)
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3792, in _create_domain_and_network
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] raise exception.VirtualInterfaceCreateException()
2014-08-14 00:03:58.010 1276 TRACE nova.compute.manager [instance: dca468e4-d26f-4ae2-a522-7d02ef7c98d1] VirtualInterfaceCreateException: Virtual Interface creation failed

Aaron Rosen (arosen)
Changed in nova:
assignee: nobody → Aaron Rosen (arosen)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/116354

Changed in nova:
status: New → In Progress
Joe Gordon (jogo)
Changed in nova:
milestone: none → juno-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/119260

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-3 → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/116354
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=80deed660815b9ba69e424b83805a21768b02bb6
Submitter: Jenkins
Branch: master

commit 80deed660815b9ba69e424b83805a21768b02bb6
Author: Aaron Rosen <email address hidden>
Date: Fri Aug 22 10:35:09 2014 -0700

    Fix race condition with vif plugging in finish migrate

    The tempest test that does a resize on an instance from time to time fails
    with neutron with a virtual interface timout error. The reason why this
    occurs is that nova-compute calls a vifs_unplug() then vifs_plug() fairly
    quickly and the neutron-agent doesn't realize this happens because the way
    it detects this is via polling. This patch fixings this problem by passing
    vifs_already_plugged=True to _create_domain_and_network to avoid waiting
    for the neutron notifcation.

    Initially, I thought this problem could be solved by setting
    vifs_already_plugged=True if migrate.source_host == migrate_desk.dest_host
    but it turns out that this race condition can still occur if migrated to
    another host though it's fairly unlikely. Always setting
    vifs_already_plugged=True ensures we never hit this issue on migrate.

    Change-Id: I90424ad50ac993c4abb049aa0654a94b225b5ebb
    Closes-bug: 1357599
    Closes-bug: 1357476

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/119260
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0d69163d6a430d95d2bc0b30e111703d536fbcbd
Submitter: Jenkins
Branch: stable/icehouse

commit 0d69163d6a430d95d2bc0b30e111703d536fbcbd
Author: Aaron Rosen <email address hidden>
Date: Fri Aug 22 10:35:09 2014 -0700

    Fix race condition with vif plugging in finish migrate

    The tempest test that does a resize on an instance from time to time fails
    with neutron with a virtual interface timout error. The reason why this
    occurs is that nova-compute calls a vifs_unplug() then vifs_plug() fairly
    quickly and the neutron-agent doesn't realize this happens because the way
    it detects this is via polling. This patch fixings this problem by passing
    vifs_already_plugged=True to _create_domain_and_network to avoid waiting
    for the neutron notifcation.

    Initially, I thought this problem could be solved by setting
    vifs_already_plugged=True if migrate.source_host == migrate_desk.dest_host
    but it turns out that this race condition can still occur if migrated to
    another host though it's fairly unlikely. Always setting
    vifs_already_plugged=True ensures we never hit this issue on migrate.

    Change-Id: I90424ad50ac993c4abb049aa0654a94b225b5ebb
    Closes-bug: 1357599
    Closes-bug: 1357476
    (cherry picked from commit 80deed660815b9ba69e424b83805a21768b02bb6)

tags: added: in-stable-icehouse
Sean Dague (sdague)
Changed in nova:
status: Fix Committed → Confirmed
status: Confirmed → Fix Committed
luozhipeng (806670512-q)
description: updated
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.