detach interface fails as instance info cache is corrupted

Bug #1326183 reported by Praveen Yalagandula
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
wangpan
Icehouse
Fix Released
High
Matt Riedemann

Bug Description

Performing attach/detach interface on a VM sometimes results in an interface that can't be detached from the VM.
I could triage it to the corrupted instance cache info due to non-atomic update of that information.
Details on how to reproduce the bug are as follows. Since this is due to a race condition, the test can take quite a bit of time before it hits the bug.

Steps to reproduce:

1) Devstack with trunk with the following local.conf:
disable_service n-net
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service q-metering
RECLONE=yes
# and other options as set in the trunk's local

2) Create few networks:
$> neutron net-create testnet1
$> neutron net-create testnet2
$> neutron net-create testnet3
$> neutron subnet-create testnet1 192.168.1.0/24
$> neutron subnet-create testnet2 192.168.2.0/24
$> neutron subnet-create testnet3 192.168.3.0/24

2) Create a testvm in testnet1:
$> nova boot --flavor m1.tiny --image cirros-0.3.2-x86_64-uec --nic net-id=`neutron net-list | grep testnet1 | cut -f 2 -d ' '` testvm

3) Run the following shell script to attach and detach interfaces for this vm in the remaining two networks in a loop until we run into the issue at hand:
--------
#! /bin/bash
c=10000
netid1=`neutron net-list | grep testnet2 | cut -f 2 -d ' '`
netid2=`neutron net-list | grep testnet3 | cut -f 2 -d ' '`
while [ $c -gt 0 ]
do
   echo "Round: " $c
   echo -n "Attaching two interfaces... "
   nova interface-attach --net-id $netid1 testvm
   nova interface-attach --net-id $netid2 testvm
   echo "Done"
   echo "Sleeping until both those show up in interfaces"
   waittime=0
   while [ $waittime -lt 60 ]
   do
       count=`nova interface-list testvm | wc -l`
       if [ $count -eq 7 ]
       then
           break
       fi
       sleep 2
       (( waittime+=2 ))
   done
   echo "Waited for " $waittime " seconds"
   echo "Detaching both... "
   nova interface-list testvm | grep $netid1 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
   nova interface-list testvm | grep $netid2 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
   echo "Done; check interfaces are gone in a minute."
   waittime=0
   while [ $waittime -lt 60 ]
   do
       count=`nova interface-list testvm | wc -l`
       echo "line count: " $count
       if [ $count -eq 5 ]
       then
           break
       fi
       sleep 2
       (( waittime+=2 ))
   done
   if [ $waittime -ge 60 ]
   then
      echo "bad case"
      exit 1
   fi
   echo "Interfaces are gone"
   (( c-- ))
done
---------

Eventually the test will stop with a failure ("bad case") and the interface remaining either from testnet2 or testnet3 can not be detached at all.

Revision history for this message
Praveen Yalagandula (ypraveen-5) wrote :
Download full text (6.4 KiB)

I could triage it to non-atomic instance info cache update. It seems heal_instance_info_cache periodic
task runs in parallel with these API calls and due to race condition between these two, the info
cache is getting corrupted.

I put in a debug statement to print stack track for each cache update and in the following failed case you could see that the "_heal_instance_info_cache" periodic task overwrote the info-cache incorrectly and hence corrupted it.

------
aviuser@ubuntu-1204-server:~/devstack$ grep -A 1 "Updating cache" /opt/stack/logs/screen/screen-n-cpu.log | grep -A 4 1aea47a5-8677-4a68-aec6-74c8573dde52
2014-06-03 17:13:27.870 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] Updating cache with info: [VIF({'ovs_interfaceid': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.1.2'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.1.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.1.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.1.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'a8159f6d-dc4a-4eab-a6e2-b9d3a626f4f2', 'label': u'testnet1'}), 'devname': u'tap9a697bd9-d2', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:d1:50:3f', 'active': True, 'type': u'ovs', 'id': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'qbg_params': None}), VIF({'ovs_interfaceid': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.2.143'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.2.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.2.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.2.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'1f411a5a-4664-4383-a44a-9d83bef7c1ca', 'label': u'testnet2'}), 'devname': u'tap1aea47a5-86', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:12:24:fc', 'active': False, 'type': u'ovs', 'id': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'qbg_params': None})] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:38
2014-06-03 17:13:27.871 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] traceback: [('/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py', 194, 'main', 'result = function(*args, **kwargs)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 129, '<lambda>', 'yield lambda: self._dispatch_and_reply(incoming)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 134, '_dispatch_and_reply', 'incoming.message))'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 177, '_dispatch', 'return self._do_dispatch(endpoint, method, ctxt, args)'), ('/u...

Read more...

Revision history for this message
Praveen Yalagandula (ypraveen-5) wrote :

Once corrupted, the info cache is never getting repaired and hence detach interface fails.

Unfortunately, the code to refresh the instance cache is very hard to follow as the cache
is getting updated from multiple points of the code in different ways. However, for this
specific bug, I could ensure that there is no race condition by putting in a lock in
the code that is obtaining the network information and updating the instance cache info.

My patch to get around this bug is at https://review.openstack.org/#/c/97662/

Changed in nova:
status: New → In Progress
assignee: nobody → Praveen Yalagandula (ypraveen-5)
tags: added: icehouse-backport-potential
Revision history for this message
Praveen Yalagandula (ypraveen-5) wrote :

This issue is present in the Icehouse distribution (2014.1). I could not figure out how to add that as an affected version too. Can someone add it or let me know how to add it?

Changed in nova:
assignee: Praveen Yalagandula (ypraveen-5) → Sean Dague (sdague)
Revision history for this message
Sean Dague (sdague) wrote :

This patch is still active, and hopefully is something that should help a bunch of issues.

Changed in nova:
importance: Undecided → High
Revision history for this message
Alex Xu (xuhj) wrote :

If there any error message from detach_interface? If the interface detach failed, there should be some message

Revision history for this message
Praveen Yalagandula (ypraveen-5) wrote :

Alex, There was no error message as the detach_interface is an asynchronous call.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like bug 1357055 is related.

Changed in nova:
assignee: Sean Dague (sdague) → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/123917

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/97662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=232cbfe67ffb7696f115830c711a960af5fa0828
Submitter: Jenkins
Branch: master

commit 232cbfe67ffb7696f115830c711a960af5fa0828
Author: Praveen Yalagandula <email address hidden>
Date: Wed Jun 4 00:42:56 2014 +0000

    Neutron: Atomic update of instance info cache

    In the Neutron network API implementation, there is a race condition
    between a thread performing periodic task to read and heal instance
    network info cache and another thread servicing interface-attach or
    interface-detach calls. This patch ensures that instance info cache is
    read and then updated in a synchronized block to ensure atomicity.

    Please see the bug report for more details.

    Change-Id: Ifc76f2b1cce834b3c9927359ac9b957bc9f9c65f
    Closes-Bug: #1326183
    Co-Authored-By: Dan Smith <email address hidden>

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/123917
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dfb0e0fc1604c0673fb9af8316dbf0e5d2c93cd1
Submitter: Jenkins
Branch: stable/icehouse

commit dfb0e0fc1604c0673fb9af8316dbf0e5d2c93cd1
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 25 06:48:47 2014 -0700

    Neutron: Atomic update of instance info cache

    In the Neutron network API implementation, there is a race condition
    between a thread performing periodic task to read and heal instance
    network info cache and another thread servicing interface-attach or
    interface-detach calls. This patch ensures that instance info cache is
    read and then updated in a synchronized block to ensure atomicity.

    Please see the bug report for more details.

    Conflicts:
            nova/network/base_api.py
            nova/network/neutronv2/api.py
            nova/tests/network/test_neutronv2.py

    NOTE(mriedem): Notes on conflicts:

    1. base_api didn't exist until Juno so that's where refresh_cache
       lives now.
    2. nova.network.neutronv2.api is using oslo.i18n in Juno rather
       than the old gettextutils import.
    3. test_neutronv2 didn't start using mock until Juno and the change
       that added the test class is probably something that we won't
       be backporting, so the test class definition is brought back
       with this change for the new test case.

    Change-Id: Ifc76f2b1cce834b3c9927359ac9b957bc9f9c65f
    Closes-Bug: #1326183
    Co-Authored-By: Dan Smith <email address hidden>
    (cherry picked from commit 232cbfe67ffb7696f115830c711a960af5fa0828)

tags: added: in-stable-icehouse
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
Revision history for this message
Juan Pavlik (jjpavlik) wrote :

I seem to have found the same problem in Grizzly, is it possible?

Revision history for this message
wangpan (hzwangpan) wrote :

I believe this bug is still exists in master branch, the reproduce steps are:
1) pull latest master branch
2) add time.sleep(30) to _heal_instance_info_cache, like below:
diff --git a/nova/compute/manager.py b/nova/compute/manager.py
index f240709..9184e08 100644
--- a/nova/compute/manager.py
+++ b/nova/compute/manager.py
@@ -5358,6 +5358,10 @@ class ComputeManager(manager.Manager):
                     break

         if instance:
+ if instance['uuid'] == '61c1159c-3999-4bd5-9029-1757a55bce3d': ### one of the new instance created in step 4(e.g. instanceB)
+ LOG.debug("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-------------------------xxxxxxxxxxxxx")
+ time.sleep(30)
+ LOG.debug("yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-------------------------xxxxxxxxxxxxx")
             # We have an instance now to refresh
             try:
                 # Call to network API to get instance info.. this will
3) restart nova-compute
4) create two new instances(e.g. instanceA & instanceB)
5) create a new port
6) attach the new port to instanceB when you see the debug log "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-------------------------xxxxxxxxxxxxx"
7) after attaching successfully, you can see two ips belong to instanceB through nova list CLI
8) after you see the debug log "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-------------------------xxxxxxxxxxxxx"(may need a few seconds), you can see only ONE ip belong to instanceB through nova list CLI, but you can get two ports belong to instanceB through nova interface-list CLI

Changed in nova:
assignee: Dan Smith (danms) → wangpan (hzwangpan)
assignee: wangpan (hzwangpan) → nobody
Revision history for this message
wangpan (hzwangpan) wrote :

this commit https://review.openstack.org/97662 is only add a lock for refresh_cache, but the info_cache contents while periodic task _heal_instance_info_cache running, may be the old dirty one(before attaching), so the new vif/port attached during this periodic task while be lost after this `_heal_instance_info_cache`.

wangpan (hzwangpan)
Changed in nova:
assignee: nobody → wangpan (hzwangpan)
Revision history for this message
wangpan (hzwangpan) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/164738

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm going to mark this incomplete since I think we need a recreate with this logging to find out what's failing in the neutronv2 API before we can move forward with this, and verify it's still a problem in Kilo.

https://review.openstack.org/#/c/164738/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/139900
Reason: I'm going to abandon this since we need a new bug filed (the bug was closed against other changes around locking while doing the nw info cache refresh), and we should have a recreate with the logging of the failure here:

https://review.openstack.org/#/c/164738/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/164738
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3244063a5cabd76a4651df8c0d8ff496ffc465d4
Submitter: Jenkins
Branch: master

commit 3244063a5cabd76a4651df8c0d8ff496ffc465d4
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 16 08:14:40 2015 -0700

    Log exception from deallocate_port_for_instance for triage

    detach_interface is a cast operation and sometimes
    NeutronClientExceptions slip through the neutronv2 API, so let's be sure
    to log any exceptions that come up from the network API method so we can
    triage them later.

    Related-Bug: #1326183

    Change-Id: I1e96128b8a502b32d1e651101d9bfd606ed4855b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.