Bug #1326183 “detach interface fails as instance info cache is c... : Bugs : OpenStack Compute (nova)

Revision history for this message

Praveen Yalagandula (ypraveen-5) wrote on 2014-06-04:

#1

Download full text (6.4 KiB)

I could triage it to non-atomic instance info cache update. It seems heal_instance_info_cache periodic
task runs in parallel with these API calls and due to race condition between these two, the info
cache is getting corrupted.

I put in a debug statement to print stack track for each cache update and in the following failed case you could see that the "_heal_instance_info_cache" periodic task overwrote the info-cache incorrectly and hence corrupted it.

------
aviuser@ubuntu-1204-server:~/devstack$ grep -A 1 "Updating cache" /opt/stack/logs/screen/screen-n-cpu.log | grep -A 4 1aea47a5-8677-4a68-aec6-74c8573dde52
2014-06-03 17:13:27.870 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] Updating cache with info: [VIF({'ovs_interfaceid': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.1.2'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.1.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.1.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.1.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'a8159f6d-dc4a-4eab-a6e2-b9d3a626f4f2', 'label': u'testnet1'}), 'devname': u'tap9a697bd9-d2', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:d1:50:3f', 'active': True, 'type': u'ovs', 'id': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'qbg_params': None}), VIF({'ovs_interfaceid': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.2.143'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.2.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.2.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.2.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'1f411a5a-4664-4383-a44a-9d83bef7c1ca', 'label': u'testnet2'}), 'devname': u'tap1aea47a5-86', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:12:24:fc', 'active': False, 'type': u'ovs', 'id': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'qbg_params': None})] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:38
2014-06-03 17:13:27.871 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] traceback: [('/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py', 194, 'main', 'result = function(*args, **kwargs)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 129, '<lambda>', 'yield lambda: self._dispatch_and_reply(incoming)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 134, '_dispatch_and_reply', 'incoming.message))'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 177, '_dispatch', 'return self._do_dispatch(endpoint, method, ctxt, args)'), ('/u...

I could triage it to non-atomic instance info cache update. It seems heal_instance_info_cache periodic
task runs in parallel with these API calls and due to race condition between these two, the info
cache is getting corrupted.

I put in a debug statement to print stack track for each cache update and in the following failed case you could see that the "_heal_instance_info_cache" periodic task overwrote the info-cache incorrectly and hence corrupted it.

------
aviuser@ubuntu-1204-server:~/devstack$ grep -A 1 "Updating cache" /opt/stack/logs/screen/screen-n-cpu.log | grep -A 4 1aea47a5-8677-4a68-aec6-74c8573dde52
2014-06-03 17:13:27.870 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] Updating cache with info: [VIF({'ovs_interfaceid': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.1.2'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.1.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.1.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.1.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'a8159f6d-dc4a-4eab-a6e2-b9d3a626f4f2', 'label': u'testnet1'}), 'devname': u'tap9a697bd9-d2', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:d1:50:3f', 'active': True, 'type': u'ovs', 'id': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'qbg_params': None}), VIF({'ovs_interfaceid': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.2.143'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.2.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.2.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.2.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'1f411a5a-4664-4383-a44a-9d83bef7c1ca', 'label': u'testnet2'}), 'devname': u'tap1aea47a5-86', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:12:24:fc', 'active': False, 'type': u'ovs', 'id': u'1aea47a5-8677-4a68-aec6-74c8573dde52', 'qbg_params': None})] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:38
2014-06-03 17:13:27.871 DEBUG nova.network.base_api [req-8a4531a0-1b8c-4c2d-815d-0e126515d446 demo demo] traceback: [('/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py', 194, 'main', 'result = function(*args, **kwargs)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 129, '<lambda>', 'yield lambda: self._dispatch_and_reply(incoming)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 134, '_dispatch_and_reply', 'incoming.message))'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 177, '_dispatch', 'return self._do_dispatch(endpoint, method, ctxt, args)'), ('/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py', 123, '_do_dispatch', 'result = getattr(endpoint, method)(ctxt, **new_args)'), ('/opt/stack/nova/nova/compute/manager.py', 401, 'decorated_function', 'return function(self, context, *args, **kwargs)'), ('/opt/stack/nova/nova/compute/manager.py', 4388, 'attach_interface', 'context, instance, port_id, network_id, requested_ip)'), ('/opt/stack/nova/nova/network/neutronv2/api.py', 436, 'allocate_port_for_instance', 'requested_networks=[(network_id, requested_ip, port_id)])'), ('/opt/stack/nova/nova/network/neutronv2/api.py', 360, 'allocate_for_instance', 'port_ids=ports_in_requested_order)'), ('/opt/stack/nova/nova/network/neutronv2/api.py', 471, 'get_instance_nw_info', 'result, update_cells=False)'), ('/opt/stack/nova/nova/network/base_api.py', 39, 'update_instance_cache_with_nw_info', "LOG.debug('traceback: %s', traceback.extract_stack())")] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:39
--
2014-06-03 17:13:27.936 DEBUG nova.network.base_api [-] Updating cache with info: [VIF({'ovs_interfaceid': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'192.168.1.2'})], 'version': 4, 'meta': {'dhcp_server': u'192.168.1.3'}, 'dns': [], 'routes': [], 'cidr': u'192.168.1.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'192.168.1.1'})})], 'meta': {'injected': False, 'tenant_id': u'adeb5c20087c42ccb7f4561a7d9cba6e'}, 'id': u'a8159f6d-dc4a-4eab-a6e2-b9d3a626f4f2', 'label': u'testnet1'}), 'devname': u'tap9a697bd9-d2', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:d1:50:3f', 'active': True, 'type': u'ovs', 'id': u'9a697bd9-d2fb-4924-8298-b76c8d1c7c4f', 'qbg_params': None})] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:38
2014-06-03 17:13:27.937 DEBUG nova.network.base_api [-] traceback: [('/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py', 194, 'main', 'result = function(*args, **kwargs)'), ('/opt/stack/nova/nova/openstack/common/loopingcall.py', 123, '_inner', 'idle = self.f(*self.args, **self.kw)'), ('/opt/stack/nova/nova/service.py', 304, 'periodic_tasks', 'return self.manager.periodic_tasks(ctxt, raise_on_error=raise_on_error)'), ('/opt/stack/nova/nova/manager.py', 81, 'periodic_tasks', 'return self.run_periodic_tasks(context, raise_on_error=raise_on_error)'), ('/opt/stack/nova/nova/openstack/common/periodic_task.py', 175, 'run_periodic_tasks', 'task(self, context)'), ('/opt/stack/nova/nova/compute/manager.py', 4910, '_heal_instance_info_cache', 'self._get_instance_nw_info(context, instance, use_slave=True)'), ('/opt/stack/nova/nova/compute/manager.py', 1142, '_get_instance_nw_info', 'instance)'), ('/opt/stack/nova/nova/network/neutronv2/api.py', 471, 'get_instance_nw_info', 'result, update_cells=False)'), ('/opt/stack/nova/nova/network/base_api.py', 39, 'update_instance_cache_with_nw_info', "LOG.debug('traceback: %s', traceback.extract_stack())")] from (pid=5206) update_instance_cache_with_nw_info /opt/stack/nova/nova/network/base_api.py:39

Revision history for this message

Praveen Yalagandula (ypraveen-5) wrote on 2014-06-04:

#2

Once corrupted, the info cache is never getting repaired and hence detach interface fails.

Unfortunately, the code to refresh the instance cache is very hard to follow as the cache
is getting updated from multiple points of the code in different ways. However, for this
specific bug, I could ensure that there is no race condition by putting in a lock in
the code that is obtaining the network information and updating the instance cache info.

My patch to get around this bug is at https://review.openstack.org/#/c/97662/

Praveen Yalagandula (ypraveen-5) on 2014-06-04

Changed in nova:
status:	New → In Progress
assignee:	nobody → Praveen Yalagandula (ypraveen-5)

Praveen Yalagandula (ypraveen-5) on 2014-06-04

tags:

added: icehouse-backport-potential

Revision history for this message

Praveen Yalagandula (ypraveen-5) wrote on 2014-06-04:

#3

This issue is present in the Icehouse distribution (2014.1). I could not figure out how to add that as an affected version too. Can someone add it or let me know how to add it?

OpenStack Infra (hudson-openstack) on 2014-09-17

Changed in nova:
assignee:	Praveen Yalagandula (ypraveen-5) → Sean Dague (sdague)

Revision history for this message

Sean Dague (sdague) wrote on 2014-09-17:

#4

This patch is still active, and hopefully is something that should help a bunch of issues.

Changed in nova:
importance:	Undecided → High

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-18:

#5

If there any error message from detach_interface? If the interface detach failed, there should be some message

Revision history for this message

Praveen Yalagandula (ypraveen-5) wrote on 2014-09-18:

#6

Alex, There was no error message as the detach_interface is an asynchronous call.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-23:

#7

Looks like bug 1357055 is related.

OpenStack Infra (hudson-openstack) on 2014-09-23

Changed in nova:
assignee:	Sean Dague (sdague) → Dan Smith (danms)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-25: Fix proposed to nova (stable/icehouse)

#8

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/123917

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-25: Fix merged to nova (master)

#9

Reviewed: https://review.openstack.org/97662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=232cbfe67ffb7696f115830c711a960af5fa0828
Submitter: Jenkins
Branch: master

commit 232cbfe67ffb7696f115830c711a960af5fa0828
Author: Praveen Yalagandula <email address hidden>
Date: Wed Jun 4 00:42:56 2014 +0000

Neutron: Atomic update of instance info cache

    In the Neutron network API implementation, there is a race condition
    between a thread performing periodic task to read and heal instance
    network info cache and another thread servicing interface-attach or
    interface-detach calls. This patch ensures that instance info cache is
    read and then updated in a synchronized block to ensure atomicity.

Please see the bug report for more details.

    Change-Id: Ifc76f2b1cce834b3c9927359ac9b957bc9f9c65f
    Closes-Bug: #1326183
    Co-Authored-By: Dan Smith <email address hidden>

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-26: Fix merged to nova (stable/icehouse)

#10

Reviewed: https://review.openstack.org/123917
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dfb0e0fc1604c0673fb9af8316dbf0e5d2c93cd1
Submitter: Jenkins
Branch: stable/icehouse

commit dfb0e0fc1604c0673fb9af8316dbf0e5d2c93cd1
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 25 06:48:47 2014 -0700

Neutron: Atomic update of instance info cache

    In the Neutron network API implementation, there is a race condition
    between a thread performing periodic task to read and heal instance
    network info cache and another thread servicing interface-attach or
    interface-detach calls. This patch ensures that instance info cache is
    read and then updated in a synchronized block to ensure atomicity.

Please see the bug report for more details.

    Conflicts:
            nova/network/base_api.py
            nova/network/neutronv2/api.py
            nova/tests/network/test_neutronv2.py

NOTE(mriedem): Notes on conflicts:

    1. base_api didn't exist until Juno so that's where refresh_cache
       lives now.
    2. nova.network.neutronv2.api is using oslo.i18n in Juno rather
       than the old gettextutils import.
    3. test_neutronv2 didn't start using mock until Juno and the change
       that added the test class is probably something that we won't
       be backporting, so the test class definition is brought back
       with this change for the new test case.

    Change-Id: Ifc76f2b1cce834b3c9927359ac9b957bc9f9c65f
    Closes-Bug: #1326183
    Co-Authored-By: Dan Smith <email address hidden>
    (cherry picked from commit 232cbfe67ffb7696f115830c711a960af5fa0828)

tags:

added: in-stable-icehouse

Thierry Carrez (ttx) on 2014-10-01

Changed in nova:
milestone:	none → juno-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-10-16

Changed in nova:
milestone:	juno-rc1 → 2014.2

Revision history for this message

Juan Pavlik (jjpavlik) wrote on 2014-10-21:

#11

I seem to have found the same problem in Grizzly, is it possible?

Revision history for this message

wangpan (hzwangpan) wrote on 2014-12-06:

#12

I believe this bug is still exists in master branch, the reproduce steps are:
1) pull latest master branch
2) add time.sleep(30) to _heal_instance_info_cache, like below:
diff --git a/nova/compute/manager.py b/nova/compute/manager.py
index f240709..9184e08 100644
--- a/nova/compute/manager.py
+++ b/nova/compute/manager.py
@@ -5358,6 +5358,10 @@ class ComputeManager(manager.Manager):
break

         if instance:
+ if instance['uuid'] == '61c1159c-3999-4bd5-9029-1757a55bce3d': ### one of the new instance created in step 4(e.g. instanceB)
+ LOG.debug("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-------------------------xxxxxxxxxxxxx")
+ time.sleep(30)
+ LOG.debug("yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-------------------------xxxxxxxxxxxxx")
             # We have an instance now to refresh
             try:
                 # Call to network API to get instance info.. this will
3) restart nova-compute
4) create two new instances(e.g. instanceA & instanceB)
5) create a new port
6) attach the new port to instanceB when you see the debug log "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-------------------------xxxxxxxxxxxxx"
7) after attaching successfully, you can see two ips belong to instanceB through nova list CLI
8) after you see the debug log "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-------------------------xxxxxxxxxxxxx"(may need a few seconds), you can see only ONE ip belong to instanceB through nova list CLI, but you can get two ports belong to instanceB through nova interface-list CLI

Changed in nova:
assignee:	Dan Smith (danms) → wangpan (hzwangpan)
assignee:	wangpan (hzwangpan) → nobody

Revision history for this message

wangpan (hzwangpan) wrote on 2014-12-06:

#13

this commit https://review.openstack.org/97662 is only add a lock for refresh_cache, but the info_cache contents while periodic task _heal_instance_info_cache running, may be the old dirty one(before attaching), so the new vif/port attached during this periodic task while be lost after this `_heal_instance_info_cache`.

wangpan (hzwangpan) on 2014-12-08

Changed in nova:
assignee:	nobody → wangpan (hzwangpan)

Revision history for this message

wangpan (hzwangpan) wrote on 2014-12-08:

#14

a new patch is here: https://review.openstack.org/#/c/139900

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-16: Related fix proposed to nova (master)

#15

Related fix proposed to branch: master
Review: https://review.openstack.org/164738

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-16:

#16

I'm going to mark this incomplete since I think we need a recreate with this logging to find out what's failing in the neutronv2 API before we can move forward with this, and verify it's still a problem in Kilo.

https://review.openstack.org/#/c/164738/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-16: Change abandoned on nova (master)

#17

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/139900
Reason: I'm going to abandon this since we need a new bug filed (the bug was closed against other changes around locking while doing the nw info cache refresh), and we should have a recreate with the logging of the failure here:

https://review.openstack.org/#/c/164738/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-19: Related fix merged to nova (master)

#18

Reviewed: https://review.openstack.org/164738
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3244063a5cabd76a4651df8c0d8ff496ffc465d4
Submitter: Jenkins
Branch: master

commit 3244063a5cabd76a4651df8c0d8ff496ffc465d4
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 16 08:14:40 2015 -0700

Log exception from deallocate_port_for_instance for triage

    detach_interface is a cast operation and sometimes
    NeutronClientExceptions slip through the neutronv2 API, so let's be sure
    to log any exceptions that come up from the network API method so we can
    triage them later.

Related-Bug: #1326183

Change-Id: I1e96128b8a502b32d1e651101d9bfd606ed4855b

OpenStack Compute (nova)

detach interface fails as instance info cache is corrupted

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	wangpan	OpenStack Compute (nova) 2014.2 "juno"
	Icehouse	Fix Released	High	Matt Riedemann	OpenStack Compute (nova) 2014.1.3