Nova cpu full of instance_info_cache stack traces due to attempting to send events about deleted instances

Bug #1304968 reported by Sean Dague
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Smith
Icehouse
Fix Released
High
YangLei

Bug Description

The bulk of the stack traces in n-cpu is because emit_event is getting triggered on a VM delete, however by the time we get to emit_event the instance is deleted (we see this exception 183 times in this log - which means it's happening on *every* compute terminate) so when we try to look up the instance we hit the exception found here:

    @base.remotable_classmethod
    def get_by_instance_uuid(cls, context, instance_uuid):
        db_obj = db.instance_info_cache_get(context, instance_uuid)
        if not db_obj:
            raise exception.InstanceInfoCacheNotFound(
                    instance_uuid=instance_uuid)
        return InstanceInfoCache._from_db_object(context, cls(), db_obj)

A log trace of this interaction looks like this:

2014-04-08 11:14:25.475 DEBUG nova.openstack.common.lockutils [req-fe9db989-416e-4da0-986c-e68336e3c602 TenantUsagesTestJSON-153098759 TenantUsagesTestJSON-953946497] Semaphore / lock released "do_terminate_instance" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:252
2014-04-08 11:14:25.907 DEBUG nova.openstack.common.lockutils [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Got semaphore "75da98d7-bbd5-42a2-ad6f-7a66e38977fa" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:168
2014-04-08 11:14:25.907 DEBUG nova.openstack.common.lockutils [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Got semaphore / lock "do_terminate_instance" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:248
2014-04-08 11:14:25.907 DEBUG nova.openstack.common.lockutils [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Got semaphore "<function _lock_name at 0x41635f0>" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:168
2014-04-08 11:14:25.908 DEBUG nova.openstack.common.lockutils [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Got semaphore / lock "_clear_events" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:248
2014-04-08 11:14:25.908 DEBUG nova.openstack.common.lockutils [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Semaphore / lock released "_clear_events" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:252
2014-04-08 11:14:25.928 AUDIT nova.compute.manager [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] [instance: 75da98d7-bbd5-42a2-ad6f-7a66e38977fa] Terminating instance
2014-04-08 11:14:25.989 DEBUG nova.objects.instance [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Lazy-loading `system_metadata' on Instance uuid 75da98d7-bbd5-42a2-ad6f-7a66e38977fa obj_load_attr /opt/stack/new/nova/nova/objects/instance.py:519
2014-04-08 11:14:26.209 DEBUG nova.network.api [req-687de0cf-67fc-434f-927f-4c37665ad5d8 FixedIPsTestJson-234831436 FixedIPsTestJson-1960919997] Updating cache with info: [VIF({'ovs_interfaceid': None, 'network': Network({'bridge': u'br100', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': u'fixed', 'floating_ips': [], 'address': u'10.1.0.2'})], 'version': 4, 'meta': {u'dhcp_server': u'10.1.0.1'}, 'dns': [IP({'meta': {}, 'version': 4, 'type': u'dns', 'address': u'8.8.4.4'})], 'routes': [], 'cidr': u'10.1.0.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': u'gateway', 'address': u'10.1.0.1'})}), Subnet({'ips': [], 'version': None, 'meta': {u'dhcp_server': None}, 'dns': [], 'routes': [], 'cidr': None, 'gateway': IP({'meta': {}, 'version': None, 'type': u'gateway', 'address': None})})], 'meta': {u'tenant_id': None, u'should_create_bridge': True, u'bridge_interface': u'eth0'}, 'id': u'9751787e-f41c-4299-be13-941c901f6d18', 'label': u'private'}), 'devname': None, 'qbh_params': None, 'meta': {}, 'details': {}, 'address': u'fa:16:3e:d8:87:38', 'active': False, 'type': u'bridge', 'id': u'db1ac48d-805a-45d3-9bb9-786bb5855673', 'qbg_params': None})] update_instance_cache_with_nw_info /opt/stack/new/nova/nova/network/api.py:74
2014-04-08 11:14:27.661 2894 DEBUG nova.virt.driver [-] Emitting event <nova.virt.event.LifecycleEvent object at 0x4932e50> emit_event /opt/stack/new/nova/nova/virt/driver.py:1207
2014-04-08 11:14:27.661 2894 INFO nova.compute.manager [-] Lifecycle event 1 on VM 75da98d7-bbd5-42a2-ad6f-7a66e38977fa
2014-04-08 11:14:27.773 2894 ERROR nova.virt.driver [-] Exception dispatching event <nova.virt.event.LifecycleEvent object at 0x4932e50>: Info cache for instance 75da98d7-bbd5-42a2-ad6f-7a66e38977fa could not be found.
Traceback (most recent call last):

  File "/opt/stack/new/nova/nova/conductor/manager.py", line 597, in _object_dispatch
    return getattr(target, method)(context, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/base.py", line 151, in wrapper
    return fn(self, ctxt, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/instance.py", line 500, in refresh
    self.info_cache.refresh()

  File "/opt/stack/new/nova/nova/objects/base.py", line 151, in wrapper
    return fn(self, ctxt, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/instance_info_cache.py", line 103, in refresh
    self.instance_uuid)

  File "/opt/stack/new/nova/nova/objects/base.py", line 112, in wrapper
    result = fn(cls, context, *args, **kwargs)

  File "/opt/stack/new/nova/nova/objects/instance_info_cache.py", line 70, in get_by_instance_uuid
    instance_uuid=instance_uuid)

InstanceInfoCacheNotFound: Info cache for instance 75da98d7-bbd5-42a2-ad6f-7a66e38977fa could not be found.

2014-04-08 11:14:27.840 2894 INFO nova.virt.libvirt.driver [-] [instance: 75da98d7-bbd5-42a2-ad6f-7a66e38977fa] Instance destroyed successfully.

Raw logs for a failure: http://logs.openstack.org/38/62038/14/check/check-tempest-dsvm-full/86cde16/logs/screen-n-cpu.txt.gz?level=TRACE

Specific failure point: http://logs.openstack.org/38/62038/14/check/check-tempest-dsvm-full/86cde16/logs/screen-n-cpu.txt.gz?level=DEBUG#_2014-04-08_11_14_25_928

Revision history for this message
Sean Dague (sdague) wrote :

I'm raising this to high because it's hit in *every* instance terminate

Changed in nova:
importance: Undecided → High
tags: added: compute
Changed in nova:
status: New → Triaged
tags: added: libvirt
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Triaged → In Progress
assignee: nobody → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/86389
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53ad788af6ced83bd9d6e58a25196a325d60fc4e
Submitter: Jenkins
Branch: master

commit 53ad788af6ced83bd9d6e58a25196a325d60fc4e
Author: Dan Smith <email address hidden>
Date: Wed Apr 9 09:29:09 2014 -0700

    Read deleted instances during lifecycle events

    This prevents us from failing to find the associated instance if
    the lifecycle event happens after the instance has been deleted,
    as in the case of a ... delete.

    Change-Id: I819ddf8ff68edc407d7932bf43771c440f514f26
    Closes-bug: #1304968

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-1
status: Fix Committed → Fix Released
Revision history for this message
YangLei (yanglyy) wrote :

Dan Smith

Will you cherry pick the code change to icehouse?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/112520

Revision history for this message
Matt Riedemann (mriedem) wrote :

We still have ~235K hits on this in 7 days, 20K of those are in stable/icehouse runs alone, the rest on master might be related to grenade jobs.

After the stable/icehouse backport is merged (approved now):

https://review.openstack.org/#/c/112520/1

We should open a new bug to track new failures on master since the same issue can crop up elsewhere during an instance delete flow and we just have to handle them one by one.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/112520
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4e1e217000f5cc2ad5c199793f3ed2aefc89ec46
Submitter: Jenkins
Branch: stable/icehouse

commit 4e1e217000f5cc2ad5c199793f3ed2aefc89ec46
Author: Dan Smith <email address hidden>
Date: Wed Apr 9 09:29:09 2014 -0700

    Read deleted instances during lifecycle events

    This prevents us from failing to find the associated instance if
    the lifecycle event happens after the instance has been deleted,
    as in the case of a ... delete.

    Change-Id: I819ddf8ff68edc407d7932bf43771c440f514f26
    Closes-bug: #1304968
    (cherry picked from commit 53ad788af6ced83bd9d6e58a25196a325d60fc4e)

tags: added: in-stable-icehouse
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is the logstash query I'm using:

message:"Exception dispatching event" AND message:"Info cache for instance" AND message:"could not be found" AND tags:"screen-n-cpu.txt"

http://goo.gl/hHMGdO

Now that the backport is merged we can see if this starts dropping off in stable/icehouse and grenade runs.

In a 4 hour logstash run it does drop off, so this is looking good.

Revision history for this message
Matt Riedemann (mriedem) wrote :

There are actually 0 hits in the last 1 hour, so this might be closed now.

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.