ec2 metadata service is very slow

Bug #851159 reported by Scott Moser
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Jesse Andrews

Bug Description

$ time python -c 'import boto.utils; boto.utils.get_instance_metadata()'

real 0m5.740s
user 0m0.190s
sys 0m0.110s

I've not done a lot of digging on this, but I suspect that
nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.

cloud-init does at least 2 separate hits of the metadata service on boot, this means we see something like this in cloud-init.log:

2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2
2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2

Note the 5 second time lapse between 'Using' and 'found'. Using just indicated there the port was open. found actually used data from it.

Revision history for this message
Vish Ishaya (vishvananda) wrote : Re: [Bug 851159] [NEW] ec2 metadata service is very slow
Download full text (3.5 KiB)

Not only does it do heavy queries, but it rebuilds the entire metadata set for every single request that is made to the metadata server. Horribly inefficient! I want to break the metadata code out into its own binary as well. I will be filing a blueprint to that affect soon. We definitely need some efficiency cleanup for the metadata server as well.

On Sep 15, 2011, at 11:26 AM, Scott Moser wrote:

> Public bug reported:
>
> $ time python -c 'import boto.utils; boto.utils.get_instance_metadata()'
>
> real 0m5.740s
> user 0m0.190s
> sys 0m0.110s
>
> I've not done a lot of digging on this, but I suspect that
> nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
>
> cloud-init does at least 2 separate hits of the metadata service on
> boot, this means we see something like this in cloud-init.log:
>
> 2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
> 2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
> 2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
> 2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2
> 2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2
>
>
> Note the 5 second time lapse between 'Using' and 'found'. Using just indicated there the port was open. found actually used data from it.
>
> ** Affects: nova
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/851159
>
> Title:
> ec2 metadata service is very slow
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> $ time python -c 'import boto.utils;
> boto.utils.get_instance_metadata()'
>
> real 0m5.740s
> user 0m0.190s
> sys 0m0.110s
>
> I've not done a lot of digging on this, but I suspect that
> nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
>
> cloud-init does at least 2 separate hits of the metadata service on
> boot, this means we see something like this in cloud-init.log:
>
> 2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
> 2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
> 2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
> 2011-09-15 18:18:00,027 - __init...

Read more...

Revision history for this message
Thierry Carrez (ttx) wrote :

Note that at some point Soren proposed that the metadata server would be handled at compute node level rather than hit the API servers.

Todd Willey (xtoddx)
Changed in nova:
assignee: nobody → Todd Willey (xtoddx)
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Confirming this.

Timing a simple wget of metadata in EC2 and OpenStack:
  ec2: 0m0.009s
  openstack: 0m1.089s

Changed in nova:
status: New → Confirmed
Revision history for this message
Jesse Andrews (anotherjesse) wrote :

ttx, even if it is handled at the compute layer (for instance cloudbuilder's deploys are putting nova-api server on each box and routing the metadata to the local api server) it is still wildly inefficient.

it is creating a dictionary of all possible data and then return a subset of that hash.

The code is very naive.

Revision history for this message
Scott Moser (smoser) wrote : Re: [Bug 851159] Re: ec2 metadata service is very slow

On Thu, 29 Sep 2011, anotherjesse wrote:

> it is creating a dictionary of all possible data and then return a
> subset of that hash.

And it does it for every request. ie, no caching.
cloud-init and others (boto.utils.get_instance_metadata()) crawl this
service, doing maybe a couple dozen hits in rapid succession.

Revision history for this message
Thierry Carrez (ttx) wrote :

Ugh. Sounds like there is room for improvement :)

Revision history for this message
Todd Willey (xtoddx) wrote :

Fix is currently up for review: https://review.openstack.org/#change,809

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → High
status: Confirmed → In Progress
Chuck Short (zulcss)
tags: added: ec2
Revision history for this message
Mark McLoughlin (markmc) wrote :

Bug has been idle for a few months and the patch needs work/rebasing. Think it's safe to say it's not InProgress any more

@todd: feel free to pick it up again, of course - just hoping de-assigning it might cause someone else to pick it up

Changed in nova:
assignee: Todd Willey (xtoddx) → nobody
status: In Progress → Confirmed
Revision history for this message
Lionel Porcheron (lionel.porcheron) wrote :

This week we had an oportunity to see this bug on our internal cloud:
We rebooted the whole cloud for maintenance purposes (200 vm hosted on 10 servers with a cloud controller)
When servers are booting, the API node is such overloaded by cloud-init waiting meta-data that is was mostly unusable. We were unable to make more than 1 API call per 2 to 5 min during several hours!

We have also tried to do some tunning on the pool size of sqlachemy without a lot of success. The only solution was to drop the DNAT rules for redirecting the meta-data calls to makes things usable again.

For the record, the cloud controler is a Xeon Quad Core 2.5 Ghz with 8 GB of RAM. Now slow query have been seen on MySQL, so it's a pure python code that was taking time (and CPU, nova-api was taking all the time 100% of CPU).

Revision history for this message
Scott Moser (smoser) wrote :

On Sat, 11 Feb 2012, Lionel Porcheron wrote:

> This week we had an oportunity to see this bug on our internal cloud:
> We rebooted the whole cloud for maintenance purposes (200 vm hosted on 10 servers with a cloud controller)
> When servers are booting, the API node is such overloaded by cloud-init waiting meta-data that is was mostly unusable. We were unable to make more than 1 API call per 2 to 5 min during several hours!
>
> We have also tried to do some tunning on the pool size of sqlachemy
> without a lot of success. The only solution was to drop the DNAT rules
> for redirecting the meta-data calls to makes things usable again.
>
> For the record, the cloud controler is a Xeon Quad Core 2.5 Ghz with 8
> GB of RAM. Now slow query have been seen on MySQL, so it's a pure python
> code that was taking time (and CPU, nova-api was taking all the time
> 100% of CPU).

Were you using something recent? This is known to be terribly broken in
diablo. However it should be at least better in essex.

Additionally, I've tested the HP public cloud, and their metadata service
returns from "python -c 'import boto.utils;
boto.utils.get_instance_metadata()' in < .2 seconds.

I think what they've done is put something caching in the middle as the
first request I did just now took maybe 4 seconds, but the second and
subsequent took < .2 seconds.

Revision history for this message
Lionel Porcheron (lionel.porcheron) wrote :

Scott,

Yes I confirm it is diabo. I did not get the oportunity to test on essex yet, but goot to know it should at least be better.

James Troup (elmo)
tags: added: canonistack
Revision history for this message
Thierry Carrez (ttx) wrote :

Based on last comments, please confirm it's still unbelievably slow ("buggy") in essex, or now falls into the "could be faster" wishlist category.

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Jesse Andrews (anotherjesse) wrote :

it is still incredibly slow. is this important enough to fix for essex?

the code is probably incredibly easy to fix - it is a matter of breaking out https://github.com/openstack/nova/blob/master/nova/api/metadata/handler.py#L142 so we only load what is needed.

Changed in nova:
status: Incomplete → Confirmed
milestone: none → essex-rc1
assignee: nobody → anotherjesse (anotherjesse)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

@Jesse: any progress on this ?

Revision history for this message
Jesse Andrews (anotherjesse) wrote :

merge prop is here: https://review.openstack.org/#change,5042

is ready for merge except where we put the in-memory cache - moving it to nova/common as per markmc's request

Changed in nova:
assignee: Jesse Andrews (anotherjesse) → Vish Ishaya (vishvananda)
Changed in nova:
assignee: Vish Ishaya (vishvananda) → Jesse Andrews (anotherjesse)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/5042
Committed: http://github.com/openstack/nova/commit/1bcf5f5431d3c9620596f5329d7654872235c7ee
Submitter: Jenkins
Branch: master

commit 1bcf5f5431d3c9620596f5329d7654872235c7ee
Author: Jesse Andrews <email address hidden>
Date: Wed Mar 7 13:05:28 2012 -0800

    improve speed of metadata

     * don't load every possible answer, only do what is needed
     * cache instance data for a given address for a 15 seconds
       using either memcache or fake memcache (in-memory).
       This means only a single queue/db lookup for multiple calls
       to metadata service
     * add cache expirey to fake memcache (don't grow forever)
       and move it to nova.common.memorycache

    Addresses Bug #851159

    Change-Id: Icf794156e055b18915b8b5be9ba2ab97d2338bbe

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-rc1 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.