OpenStack Compute (nova)

ec2 metadata service is very slow

Bug #851159 reported by Scott Moser on 2011-09-15

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Jesse Andrews	OpenStack Compute (nova) 2012.1 "essex"

Bug Description

$ time python -c 'import boto.utils; boto.utils.get_instance_metadata()'

real 0m5.740s
user 0m0.190s
sys 0m0.110s

I've not done a lot of digging on this, but I suspect that
nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.

cloud-init does at least 2 separate hits of the metadata service on boot, this means we see something like this in cloud-init.log:

2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2
2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2

Note the 5 second time lapse between 'Using' and 'found'. Using just indicated there the port was open. found actually used data from it.

Tags:

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-09-15: Re: [Bug 851159] [NEW] ec2 metadata service is very slow

Download full text (3.5 KiB)

Not only does it do heavy queries, but it rebuilds the entire metadata set for every single request that is made to the metadata server. Horribly inefficient! I want to break the metadata code out into its own binary as well. I will be filing a blueprint to that affect soon. We definitely need some efficiency cleanup for the metadata server as well.

On Sep 15, 2011, at 11:26 AM, Scott Moser wrote:

> Public bug reported:
>
> $ time python -c 'import boto.utils; boto.utils.get_instance_metadata()'
>
> real 0m5.740s
> user 0m0.190s
> sys 0m0.110s
>
> I've not done a lot of digging on this, but I suspect that
> nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
>
> cloud-init does at least 2 separate hits of the metadata service on
> boot, this means we see something like this in cloud-init.log:
>
> 2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
> 2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
> 2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
> 2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2
> 2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2
>
>
> Note the 5 second time lapse between 'Using' and 'found'. Using just indicated there the port was open. found actually used data from it.
>
> ** Affects: nova
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/851159
>
> Title:
> ec2 metadata service is very slow
>
> Status in OpenStack Compute (Nova):
> New
>
> Bug description:
> $ time python -c 'import boto.utils;
> boto.utils.get_instance_metadata()'
>
> real 0m5.740s
> user 0m0.190s
> sys 0m0.110s
>
> I've not done a lot of digging on this, but I suspect that
> nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
>
> cloud-init does at least 2 separate hits of the metadata service on
> boot, this means we see something like this in cloud-init.log:
>
> 2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds
> 2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773']
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254']
> 2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
> 2011-09-15 18:18:00,027 - __init...

Not only does it do heavy queries, but it rebuilds the entire metadata set for every single request that is made to the metadata server.  Horribly inefficient! I want to break the metadata code out into its own binary as well.  I will be filing a blueprint to that affect soon.  We definitely need some efficiency cleanup for the metadata server as well.

On Sep 15, 2011, at 11:26 AM, Scott Moser wrote:

> Public bug reported:
> 
> $ time python -c 'import boto.utils; boto.utils.get_instance_metadata()'
> 
> real    0m5.740s
> user    0m0.190s
> sys     0m0.110s
> 
> I've not done a lot of digging on this, but I suspect that 
> nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
> 
> cloud-init does at least 2 separate hits of the metadata service on
> boot, this means we see something like this in cloud-init.log:
> 
> 2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds 
> 2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2'] 
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773'] 
> 2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254'] 
> 2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254' 
> 2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2 
> 2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2 
> 
> 
> Note the 5 second time lapse between 'Using' and 'found'.  Using just indicated there the port was open.  found actually used data from it.
> 
> ** Affects: nova
>     Importance: Undecided
>         Status: New
> 
> -- 
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/851159
> 
> Title:
>  ec2 metadata service is very slow
> 
> Status in OpenStack Compute (Nova):
>  New
> 
> Bug description:
>  $ time python -c 'import boto.utils;
>  boto.utils.get_instance_metadata()'
> 
>  real    0m5.740s
>  user    0m0.190s
>  sys     0m0.110s
> 
>  I've not done a lot of digging on this, but I suspect that 
>  nova/api/ec2/cloud.py:get_metadata is doing some heavy queries.
> 
>  cloud-init does at least 2 separate hits of the metadata service on
>  boot, this means we see something like this in cloud-init.log:
> 
>  2011-09-15 18:17:53,844 - cloud-init[INFO]: cloud-init start running: Thu, 15 Sep 2011 18:17:53 +0000. up 6.01 seconds 
>  2011-09-15 18:17:53,855 - __init__.py[DEBUG]: searching for data source in ['DataSourceNoCloudNet', 'DataSourceOVFNet', 'DataSourceEc2'] 
>  2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: removed the following from metadata urls: ['http://instance-data:8773'] 
>  2011-09-15 18:17:53,989 - DataSourceEc2.py[DEBUG]: Searching the following metadata urls: ['http://169.254.169.254'] 
>  2011-09-15 18:17:54,350 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254' 
>  2011-09-15 18:18:00,027 - __init__.py[DEBUG]: found data source DataSourceEc2 
>  2011-09-15 18:18:00,050 - cloud-init[DEBUG]: found data source: DataSourceEc2 
> 
> 
>  Note the 5 second time lapse between 'Using' and 'found'.  Using just indicated there the port was open.  found actually used data from it.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/851159/+subscriptions

Revision history for this message

Thierry Carrez (ttx) wrote on 2011-09-16:

Note that at some point Soren proposed that the metadata server would be handled at compute node level rather than hit the API servers.

Todd Willey (xtoddx) on 2011-09-21

Changed in nova:
assignee:	nobody → Todd Willey (xtoddx)

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2011-09-29:

Confirming this.

Timing a simple wget of metadata in EC2 and OpenStack:
ec2: 0m0.009s
openstack: 0m1.089s

Changed in nova:
status:	New → Confirmed

Revision history for this message

Jesse Andrews (anotherjesse) wrote on 2011-09-29:

ttx, even if it is handled at the compute layer (for instance cloudbuilder's deploys are putting nova-api server on each box and routing the metadata to the local api server) it is still wildly inefficient.

it is creating a dictionary of all possible data and then return a subset of that hash.

The code is very naive.

Revision history for this message

Scott Moser (smoser) wrote on 2011-09-30: Re: [Bug 851159] Re: ec2 metadata service is very slow

On Thu, 29 Sep 2011, anotherjesse wrote:

> it is creating a dictionary of all possible data and then return a
> subset of that hash.

And it does it for every request. ie, no caching.
cloud-init and others (boto.utils.get_instance_metadata()) crawl this
service, doing maybe a couple dozen hits in rapid succession.

Revision history for this message

Thierry Carrez (ttx) wrote on 2011-09-30:

Ugh. Sounds like there is room for improvement :)

Revision history for this message

Todd Willey (xtoddx) wrote on 2011-10-07:

Fix is currently up for review: https://review.openstack.org/#change,809

Thierry Carrez (ttx) on 2011-10-21

Changed in nova:
importance:	Undecided → High
status:	Confirmed → In Progress

Chuck Short (zulcss) on 2011-10-25

tags:

added: ec2

Revision history for this message

Mark McLoughlin (markmc) wrote on 2012-01-13:

Bug has been idle for a few months and the patch needs work/rebasing. Think it's safe to say it's not InProgress any more

@todd: feel free to pick it up again, of course - just hoping de-assigning it might cause someone else to pick it up

Changed in nova:
assignee:	Todd Willey (xtoddx) → nobody
status:	In Progress → Confirmed

Revision history for this message

Lionel Porcheron (lionel.porcheron) wrote on 2012-02-11:

This week we had an oportunity to see this bug on our internal cloud:
We rebooted the whole cloud for maintenance purposes (200 vm hosted on 10 servers with a cloud controller)
When servers are booting, the API node is such overloaded by cloud-init waiting meta-data that is was mostly unusable. We were unable to make more than 1 API call per 2 to 5 min during several hours!

We have also tried to do some tunning on the pool size of sqlachemy without a lot of success. The only solution was to drop the DNAT rules for redirecting the meta-data calls to makes things usable again.

For the record, the cloud controler is a Xeon Quad Core 2.5 Ghz with 8 GB of RAM. Now slow query have been seen on MySQL, so it's a pure python code that was taking time (and CPU, nova-api was taking all the time 100% of CPU).

Revision history for this message

Scott Moser (smoser) wrote on 2012-02-11:

#10

On Sat, 11 Feb 2012, Lionel Porcheron wrote:

> This week we had an oportunity to see this bug on our internal cloud:
> We rebooted the whole cloud for maintenance purposes (200 vm hosted on 10 servers with a cloud controller)
> When servers are booting, the API node is such overloaded by cloud-init waiting meta-data that is was mostly unusable. We were unable to make more than 1 API call per 2 to 5 min during several hours!
>
> We have also tried to do some tunning on the pool size of sqlachemy
> without a lot of success. The only solution was to drop the DNAT rules
> for redirecting the meta-data calls to makes things usable again.
>
> For the record, the cloud controler is a Xeon Quad Core 2.5 Ghz with 8
> GB of RAM. Now slow query have been seen on MySQL, so it's a pure python
> code that was taking time (and CPU, nova-api was taking all the time
> 100% of CPU).

Were you using something recent? This is known to be terribly broken in
diablo. However it should be at least better in essex.

Additionally, I've tested the HP public cloud, and their metadata service
returns from "python -c 'import boto.utils;
boto.utils.get_instance_metadata()' in < .2 seconds.

I think what they've done is put something caching in the middle as the
first request I did just now took maybe 4 seconds, but the second and
subsequent took < .2 seconds.

Revision history for this message

Lionel Porcheron (lionel.porcheron) wrote on 2012-02-12:

#11

Scott,

Yes I confirm it is diabo. I did not get the oportunity to test on essex yet, but goot to know it should at least be better.

James Troup (elmo) on 2012-03-05

tags:

added: canonistack

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-03-06:

#12

Based on last comments, please confirm it's still unbelievably slow ("buggy") in essex, or now falls into the "could be faster" wishlist category.

Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Jesse Andrews (anotherjesse) wrote on 2012-03-07:

#13

it is still incredibly slow. is this important enough to fix for essex?

the code is probably incredibly easy to fix - it is a matter of breaking out https://github.com/openstack/nova/blob/master/nova/api/metadata/handler.py#L142 so we only load what is needed.

Vish Ishaya (vishvananda) on 2012-03-07

Changed in nova:
status:	Incomplete → Confirmed
milestone:	none → essex-rc1
assignee:	nobody → anotherjesse (anotherjesse)

OpenStack Infra (hudson-openstack) on 2012-03-07

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-03-12:

#14

@Jesse: any progress on this ?

Revision history for this message

Jesse Andrews (anotherjesse) wrote on 2012-03-12:

#15

merge prop is here: https://review.openstack.org/#change,5042

is ready for merge except where we put the in-memory cache - moving it to nova/common as per markmc's request

OpenStack Infra (hudson-openstack) on 2012-03-12

Changed in nova:
assignee:	Jesse Andrews (anotherjesse) → Vish Ishaya (vishvananda)

Vish Ishaya (vishvananda) on 2012-03-12

Changed in nova:
assignee:	Vish Ishaya (vishvananda) → Jesse Andrews (anotherjesse)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-03-12: Fix merged to nova (master)

#16

Reviewed: https://review.openstack.org/5042
Committed: http://github.com/openstack/nova/commit/1bcf5f5431d3c9620596f5329d7654872235c7ee
Submitter: Jenkins
Branch: master

commit 1bcf5f5431d3c9620596f5329d7654872235c7ee
Author: Jesse Andrews <email address hidden>
Date: Wed Mar 7 13:05:28 2012 -0800

improve speed of metadata

     * don't load every possible answer, only do what is needed
     * cache instance data for a given address for a 15 seconds
       using either memcache or fake memcache (in-memory).
       This means only a single queue/db lookup for multiple calls
       to metadata service
     * add cache expirey to fake memcache (don't grow forever)
       and move it to nova.common.memorycache

Addresses Bug #851159

Change-Id: Icf794156e055b18915b8b5be9ba2ab97d2338bbe

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2012-03-20

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2012-04-05

Changed in nova:
milestone:	essex-rc1 → 2012.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.