cloud-init network error when using MAAS/juju

Bug #1345433 reported by HeinMueck
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Expired
Medium
Scott Moser
juju-core
Fix Released
Medium
Dimiter Naydenov

Bug Description

I am setting up a testing environment for Trusty, MAAS and juju, all on KVM's. I provisioned 5 machines through MAAS and juju add-machine, all went fine.

When the systems reboot - all 5 of them - the network configuration fails and thus the cloud init. You get the "cloud-init-nonet wating for 120 seconds ..." and then the subsequent "gave up waiting ..." messages.

I added some outputs to cloud-init-nonet, for that the funny lines in the boot.log. What is interesting that after all the wating and doing things, finally you get a login prompt. And the network is up as it should!

I added bridge_maxwait 0 to the br0 stanza in /etc/network/eth0.config, but that does not change anything. Upstart job logs do not show any errors.

I have tested the br0 configuration on a fresh 14.04 server installation with no cloud features and it works without any changes.

I thought maybe firewall issues, so I removed ufw.conf for a change - reboot was the same.

All used packages used are the latest 14.04 repositories, freshly updated.

Related bugs:
 * bug 1031065: cloud-init-nonet runs 'start networking' explicitly
 * bug 800824: cloud-init-nonet times out in lxc
 * bug 925122: container's udevadm trigger --add affects the host
 * bug 643289: idmapd does not starts to work after system reboot

Revision history for this message
HeinMueck (cperz) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

There is a lot of information missing from your bug report.
I'm not sure how 'br0' comes into play at all here.

Unless you've modified something, the installed system will not expect or use a 'br0'. And if you *have* modified something, then you're going to need to explain what that is that you've modified.

cloud-init-nonet upstart job blocks further boot until all 'auto' network adapters in /etc/network/interfaces (including those sourced from /etc/network/interfaces.d/*) are up. That is by design and I suspect not broken.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
HeinMueck (cperz) wrote :

I am confused. But maybe thats normal. Anyway, br0 gets created when you provision a bare metal using juju. Its the bridge for possible lxc containers you might want to adress when deploying charms.

But yesterday I did not know how br0 was actually put in place. I could see the config files, but not where they came from.

Know better today: found this in /var/lib/cloud/instances/node-fb671a3c-0a47-11e4-8074-5254000e3fd6/scripts/runcmd

------------[snip]---------------
ifdown eth0
cat > /etc/network/eth0.config << EOF
iface eth0 inet manual

auto br0
iface br0 inet dhcp
  bridge_ports eth0
EOF

sed -i "s/iface eth0 inet dhcp/source \/etc\/network\/eth0.config/" /etc/network/interfaces
ifup br0

------------[snap]---------------

This must be coming from some juju-juju.

1. When I activate only eth0 and boot, all is fine, dhcp working
2. When I activate the br0 setup, cloud-init-nonet waits forever and gives up, although I see the dhcp requests coming in on the server and I see later that the interfaces are up and working.

Seems to me that for some reason the upstart events needed by cloud-init-nonet to stop are lost or not fired.

Revision history for this message
HeinMueck (cperz) wrote :

I think I found the reason.

The concept used to stop cloud-init-nonet will only work if all your interfaces are physical and have a correspondent kernel object. These interfaces will be set up by the network-interface job, which will emit static-network-up when done.

As soon you have a bridge in your system, setup of virtual interfaces is handled by the networking job, emitting static-network-up will be postponed until this job has finished. cloud-init-nonet fires up before the networking job and prevents it from being started - it will wait the full time and then give up.

Systemconfiguration continues as expected and the networking will be configured correctly - but cloud-init will not care as /var/lib/cloud/data/no-net will exist.

I am looking for a solution now. Trying to make sure that cloud-init-nonet is fired after the networking job has been started.

Revision history for this message
HeinMueck (cperz) wrote :

Just a question: Why not using the static-network-up event to start the network dependant cloud-init instead of stalling the complete boot process while waiting on it? I'm sure you had this discussion, but I could not find it.

I would guess that some actions of cloud-init rely on the no-net file to figure out if they could run or not. Turn it to has-network.

Stalling seems a bit against the whole concept of upstart.

Revision history for this message
HeinMueck (cperz) wrote :

Here is a patch that is now working for me. Edited some old copy-paste comments too.

It works for system with physical interfaces and virtual interfaces defined in /etc/network/interfaces. I have looked into network-manager, but I dont see a chance to make this work without actually extending all network scripts to fit with the current cloud-init architecture. But I may well be wrong here.

HeinMueck (cperz)
Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Scott Moser (smoser) wrote :

We block boot by design.
If we didn't block boot, then there would be no way to guarantee consumption of user-data would take place at a defined point during boot.

As it is done right now, you can be assured that modifying just about any service or config file during a boothook will have the same affect as doing so before you booted the instance entirely.

no longer affects: juju
Revision history for this message
Scott Moser (smoser) wrote :

if you're posting a patch, please do so in 'diff -u' format. the default output of diff is pointlessly useless (not your fault, probably 'diff -u' should be its default).

description: updated
Revision history for this message
HeinMueck (cperz) wrote :
Curtis Hovey (sinzui)
tags: added: network
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: cloud-installer landscape
David Britton (dpb)
tags: removed: cloud-installer landscape
Scott Moser (smoser)
Changed in cloud-init:
status: New → In Progress
assignee: nobody → Scott Moser (smoser)
importance: Undecided → High
Revision history for this message
Scott Moser (smoser) wrote :

attaching a doc on how to attempt a reproduce.
I've not been successful yet though. David reports probably 1/1000 instances hit this.

Revision history for this message
HeinMueck (cperz) wrote :

Scott,

you're sure you are commenting on the right bug? Looks like your attempt was ment for another problem.

Revision history for this message
Scott Moser (smoser) wrote :

yes, this is the right bug. the reproduce shows how to do this and demonstrates the bug that they're hitting outside of juju, using the tools that juju uses.

Revision history for this message
HeinMueck (cperz) wrote :

Then let me show you how I reproduced the bug I wrote this report for:

1. add to /etc/network/interfaces on any ubuntu using cloud-init

auto br0
iface br0 inet dhcp
  bridge_ports eth0

2. reboot

Failrate is 100%. Maybe the references made the original problem look a bit bigger than it was before. The patch tries to fix this one by also taking virtual networking into the equation.

Revision history for this message
Scott Moser (smoser) wrote :

i was wrong. wrong bug. sorry for confusion.

Changed in cloud-init:
status: In Progress → Confirmed
importance: High → Medium
Revision history for this message
Janghoon-Paul Sim (janghoon) wrote :

I have the same issue.
This issue only happens on Trusty deployed by juju with MaaS provider.
On precise, cloud-init brings br0 correctly.

One more Related bug,
https://bugs.launchpad.net/juju-core/+bug/1271144

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

This happens to me also. Let me clarify things a little bit. From my point of view:

When you are provisioning a server with MAAS it expect all networks go physical. But in my case I have to deploy servers to an openstack network as compute nodes. So I have to bridge all interfaces with openvswitch. This is my choice, and I did this way.

ovs-vsctl show
8d08d8e4-49f2-4243-b1db-7641984a8530
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port phy-br-ex
            Interface phy-br-ex
    Bridge "br0"
        Port "br0"
            Interface "br0"
                type: internal
        Port "phy-br0"
            Interface "phy-br0"
    Bridge br-ext
        Port br-ext
            Interface br-ext
                type: internal
    Bridge br-int
        fail_mode: secure
        Port "int-br0"
            Interface "int-br0"
        Port "qvoea759488-64"
            tag: 1
            Interface "qvoea759488-64"
        Port "qvod1b2e6dc-7f"
            tag: 1
            Interface "qvod1b2e6dc-7f"
        Port "eth1"
            Interface "eth1"
        Port br-int
            Interface br-int
                type: internal
        Port int-br-ex
            Interface int-br-ex
        Port "qvoc0571326-4c"
            tag: 1
            Interface "qvoc0571326-4c"
    ovs_version: "2.0.2"

In my case br-int communicates with the internal integration network. Where MAAS is available for me.

br-int Link encap:Ethernet direcciónHW 68:05:ca:1a:09:50
          Direc. inet:172.16.0.100 Difus.:172.16.31.255 Másc:255.255.0.0
          Dirección inet6: fe80::e46e:23ff:fe3c:c279/64 Alcance:Enlace
          ACTIVO DIFUSIÓN FUNCIONANDO MTU:1500 Métrica:1
          Paquetes RX:577447 errores:0 perdidos:0 overruns:0 frame:0
          Paquetes TX:724745 errores:0 perdidos:0 overruns:0 carrier:0
          colisiones:0 long.colaTX:0
          Bytes RX:3114547329 (3.1 GB) TX bytes:908737979 (908.7 MB)

MAAS is 172.16.0.40.

What I think it happens is that ubuntu brings up all interfaces but openvswitch. Then it tries to see if network is up, but it's not because the bridges and rest of the config is not done. OpenVSwitch config is still not started. So it waits for network interfaces. And finally fails.

After a while of waiting, it starts the openvswitch stuff and brings the net up. But cloud-init already failed. In my case is worse even because I depend on ceph and ceph is started even after of openvswitch so it also fails, and starts failed. No more recover. So I have to do it manually after it boots.

I think solution is to bring openvswitch config with the rest of the system. I rode about it but it seems that I cannot set configuration it /etc/network/interfaces since it does not work. So I'm sticking with ovsdb config.

What do you think? Can this be the problem?

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :
Revision history for this message
Alex Tomkins (tomkins) wrote :

Having annoying issues with this exact problem, every boot takes an additional 130 seconds because I have a bridge interface configured.

Comments #4 and #14 describe this perfectly. Add a bridge interface, but because bridge interfaces aren't real and don't emit a net-device-added because they aren't physical - the bridge is configured by networking.conf instead - but this is too late.

Revision history for this message
HeinMueck (cperz) wrote :

@gadLinux - as Alex says, comments 4 and 14 have the reason, comment 10 the patch. It hits you as soon you have more than ethX in your system.

What bugs me is not that it takes some more time to boot, but that cloud-init thinks there is not network and thus will not run. Not do its job. Umpf.

And of course if you run into a problem like that during boot, you end up with declaring the status of your machine as unknown, because you never know which package you installed later that also subsequently fails, maybe silently. You are just likely to have more problems than just not running cloud-init. Pretty deadly for a serious production environment.

I am surprised that this bug does not get much attention.

Revision history for this message
Alex Tomkins (tomkins) wrote :

Old style interface aliases will also cause the same problem, eg. eth0:1 - another interface which isn't physical so no signal gets emitted for it.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Medium
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

With this change that landed on juju-core trunk: https://github.com/juju/juju/pull/1046 and was also backported to the 1.21 and 1.20 branches, the issues with LXC/KVM containers that got stuck in "pending" state or failed to start due to incorrect networking (bridge name, interface to use, etc.) configuration should be fixed. The bridge name is now "juju-br0" for both LXC and KVM containers on MAAS.

Changed in juju-core:
status: Triaged → Fix Committed
assignee: nobody → Dimiter Naydenov (dimitern)
milestone: none → 1.20.12
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Larry Michel (lmic) wrote :

I think we are hitting an issue associated with this fix. Please see https://bugs.launchpad.net/juju-core/+bug/1395908.

The problem I am seeing is juju-br0 bridge being configured with eth2:

auto lo
iface eth3 inet dhcp
iface eth4 inet dhcp
iface eth5 inet dhcp
auto eth0
iface eth0 inet dhcp
iface eth1 inet dhcp
iface eth2 inet manual
auto juju-br0
iface juju-br0 inet dhcp
    bridge_ports eth2

Problem is that we only have eth0 connected and eth2 is disconnected. How is it determining which interface to use for juju-br0?

Revision history for this message
Larry Michel (lmic) wrote :

This looks to be re-creatable on servers that have 2 types of network card. It's looking like a regression.

Revision history for this message
Larry Michel (lmic) wrote :

On servers with one type of network card, we have no issue. I have one server where the 2nd set of cards is not configured as eth2..ethX and juju-br0 still gets configured with eth2. I have added the details to bug 1395908.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

@Larry,

Can you please attach the lshw XML output stored in your MAAS for those machines with 2 types of cards (eth0, eth2)?

I suspect the machines originally had a single interface, were commissioned OK by MAAS, then their hardware had changed (i.e. added an extra NIC or rearranged which NIC is first ?) without recommissioning them again. Since the juju-br0 fix, juju now tries to discover the primary NIC on the machine (the one to plug into juju-br0) using the lshw XML dump of the machine. There is a bug #1390404 I filed against MAAS that will improve this and allow us to discover what NICs are there via the MAAS API rather than with such workarounds.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

After analysis of the issue I can confirm bug #1395908 describes a deviant case where most of the network interfaces on nodes MAAS provisions for Juju are disabled. This fix does not take into account disabled="true" on <network /> elements when parsing the lshw XML dump discovered during commissioning.

There will be a fix for this issue with backports for 1.20.14 and 1.21.

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

Hi @cperz and @tomkins,

I must say that I spend my morning compiling and patching cloudinit 0.7.5 revision against the patch in comment#10 and the system is the same.

It maybe because I'm using ovs (openvswitch) but I see no improvement.

Every. I say EVERY server I build to integrate with openstack has this problem. I'm evaluating drop MAAS since cloud-init stuff seems to be the cause of all headache.

And in effect. Is incredible that nobody notices this, maybe because nobody uses MAAS+Openstack to do deployment.

This bug is marked as Importance Medium but since you said @cperz, it leaves all my other demons unconfigured against the wrong devices because init did not the job right. Yes after a wayt of more than 120secs it starts and brings up the interfaces because all openvswitch stuff is run after the timeout. But it leaves everything that required network in a stale state that makes demons unable to connect the network.

I have to start everything by hand and solve problems manually.

And all because nobody cared about systems with more than one interface and some bridges. Unbeliverable!

My cluster is small and I can deal with it, but what does the other do?

Please increase importance of this bug. Since it's not solved.

Thank you a lot in advance.

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

In fact the servers take more than 120 secs.

It can measure in about 5-10 min, depending on the number of bridges. It's really a mess.

Revision history for this message
james beedy (jamesbeedy) wrote :

I am affected by some variant this bug.....When I provision nodes with multiple interfaces and eth0 is not used.
I am experiencing a slew ....no wait .... a plethora of different bugs revolving around this......to many to accurately write up from my phone.....I will give a complete write up tonight or tomorrow. Bottom line ....can we get this fixed asap?

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

Today I did a lot of testing. The situation is as follows.

If you has a bridge that has to become up with open vswitch and you tell /etc/network/interfaces that this bridge has to have an ip address assigned by dhcp after it's created. Then cloud-init will wait for it. Since openvswitch is one of the latest things is activated the init will hang until cloud-init gives up.

Revision history for this message
Karan (karan-pugla) wrote :

This is how I made it work while using openvswitch bridges and openstack compute nodes.
cloud-init-nonet blocks all other services until it receives event 'static-network-up'.
Event 'static-network-up' will be emitted only when all the interfaces configured 'auto' in /etc/network/interfaces are up.
Script responsible for emitting ' static-network-up' is at '/etc/network/if-up.d/upstart'.
Since openvswitch service is blocked when cloud-init-nonet is active, bridges created by openvswitch which are also marked auto in /etc/network/interfaces will not come up and the event 'static-network-up' will not be emitted.

So the solution is to use interfaces sybsystem hotplug . Instead of declaring the bridge as auto, declare it as allow-ovs:

BEFORE:

auto eth0
iface eth0 inet manual

auto br-ex
iface br-ex inet static
    pre-up ifup eth0
    gateway 10.10.0.6
    address 10.10.208.135/16
    mtu 1500
    post-down ifdown eth0

AFTER:

auto eth0
iface eth0 inet manual

allow-ovs br-ex
iface br-ex inet static
    pre-up ifup eth0
    gateway 10.10.0.6
    address 10.10.208.135/16
    mtu 1500
    post-down ifdown eth0

This will bring up all the devices on boot and statically assign IPs. This works because openvswitch init service specifically brings up interfaces on boot which are declared with 'ovs' hotplug.

Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.