juju 2.1 beta5 can reset networking while hooks are running

Bug #1662586 reported by Francis Ginther
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
High
Francis Ginther
Canonical Juju
Fix Released
Critical
John A Meinel
2.1
Fix Released
Critical
John A Meinel
Landscape Server
Fix Released
Undecided
Francis Ginther

Bug Description

This was found while using juju 2.1 beta5 from ppa:juju/devel and testing landscape autopilot in CI. These logs are from https://ci.lscape.net/job/landscape-system-tests/5209.

Landscape autopilot deploys some openstack services in containers, while some run on the host machine itself. We are seeing install hook failures on the host machine due to networking being disabled during hook execution:

[landscape-0-inner-logs/ceph-osd-1/var/log/juju/unit-ceph-osd-1.log]
2017-02-07 10:43:26 INFO install dpkg-query: package 'python-netaddr' is not installed and no information is available
2017-02-07 10:43:26 INFO install Use dpkg --info (= dpkg-deb --info) to examine archive files,
2017-02-07 10:43:26 INFO install and dpkg --contents (= dpkg-deb --contents) to list their contents.
2017-02-07 10:43:27 INFO install Reading package lists...
2017-02-07 10:43:27 INFO install Building dependency tree...
2017-02-07 10:43:27 INFO install Reading state information...
2017-02-07 10:43:27 INFO install The following additional packages will be installed:
2017-02-07 10:43:27 INFO install ieee-data
2017-02-07 10:43:27 INFO install Suggested packages:
2017-02-07 10:43:27 INFO install ipython python-netaddr-docs
2017-02-07 10:43:27 INFO install The following NEW packages will be installed:
2017-02-07 10:43:27 INFO install ieee-data python-netaddr
2017-02-07 10:43:27 INFO install 0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
2017-02-07 10:43:27 INFO install Need to get 1004 kB of archives.
2017-02-07 10:43:27 INFO install After this operation, 5239 kB of additional disk space will be used.
2017-02-07 10:43:27 INFO install Err:1 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 ieee-data all 20150531.1
2017-02-07 10:43:27 INFO install Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install Err:2 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 python-netaddr all 0.7.18-1
2017-02-07 10:43:27 INFO install Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/i/ieee-data/ieee-data_20150531.1_all.deb Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install
2017-02-07 10:43:27 INFO install E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python-netaddr/python-netaddr_0.7.18-1_all.deb Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install
2017-02-07 10:43:27 INFO install E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
2017-02-07 10:43:27 INFO install dpkg-query: package 'python-netifaces' is not installed and no information is available
2017-02-07 10:43:27 INFO install Use dpkg --info (= dpkg-deb --info) to examine archive files,
2017-02-07 10:43:27 INFO install and dpkg --contents (= dpkg-deb --contents) to list their contents.
2017-02-07 10:43:27 INFO install Reading package lists...
2017-02-07 10:43:27 INFO install Building dependency tree...
2017-02-07 10:43:27 INFO install Reading state information...
2017-02-07 10:43:27 INFO install The following NEW packages will be installed:
2017-02-07 10:43:27 INFO install python-netifaces
2017-02-07 10:43:27 INFO install 0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
2017-02-07 10:43:27 INFO install Need to get 16.7 kB of archives.
2017-02-07 10:43:27 INFO install After this operation, 52.2 kB of additional disk space will be used.
2017-02-07 10:43:27 INFO install Err:1 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 python-netifaces amd64 0.10.4-0.1build2
2017-02-07 10:43:27 INFO install Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/n/netifaces/python-netifaces_0.10.4-0.1build2_amd64.deb Cannot initiate the connection to 10.96.0.10:8000 (10.96.0.10). - connect (101: Network is unreachable)
2017-02-07 10:43:27 INFO install
2017-02-07 10:43:27 INFO install E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

These timestamps match with messages in syslog indicating that eth0 is being set to promiscuous mode:

[landscape-0-inner-logs/ceph-osd-1/var/log/syslog]
Feb 7 10:43:21 elkhart systemd[1]: Started LXD - main daemon.
Feb 7 10:43:22 elkhart ntpd[2576]: Listen normally on 10 lxdbr0 10.0.0.1:123
Feb 7 10:43:22 elkhart ntpd[2576]: Listen normally on 11 lxdbr0 [fe80::f8ac:cfff:fef8:2944%7]:123
Feb 7 10:43:22 elkhart ntpd[2576]: Deleting interface #8 lxdbr0, fe80::4428:2aff:fe93:8e5b%6#123, interface stats: received=0, sent=0, dropped=0, active_time=18 secs
Feb 7 10:43:22 elkhart ntpd[2576]: Deleting interface #9 lxdbr0, fe80::1%6#123, interface stats: received=0, sent=0, dropped=0, active_time=18 secs
Feb 7 10:43:22 elkhart ntpd[2576]: new interface(s) found: waking up resolver
Feb 7 10:43:24 elkhart systemd-timesyncd[594]: Timed out waiting for reply from 91.189.89.198:123 (ntp.ubuntu.com).
Feb 7 10:43:25 elkhart kernel: [ 68.806766] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Feb 7 10:43:25 elkhart systemd[1]: Reloading OpenBSD Secure Shell server.
Feb 7 10:43:25 elkhart systemd[1]: Reloaded OpenBSD Secure Shell server.
Feb 7 10:43:25 elkhart systemd-udevd[4988]: Could not generate persistent MAC address for br-eth0: No such file or directory
Feb 7 10:43:25 elkhart kernel: [ 68.891670] device eth0 entered promiscuous mode
Feb 7 10:43:25 elkhart kernel: [ 68.894098] IPv6: ADDRCONF(NETDEV_UP): br-eth0: link is not ready
Feb 7 10:43:25 elkhart systemd[1]: Started ifup for br-eth0.
Feb 7 10:43:25 elkhart sh[5069]: ifup: waiting for lock on /run/network/ifstate.br-eth0
Feb 7 10:43:25 elkhart systemd[1]: Found device /sys/subsystem/net/devices/br-eth0.
Feb 7 10:43:27 elkhart ntpd[2576]: Deleting interface #3 eth0, 10.96.60.114#123, interface stats: received=0, sent=5, dropped=0, active_time=42 secs
Feb 7 10:43:27 elkhart ntpd[2576]: Deleting interface #5 eth0, fe80::2e59:e5ff:fe3b:1120%2#123, interface stats: received=0, sent=0, dropped=0, active_time=42 secs
Feb 7 10:43:29 elkhart kernel: [ 72.746818] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Feb 7 10:43:29 elkhart kernel: [ 72.746968] br-eth0: port 1(eth0) entered forwarding state
Feb 7 10:43:29 elkhart kernel: [ 72.746988] br-eth0: port 1(eth0) entered forwarding state
Feb 7 10:43:29 elkhart kernel: [ 72.747149] IPv6: ADDRCONF(NETDEV_CHANGE): br-eth0: link becomes ready
Feb 7 10:43:29 elkhart systemd[1]: Reloading OpenBSD Secure Shell server.
Feb 7 10:43:29 elkhart systemd[1]: Reloaded OpenBSD Secure Shell server.
Feb 7 10:43:29 elkhart sh[5069]: ifup: interface br-eth0 already configured
Feb 7 10:43:30 elkhart ntpd[2576]: Listen normally on 12 br-eth0 10.96.60.114:123

In a brief hangout discussion with Aaron Bentley, it sounded like the hook execution and network changes should run serially and not cause the install hook to fail.

Revision history for this message
Francis Ginther (fginther) wrote :
Ryan Beisner (1chb1n)
tags: added: uosci
David Britton (dpb)
Changed in landscape:
milestone: none → 17.01
status: New → Confirmed
Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
John A Meinel (jameinel)
Changed in juju:
status: New → Triaged
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → Critical
milestone: none → 2.1-rc1
Changed in autopilot-log-analyser:
assignee: nobody → Francis Ginther (fginther)
Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
Changed in autopilot-log-analyser:
importance: Undecided → High
Changed in autopilot-log-analyser:
status: New → In Progress
Changed in juju:
milestone: 2.1-rc1 → 2.2.0-alpha1
Revision history for this message
John A Meinel (jameinel) wrote :
Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
Revision history for this message
Francis Ginther (fginther) wrote :

With juju 2.1 rc1, it now appears that the network reset is serialized with the charm hooks.

Ian Booth (wallyworld)
Changed in juju:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Francis Ginther (fginther) wrote :

Fixed in juju 2.1 final.

Changed in landscape:
status: Confirmed → Fix Committed
Revision history for this message
Chad Smith (chad.smith) wrote :

Even though network-restarts are serialized w/ hooks, this bridge setup will still affect customers who attempt to deploy juju services to containers on the controller machine.

Landscape autopilot deployments using juju 2.1.1 have seen a disruption in the juju's mongodb due to this initial bridge configuration during container creation on the controller node. lp:1677353

When this first bridge for the first container is created on the controller node, mongodb connections get severed and any juju API calls can get lost in the restart of juju services.

We've file a bug against landscape related to this issue lp:1677353

Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.