Juju agents do not handle reboots

Bug #863526 reported by Clint Byrum
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
pyjuju
Fix Released
High
William Reade

Bug Description

Currently if a juju managed machine reboots, its pretty much lost to juju without heavy manual intervention. Even then it may have missed state changes that are critical to its configuration.

The agents will need to be run via upstart jobs, and the state topology will need to be done in a way where changes are ack'd by the agents in some manner.

Tags: production
tags: added: production
Changed in juju:
status: New → Triaged
importance: Undecided → High
Changed in juju:
milestone: none → florence
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

The recent landing of upstartification and restart support should help this but needs verification.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :
Download full text (8.5 KiB)

I did some smoke testing of this, and it seems like there is still work to do.

Doing a 'restart juju-machine-agent' on a machine, I got this:

2012-02-21 23:35:40,411:1524(0xb7108b70):ZOO_INFO@check_events@1632: session establishment complete on server [10.207.38.151:2181], se
ssionId=0x135a23c31a5000a, negotiated timeout=10000
2012-02-21 23:35:40,424: juju.agents.machine@DEBUG: Units changed old:set([]) new:set(['mysql/0'])
2012-02-21 23:35:40,424: juju.agents.machine@DEBUG: Starting service unit: mysql/0 ...
2012-02-21 23:35:40,425: juju.agents.machine@INFO: Machine agent started id:1 deploy:<class 'juju.machine.unit.UnitMachineDeployment'>
 provider:'ec2'
2012-02-21 23:35:40,533: juju.agents.machine@DEBUG: Downloading charm local:oneiric/mysql-1329352875 to /var/lib/juju/charms
2012-02-21 23:35:41,061: juju.agents.machine@DEBUG: Starting service unit mysql/0
2012-02-21 23:35:42,247: juju.agents.machine@INFO: Started service unit mysql/0
2012-02-21 23:37:44,036:1524(0xb7108b70):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 26ms
2012-02-21 23:37:47,385:1524(0xb7108b70):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 15ms
2012-02-21 23:37:50,739:1524(0xb7108b70):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 19ms
2012-02-21 23:58:38,875:1524(0xb7108b70):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 19ms
2012-02-22 00:00:43,114:1524(0xb736d6c0):ZOO_INFO@zookeeper_close@2304: Closing zookeeper sessionId=0x135a23c31a5000a to [10.207.38.151:2181]

2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.3
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@662: Client environment:host.name=domU-12-31-39-09-F0-F6
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@670: Client environment:os.arch=3.0.0-14-virtual
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@671: Client environment:os.version=#23-Ubuntu SMP Mon Nov 21 23:40:55 UTC 2011
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@679: Client environment:user.name=(null)
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@687: Client environment:user.home=/root
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@log_env@699: Client environment:user.dir=/
2012-02-22 00:00:43,901:3654(0xb74f76c0):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=domU-12-31-39-15-25-69.compute-1.internal:2181 sessionTimeout=10000 watcher=0xb730e5c0 sessionId=0 sessionPasswd=<null> context=0x8d62028 flags=0
2012-02-22 00:00:43,903:3654(0xb7294b70):ZOO_INFO@check_events@1585: initiated connection to server [10.207.38.151:2181]
2012-02-22 00:00:43,907:3654(0xb7294b70):ZOO_INFO@check_events@1632: session establishment complete on server [10.207.38.151:2181], sessionId=0x135a23c31a5000d, negotiated timeout=10000
2012-02-22 00:00:43,918: juju.agents.machine@DEBUG: Units changed old:set([]) new:set(['mysql/0'])
2012-02-22 00:00:43,918: juju.agents.machine@DEBUG: Starting service unit: mysql/0 ...
2012-02-22 00:00:43,918: juju.agents.machine@INFO: Machine agent started id:1 deploy...

Read more...

Revision history for this message
William Reade (fwereade) wrote :

charm download:

Agreed not necessary. I believe it's not actively harmful, though... have I missed something?

config-changed:

As I recall, this has always been run automatically when a unit starts up; again, it may be redundant, but I don't *think* it'll be harmful.

missing machine:

Hmm, I guess we should filter slightly differently in the ec2 provider -- we should include stopped machines but not terminated ones. (we should also not filter whatever the moving-to-stopped state is ("shutting-down"?)... and if that means we still see machines in the process of termination, so be it).

IMO these are 3 distinct bugs, and I don't personally see the first two as very high priority; I'm not entirely clear on the ramifications of a config-changed change, but the others should be pretty trivial to fix. Opinions?

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 863526] Re: Juju agents do not handle reboots

Excerpts from William Reade's message of Wed Feb 22 08:32:20 UTC 2012:
> charm download:
>
> Agreed not necessary. I believe it's not actively harmful, though...
> have I missed something?
>

Understood, makes sense, and no you have not missed anything.

> config-changed:
>
> As I recall, this has always been run automatically when a unit starts
> up; again, it may be redundant, but I don't *think* it'll be harmful.
>

Actually I kind of like this.. re-asserting as often as makes sense is
good, and after a reboot/restart, that actually makes a lot of sense.

> missing machine:
>
> Hmm, I guess we should filter slightly differently in the ec2 provider
> -- we should include stopped machines but not terminated ones. (we
> should also not filter whatever the moving-to-stopped state is
> ("shutting-down"?)... and if that means we still see machines in the
> process of termination, so be it).
>
> IMO these are 3 distinct bugs, and I don't personally see the first two
> as very high priority; I'm not entirely clear on the ramifications of a
> config-changed change, but the others should be pretty trivial to fix.
> Opinions?
>

The first two aren't even bugs IMO. Lets just leave them be. For the other
one, sounds like its just a cosmetic bug. You sound like you understand
it better than I, so please do flesh out the details in a bug report.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I've tested rebooting individual machines, and it works fine. I think this is *fixed*.

There is a new issue to replace it, but it only affects EC2. The bootstrap node *might* change addresses if it is rebooted on EC2, and when it does, none of the agents will be able to find it because they hard code its IP.

Changed in juju:
assignee: nobody → William Reade (fwereade)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.