"juju models" hangs

Bug #1611093 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead

Bug Description

I've been testing multiple models on the same controller for OIL, and my controller is in a state now where when I run "juju models" it hangs.

http://paste.ubuntu.com/22726515/

I've attached /var/log from controller machine 0.

This is with 2.0 beta 14.

Tags: eda oil oil-2.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Is there a work around for this? I'm setting up to test again, but if it repeats, this will be a blocker preventing OIL from adopting (and further testing) multi-model controllers, which is a high priority item for us.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

When `juju model` hangs, what happens if you cancel it (Ctrl-C) and run it again? Does it succeed the second time. Knowing this will help establish where the problem is.

It would also be helpful if you could provide the output of this (when it hangs):

JUJU_LOGGING_CONFIG="<root>=TRACE" juju model --debug

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1611093] Re: "juju models" hangs

It always hangs when it's in this state. Running it again results in it
hanging again.

On Mon, Aug 8, 2016 at 7:03 PM, Menno Smits <email address hidden>
wrote:

> When `juju model` hangs, what happens if you cancel it (Ctrl-C) and run
> it again? Does it succeed the second time. Knowing this will help
> establish where the problem is.
>
> It would also be helpful if you could provide the output of this (when
> it hangs):
>
> JUJU_LOGGING_CONFIG="<root>=TRACE" juju model --debug
>
> ** Changed in: juju-core
> Status: Triaged => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1611093
>
> Title:
> "juju models" hangs
>
> Status in juju-core:
> Incomplete
>
> Bug description:
> I've been testing multiple models on the same controller for OIL, and
> my controller is in a state now where when I run "juju models" it
> hangs.
>
> http://paste.ubuntu.com/22726515/
>
> I've attached /var/log from controller machine 0.
>
> This is with 2.0 beta 14.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1611093/+subscriptions
>

Revision history for this message
Larry Michel (lmic) wrote :

@Anastasia, we have identified this bug as being critical for OIL.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Larry Michel (lmic) wrote :

Moving this back to incomplete since we're waiting for logs.

Changed in juju-core:
status: New → Incomplete
Changed in juju-core:
milestone: 2.0.0 → 2.0-beta16
Changed in juju-core:
milestone: 2.0-beta16 → 2.0-beta17
Revision history for this message
John A Meinel (jameinel) wrote :

Is it only 'juju models' that hangs? Do other things like "juju status" succeed? Do you already know what models are there, and can try to run 'juju status' in all of them?

Look at the content of "ListModels" API call, there doesn't seem to be anything that should be blocking specifically (it looks to all be Read requests). Can you also give an indication of the load on the controller itself? I don't think that's the factor, but its worth asking.

affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta17 → none
milestone: none → 2.0-beta17
Changed in juju:
milestone: 2.0-beta17 → none
Revision history for this message
Christian Muirhead (2-xtian) wrote :

I've got more understanding of the state the system is in, and I can see the point in the logs where whatever happened to put it in that state happened, but there's not enough information being logged to show what caused the problem. I'd like to make a binary with clearer logging in a couple of places and send it to Jason to try to reproduce the problem again, if that's ok?

More detail:

On 2016-09-16 between 18:20:57 and 18:21:35, in machine-0.log we can see a series of 12 log entries saying:

2016-09-16 18:20:57 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model

The undertaker removes dead models, it's failed when trying to delete the associated records, but the logging has been done in such a way that we can't see what the actual error is here.

Immediately after that another set of messages start:

2016-09-16 18:21:36 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings 97d9cc41-39ac-40ef-855c-ce4cd7a74475:r#23#provider#neutron-gateway/0} for applying transaction 57dc3802e545490f7e5d3701_d9b4cb4e

(Also the same error message from the undertaker.)

Those "cannot find document" errors happen continually after that, but only with 3 different transaction identifiers (and missing documents):
{settings 65914efc-bef1-474b-82a6-17c2727a8647:r#11#provider#glance/0} for applying transaction 57dd7166e545490f7e6ff657_744f5a98
{settings 97d9cc41-39ac-40ef-855c-ce4cd7a74475:r#23#provider#neutron-gateway/0} for applying transaction 57dc3802e545490f7e5d3701_d9b4cb4e
{settings f884855b-8afc-42f8-86f6-320aeacd7331:r#12#provider#keystone/0} for applying transaction 57dc67d5e545490f7e5ffd2e_9312baa6

Anything else that tries to run a transaction that touches a model gets queued up behind those 3 stuck transactions, which is why `juju models` and `juju controllers --refresh` hang. Looking in the DB, those settings records are indeed missing, but the only transactions that touch them are the ones that created them initially and the stuck transactions. Each of those transactions should delete those settings records, but also some settings records that are still present in the collection. So it looks like somehow the transactions have been partially applied, hopefully some better logging will show how that's happening.

There are also some unusual messages about cleanups failing that seem like they could be related somehow, although they start happening much earlier. I'm going to add more detail to that error message as well to see what's failing there.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Actually, I think I've worked it out - the cleanup to remove relation settings deletes the records directly rather than using a transaction. That means that the sequence of events is something like this:

1. We start destroying the model.
2. -> We tear down the applications
3. -> We tear down the relations
4. -> As part of that we queue a cleanup to remove the relation settings
5. The destroy finishes and we mark the model as dead.
6. The undertaker starts the removal process.
7. -> It makes a txn to remove the remaining settings docs, including the relation settings because the cleanup hasn't run yet.
8. Cleanups run before that txn.
9. -> The queued cleanup deletes the relation settings directly.
10. The txn fails to run and starts blocking stuff that touches anything related to the model, because it has an assertion that the model is dead.

The settings records to be deleted in the blocked transactions are all relation settings except one model settings item in each one, and in each case that's the only record that's still there.

So the fix is to make the relation settings cleanup transactional, I think. The bit of code that raises the error in mgo/txn is failing to find a stash record for a missing item - removing the relation settings transactionally will create the stash records as it deletes the rows.

I'm going to quickly make that change and get some binaries ready for another test.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

I've built a binary with the change that I think should fix this and uploaded it to the OIL CI bastion host.
Changes are here: https://github.com/juju/juju/compare/master...babbageclunk:settings-hang?expand=1

tags: added: eda
Changed in juju:
status: Incomplete → In Progress
assignee: nobody → Christian Muirhead (2-xtian)
milestone: none → 2.0-rc2
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.