model not successfully destroyed, and error on "juju list-models"

Bug #1611159 reported by Jason Hobbs
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead

Bug Description

I'm testing multi model support in OIL.

OIL got hung up because a model failed to be destroyed, and 'juju list-models' doesn't work anymore.

The model is 'ci-oil-slave2'. Here's a timeline of what happened with it:

2016-08-08 21:58:02,322 - Model Added
2016-08-08 22:55:01,250 - destroy-model ran against it.
2016-08-09 00:12:44,258 - destroy-model ran against it a second time
2016-08-09 00:12:59,333 - Model add failed with "already exists"

Here's the output of "JUJU_LOGGING_CONFIG="<root>=TRACE" juju models --debug" now:
http://paste.ubuntu.com/22756389/

I've attached logs from the controller.

This is similar to 1611097, except there was almost an hour between adding and destroying the model, so it doesn't appear to be the same thing.

It's also similar to bug 1611093, but there, juju client hangs, and in this one, we get an error output.

Without a workaround or a fix, this bug will block OIL from switching to multi-model/single controller testing.

This was with beta 14.

Tags: oil oil-2.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
Changed in juju-core:
importance: High → Critical
milestone: 2.0.0 → 2.0-beta15
Changed in juju-core:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Revision history for this message
Larry Michel (lmic) wrote :

@Anastasia, we have identified this bug internally as being one of the most critical ones for OIL.

Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
Revision history for this message
Ian Booth (wallyworld) wrote :

Do you have a log of what CLI commands were run against Juju? Can you attach the script or CLI history?

Revision history for this message
Ian Booth (wallyworld) wrote :

The issue with list models not working is that the original problem with destroy hanging has resulted in the model being partially removed. The core model record is still in the db but not the status record. The list failure is symptomatic of bad data left in the db

Revision history for this message
Ian Booth (wallyworld) wrote :

And the controller logs are filled with errors reflecting corruption of documents related to model in question after a connection error:

2016-08-08 22:56:32 ERROR juju.rpc server.go:540 error writing response: write tcp 10.244.241.49:17070->10.244.241.55:54642: write: broken pipe
...
2016-08-09 00:32:15 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings 95ed3564-4ee3-4505-81d7-da4297d34ea4:r#1#peer#cinder/0} for applying transaction 57a90e06e5454915765572ef_84bcc115

The db will need to be cleaned up manually.

The connection between juju and mongo just died for a reason not apparent from the logs:

2016-08-08 19:56:26 ERROR juju.rpc server.go:540 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:46778: write: connection reset by peer

and this lead to the mongo corruption.

I'm not sure what we can do here. Was the system under heavy load?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I don't have a log of all the juju commands run on the cli. This is all done by an automated system. That's something juju should be logging if it's needed to debug issues.

We:

bootstrap the controller, once.

for each test run, pick a model name. run 'juju destroy-model --yes <controller name>:<model name>' on the model, to make sure it doesn't exist already. Then we sleep 15 seconds.

Then we 'juju add-model <model-name>'
Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
Then we use 'juju wait' to wait for the deployment to finish.
Then we deploy and run tempest, you can see those commands here:
https://git.launchpad.net/bundletester-tempest/tree/run.sh
Then we destroy the model: 'juju destroy-model --yes <controller name>:<model name>"

The system is a single core VM (dedicated to the VM) with 8GB of ram. All it was doing was hosting juju controller/mongodb.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1611159] Re: model not successfully destroyed, and error on "juju list-models"

Also, what do you mean "the original problem with destroy hanging"? I
never saw destroy hang, it doesn't block. It just didn't work all the
way, apparently.

Jason

On 08/12/2016 01:50 PM, Jason Hobbs wrote:
> I don't have a log of all the juju commands run on the cli. This is all
> done by an automated system. That's something juju should be logging if
> it's needed to debug issues.
>
> We:
>
> bootstrap the controller, once.
>
> for each test run, pick a model name. run 'juju destroy-model --yes
> <controller name>:<model name>' on the model, to make sure it doesn't
> exist already. Then we sleep 15 seconds.
>
> Then we 'juju add-model <model-name>'
> Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
> Then we use 'juju wait' to wait for the deployment to finish.
> Then we deploy and run tempest, you can see those commands here:
> https://git.launchpad.net/bundletester-tempest/tree/run.sh
> Then we destroy the model: 'juju destroy-model --yes <controller name>:<model name>"
>
> The system is a single core VM (dedicated to the VM) with 8GB of ram.
> All it was doing was hosting juju controller/mongodb.
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Also we see the broken pipe error in almost, if not, every deployment,
including the ones that succeed.

On Fri, Aug 12, 2016 at 2:06 PM, Jason Hobbs <email address hidden>
wrote:

> Also, what do you mean "the original problem with destroy hanging"? I
> never saw destroy hang, it doesn't block. It just didn't work all the
> way, apparently.
>
> Jason
>
>
> On 08/12/2016 01:50 PM, Jason Hobbs wrote:
> > I don't have a log of all the juju commands run on the cli. This is all
> > done by an automated system. That's something juju should be logging if
> > it's needed to debug issues.
> >
> > We:
> >
> > bootstrap the controller, once.
> >
> > for each test run, pick a model name. run 'juju destroy-model --yes
> > <controller name>:<model name>' on the model, to make sure it doesn't
> > exist already. Then we sleep 15 seconds.
> >
> > Then we 'juju add-model <model-name>'
> > Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
> > Then we use 'juju wait' to wait for the deployment to finish.
> > Then we deploy and run tempest, you can see those commands here:
> > https://git.launchpad.net/bundletester-tempest/tree/run.sh
> > Then we destroy the model: 'juju destroy-model --yes <controller
> name>:<model name>"
> >
> > The system is a single core VM (dedicated to the VM) with 8GB of ram.
> > All it was doing was hosting juju controller/mongodb.
> >
>

Revision history for this message
Anastasia (anastasia-macmood) wrote :

we are waiting on recreations and further logs.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Which further logs do you need?

Changed in juju-core:
status: Incomplete → New
Changed in juju-core:
milestone: 2.0-beta16 → 2.0-beta17
importance: Critical → High
Changed in juju-core:
status: New → Triaged
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta17 → none
milestone: none → 2.0-beta17
Revision history for this message
Ian Booth (wallyworld) wrote :

@jason, looks like there's enough to get started digging a bit. At first glance, I don't hold much hope for giving an answer other than that there has been a level of data corruption. The fact that you see broken pipe errors could possibly be caused by i/o and/or network contention during the deploy? We've seen in the past that bundles which deploy on top of many LXD containers on a machine can cause disk thrashing and connection drop outs. This should not result in mongo losing data as indicated in an earlier message, but anecdotally we've seen various scenarios where mongo gets corrupted. The TL;DR; of the messages in the logs is that documents that Juju expects to find in the database have simply disappeared. Juju cannot recover from that. We need to try and understand what is causing mongo to lose data.

Revision history for this message
Ian Booth (wallyworld) wrote :

Another possibility worth investigating is that Juju may be constructing the transaction op slice in the wrong order. It is important that subordinate records be written first before the main parent record. There's no guarantee that mongo will write the entire slice atomically. If it gets interrupted, other process may come in and read a parent record and then attempt to read subordinate records which are not written yet. On the surface that could also explain the observed behaviour, especially on a busy system.

Revision history for this message
Ian Booth (wallyworld) wrote :

The above comment is interesting but I've checked the code and it seems Juju is doing the right thing. Plus the fact that it's mgo-txn-resume and not Juju per se implies something more sinister.

Changed in juju:
assignee: Alexis Bruemmer (alexis-bruemmer) → Christian Muirhead (2-xtian)
description: updated
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta18 → 2.0-beta19
Changed in juju:
milestone: 2.0-beta19 → 2.0-rc1
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I've been trying to reproduce this with beta18 and haven't been able to yet. My test setup has been running for about 18 hours and I've had no failures yet. I could hit it before in a couple of hours. I'm going to let it run over the weekend.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I've hit this again on beta 18.

Here's the output of juju models, which hangs now:
https://pastebin.canonical.com/165822/

In machine-0.log on the controller node:
2016-09-17 23:40:30 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings 97d9cc41-39ac-40ef-855c-ce4cd7a74475:r#23#provider#neutron-gateway/0} for applying transaction 57dc3802e545490f7e5d3701_d9b4cb4e

The environment is still up and I've given @2-xtian instructions on how to access it.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Logs from the beta 18 reproduce.

Changed in juju:
milestone: 2.0-rc1 → 2.0-rc2
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This may be able to clear things up when this is hit:

https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge

Revision history for this message
Larry Michel (lmic) wrote :

I actually also recreated it and commented in bug 1566426. Looking at the error messages they match the mongodb transaction errors. For that one, I had tried the steps from: https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge
but that did not work because of the version of my client at the time. Per @anastasia, the wiki has been updated to reflect the client version to use so it should work now. In my case, the workaround for me to kill the controller was to power off the controller then try the kill-controller (had to run it twice for the controller to be completely destroyed)

My comment and logs are at:
https://bugs.launchpad.net/juju/+bug/1566426/comments/13

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Based on the log messages it seems like this is the same problem as described here: https://bugs.launchpad.net/juju/+bug/1611093/comments/9

I've built a binary that changes the relation settings cleanup to use transactions - it should fix this as well.

Changed in juju:
status: Triaged → In Progress
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.