Canonical Juju

model not successfully destroyed, and error on "juju list-models"

Bug #1611159 reported by Jason Hobbs on 2016-08-09

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Christian Muirhead	Canonical Juju 2.0-rc2

Bug Description

I'm testing multi model support in OIL.

OIL got hung up because a model failed to be destroyed, and 'juju list-models' doesn't work anymore.

The model is 'ci-oil-slave2'. Here's a timeline of what happened with it:

2016-08-08 21:58:02,322 - Model Added
2016-08-08 22:55:01,250 - destroy-model ran against it.
2016-08-09 00:12:44,258 - destroy-model ran against it a second time
2016-08-09 00:12:59,333 - Model add failed with "already exists"

Here's the output of "JUJU_LOGGING_CONFIG="<root>=TRACE" juju models --debug" now:
http://paste.ubuntu.com/22756389/

I've attached logs from the controller.

This is similar to 1611097, except there was almost an hour between adding and destroying the model, so it doesn't appear to be the same thing.

It's also similar to bug 1611093, but there, juju client hangs, and in this one, we get an error output.

Without a workaround or a fix, this bug will block OIL from switching to multi-model/single controller testing.

This was with beta 14.

See original description

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-09:

logs.tar.gz Edit (2.2 MiB, application/x-tar)

Anastasia (anastasia-macmood) on 2016-08-09

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.0.0

Anastasia (anastasia-macmood) on 2016-08-09

Changed in juju-core:
importance:	High → Critical
milestone:	2.0.0 → 2.0-beta15

Anastasia (anastasia-macmood) on 2016-08-09

Changed in juju-core:
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Revision history for this message

Larry Michel (lmic) wrote on 2016-08-09:

@Anastasia, we have identified this bug internally as being one of the most critical ones for OIL.

Anastasia (anastasia-macmood) on 2016-08-10

Changed in juju-core:
milestone:	2.0-beta15 → 2.0-beta16

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-12:

Do you have a log of what CLI commands were run against Juju? Can you attach the script or CLI history?

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-12:

The issue with list models not working is that the original problem with destroy hanging has resulted in the model being partially removed. The core model record is still in the db but not the status record. The list failure is symptomatic of bad data left in the db

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-12:

And the controller logs are filled with errors reflecting corruption of documents related to model in question after a connection error:

2016-08-08 22:56:32 ERROR juju.rpc server.go:540 error writing response: write tcp 10.244.241.49:17070->10.244.241.55:54642: write: broken pipe
...
2016-08-09 00:32:15 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings 95ed3564-4ee3-4505-81d7-da4297d34ea4:r#1#peer#cinder/0} for applying transaction 57a90e06e5454915765572ef_84bcc115

The db will need to be cleaned up manually.

The connection between juju and mongo just died for a reason not apparent from the logs:

2016-08-08 19:56:26 ERROR juju.rpc server.go:540 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:46778: write: connection reset by peer

and this lead to the mongo corruption.

I'm not sure what we can do here. Was the system under heavy load?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-12:

I don't have a log of all the juju commands run on the cli. This is all done by an automated system. That's something juju should be logging if it's needed to debug issues.

We:

bootstrap the controller, once.

for each test run, pick a model name. run 'juju destroy-model --yes <controller name>:<model name>' on the model, to make sure it doesn't exist already. Then we sleep 15 seconds.

Then we 'juju add-model <model-name>'
Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
Then we use 'juju wait' to wait for the deployment to finish.
Then we deploy and run tempest, you can see those commands here:
https://git.launchpad.net/bundletester-tempest/tree/run.sh
Then we destroy the model: 'juju destroy-model --yes <controller name>:<model name>"

The system is a single core VM (dedicated to the VM) with 8GB of ram. All it was doing was hosting juju controller/mongodb.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-12: Re: [Bug 1611159] Re: model not successfully destroyed, and error on "juju list-models"

Also, what do you mean "the original problem with destroy hanging"? I
never saw destroy hang, it doesn't block. It just didn't work all the
way, apparently.

Jason

On 08/12/2016 01:50 PM, Jason Hobbs wrote:
> I don't have a log of all the juju commands run on the cli. This is all
> done by an automated system. That's something juju should be logging if
> it's needed to debug issues.
>
> We:
>
> bootstrap the controller, once.
>
> for each test run, pick a model name. run 'juju destroy-model --yes
> <controller name>:<model name>' on the model, to make sure it doesn't
> exist already. Then we sleep 15 seconds.
>
> Then we 'juju add-model <model-name>'
> Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
> Then we use 'juju wait' to wait for the deployment to finish.
> Then we deploy and run tempest, you can see those commands here:
> https://git.launchpad.net/bundletester-tempest/tree/run.sh
> Then we destroy the model: 'juju destroy-model --yes <controller name>:<model name>"
>
> The system is a single core VM (dedicated to the VM) with 8GB of ram.
> All it was doing was hosting juju controller/mongodb.
>

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-12:

Also we see the broken pipe error in almost, if not, every deployment,
including the ones that succeed.

On Fri, Aug 12, 2016 at 2:06 PM, Jason Hobbs <email address hidden>
wrote:

> Also, what do you mean "the original problem with destroy hanging"? I
> never saw destroy hang, it doesn't block. It just didn't work all the
> way, apparently.
>
> Jason
>
>
> On 08/12/2016 01:50 PM, Jason Hobbs wrote:
> > I don't have a log of all the juju commands run on the cli. This is all
> > done by an automated system. That's something juju should be logging if
> > it's needed to debug issues.
> >
> > We:
> >
> > bootstrap the controller, once.
> >
> > for each test run, pick a model name. run 'juju destroy-model --yes
> > <controller name>:<model name>' on the model, to make sure it doesn't
> > exist already. Then we sleep 15 seconds.
> >
> > Then we 'juju add-model <model-name>'
> > Then we 'juju deploy -m <controller name><model name> <bundle.yaml>'.
> > Then we use 'juju wait' to wait for the deployment to finish.
> > Then we deploy and run tempest, you can see those commands here:
> > https://git.launchpad.net/bundletester-tempest/tree/run.sh
> > Then we destroy the model: 'juju destroy-model --yes <controller
> name>:<model name>"
> >
> > The system is a single core VM (dedicated to the VM) with 8GB of ram.
> > All it was doing was hosting juju controller/mongodb.
> >
>

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-08-15:

we are waiting on recreations and further logs.

Changed in juju-core:
status:	Triaged → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-15:

#10

Which further logs do you need?

Jason Hobbs (jason-hobbs) on 2016-08-16

Changed in juju-core:
status:	Incomplete → New

Anastasia (anastasia-macmood) on 2016-08-16

Changed in juju-core:
milestone:	2.0-beta16 → 2.0-beta17
importance:	Critical → High

Anastasia (anastasia-macmood) on 2016-08-16

Changed in juju-core:
status:	New → Triaged

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta17 → none
milestone:	none → 2.0-beta17

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-26:

#11

@jason, looks like there's enough to get started digging a bit. At first glance, I don't hold much hope for giving an answer other than that there has been a level of data corruption. The fact that you see broken pipe errors could possibly be caused by i/o and/or network contention during the deploy? We've seen in the past that bundles which deploy on top of many LXD containers on a machine can cause disk thrashing and connection drop outs. This should not result in mongo losing data as indicated in an earlier message, but anecdotally we've seen various scenarios where mongo gets corrupted. The TL;DR; of the messages in the logs is that documents that Juju expects to find in the database have simply disappeared. Juju cannot recover from that. We need to try and understand what is causing mongo to lose data.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-26:

#12

Another possibility worth investigating is that Juju may be constructing the transaction op slice in the wrong order. It is important that subordinate records be written first before the main parent record. There's no guarantee that mongo will write the entire slice atomically. If it gets interrupted, other process may come in and read a parent record and then attempt to read subordinate records which are not written yet. On the surface that could also explain the observed behaviour, especially on a busy system.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-08-26:

#13

The above comment is interesting but I've checked the code and it seems Juju is doing the right thing. Plus the fact that it's mgo-txn-resume and not Juju per se implies something more sinister.

Alexis Bruemmer (alexis-bruemmer) on 2016-08-26

Changed in juju:
assignee:	Alexis Bruemmer (alexis-bruemmer) → Christian Muirhead (2-xtian)

Jason Hobbs (jason-hobbs) on 2016-08-31

description:

updated

Curtis Hovey (sinzui) on 2016-09-01

Changed in juju:
milestone:	2.0-beta17 → 2.0-beta18

Curtis Hovey (sinzui) on 2016-09-09

Changed in juju:
milestone:	2.0-beta18 → 2.0-beta19

Anastasia (anastasia-macmood) on 2016-09-11

Changed in juju:
milestone:	2.0-beta19 → 2.0-rc1

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-16:

#14

I've been trying to reproduce this with beta18 and haven't been able to yet. My test setup has been running for about 18 hours and I've had no failures yet. I could hit it before in a couple of hours. I'm going to let it run over the weekend.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-17:

#15

I've hit this again on beta 18.

Here's the output of juju models, which hangs now:
https://pastebin.canonical.com/165822/

In machine-0.log on the controller node:
2016-09-17 23:40:30 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings 97d9cc41-39ac-40ef-855c-ce4cd7a74475:r#23#provider#neutron-gateway/0} for applying transaction 57dc3802e545490f7e5d3701_d9b4cb4e

The environment is still up and I've given @2-xtian instructions on how to access it.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-17:

#16

logs.tar.gz Edit (33.1 MiB, application/x-tar)

Logs from the beta 18 reproduce.

Richard Harding (rharding) on 2016-09-20

Changed in juju:
milestone:	2.0-rc1 → 2.0-rc2

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-23:

#17

This may be able to clear things up when this is hit:

https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-23:

#18

I actually also recreated it and commented in bug 1566426. Looking at the error messages they match the mongodb transaction errors. For that one, I had tried the steps from: https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge
but that did not work because of the version of my client at the time. Per @anastasia, the wiki has been updated to reflect the client version to use so it should work now. In my case, the workaround for me to kill the controller was to power off the controller then try the kill-controller (had to run it twice for the controller to be completely destroyed)

My comment and logs are at:
https://bugs.launchpad.net/juju/+bug/1566426/comments/13

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-09-23:

#19

Based on the log messages it seems like this is the same problem as described here: https://bugs.launchpad.net/juju/+bug/1611093/comments/9

I've built a binary that changes the relation settings cleanup to use transactions - it should fix this as well.

Alexis Bruemmer (alexis-bruemmer) on 2016-09-27

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-09-29:

#20

PR: https://github.com/juju/juju/pull/6351

Anastasia (anastasia-macmood) on 2016-09-29

Changed in juju:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-09-29

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.