Juju 2.0-beta12 userdata execution fails on Windows

Bug #1604474 reported by Adrian Vladu
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Nate Finch

Bug Description

The Windows agent Juju userdata fails when executed, because it does not retry to connect to the public IP of the stateserver. It seems that the only IP used is the stateserver's private IP.

This is the JujuUserdata log, retrieved from the azure portal:
http://paste.ubuntu.com/20047276/

As seen in
    http://reports.vapour.ws/releases/issue/5789759d749a5616aa83b491

Adrian Vladu (avladu)
summary: - Juju 2.0-beta12 agent fails with Windows when on Azure
+ Juju 2.0-beta12 userdata execution fails with Windows when on Azure
summary: - Juju 2.0-beta12 userdata execution fails with Windows when on Azure
+ Juju 2.0-beta12 userdata execution fails on Windows if azure-provider
+ is used
Revision history for this message
Cheryl Jennings (cherylj) wrote : Re: Juju 2.0-beta12 userdata execution fails on Windows if azure-provider is used

Can you attach /var/log/juju/machine-0.log for machine 0 in the controller model?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :

There are 2 issues here:

1) The use of Write-Error inside the catch block of the userdata. If $ErrorActionPreference is set to "Stop" in Powershell, and something gets written to STDERR via Write-Error, the execution will stop with an error. Its one of the silliest things in powershell, but such is life.

https://github.com/juju/juju/blob/master/cloudconfig/windowsuserdatafiles/retry.ps1#L18 <-- one such example

2) The state machine logs show this:

2016/07/19 23:16:41 http: TLS handshake error from 86.120.209.201:47386: tls: no cipher suite supported by both client and server

This is the only entry that gets written to the log when the userdata attempts to download the juju tools.

Even though TLS1.2 is enabled in the userdata, the cipher suit configured by juju does not appear to be supported by the client.

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :

I suspect this is not limited to the azure provider. This will most likely happen on all providers when deploying Windows workloads.

Changed in juju-core:
status: Incomplete → Triaged
importance: Undecided → Critical
milestone: none → 2.0-beta13
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I see that our deployment tests in CI have also been failing.

summary: - Juju 2.0-beta12 userdata execution fails on Windows if azure-provider
- is used
+ Juju 2.0-beta12 userdata execution fails on Windows
tags: added: oil-2.0
tags: added: oil
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
Nate Finch (natefinch)
Changed in juju-core:
status: Triaged → In Progress
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Revision history for this message
Nate Finch (natefinch) wrote :

It looks like there's some problem with powershell connecting to a TLS 1.2 server (our code was switched to only support TLS 1.2). I can replicate this with a trivial go program running an HTTPS server using the same TLS configuration we use for Juju, and trying to get a page from that server. it works from chrome, it won't work from powershell (powershell requests a maximum of TLS 1.1).

So far, I haven't figured out why this is so. .Net 4.5 is installed, and powershell is using it, and according to what I've read, TLS 1.2 is enabled by default.

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :

Nate,

I don't think its the protocol. I think its the cypher:

https://github.com/juju/utils/blob/master/tls_transport_go13.go#L43-L49

try commenting out:

https://github.com/juju/utils/blob/master/tls_transport_go13.go#L60

and see if it works on your system.

Revision history for this message
Nate Finch (natefinch) wrote :

I'm pretty sure it's both. In my tests, a webclient request from powershell defaults to max version of TLS 1.1. If I explicitly tell it to use TLS 1.2, then it will, thusly:

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

But as you said, the default cyphersuites enabled on Windows do not line up with what we have supported in Juju.

I believe Windows supports the cyphersuites we use (at least some of them), but they're just not enabled. However, I've been having trouble figuring out how to enable them in Windows. There are many examples on the internet, but they're not super straightforward. There's a couple different methods mentioned, either group policy or in the registry... I tried both without any success... it doesn't help that windows is talking about cyphersuites with a _p384 etc at the end, which no one else seems to understand, e.g. TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P256

Revision history for this message
Nate Finch (natefinch) wrote :

Here's a raw dump of what windows is requesting (which I got by hacking the go stdlib to print them out from the handshake):

2016/07/25 19:50:53 3c
2016/07/25 19:50:53 2f
2016/07/25 19:50:53 3d
2016/07/25 19:50:53 35
2016/07/25 19:50:53 5
2016/07/25 19:50:53 a
2016/07/25 19:50:53 c027
2016/07/25 19:50:53 c013
2016/07/25 19:50:53 c014
2016/07/25 19:50:53 c02b
2016/07/25 19:50:53 c023
2016/07/25 19:50:53 c02c
2016/07/25 19:50:53 c024
2016/07/25 19:50:53 c009
2016/07/25 19:50:53 c00a
2016/07/25 19:50:53 40
2016/07/25 19:50:53 32
2016/07/25 19:50:53 6a
2016/07/25 19:50:53 38
2016/07/25 19:50:53 13
2016/07/25 19:50:53 4

c02c is TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
c02b is TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256

(see here for the names/values of go's supported ciphers: https://golang.org/pkg/crypto/tls/#pkg-constants)

Both of which are explicitly supported in our list.

This is our list
  tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (c02c)
 tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 (c02b)

 tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (c02f)
 tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (c030)

...ahh, doing some more spelunking I see that the cert I generated for my test server was RSA, so only the latter two ciphers would work with it... except that they're not in the windows supported list. If I generate an ECDSA cert, I can get a ciphersuite to work.... but I'm still getting an error from webclient:

Windows PowerShell
Copyright (C) 2012 Microsoft Corporation. All rights reserved.

PS C:\Users\Administrator> [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
PS C:\Users\Administrator> (New-Object System.Net.WebClient).DownloadFile("https://127.0.0.1:10443", "out.txt")
Exception calling "DownloadFile" with "2" argument(s): "The underlying connection was closed: Could not establish
trust relationship for the SSL/TLS secure channel."
At line:1 char:1
+ (New-Object System.Net.WebClient).DownloadFile("https://127.0.0.1:10443", "out.t ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : WebException

I'm not getting any printed error on the serverside now, which is weird.

Revision history for this message
Nate Finch (natefinch) wrote :

OK, so interesting note - we currently generate a certificate with an RSA private key, which means the client must use an RSA ciphersuite... any other ciphersuites we supposedly support are useless for Juju. Unfortunately, the RSA ciphersuites we support are not supported by windows:

TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

Supported list:

https://support.microsoft.com/en-us/kb/2929781

Windows supports TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 & TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, but those require an ECDSA cert, which is not what we're using.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

I believe we want to move to ECDSA. Is it possible to do both as an interim measure?

Curtis Hovey (sinzui)
description: updated
tags: added: ci regression
Revision history for this message
Nate Finch (natefinch) wrote :

Note that windows supports TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, which is in our list, so we can do ECDSA whenever needed.

Nate Finch (natefinch)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :

CI is still seeing failures testing windows charms in azure
    http://reports.vapour.ws/releases/issue/5789759d749a5616aa83b491

We retest juju's from July 1 and they pass. Current juju fails.

Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
Richard Harding (rharding) wrote : Re: [Bug 1604474] Re: Juju 2.0-beta12 userdata execution fails on Windows

Thanks Curtis, we'll look into it.

On Tue, Aug 2, 2016 at 11:45 AM Curtis Hovey <email address hidden> wrote:

> CI is still seeing failures testing windows charms in azure
> http://reports.vapour.ws/releases/issue/5789759d749a5616aa83b491
>
> We retest juju's from July 1 and they pass. Current juju fails.
>
> ** Changed in: juju-core
> Status: Fix Committed => In Progress
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> https://bugs.launchpad.net/bugs/1604474
>
> Title:
> Juju 2.0-beta12 userdata execution fails on Windows
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1604474/+subscriptions
>

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
ryeterrell (ryeterrell)
tags: added: vpil
Revision history for this message
Nate Finch (natefinch) wrote :

I'm currently getting a weird error reported in the azure portal, with the windows VMs failing to start:

Provisioning failed. Shrinking a disk from 136367309312 bytes to 34359738880 bytes is not supported.. ResizeDiskError

Revision history for this message
Richard Harding (rharding) wrote :

I think that was tied to a bug that axw was working on and should have a
fix in trunk. Are you up to date on the codebase?

On Wed, Aug 10, 2016 at 2:55 PM Nate Finch <email address hidden> wrote:

> I'm currently getting a weird error reported in the azure portal, with
> the windows VMs failing to start:
>
> Provisioning failed. Shrinking a disk from 136367309312 bytes to
> 34359738880 bytes is not supported.. ResizeDiskError
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> https://bugs.launchpad.net/bugs/1604474
>
> Title:
> Juju 2.0-beta12 userdata execution fails on Windows
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1604474/+subscriptions
>

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Rick, I suspect the changes I made have actually caused the "ResizeDiskError" issue. I guess for Windows we'll have to set the minimum disk size to 127GiB.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I'm merging a fix for the ResizeDiskError: https://github.com/juju/juju/pull/5972. Azure was able to provision a Windows VM with this change in place.

Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
Revision history for this message
Nate Finch (natefinch) wrote :

Userdata is working fine... we're seeing some other problems that are unrelated to this bug.

Changed in juju-core:
status: In Progress → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.