service haproxy reload sometimes fails to pick up new TLS certificates

Bug #1828496 reported by Barry Price
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
haproxy (Ubuntu)
Confirmed
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned

Bug Description

I suspect this is the same thing reported on StackOverflow:

"I had this same issue where even after reloading the config, haproxy would randomly serve old certs. After looking around for many days the issue was that "reload" operation created a new process without killing the old one. Confirm this by "ps aux | grep haproxy"."

https://stackoverflow.com/questions/46040504/haproxy-wont-recognize-new-certificate

In our setup, we automate Let's Encrypt certificate renewals, and a fresh certificate will trigger a reload of the service. But occasionally this reload doesn't seem to do anything.

Will update with details next time it happens, and hopefully confirm the multiple process theory.

Related branches

Revision history for this message
Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

When you update with details, please make sure to provide full reproduction steps and include details of the Ubuntu release and package versions you used. When done, please change the bug status back to New, and we can look at it again.

Changed in haproxy (Ubuntu):
status: New → Incomplete
Revision history for this message
Barry Price (barryprice) wrote :

This is haproxy 1.6.3-1ubuntu0.2 on Xenial/16.04 running on amd64 hardware.

Reproduction steps:

1. Install haproxy and configure it to use a TLS certificate
2. Renew and replace that certificate
3. Run 'service haproxy reload'
4. Sometimes this starts serving the new certificate, sometimes it doesn't

Changed in haproxy (Ubuntu):
status: Incomplete → New
Revision history for this message
Paride Legovini (paride) wrote :

Hi,

I see you are using service(1) to reload haproxy. The service command has been written to manage SysV scripts, but nowadays it prefers calling systemctl when possible. You quoted that:

  After looking around for many days the issue was that "reload"
  operation created a new process without killing the old one.

and as haproxy ships with both SysV and systemd init files, this makes me think that service(1) may for some reason fail to call systemctl and ends up reloading haproxy using the SysV script, causing the double process. The first process would in this case keep running, serving the old certificate. However in the Xenial system where I tried to reproduce the issue `service haproxy reload` calls `systemctl reload` as expected. From /var/log/syslog:

May 23 09:39:42 xenial systemd[1]: Reloading HAProxy Load Balancer.
May 23 09:39:42 xenial haproxy[26047]: Configuration file is valid
May 23 09:39:42 xenial haproxy-systemd-wrapper[25945]: haproxy-systemd-wrapper: re-executing
May 23 09:39:42 xenial systemd[1]: Reloaded HAProxy Load Balancer.
May 23 09:39:42 xenial haproxy-systemd-wrapper[25945]: haproxy-systemd-wrapper: executing /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 26035

Do you happen to see two haproxy processes when a `service haproxy reload` fails to make haproxy serve the new certificate? Can you still reproduce the issue if you reload haproxy using systemctl instead of service? Can you please attach the tail of /var/log/syslog the next time you encounter the issue? Thank you.

Revision history for this message
Paride Legovini (paride) wrote :

I'm marking this report a Incomplete for now, as we still miss some pieces of information needed to determine if this is actually a bug in Ubuntu. After providing them please change the bug status back to New, and we will look at it again. Thank you!

Changed in haproxy (Ubuntu):
status: New → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Note that there is a systemd wrapper process in xenial:
  411 ? Ss 0:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
  413 ? S 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
  432 ? Ss 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds

After a reload (not restart), that particular process stays (411), but its children, which is what actually serves the content, are restarted:
  411 ? Ss 0:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
  671 ? S 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 432
  675 ? Ss 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 432

Maybe there is a bad interaction between reload, certs, and existing connections. The tests I've done so far are rather static, with a simple frontend and backend.

Revision history for this message
Barry Price (barryprice) wrote :

Never mind, here it is on Xenial:

https://pastebin.ubuntu.com/p/3zbQdnTBtF/

There's nothing relevant to haproxy in syslog, but here's the relevant lines from /var/log/haproxy.log:

https://pastebin.ubuntu.com/p/HJT3WRc8Dw/

While the proxy processes apparently did stop, the pid 2215 process did not.

Running 'systemctl restart haproxy.service' afterwards killed both processes, and started a single fresh one with a new pid, which I think is correct behaviour. So this does indeed seem likely to be down to using 'service' here.

Revision history for this message
Paride Legovini (paride) wrote :

Hi Barry,

Glad that you have a workaround, but I don't think we nailed the problem. You are now using the 'restart' action, which is stronger than a 'reload', and I bet that a 'service haproxy restart' would have worked too. The 'restart' action fully stops and then starts the service, while 'reload' sends the USR2 signal to haproxy. The systemd unit file defining the reload action comes from upstream.

This is what I found out. As of the version of haproxy in Xenial (1.6.x) the only two things mentioning the usage of SIGUSR2 to reload the service are the upstream changelog:

  - MINOR: systemd wrapper: re-execute on SIGUSR2

and the systemd unit file actually using it.

With haproxy 1.8 upstream announced [1] a much improved management of multiple haproxy processes with the newly introduced master-worker mode, finally documenting [2] SIGUSR2 as the correct the way to make haproxy reload:

    In master-worker mode, it is not needed to start a new haproxy
    process in order to reload the configuration. The master process
    reacts to the SIGUSR2 signal by reexecuting itself with the -sf
    parameter followed by the PIDs of the workers. The master will
    then parse the configuration file and fork new workers.

My take is that in haproxy 1.6 SIGUSR2 isn't working well as a way to reload the haproxy configuration file, a full 'restart' may be necessary.

It is still not clear to me why I *do* see the PIDs of the haproxy processes changing with a 'service haproxy reload', while Andreas doesn't.

[1] https://www.haproxy.com/blog/whats-new-haproxy-1-8/
[2] https://cbonte.github.io/haproxy-dconv/1.8/management.html

Barry Price (barryprice)
Changed in haproxy (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Is there any new detail to the re-opening of the case?

Revision history for this message
Barry Price (barryprice) wrote :

It was marked Incomplete in #4, so there's new detail in #5, #7 and #8 - I just missed re-opening it until now.

Revision history for this message
Paride Legovini (paride) wrote :

Hi Barry,

I tried again to reproduce the issue on Xenial, but without success. I tested by 'reloading' the service several times, and in all the reloads the PIDs of the haproxy processes did change, as expected given the version of haproxy currently in Xenial.

In order to investigate the issue we need some more details. To begin with:

1) How often do you experience the problem? Is this a rare, difficult to reproduce issue, or something you can easily reproduce? I'm not asking to evaluate the severity of the issue, but to be able to test in a meaningful way.

2) When you hit the problem, can you check if the PIDs of the haproxy processes change after reloading the service?

I imagine a scenario where you have several auto-renewing certificates with a hook/script reloading haproxy on cert renewal, and from time to time you find out that an old certificate is getting served, even if it has been renewed and the haproxy service reloaded. It this case it would be nice to log a few things before and after the reload (e.g. verify that the new certs are actually in place, `ps fauxww`, `systemctl status`, haproxy.log, ...), sleeping for a few seconds after the reload command is issued, to give time for the daemons to restart.

Once again after providing additional information please set the bug status back to New. Thanks!

Changed in haproxy (Ubuntu):
status: New → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Going over the details from comment #7

This is the state before the reload:
ubuntu@foo:~$ ps auxfwww | grep haproxy
root 1346 0.0 0.0 4356 684 ? Ss May22 0:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 2210 0.0 0.2 42644 10520 ? S May22 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 1378
haproxy 2215 2.7 0.8 68576 36308 ? Ss May22 84:46 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 1378

-sf means to send the finish signal (which is SIGTTOU and SIGUSR1 according to haproxy(1)) to the pids listed after startup, which is pid 1378 in this case. There is no haproxy 1378 in this list, so I wonder if the "before" state was already a bit borked and what haproxy does if the pids listed after -sf do not exist.

After reload, we have:
ubuntu@foo:~$ ps auxfwww | grep haproxy
root 1346 0.0 0.0 4356 724 ? Ss May22 0:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 2210 0.0 0.2 42644 10520 ? S May22 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 1378
haproxy 2215 2.7 0.8 68496 36228 ? Ss May22 84:47 | \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 1378
haproxy 8151 0.0 0.2 42644 10456 ? S 07:36 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 2215
haproxy 8152 2.0 0.2 43048 10568 ? Ss 07:36 0:00 \_ /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 2215
ubuntu@foo:~$

Here we can see new haproxy processes with -sf pointing at the previous 2215 one. The ones with -sf 1378 are still there, and will remain there until a full restart probably.

Revision history for this message
Barry Price (barryprice) wrote :

Hi Paride, sorry for the late reply - summer holidays.

1). Unfortunately it's a rare one, and I'm not sure exactly what state the service needs to be in, in order to trigger the bug.

2). I believe the pastebin in comment #7 is representative, and that the PIDs don't change, but rather new haproxy processes appear alongside the older ones. I'll certainly confirm and update this bug next time this pops up. It just might be a while before that happens again.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for haproxy (Ubuntu) because there has been no activity for 60 days.]

Changed in haproxy (Ubuntu):
status: Incomplete → Expired
Changed in haproxy (Ubuntu):
status: Expired → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in haproxy (Ubuntu):
status: New → Confirmed
Revision history for this message
Haw Loeung (hloeung) wrote :

Saw this elsewhere, on Bionic:

| root 6522 0.0 0.7 2089380 2035060 ? Ss Mar11 0:36 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 60208 12488 19796 -x /run/haproxy/admin.sock
| haproxy 19796 0.0 0.5 8374612 1518588 ? Ssl Mar12 370:39 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 62903 -x /run/haproxy/admin.sock
| haproxy 63249 0.0 0.5 8490864 1468892 ? Ssl 19:13 185:53 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 60208 12488 19796 -x /run/haproxy/admin.sock

New configs weren't made live because the old process was still around, even after a reload and HAProxy spawning a new.

Changed in haproxy (Ubuntu Xenial):
status: New → Confirmed
Changed in haproxy (Ubuntu Bionic):
status: New → Confirmed
Revision history for this message
Haw Loeung (hloeung) wrote :

| haproxy:
| Installed: 1.8.8-1ubuntu0.9
| Candidate: 1.8.8-1ubuntu0.9
| Version table:
| *** 1.8.8-1ubuntu0.9 500
| 500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
| 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
| 100 /var/lib/dpkg/status

Revision history for this message
Haw Loeung (hloeung) wrote :

Looks like HAProxy's 'hard-stop-after'[1] config might be the solution to this.

"""
Defines the maximum time allowed to perform a clean soft-stop.

This may be used to ensure that the instance will quit even if connections
remain opened during a soft-stop (for example with long timeouts for a proxy
in tcp mode). It applies both in TCP and HTTP mode.
"""

[1]http://cbonte.github.io/haproxy-dconv/1.8/configuration.html#3.1-hard-stop-after

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Xenial has reached its end of standard support.

Changed in haproxy (Ubuntu Xenial):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.