frequent crashes when using do-resolve()

Bug #1894879 reported by Simon Déziel
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
haproxy (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

* using the do-resolve action causes frequent crashes (>15 times/day on
  a given machine) and sometimes put haproxy into a bogus state where DNS
  resolution stops working forever without crashing (DoS)

* to address the problem, 3 upstream fixes that were released in a later
  micro release are backported. Those fixes address problems affecting
  the do-resolve action that was introduced in the 2.0 branch (which is
  why only Focal is affected)

[Test case]

Those crashes are due to the do-resolve action not being thread-safe. As
such, it is hard to trigger this on demand, except with production load.

The proposed fix should be tested for any regression, ideally on a SMP
enabled (multiple CPUs/cores) machine as the crashes happen more easily on those.

haproxy should be configured like in your production environment and put
to test to see if it still behaves like it should. The more concurrent
traffic the better, especially if you make use of the do-resolve action.

Any crash should be visible in the journal so you can keep an eye on
"journalctl -fu haproxy.service" during the tests. Abnormal exits should
be visible with:

 journalctl -u haproxy --grep ALERT | grep -F 'exited with code '

With the proposed fix, there should be none even when under heavy concurrent traffic.

[Regression Potential]

The backported fixes address locking and memory management issues. It's
possible they could introduce more locking problems or memory leaks. That
said, they only change code related to the DNS/do-resolve action which is
already known to trigger frequent crashes.

The backported fixes come from a later micro version (2.0.16, released on
2020-07-17) so it has seen real world production usage already. Upstream
also released another micro version since then and they didn't revert nor
rework the DNS/do-resolve action.

Furthermore, the patch was tested on a production machine that used to
crash >15 times per day. We are now 48 hours after the patch deployment
and we saw no crash and no sign of regression either.

[Original description]

haproxy 2.0.13-2 keeps crashing for us:

# journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code '
Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault)
Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault)
Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault)
Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault)
Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault)
Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault)
Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault)
Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault)
Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault)
Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault)
Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault)
Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault)
Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted)
Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault)
Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted)
Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault)
Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault)
Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault)
Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault)

The server in question uses a config that looks like this:

global
  maxconn 50000
  log /dev/log local0
  log /dev/log local1 notice
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  daemon

defaults
  maxconn 50000
  log global
  mode tcp
  option tcplog
  # dontlognull won't log sessions where the DNS resolution failed
  #option dontlognull
  timeout connect 5s
  timeout client 15s
  timeout server 15s

resolvers mydns
  nameserver local 127.0.0.1:53
  accepted_payload_size 1400
  timeout resolve 1s
  timeout retry 1s
  hold other 30s
  hold refused 30s
  hold nx 30s
  hold timeout 30s
  hold valid 10s
  hold obsolete 30s

frontend foo
  ...

  # dns lookup
  tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host)
  tcp-request content capture var(txn.remote_ip) len 40

  # XXX: the remaining rejections happen in the backend to provide better logging
  # reject connection on DNS resolution error
  use_backend be_dns_failed unless { var(txn.remote_ip) -m found }

  ...
  # at this point, we should let the connection through
  default_backend be_allowed

When reaching out to haproxy folks in #haproxy, https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was mentioned as a potential fix.

https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with "dns":

* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae
* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766
* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f

It would be great to have those (at least ef131ae) SRU'ed to Focal (Bionic doesn't isn't affected as it runs on 1.8).

Additional information:

# lsb_release -rd
Description: Ubuntu 20.04.1 LTS
Release: 20.04

# apt-cache policy haproxy
haproxy:
  Installed: 2.0.13-2
  Candidate: 2.0.13-2
  Version table:
 *** 2.0.13-2 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
        100 /var/lib/dpkg/status

Related branches

Revision history for this message
Simon Déziel (sdeziel) wrote :

I've deployed a test package with all 3 git commits included. I will report here after few days of testing and hopefully a debdiff to propose ;)

Revision history for this message
Simon Déziel (sdeziel) wrote :
description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

+0003-BUG-CRITICAL-dns-Make-the-do-resolve-action-thread-safe.patch

This is available at 2020/07/23: 2.2.1

+0004-BUG-MEDIUM-dns-Release-answer-items-when-a-DNS-resolution-is-freed.patch

This is available at 2020/07/23: 2.2.1

+0005-BUG-MEDIUM-dns-Dont-yield-in-do-resolve-action-on-a-final.patch

This one is available at 2020/07/31 : 2.2.2

So we are good in Groovy.

Changed in haproxy (Ubuntu):
status: New → Fix Released
Changed in haproxy (Ubuntu Focal):
status: New → Triaged
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Simon, you said you would be testing in production ... do you have any update on that, apart from the debdiff ?

For Focal, we have:

# issue #236

BUG/MAJOR: dns: Make the do-resolve action thread-safe

> This patch should fix the issue #236. It must be backported as far as 2.0.

# issue #222

BUG/MEDIUM: dns: Don't yield in do-resolve action on a final evaluation

> This patch is related to the issue #222. It must be backported as far as 2.0.

BUG/MEDIUM: dns: Release answer items when a DNS resolution is freed

> This patch should solve the issue #222. It must be backported, at least, as far as 2.0, and probably, with caution, as far as 1.8 or 1.7.

I'm +1 in having those in Focal as well... let us know how your tests went and we can sponsor your SRU.

Thanks!

-rafaeldtinoco

Revision history for this message
Simon Déziel (sdeziel) wrote :

Rafael, yes production testing went/is going well and we are more than 2 days in (it's mentioned/buried in the regression potential). Thanks for your sponsoring offer!

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Perfect! Sorry for missing that feedback. Ill upload (sponsor) for you and it will get reviewed by the SRU team then, for acceptance. But I don't think there will be any issues.

Revision history for this message
Simon Déziel (sdeziel) wrote :

Thanks for adding the d/control part :)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

From autopkgtest:

---
proxy-localhost PASS
cli PASS
proxy-localhost PASS
---

$ git push pkg upload/2.0.13-2ubuntu0.1
Counting objects: 17, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (17/17), done.
Writing objects: 100% (17/17), 6.47 KiB | 602.00 KiB/s, done.
Total 17 (delta 9), reused 0 (delta 0)
To ssh://git.launchpad.net/ubuntu/+source/haproxy
 * [new tag] upload/2.0.13-2ubuntu0.1 -> upload/2.0.13-2ubuntu0.1

$ debdiff *.dsc 2>&1 | diffstat -l | grep -v gpg
changelog
control
patches/lp1894879-BUG-CRITICAL-dns-Make-the-do-resolve-action-thread-safe.patch
patches/lp1894879-BUG-MEDIUM-dns-Dont-yield-in-do-resolve-action-on-a-final.patch
patches/lp1894879-BUG-MEDIUM-dns-Release-answer-items-when-a-DNS-resolution-is-freed.patch
patches/series

$ dput ubuntu haproxy_2.0.13-2ubuntu0.1_source.changes
Checking signature on .changes
Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading haproxy_2.0.13-2ubuntu0.1.dsc: done.
  Uploading haproxy_2.0.13-2ubuntu0.1.debian.tar.xz: done.
  Uploading haproxy_2.0.13-2ubuntu0.1_source.buildinfo: done.
  Uploading haproxy_2.0.13-2ubuntu0.1_source.changes: done.
Successfully uploaded packages.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@sdeziel,

Could you please verify the fix from -proposed archive once it gets there (after SRU team approves the upload) and change the tag from "verification-needed" to "verification-done" when appropriate ?

I took the liberty to change the patch names and position in series file to help a bit the package management.

Thanks for the contribution!

Revision history for this message
Simon Déziel (sdeziel) wrote :

Sure thing, I'll keep an eye out for -proposed and will report back.

Revision history for this message
Simon Déziel (sdeziel) wrote :

Quick update: it's been roughly a week and still no crash with the pre-proposed package so that's good! Still waiting on the update to show up in -proposed.

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Simon, or anyone else affected,

Accepted haproxy into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/haproxy/2.0.13-2ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in haproxy (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Simon Déziel (sdeziel) wrote :

The proposed version is running on 2 production machines and no crash (nor any other problem) was observed since then. Marking as verification-done.

# apt-get install haproxy
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
  vim-haproxy haproxy-doc
The following packages will be upgraded:
  haproxy
1 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.
Need to get 1,519 kB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 haproxy amd64 2.0.13-2ubuntu0.1 [1,519 kB]
Fetched 1,519 kB in 1s (2,614 kB/s)
(Reading database ... 14826 files and directories currently installed.)
Preparing to unpack .../haproxy_2.0.13-2ubuntu0.1_amd64.deb ...
Unpacking haproxy (2.0.13-2ubuntu0.1) over (2.0.13-2) ...
Setting up haproxy (2.0.13-2ubuntu0.1) ...
Processing triggers for rsyslog (8.2001.0-1ubuntu1.1) ...
Processing triggers for systemd (245.4-4ubuntu3.2) ...

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package haproxy - 2.0.13-2ubuntu0.1

---------------
haproxy (2.0.13-2ubuntu0.1) focal; urgency=medium

  * Backport dns related fixes from git to resolve crashes when
    using do-resolve action (LP: #1894879)
    - BUG/CRITICAL: dns: Make the do-resolve action thread safe
    - BUG/MEDIUM: dns: Release answer items when a DNS resolution is freed
    - BUG/MEDIUM: dns: Don't yield in do resolve action on a final

 -- Simon Deziel <email address hidden> Tue, 08 Sep 2020 17:16:14 +0000

Changed in haproxy (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for haproxy has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
David (liewebagency-deactivatedaccount) wrote : Re: [Bug 1894879] frequent crashes when using do-resolve()

Hello,

Is it possible to run with the old net manager and focal ? Ive installed 16.04 now to get the eth0 and network manager.

There are no Server Control Panels who can setup and control ip adresses like get up a eth0 eth01 the .yaml system I hate so much and its the worst idea ever, it destroys everything!

> 11. sep. 2020 kl. 18:04 skrev Launchpad Bug Tracker <email address hidden>:
>
> ** Merge proposal linked:
> https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/haproxy/+git/haproxy/+merge/390625
>
> --
> You received this bug notification because you are subscribed to Focal.
> Matching subscriptions: <email address hidden>
> https://bugs.launchpad.net/bugs/1894879
>
> Title:
> frequent crashes when using do-resolve()
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+subscriptions

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi David,
your comment seems out of context here. If there is an issue I'd recommend filing a new bug for netplan and discussing things there, but if all you need is "Is it possible to run with the old net manager and focal" then look at [1].

[1]: https://netplan.io/faq/#how-to-go-back-to-ifupdown

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.