dnsmasq often using 100% of CPU if the upstream nameserver is unreachable

Bug #1980146 reported by mixmastamyk
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Triaged
Medium
Unassigned
Focal
Triaged
Medium
Unassigned
Jammy
Triaged
Medium
Unassigned
Kinetic
Won't Fix
Medium
Unassigned

Bug Description

Release: 22.04
Codename: jammy
Kernel: Linux 5.15.0-40-generic x86_64

⏵ apt-cache policy dnsmasq
  Installed: 2.86-1.1ubuntu0.1

dnsmasq is caught in a loop after every dns request, resulting in 100% CPU usage for several minutes each time. This leads to a hot and lethargic computer. During this time thousands of the following messages (see below) are printed from strace.

The loop tends to obsess on denied connections, but there are so many I'm not 100% sure.

systemd-resolved is _not_ running, some bugs refer to that.

⏵ head /etc/dnsmasq.d/foo.conf

address=/#/127.0.0.2
port=53
resolv-file=/var/run/NetworkManager/resolv.conf

⏵ sudo strace -p 3519 (dnsmasq)

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}], 9, -1) = 1 ([{fd=4, revents=POLLIN}])

recvmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(60224), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=28 => 16, msg_iov=[{iov_base="\302\221\1\0\0\1\0\0\0\0\0\0\17classify-client\10ser"..., iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("lo"), ipi_spec_dst=inet_addr("127.0.0.53"), ipi_addr=inet_addr("127.0.0.53")}}], msg_controllen=32, msg_flags=0}, 0) = 54

ioctl(4, SIOCGIFNAME, {ifr_ifindex=1, ifr_name="lo"}) = 0

sendto(14, "\302\221\1\0\0\1\0\0\0\0\0\0\17classify-client\10ser"..., 54, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 54

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

hi, mixmastamyk. Thanks for taking the time to report this bug.

Can you track those spikes to any specific network operation? Do you have logs for these specific spikes?

Finally, would you be able to provide a minimal reproducer with steps and configuration so we can reproduce the issue?

Revision history for this message
mixmastamyk (ubforums) wrote :

Hi, it typically happens while using firefox, when it makes any of its automatic or requested network accesses. I originally thought firefox was hammering it, but I was also able to provoke the behavior with another tool, wget I think.

In terms of config, the /etc/dnsmasq.d/foo.conf above is it, with the exception of about ~40 server=... lines. The file checks out though:

⏵ dnsmasq --test
dnsmasq: syntax check OK.

The command below will stop the issue, for a while:

⏵ sudo systemctl restart dnsmasq

Revision history for this message
sefs (sefsinc) wrote :

I have been monitoring the same issue for months. But have not been able to identify the root cause.
Worst yet it not only affects my system but also my router.

The dnsmasq goes to 100%cpu and somehow forces my pfsense to also begin to consume 100%cpu
if i kill dnsmasq then the router comes back to normal.

Please send help.

tags: added: server-triage-discuss
tags: removed: server-triage-discuss
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Hi mixmastamyk,

Would you be able to provide the following information:

  * dpkg --get-selections | cut -f 1
  * /etc/resolv.conf
  * /etc/NetworkManager/NetworkManager.conf
  * /etc/dnsmasq.conf
  * /run/dnsmasq/resolv.conf
  * /var/run/dhcpd.leases (or /run/dhcpd.leases?)

Do make sure you remove any sensitive information from those.

Also, when your system hits the 100% issue, could you provide the following info:

  * ps aux
    - Is logrotate running? (Or some other 'heavy' process?)
  * journalctl -f
    - Any mention of "maximum number of concurrent dns queries reached"?

https://github.com/jedisct1/dnsblast provides a tool which could be used to try to reproduce the issue.

@sefs:

Your issue may be unrelated to mixmastamyk's bug. Wouldyou be able to file a separate bug providing the same information I just requested to mixmastamyk above?

Thanks!

Revision history for this message
mixmastamyk (ubforums) wrote :
Revision history for this message
mixmastamyk (ubforums) wrote :

⏵ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory

⏵ cat /etc/NetworkManager/NetworkManager.conf
[main]
plugins=ifupdown,keyfile

[ifupdown]
managed=false

[device]
wifi.scan-rand-mac-address=no

⏵ cat /etc/dnsmasq.conf
## everything is commented out here ##

⏵ cat /run/dnsmasq/resolv.conf
cat: /run/dnsmasq/resolv.conf: No such file or directory

⏵ cat /var/run/dhcpd.leases /run/dhcpd.leases
cat: /var/run/dhcpd.leases: No such file or directory
cat: /run/dhcpd.leases: No such file or directory

(This is an otherwise standard Ubuntu Mate system on Jammy, with dnsmasq and ssh installed.)

Revision history for this message
mixmastamyk (ubforums) wrote (last edit ):

This is over ssh. No one is logged in at desktop, although lightdm is up.

dnsmasq:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dnsmasq 823 97.7 0.0 15724 2268 ? R 15:22 23:35 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d

Revision history for this message
mixmastamyk (ubforums) wrote :

⏵ journalctl -f

Nothing new coming up.

⏵ cat /var/run/NetworkManager/resolv.conf
# Generated by NetworkManager
search socal.rr.com
nameserver 127.0.0.53
options edns0 trust-ad

Revision history for this message
mixmastamyk (ubforums) wrote :

I don't have to do anything to reproduce it, just boot the machine and it starts happening:

⏵ sudo strace -p 823

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=13, events=POLLIN}], 8, -1) = 1 ([{fd=4, revents=POLLIN}])

recvmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(33704), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=28 => 16, msg_iov=[{iov_base="4\312\1\0\0\1\0\0\0\0\0\0\0014\19\18\0015\1c\1d\0017\0016\0016\1b"..., iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("lo"), ipi_spec_dst=inet_addr("127.0.0.53"), ipi_addr=inet_addr("127.0.0.53")}}], msg_controllen=32, msg_flags=0}, 0) = 90

ioctl(4, SIOCGIFNAME, {ifr_ifindex=1, ifr_name="lo"}) = 0

sendto(13, "4\312\1\0\0\1\0\0\0\0\0\0\0014\19\18\0015\1c\1d\0017\0016\0016\1b"..., 90, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 90

As mentioned, I think it happens when an address is denied by dnsmasq.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for the information you provided. I set up a Ubuntu Jammy VM with ssh and dnsmasq running and unfortunately I cannot reproduce it just with that. There might be something else happening in your system.

If you could provide information about all processes running in your system when this happens it would be great. It could at least serve as a lead for us to keep investigating this issue, right now I am not sure what's going on.

Changed in dnsmasq (Ubuntu):
status: New → Incomplete
Revision history for this message
mixmastamyk (ubforums) wrote :

Spent the whole morning on it and finally figured it out—I had dnsmasq looking
at the wrong DNS server because it was looking at the wrong resolv.conf:

    ⏵ cat /var/run/NetworkManager/resolv.conf
    # Generated by NetworkManager
    nameserver 127.0.0.53

I remember now that early on there were too many daemons managing DNS !!
(NetworkManager, systemd-resolved, dnsmasq). Couldn't get them all happy so I
removed systemd-resolved.

That would be fine if there was something running at 127.0.0.53 but there
wasn't since I'd uninstalled it.
I needed the original (though dynamic) resolv.conf and found it at
no-stub-resolv.conf:

    ⏵ head /etc/dnsmasq.d/foo.conf

    address=/#/127.0.0.2
    port=53
    # resolv-file=/var/run/NetworkManager/resolv.conf
    resolv-file=/var/run/NetworkManager/no-stub-resolv.conf

This file has my correct DNS info from the router and now dnsmasq is working
quietly as expected. I see there are other ways to configure things
(NetworkManager can run dnsmasq too!) but I like this way because I can do
everything in just one file. No need to instruct NetworkManager to write
/etc/resolv.conf either.

Was inclined to close this, but thinking about it, dnsmasq should not hammer a
non-existent DNS server to the point of saturating the CPU.

True, the server address was misconfigured but dnsmasq should check *at most* once every few seconds. Definitely needs a delay in there in this case.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for the extra details, I was able to reproduce the problem here.

Here are the steps:

$ lxc launch ubuntu-daily:jammy dnsmasq-cpu-bug
$ lxc shell dnsmasq-cpu-bug
# apt update
# apt install -y dnsmasq
# systemctl disable --now systemd-resolved.service
# systemctl start dnsmasq.service
# dig gnu.org

In another terminal, you can shell into the container and do an "htop" to check that dnsmasq will be using 100% of one CPU core.

I was able to reproduce this all the way back to Focal, but Bionic isn't impacted by this bug.

I believe it's worth trying to report it upstream and seeing if this rings any bells. I'm going to include this bug in our backlog; right now everybody from the Ubuntu Server team is busy with the upcoming Kinetic release.

mixmastamyk, if you feel like reporting this upstream please post the link to the email thread here so that we can keep an eye on it.

Thanks.

Changed in dnsmasq (Ubuntu Focal):
status: New → Triaged
Changed in dnsmasq (Ubuntu Jammy):
status: New → Triaged
Changed in dnsmasq (Ubuntu Kinetic):
status: Incomplete → Triaged
Changed in dnsmasq (Ubuntu Focal):
importance: Undecided → Medium
Changed in dnsmasq (Ubuntu Kinetic):
importance: Undecided → Medium
Changed in dnsmasq (Ubuntu Jammy):
importance: Undecided → Medium
Simon Déziel (sdeziel)
summary: - dnsmasq often using 100% of CPU
+ dnsmasq often using 100% of CPU if the upstream nameserver is
+ unreachable
Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Ubuntu 22.10 (Kinetic Kudu) has reached end of life, so this bug will not be fixed for that specific release.

Changed in dnsmasq (Ubuntu Kinetic):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.