sssd got killed due to segfault in ubuntu 16.04

Bug #1883614 reported by Kamal Sahoo
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Incomplete
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned

Bug Description

SSSD.LOG
--------------
(Sun Jun 14 20:23:53 2020) [sssd] [mt_svc_sigkill] (0x0010): [pam][13305] is not responding to SIGTERM. Sending SIGKILL.
(Sun Jun 14 20:29:34 2020) [sssd] [monitor_restart_service] (0x0010): Process [nss], definitely stopped!

apport.log:
--------------
ERROR: apport (pid 623) Sun Jun 14 20:25:21 2020: Unhandled exception:
Traceback (most recent call last):
  File "/usr/share/apport/apport", line 515, in <module>
    get_pid_info(pid)
  File "/usr/share/apport/apport", line 62, in get_pid_info
    proc_pid_fd = os.open('/proc/%s' % pid, os.O_RDONLY | os.O_PATH | os.O_DIRECTORY)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/13305'
ERROR: apport (pid 623) Sun Jun 14 20:25:21 2020: pid: 623, uid: 0, gid: 0, euid: 0, egid: 0
ERROR: apport (pid 623) Sun Jun 14 20:25:21 2020: environment: environ({})
root@gamma13:/var/log# ps -fp 13305
UID PID PPID C STIME TTY TIME CMD

syslog
--------------
Jun 14 20:20:32 gamma13 sssd[be[myorg]]: Starting up
Jun 14 20:22:06 gamma13 kernel: [2543859.316724] sssd_pam[13305]: segfault at a4 ip 00007f0f77329989 sp 00007fff35844480 error 4 in libdbus-1.so.3.14.6[7f0f772ff000+4b000]
Jun 14 20:22:06 gamma13 sssd[be[myorg]]: Starting up
Jun 14 20:22:53 gamma13 sssd: Killing service [pam], not responding to pings!
Jun 14 20:23:53 gamma13 sssd: [pam][13305] is not responding to SIGTERM. Sending SIGKILL.
Jun 14 20:23:58 gamma13 sssd[pam]: Starting up
Jun 14 20:24:27 gamma13 smtpd[1732]: smtp-in: session 689f0b74a7b74828: connection from host gamma13.internal.myorg.com.internal.myorg.com [local] established
Jun 14 20:25:01 gamma13 CRON[1041]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 14 20:25:01 gamma13 CRON[1042]: (root) CMD (/usr/sbin/icsisnap /var/lib/icsisnap)
Jun 14 20:27:57 gamma13 sssd[be[myorg]]: Starting up
Jun 14 20:27:58 gamma13 systemd[1]: Started Session 16859 of user kamals.
Jun 14 20:29:18 gamma13 sssd[be[myorg]]: Starting up
Jun 14 20:29:28 gamma13 sssd[nss]: Starting up
Jun 14 20:29:30 gamma13 sssd[nss]: Starting up
Jun 14 20:29:37 gamma13 sssd[nss]: Starting up
Jun 14 20:29:37 gamma13 sssd: Exiting the SSSD. Could not restart critical service [nss].
Jun 14 20:29:44 gamma13 sssd[be[myorg]]: Shutting down
Jun 14 20:29:44 gamma13 sssd[pam]: Shutting down

[Another server had this log

Jun 9 21:12:52 grid kernel: [5088481.338650] rpcmgr[1409]: segfault at 7fa5541c1d13 ip 00007fa5dcb5be8f sp 00007fa5d35ccc80 error 4 in libpthread-2.23.so[7fa5dcb54000+18000]

]

kamals@gamma13:~$ uname -r
4.4.0-178-generic
kamals@gamma13:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"

Hi, the sssd got killed for second time with segfault, the above log is the latest sssd shutdown.

Please help me how to fix this issue.

Tags: bot-comment
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1883614/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → sssd (Ubuntu)
Changed in sssd (Ubuntu):
status: New → Opinion
status: Opinion → New
Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Is anyone looking into this or I will close this one ?

Revision history for this message
Paride Legovini (paride) wrote :

Hi and thanks for this bug report. The segmentation fault is certainly something that should not happen, but there isn't really enough information here for a developer to begin working on it.

Are you able to identify anything that can trigger the crash?

Is anything else crashing in a similar way on the same system? Random segmentation faults may indicate a memory corruption issue.

Did the rpcmgr segfault happen on the same machine or on a different one, as the system hostname in the log snippet suggests? If it happened on a different machine, do you have any reason to think it's related to the sssd crash?

Could you please include a fuller version of the system logs?

I'm marking this bug report as Incomplete for the moment. Please change its status back to New after commenting back and we'll look at it again. Thanks!

Changed in sssd (Ubuntu):
status: New → Incomplete
Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Are you able to identify anything that can trigger the crash?
- Nothing found except the segafult I found, no load, no OOM, no IO, etc. The machine was really quiet at that time when crashed - most of the activity seems to be after the event, which is likely the sssd restart.

Is anything else crashing in a similar way on the same system? Random segmentation faults may indicate a memory corruption issue.
- No only the SSSD

Did the rpcmgr segfault happen on the same machine or on a different one, as the system hostname in the log snippet suggests? If it happened on a different machine, do you have any reason to think it's related to the sssd crash?
- It's from different machine. I found this at just the second before `sssd` got killed.

Could you please include a fuller version of the system logs?
- Sure.

Changed in sssd (Ubuntu):
status: Incomplete → New
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you, KK.

It is unfortunate that we are not able to reproduce the bug here. Would you happen to have the coredump file that was generated by the crash? That would help us a lot. If you don't have the coredump nor steps to reproduce the bug, could you please make sure to attach a coredump when the crash happens again?

Without either of those, we can't really action on it. Therefore, I am marking this as Incomplete again.

Another thing worth mentioning is that this bug happens on a quite old Ubuntu release, so if you can't reproduce it, maybe you would like to give it a try with a newer Ubuntu release.

Changed in sssd (Ubuntu):
status: New → Incomplete
Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

HI Sergio,

There is no core dump, nor I am not aware I can do that. Is there any way you suggest to fix the sssd getting killed multiple times on multiple servers running this Ubuntu version 16.04(4.4.0-178-generic). Can you please guide with steps if you are aware to fix this sssd thing and the fix shouldn't crash the machine ?

Thank you.

Changed in sssd (Ubuntu):
status: Incomplete → New
Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Unfortunately we cannot guide you through a fix nor a workaround because we are not able to reproduce the bug. If by any mean you provide a core dump as Sergio mentioned (or better: a step by step on how to reproduce it) you can come back to this bug and we will be happy to investigate this crash.

I am marking this bug as Incomplete again since it is not actionable in our side. In case you have any information which can help us to investigate this bug please change the status to New again.

Changed in sssd (Ubuntu):
status: New → Incomplete
Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Again the sssd got crashed, not seeing any issue with the machine,System resources were/are free …less than 30% utilized…

Aug 7 10:06:12 gamma14 sssd: Killing service [myOrg], not responding to pings!
Aug 7 10:06:22 gamma14 systemd[1]: sssd.service: Main process exited, code=dumped, status=6/ABRT
Aug 7 10:06:22 gamma14 sssd[pam]: Shutting down
Aug 7 10:06:22 gamma14 sssd[nss]: Shutting down
Aug 7 10:06:24 gamma14 systemd[1]: sssd.service: Unit entered failed state.
Aug 7 10:06:24 gamma14 systemd[1]: sssd.service: Failed with result 'core-dump'.
Aug 7 10:09:02 gamma14 CRON[1031]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && /usr/lib/php/sessionclean)
Aug 7 10:10:01 gamma14 CRON[1586]: (root) CMD (/usr/sbin/icsisnap /var/lib/icsisnap)

root@gamma14:/var/log/sssd# cat sssd.log
(Fri Aug 7 10:06:12 2020) [sssd] [talloc_log_fn] (0x0010): Bad talloc magic value - unknown value

I found these logs, except these no other suspicious logs found anywhere.

Core dumps is not generated because ,

|10:47:56|kamals@gamma14:[crash]> ulimit -a
core file size (blocks, -c) 0

core file size of "0" means that core files won't get generated at all.

Please confirm "Bad talloc magic value - unknown value" if this is the issue and what does it mean ?

Changed in sssd (Ubuntu):
status: Incomplete → New
Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Attaching the file generated in /var/crash

Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Attached the file generated in /var/crash when the sssd dead.

Changed in sssd (Ubuntu Xenial):
status: New → Confirmed
Changed in sssd (Ubuntu):
status: New → Triaged
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Unfortunately looks like the crash file does not contain a valid dump...

(gdb) bt
#0 __GI_raise (sig=7551) at ../sysdeps/unix/sysv/linux/raise.c:37
#1 0x00007f63fb16d02a in __GI_abort () at abort.c:87
#2 0x00000000023953f0 in ?? ()
#3 0x000000000239bd80 in ?? ()
#4 0x0000000000000001 in ?? ()
#5 0x0000000000000000 in ?? ()

frames rewind is corrupted and

#0 __GI_raise (sig=7551) at ../sysdeps/unix/sysv/linux/raise.c:37
        pid = 0
        selftid = 0

values are scrambled (frame #1 seem to have correct mapped page address, according to ProcMaps).

judging by the message:

---

(Fri Aug 7 10:06:12 2020) [sssd] [talloc_log_fn] (0x0010): Bad talloc magic value - unknown value

it is very likely that a previous allocated (by talloc) buffer suffered an attempted to be realloc'ed or freed and talloc code realized the buffer checksum was corrupted, indicating a memory issue by the consumer (sssd in this case).

----

Ideas for debugging it:

----

Attempt (1)
===========

sysctl -w kernel.core_pattern=core

would make core dump to be generated without apport (to see if it helps). I tried to find other dumps using our upstream crash repository at:

https://errors.ubuntu.com/?release=Ubuntu 16.04&package=sssd&period=year

but there wasn't other failures we could use.

----

Attempt #2
==========

Another possibility would be to run valgrind and discover the issue:

Put this in your bashrc:
--
# debug symbols

getsymbols() {

    binfile=$1

    for pkg in $(for file in $(ldd $binfile | awk '{print $1}' | xargs); do dpkg -S $file 2>/dev/null ; done | awk '{print $1}' | sed 's:\:.*\:::g' | sort -u); do apt-cache pkgnames | grep -E "($pkg- dbg$|$pkg-dbgsym$)" ; done | xargs sudo apt-get install -y

}
--

and enable:

deb http://ddebs.ubuntu.com xenial main restricted universe multiverse
deb http://ddebs.ubuntu.com xenial-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com xenial-proposed main restricted universe multiverse

in your /etc/apt/sources.list, apt-get update.

Then you do:

getsymbols /usr/sbin/sssd

it will install a bunch of "dbg and dbgsym" packages (for sssd dependencies). At the end you execute:

apt-get install sssd-common-dbgsym sssd-ad-common-dbgsym sssd-ad-dbgsym sssd-dbus-dbgsym sssd-ipa-dbgsym sssd-krb5-common-dbgsym sssd-ldap-dbgsym sssd-proxy-dbgsym

and you will have all debugging symbols installed for sssd and its dependencies. Then, instead of starting sssd from systemd, you run it with valgrind:

"""
$ sudo valgrind --tool=memcheck --trace-children=yes --leak-check=yes --leak-resolution=med --show-leak-kinds=definite --track-origins=yes /usr/sbin/sssd -i -f
"""

This will generate an output that you can save and attach here. Hopefully, with all debug symbols in place, memcheck from valgrind will tell us what are the places where the mem corruption has happened (saying stack trace as well).

Feel free to attach output file from valgrind to this case.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

For the record only.. in between 1.13.4 and 1.14.2 (next version after xenial) we have:

2dcf7b9 NSS: Fix NSS responder to cope with fully-qualified usernames
1510d12 test_ipa_subdom_server: Workaround for slow krb5 + SELinux
c19374b cache_req tests: use leak check in test fixtures
2dd75ea TOOLS: Fix memory leak after getline() failed
556801e SUDO: fix potential memory leak in sdap_sudo_init
343b053 NSS: fix a use-after-free issue
1584db9 intg: fix assert messages in test_memory_cache
8c9ecf0 PROXY: fix minor memory leak
3fa03d5 SDAP: fix minor memory leak
a2d6d4d IPA: fix minor memory leak
12440d2 AD: fix minor memory leak

and from 1.14.2 to 1.16_5 (latest release from 1.x series) we have:

acce032 lib/cifs_idmap_sss: fixed unaligned mem access
d1c9308 tests: fix mocking krb5_creds in test_copy_ccache
f62d2af CRYPTO: Save prefix in s3crypt_sha512
6191cf8 KCM: be aware that size_t might have different size than other integers
e354ec7 DP/LDAP: Only increase the initgrTimestamp when the full initgroups DP request finishes
fd17e09 mmap_cache: Remove unnecessary memchr in client code
2951a9a CRYPTO: Suppress warning Wstringop-truncation
613a832 KCM: Add some forgotten NULL checks
f772649 KRB5: Fix access_provider=krb5
327a166 PAM: fix memory leak in pam_sss
e6a5f8c WATCHDOG: Avoid non async-signal-safe from the signal_handler
f78b2dd krb5: fix two memory leaks

as commits that *could* be related.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Oops, forgot to ask you to install "valgrind" before running it.. in comment #11, but I think by now you already got that.

Revision history for this message
Kamal Sahoo (ksahoo-sifive) wrote :

Thank you so much Rafael. Will try and update you. Thanks

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've just dup'ed another new case onto this, maybe this will revive the old case.

@KK if you happened to have found a solution/root-cause and thereby didn't come back here this is the chance to update the case :-)

And while touching the case I just wanted to re-state that various problems in this kind of setups all seem to have the same symptom on the surface being "Bad talloc magic value".
But in those cases debugging revealed a different issue each time.
Examples:
https://pagure.io/SSSD/sssd/issue/3523
https://bugzilla.redhat.com/show_bug.cgi?id=1577335
https://bugzilla.redhat.com/show_bug.cgi?id=1502686

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Xenial is in ESM state right now, and as such there are no immediate plans to fix this bug in it.

Changed in sssd (Ubuntu Xenial):
status: Confirmed → Won't Fix
Revision history for this message
Bryce Harrington (bryce) wrote :

Is this reproducible on any newer Ubuntu releases?

Changed in sssd (Ubuntu):
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.