ASTERISK-25212: [patch]Segfault when using DEBUG_FD

[Home]

Summary: ASTERISK-25212: [patch]Segfault when using DEBUG_FD_LEAKS

Reporter: Walter Doekes (wdoekes) Labels:

Date Opened: 2015-06-30 09:31:18 Date Closed: 2015-07-02 07:53:20

Priority: Major Regression? No

Status: Closed/Complete Components: Applications/app_dial Channels/chan_sip/General Core/General

Versions: 11.19.0 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-25213 [patch]Possibility of deadlock in chan_sip INVITE early Replace code

Environment: Attachments: ( 0) bt.txt
( 1) btall.txt
( 2) btallfull.txt
( 3) btfull.txt
( 4) infothreads.txt
( 5) issueA25212-fix_fdleak_overflow_and_cleanup.patch

Description: [Edit by Rusty - copying part of Walter's comment up here so it will be seen quickly for those first looking at the issue]
Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot.
Same when disabling DEBUG_THREADS.

Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue.
[end copy paste]

When attempting to reproduce a deadlock, I enabled DEBUG_THREADS and DETECT_DEADLOCKS.

Instead of detecting the deadlock, sometimes Asterisk crashes instead.

Recent crash contains this:

{noformat}
(gdb) bt
#0 pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S:69
#1 0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond",
cond=0x885680 <as_cond>) at lock.c:508
#2 0x0000000000456ec8 in ast_autoservice_start (chan=0x3b31188) at autoservice.c:222
#3 0x00000000004573ba in ast_autoservice_chan_hangup_peer (chan=0x3b31188, peer=0x7eff1038ef68) at autoservice.c:317
#4 0x00007efed32345c7 in dial_exec_full (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob", peerflags=0x7efec6cc64c0, continue_exec=0x0) at app_dial.c:3078
#5 0x00007efed323499a in dial_exec (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob") at app_dial.c:3130
#6 0x0000000000533b49 in pbx_exec (c=0x3b31188, app=0x27c0440, data=0x7efec6cc6b10 "SIP/bob") at pbx.c:1677
#7 0x000000000053e7c8 in pbx_extension_helper (c=0x3b31188, con=0x0, context=0x3b31fd8 "default", exten=0x3b32028 "bob", priority=1, label=0x0, callerid=0x3df43b0 "alice", action=E_SPAWN,
found=0x7efec6ccabfc, combined_find_spawn=1) at pbx.c:4970
{noformat}

{noformat}
(gdb) disass
Dump of assembler code for function pthread_cond_signal@@GLIBC_2.3.2:
...
0x00007eff16ab3c4e <+62>: je 0x7eff16ab3c67 <pthread_cond_signal@@GLIBC_2.3.2+87>
0x00007eff16ab3c50 <+64>: mov 0x20(%r8),%rcx
=> 0x00007eff16ab3c54 <+68>: mov 0x10(%rcx),%r11d
0x00007eff16ab3c58 <+72>: and $0x30,%r11d
0x00007eff16ab3c5c <+76>: cmp $0x20,%r11d
...
(gdb) up
#1 0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond",
cond=0x885680 <as_cond>) at lock.c:508
508 return pthread_cond_signal(cond);
(gdb) print cond
$55 = (ast_cond_t *) 0x885680 <as_cond>
(gdb) print *cond
$56 = {__data = {__lock = 1, __futex = 1638, __total_seq = 8389759012580623496, __wakeup_seq = 7809628147932885606, __woken_seq = 6874024004411486821, __mutex = 0x7069700074696e69, __nwaiters = 101,
__broadcast_seq = 2063597568}, __size = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{", __align = 7035156430849}
(gdb) p *(char(*)[48])cond
$57 = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{"
{noformat}

That doesn't look like valid values in the cond struct.

{noformat}
(gdb) info registers
rax 0xca 202
rbx 0x10 16
rcx 0x7069700074696e69 8100128552057925225
rdx 0x1 1
rsi 0x5 5
rdi 0x885684 8935044
rbp 0x7efec6cc5520 0x7efec6cc5520
rsp 0x7efec6cc54f0 0x7efec6cc54f0
r8 0x885680 8935040
r9 0x843510 8664336
r10 0x7efec4456de0 139632679677408
r11 0x7efec6ccafc8 139632722096072
r12 0x3b32028 62070824
r13 0x1 1
r14 0x3caa942 63613250
r15 0x27c04a0 41682080
rip 0x509909 0x509909 <__ast_cond_signal+39>
eflags 0x10217 [ CF PF AF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
{noformat}

Cheers,
Walter Doekes
OSSO B.V.

Comments: By: Asterisk Team (asteriskteam) 2015-06-30 09:31:19.320-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Walter Doekes (wdoekes) 2015-06-30 10:55:50.239-0500

Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot.
Same when disabling DEBUG_THREADS.

Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue.

Just sit back and wait, and it dies after half a minute (at around 600+ calls):
{noformat}
*CLI> Segmentation fault (core dumped)

real 0m23.660s
user 0m1.661s
sys 0m0.609s
{noformat}

Crashes do not always happen in the same spot, cause looks like overwritten memory in all cases. Enabled MALLOC_DEBUG and of course all the crashes went away. *sigh*

(Or was it the DEBUG_FD_LEAKS option?)
By: Walter Doekes (wdoekes) 2015-07-01 05:31:39.045-0500

Hm. Appears to be the DEBUG_FD_LEAKS indeed. After finishing the work on ASTERISK-25213 and observing no crashes all morning, I recompiled with DEBUG_FD_LEAKS.

It did not take long until I got this:
{noformat}
*CLI> ==21778== Thread 25:
==21778== Invalid read of size 4
==21778== at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69)
==21778== by 0x456E17: ast_autoservice_start (autoservice.c:222)
==21778== by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317)
==21778== by 0x200BA5C6: dial_exec_full (app_dial.c:3078)
==21778== by 0x200BA999: dial_exec (app_dial.c:3130)
==21778== by 0x52F6EA: pbx_exec (pbx.c:1677)
==21778== by 0x53A369: pbx_extension_helper (pbx.c:4970)
==21778== by 0x53D937: ast_spawn_extension (pbx.c:6100)
==21778== by 0x53F047: __ast_pbx_run (pbx.c:6575)
==21778== by 0x540A8B: pbx_thread (pbx.c:6905)
==21778== by 0x5A000B: dummy_start (utils.c:1223)
==21778== by 0x6264181: start_thread (pthread_create.c:312)
==21778== Address 0x7069700074696e79 is not stack'd, malloc'd or (recently) free'd
{noformat}

{noformat}
==21778== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==21778== General Protection Fault
==21778== at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69)
==21778== by 0x456E17: ast_autoservice_start (autoservice.c:222)
==21778== by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317)
==21778== by 0x200BA5C6: dial_exec_full (app_dial.c:3078)
==21778== by 0x200BA999: dial_exec (app_dial.c:3130)
==21778== by 0x52F6EA: pbx_exec (pbx.c:1677)
==21778== by 0x53A369: pbx_extension_helper (pbx.c:4970)
==21778== by 0x53D937: ast_spawn_extension (pbx.c:6100)
==21778== by 0x53F047: __ast_pbx_run (pbx.c:6575)
==21778== by 0x540A8B: pbx_thread (pbx.c:6905)
==21778== by 0x5A000B: dummy_start (utils.c:1223)
==21778== by 0x6264181: start_thread (pthread_create.c:312)
{noformat}
By: Walter Doekes (wdoekes) 2015-07-02 04:16:32.929-0500

This {{issueA25212-fix_fdleak_overflow_and_cleanup.patch}} ought to fix it.

The problem was that the pipe() function did not do any bounds checking, overwriting past the fdleaks static structure.

I also cleaned some, amongst others consistently making use of ARRAY_LEN.