Summary: | ASTERISK-25212: [patch]Segfault when using DEBUG_FD_LEAKS | ||||
Reporter: | Walter Doekes (wdoekes) | Labels: | |||
Date Opened: | 2015-06-30 09:31:18 | Date Closed: | 2015-07-02 07:53:20 | ||
Priority: | Major | Regression? | No | ||
Status: | Closed/Complete | Components: | Applications/app_dial Channels/chan_sip/General Core/General | ||
Versions: | 11.19.0 | Frequency of Occurrence | Occasional | ||
Related Issues: |
| ||||
Environment: | Attachments: | ( 0) bt.txt ( 1) btall.txt ( 2) btallfull.txt ( 3) btfull.txt ( 4) infothreads.txt ( 5) issueA25212-fix_fdleak_overflow_and_cleanup.patch | |||
Description: | [Edit by Rusty - copying part of Walter's comment up here so it will be seen quickly for those first looking at the issue]
Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot. Same when disabling DEBUG_THREADS. Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue. [end copy paste] When attempting to reproduce a deadlock, I enabled DEBUG_THREADS and DETECT_DEADLOCKS. Instead of detecting the deadlock, sometimes Asterisk crashes instead. Recent crash contains this: {noformat} (gdb) bt #0 pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S:69 #1 0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond", cond=0x885680 <as_cond>) at lock.c:508 #2 0x0000000000456ec8 in ast_autoservice_start (chan=0x3b31188) at autoservice.c:222 #3 0x00000000004573ba in ast_autoservice_chan_hangup_peer (chan=0x3b31188, peer=0x7eff1038ef68) at autoservice.c:317 #4 0x00007efed32345c7 in dial_exec_full (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob", peerflags=0x7efec6cc64c0, continue_exec=0x0) at app_dial.c:3078 #5 0x00007efed323499a in dial_exec (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob") at app_dial.c:3130 #6 0x0000000000533b49 in pbx_exec (c=0x3b31188, app=0x27c0440, data=0x7efec6cc6b10 "SIP/bob") at pbx.c:1677 #7 0x000000000053e7c8 in pbx_extension_helper (c=0x3b31188, con=0x0, context=0x3b31fd8 "default", exten=0x3b32028 "bob", priority=1, label=0x0, callerid=0x3df43b0 "alice", action=E_SPAWN, found=0x7efec6ccabfc, combined_find_spawn=1) at pbx.c:4970 {noformat} {noformat} (gdb) disass Dump of assembler code for function pthread_cond_signal@@GLIBC_2.3.2: ... 0x00007eff16ab3c4e <+62>: je 0x7eff16ab3c67 <pthread_cond_signal@@GLIBC_2.3.2+87> 0x00007eff16ab3c50 <+64>: mov 0x20(%r8),%rcx => 0x00007eff16ab3c54 <+68>: mov 0x10(%rcx),%r11d 0x00007eff16ab3c58 <+72>: and $0x30,%r11d 0x00007eff16ab3c5c <+76>: cmp $0x20,%r11d ... (gdb) up #1 0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond", cond=0x885680 <as_cond>) at lock.c:508 508 return pthread_cond_signal(cond); (gdb) print cond $55 = (ast_cond_t *) 0x885680 <as_cond> (gdb) print *cond $56 = {__data = {__lock = 1, __futex = 1638, __total_seq = 8389759012580623496, __wakeup_seq = 7809628147932885606, __woken_seq = 6874024004411486821, __mutex = 0x7069700074696e69, __nwaiters = 101, __broadcast_seq = 2063597568}, __size = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{", __align = 7035156430849} (gdb) p *(char(*)[48])cond $57 = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{" {noformat} That doesn't look like valid values in the cond struct. {noformat} (gdb) info registers rax 0xca 202 rbx 0x10 16 rcx 0x7069700074696e69 8100128552057925225 rdx 0x1 1 rsi 0x5 5 rdi 0x885684 8935044 rbp 0x7efec6cc5520 0x7efec6cc5520 rsp 0x7efec6cc54f0 0x7efec6cc54f0 r8 0x885680 8935040 r9 0x843510 8664336 r10 0x7efec4456de0 139632679677408 r11 0x7efec6ccafc8 139632722096072 r12 0x3b32028 62070824 r13 0x1 1 r14 0x3caa942 63613250 r15 0x27c04a0 41682080 rip 0x509909 0x509909 <__ast_cond_signal+39> eflags 0x10217 [ CF PF AF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 {noformat} Cheers, Walter Doekes OSSO B.V. | ||||
Comments: | By: Asterisk Team (asteriskteam) 2015-06-30 09:31:19.320-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Walter Doekes (wdoekes) 2015-06-30 10:55:50.239-0500 Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot. Same when disabling DEBUG_THREADS. Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue. Just sit back and wait, and it dies after half a minute (at around 600+ calls): {noformat} *CLI> Segmentation fault (core dumped) real 0m23.660s user 0m1.661s sys 0m0.609s {noformat} Crashes do not always happen in the same spot, cause looks like overwritten memory in all cases. Enabled MALLOC_DEBUG and of course all the crashes went away. *sigh* (Or was it the DEBUG_FD_LEAKS option?) By: Walter Doekes (wdoekes) 2015-07-01 05:31:39.045-0500 Hm. Appears to be the DEBUG_FD_LEAKS indeed. After finishing the work on ASTERISK-25213 and observing no crashes all morning, I recompiled with DEBUG_FD_LEAKS. It did not take long until I got this: {noformat} *CLI> ==21778== Thread 25: ==21778== Invalid read of size 4 ==21778== at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69) ==21778== by 0x456E17: ast_autoservice_start (autoservice.c:222) ==21778== by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317) ==21778== by 0x200BA5C6: dial_exec_full (app_dial.c:3078) ==21778== by 0x200BA999: dial_exec (app_dial.c:3130) ==21778== by 0x52F6EA: pbx_exec (pbx.c:1677) ==21778== by 0x53A369: pbx_extension_helper (pbx.c:4970) ==21778== by 0x53D937: ast_spawn_extension (pbx.c:6100) ==21778== by 0x53F047: __ast_pbx_run (pbx.c:6575) ==21778== by 0x540A8B: pbx_thread (pbx.c:6905) ==21778== by 0x5A000B: dummy_start (utils.c:1223) ==21778== by 0x6264181: start_thread (pthread_create.c:312) ==21778== Address 0x7069700074696e79 is not stack'd, malloc'd or (recently) free'd {noformat} {noformat} ==21778== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==21778== General Protection Fault ==21778== at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69) ==21778== by 0x456E17: ast_autoservice_start (autoservice.c:222) ==21778== by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317) ==21778== by 0x200BA5C6: dial_exec_full (app_dial.c:3078) ==21778== by 0x200BA999: dial_exec (app_dial.c:3130) ==21778== by 0x52F6EA: pbx_exec (pbx.c:1677) ==21778== by 0x53A369: pbx_extension_helper (pbx.c:4970) ==21778== by 0x53D937: ast_spawn_extension (pbx.c:6100) ==21778== by 0x53F047: __ast_pbx_run (pbx.c:6575) ==21778== by 0x540A8B: pbx_thread (pbx.c:6905) ==21778== by 0x5A000B: dummy_start (utils.c:1223) ==21778== by 0x6264181: start_thread (pthread_create.c:312) {noformat} By: Walter Doekes (wdoekes) 2015-07-02 04:16:32.929-0500 This {{issueA25212-fix_fdleak_overflow_and_cleanup.patch}} ought to fix it. The problem was that the pipe() function did not do any bounds checking, overwriting past the fdleaks static structure. I also cleaned some, amongst others consistently making use of ARRAY_LEN. |