Summary: | ASTERISK-17399: deadlock related to "handle_tcptls_connection" - tcp transport | ||
Reporter: | Luke H (luckman212) | Labels: | |
Date Opened: | 2011-02-13 16:16:48.000-0600 | Date Closed: | 2011-09-22 07:18:31 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_sip/TCP-TLS |
Versions: | 1.8.2 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) backtrace_1297948105.txt ( 1) backtrace_1297964340.txt ( 2) gdb.txt ( 3) gdb2.txt ( 4) gdb3.txt ( 5) locks_1297948105.txt ( 6) locks_1297964340.txt ( 7) locks.txt ( 8) locks3.txt | |
Description: | I have been experiencing very frequent deadlocks (usually 2-3x per day, sometimes more) in Asterisk 1.8.2.3 since I enabled the TCP transport for one of my endpoints. This *could* be a coincidence but it does seem to correspond with the rest of the debugging info that I have. I'm running Asterisk 1.8.2.3 as part of a Pbx-In-a-Flash distribution (CentOS 5.5) in a SIP-only (no DAHDI hardware) setup. 3 SIP trunks and 4 SIP peers, a small & simple setup. What happens is that my pbx will be humming along and then suddenly I will notice that I can no longer make or receive any calls. The *CLI is still responsive, but I notice that my trunks have stopped refreshing their registrations (sip show registry will indicate Reg.Time not updating). I believe that Asterisk is deadlocking during the registration process for this particular endpoint for which I've enabled the TCP transport. I've compiled Asterisk with DEBUG_THREADS and DONT_OPTIMIZE and when the deadlock occurs, I run the following command: pbx*CLI> core show locks This shows a handful of locks, but the most interesting one is (e.g): === Thread ID: 85511056 (handle_tcptls_connection started at [ 274] tcptls.c ast_tcptls_server_root()) === ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1) /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe] /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9] /usr/lib/asterisk/modules/chan_sip.so [0x4933c89] /usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d] /usr/lib/asterisk/modules/chan_sip.so [0x48c8422] /usr/sbin/asterisk [0x8186000] /usr/sbin/asterisk [0x8195ee1] /lib/libpthread.so.0 [0xadd832] /lib/libc.so.6(clone+0x5e) [0xa1cf6e] === --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do) === ------------------------------------------------------------------- === === Thread ID: 86985616 (handle_tcptls_connection started at [ 274] tcptls.c ast_tcptls_server_root()) === ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1) /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe] /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9] /usr/lib/asterisk/modules/chan_sip.so [0x4933c89] /usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d] /usr/lib/asterisk/modules/chan_sip.so [0x48c8422] /usr/sbin/asterisk [0x8186000] /usr/sbin/asterisk [0x8195ee1] /lib/libpthread.so.0 [0xadd832] /lib/libc.so.6(clone+0x5e) [0xa1cf6e] === --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do) === ------------------------------------------------------------------- === ======================================================================= I have gdb installed but my skills w/ it are severely limited and I don't know how to debug further. I think there is a chance that this issue has been reported before, at https://issues.asterisk.org/view.php?id=17675 However that was with 1.6.x and the issue had been closed, so I am opening this here in hopes that it can be debugged. | ||
Comments: | By: Luke H (luckman212) 2011-02-15 08:50:47.000-0600 Just happened again. This time I did a backtrace, which I've attached. Also, I want to add that I think this may also be related to res_timing_pthread. I do not have res_timing_timerfd available (CentOS 5.5 kernel too old) and res_timing_dahdi causes MusicOnHold to stop working so I've disabled it. I only have res_timing_pthread. By: Luke H (luckman212) 2011-02-15 15:22:35.000-0600 Just deadlocked again. This time I also have the output of "core show locks" to go along with the backtrace. Want to add some more possibly relevant info: On this particular system (CPU=Atom230) my hardware clocksource in grub.conf is "acpi_pm". I don't know if that affects timing but it would make sense to me if the answer was YES. I have hpet, tsc, jiffies, and pit vailable as well. If you believe this to be the root of the problem then I will change & retest. By: Luke H (luckman212) 2011-02-16 11:21:39.000-0600 Happened again, Added another backtrace & locks output. By: David Woolley (davidw) 2011-02-16 11:46:10.000-0600 I'm not familiar with the internals of res_timing_pthread, but a blocked write seems to be part of the deadlock cycle. My guess is that you will find something like it is using a pipe to deliver the timing pulses, but the pipe has filled and the the bit that would read events off the pipe is waiting for a lock held by the thread writing to the pipe. By: Luke H (luckman212) 2011-02-17 07:16:30.000-0600 I realized I don't think I was capturing "full" backtraces. now I am issuing bt bt full thread apply all bt here's the latest backtrace & core-show-locks output. By: Luke H (luckman212) 2011-02-17 11:58:47.000-0600 Adding another locks & backtrace. Is it helpful if keep submitting these or is it now redundant? By: Luke H (luckman212) 2011-02-20 21:41:55.000-0600 Is this possibly related to https://issues.asterisk.org/view.php?id=18845 ? By: Russell Bryant (russell) 2011-03-11 12:27:44.000-0600 Please try res_timing_dahdi now. I believe the MOH issue you mentioned has been fixed. By: Luke H (luckman212) 2011-04-20 14:58:18 I have now updated the following: DAHDI: 2.4.1.2+2.4.1 Wanpipe: 1.0.11 Asterisk: 1.8.3.2 (release) Sadly, the MOH issue is not fixed. Should I try the latest SVN? By: Leif Madsen (lmadsen) 2011-05-16 08:36:26 Please try the latest SVN to determine if the issue has already been resolved. By: Luke H (luckman212) 2011-05-16 22:03:29 russel, lmadsen: i've compiled SVN-branch-1.8-r319261M and tested, I am sure it does not fix the 'streaming MOH goes silent after moh reload' issue. As for the deadlocks with res_timing_pthread when res_timing_dahdi is disabled + using TCP transport, I can't say because I've since re-enabled res_timing_dahdi. By: Leif Madsen (lmadsen) 2011-09-22 07:18:31.831-0500 I'm seeing the use of res_timing_pthread in all the logs, so I'm closing this issue out as res_timing_pthread has known issues that appear as a deadlock. |