Summary:ASTERISK-17399: deadlock related to "handle_tcptls_connection" - tcp transport
Reporter:Luke H (luckman212)Labels:
Date Opened:2011-02-13 16:16:48.000-0600Date Closed:2011-09-22 07:18:31
Versions:1.8.2 Frequency of
Environment:Attachments:( 0) backtrace_1297948105.txt
( 1) backtrace_1297964340.txt
( 2) gdb.txt
( 3) gdb2.txt
( 4) gdb3.txt
( 5) locks_1297948105.txt
( 6) locks_1297964340.txt
( 7) locks.txt
( 8) locks3.txt
Description:I have been experiencing very frequent deadlocks (usually 2-3x per day, sometimes more) in Asterisk since I enabled the TCP transport for one of my endpoints.  This *could* be a coincidence but it does seem to correspond with the rest of the debugging info that I have.

I'm running Asterisk as part of a Pbx-In-a-Flash distribution (CentOS 5.5) in a SIP-only (no DAHDI hardware) setup.  3 SIP trunks and 4 SIP peers, a small & simple setup.

What happens is that my pbx will be humming along and then suddenly I will notice that I can no longer make or receive any calls.  The *CLI is still responsive, but I notice that my trunks have stopped refreshing their registrations (sip show registry will indicate Reg.Time not updating).  I believe that Asterisk is deadlocking during the registration process for this particular endpoint for which I've enabled the TCP transport.

I've compiled Asterisk with DEBUG_THREADS and DONT_OPTIMIZE and when the deadlock occurs, I run the following command:

pbx*CLI> core show locks

This shows a handful of locks, but the most interesting one is (e.g):

=== Thread ID: 85511056 (handle_tcptls_connection started at [  274] tcptls.c ast_tcptls_server_root())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe]
       /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9]
       /usr/lib/asterisk/modules/chan_sip.so [0x4933c89]
       /usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d]
       /usr/lib/asterisk/modules/chan_sip.so [0x48c8422]
       /usr/sbin/asterisk [0x8186000]
       /usr/sbin/asterisk [0x8195ee1]
       /lib/libpthread.so.0 [0xadd832]
       /lib/libc.so.6(clone+0x5e) [0xa1cf6e]
=== --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do)
=== -------------------------------------------------------------------
=== Thread ID: 86985616 (handle_tcptls_connection started at [  274] tcptls.c ast_tcptls_server_root())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe]
       /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9]
       /usr/lib/asterisk/modules/chan_sip.so [0x4933c89]
       /usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d]
       /usr/lib/asterisk/modules/chan_sip.so [0x48c8422]
       /usr/sbin/asterisk [0x8186000]
       /usr/sbin/asterisk [0x8195ee1]
       /lib/libpthread.so.0 [0xadd832]
       /lib/libc.so.6(clone+0x5e) [0xa1cf6e]
=== --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do)
=== -------------------------------------------------------------------

I have gdb installed but my skills w/ it are severely limited and I don't know how to debug further.  

I think there is a chance that this issue has been reported before, at

However that was with 1.6.x and the issue had been closed, so I am opening this here in hopes that it can be debugged.
Comments:By: Luke H (luckman212) 2011-02-15 08:50:47.000-0600

Just happened again.  This time I did a backtrace, which I've attached.  Also, I want to add that I think this may also be related to res_timing_pthread.  I do not have res_timing_timerfd available (CentOS 5.5 kernel too old) and res_timing_dahdi causes MusicOnHold to stop working so I've disabled it.  I only have res_timing_pthread.

By: Luke H (luckman212) 2011-02-15 15:22:35.000-0600

Just deadlocked again.
This time I also have the output of "core show locks" to go along with the backtrace.  

Want to add some more possibly relevant info:
On this particular system (CPU=Atom230) my hardware clocksource in grub.conf is "acpi_pm".  I don't know if that affects timing but it would make sense to me if the answer was YES.  I have hpet, tsc, jiffies, and pit vailable as well.  If you believe this to be the root of the problem then I will change & retest.

By: Luke H (luckman212) 2011-02-16 11:21:39.000-0600

Happened again, Added another backtrace & locks output.

By: David Woolley (davidw) 2011-02-16 11:46:10.000-0600

I'm not familiar with the internals of res_timing_pthread, but a blocked write seems to be part of the deadlock cycle.

My guess is that you will find something like it is using a pipe to deliver the timing pulses, but the pipe has filled and the the bit that would read events off the pipe is waiting for a lock held by the thread writing to the pipe.

By: Luke H (luckman212) 2011-02-17 07:16:30.000-0600

I realized I don't think I was capturing "full" backtraces.
now I am issuing
bt full
thread apply all bt

here's the latest backtrace & core-show-locks output.

By: Luke H (luckman212) 2011-02-17 11:58:47.000-0600

Adding another locks & backtrace.  Is it helpful if keep submitting these or is it now redundant?

By: Luke H (luckman212) 2011-02-20 21:41:55.000-0600

Is this possibly related to https://issues.asterisk.org/view.php?id=18845 ?

By: Russell Bryant (russell) 2011-03-11 12:27:44.000-0600

Please try res_timing_dahdi now.  I believe the MOH issue you mentioned has been fixed.

By: Luke H (luckman212) 2011-04-20 14:58:18

I have now updated the following:

Wanpipe:  1.0.11
Asterisk: (release)

Sadly, the MOH issue is not fixed.  Should I try the latest SVN?

By: Leif Madsen (lmadsen) 2011-05-16 08:36:26

Please try the latest SVN to determine if the issue has already been resolved.

By: Luke H (luckman212) 2011-05-16 22:03:29

russel, lmadsen:  i've compiled SVN-branch-1.8-r319261M and tested, I am sure it does not fix the 'streaming MOH goes silent after moh reload' issue.  As for the deadlocks with res_timing_pthread when res_timing_dahdi is disabled + using TCP transport, I can't say because I've since re-enabled res_timing_dahdi.

By: Leif Madsen (lmadsen) 2011-09-22 07:18:31.831-0500

I'm seeing the use of res_timing_pthread in all the logs, so I'm closing this issue out as res_timing_pthread has known issues that appear as a deadlock.