ASTERISK-17399: deadlock related to "handle_tcptls_connection"

[Home]

Summary: ASTERISK-17399: deadlock related to "handle_tcptls_connection" - tcp transport

Reporter: Luke H (luckman212) Labels:

Date Opened: 2011-02-13 16:16:48.000-0600 Date Closed: 2011-09-22 07:18:31

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_sip/TCP-TLS

Versions: 1.8.2 Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace_1297948105.txt
( 1) backtrace_1297964340.txt
( 2) gdb.txt
( 3) gdb2.txt
( 4) gdb3.txt
( 5) locks_1297948105.txt
( 6) locks_1297964340.txt
( 7) locks.txt
( 8) locks3.txt

Description: I have been experiencing very frequent deadlocks (usually 2-3x per day, sometimes more) in Asterisk 1.8.2.3 since I enabled the TCP transport for one of my endpoints. This *could* be a coincidence but it does seem to correspond with the rest of the debugging info that I have.

I'm running Asterisk 1.8.2.3 as part of a Pbx-In-a-Flash distribution (CentOS 5.5) in a SIP-only (no DAHDI hardware) setup. 3 SIP trunks and 4 SIP peers, a small & simple setup.

What happens is that my pbx will be humming along and then suddenly I will notice that I can no longer make or receive any calls. The *CLI is still responsive, but I notice that my trunks have stopped refreshing their registrations (sip show registry will indicate Reg.Time not updating). I believe that Asterisk is deadlocking during the registration process for this particular endpoint for which I've enabled the TCP transport.

I've compiled Asterisk with DEBUG_THREADS and DONT_OPTIMIZE and when the deadlock occurs, I run the following command:

pbx*CLI> core show locks

This shows a handful of locks, but the most interesting one is (e.g):

=== Thread ID: 85511056 (handle_tcptls_connection started at [ 274] tcptls.c ast_tcptls_server_root())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9]
/usr/lib/asterisk/modules/chan_sip.so [0x4933c89]
/usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d]
/usr/lib/asterisk/modules/chan_sip.so [0x48c8422]
/usr/sbin/asterisk [0x8186000]
/usr/sbin/asterisk [0x8195ee1]
/lib/libpthread.so.0 [0xadd832]
/lib/libc.so.6(clone+0x5e) [0xa1cf6e]
=== --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do)
=== -------------------------------------------------------------------
===
=== Thread ID: 86985616 (handle_tcptls_connection started at [ 274] tcptls.c ast_tcptls_server_root())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 23962 handle_request_do &netlock 0x4977660 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x811edbe]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xaa) [0x81185b9]
/usr/lib/asterisk/modules/chan_sip.so [0x4933c89]
/usr/lib/asterisk/modules/chan_sip.so [0x48c8c6d]
/usr/lib/asterisk/modules/chan_sip.so [0x48c8422]
/usr/sbin/asterisk [0x8186000]
/usr/sbin/asterisk [0x8195ee1]
/lib/libpthread.so.0 [0xadd832]
/lib/libc.so.6(clone+0x5e) [0xa1cf6e]
=== --- ---> Locked Here: chan_sip.c line 23962 (handle_request_do)
=== -------------------------------------------------------------------
===
=======================================================================

I have gdb installed but my skills w/ it are severely limited and I don't know how to debug further.

I think there is a chance that this issue has been reported before, at
https://issues.asterisk.org/view.php?id=17675

However that was with 1.6.x and the issue had been closed, so I am opening this here in hopes that it can be debugged.

Comments: By: Luke H (luckman212) 2011-02-15 08:50:47.000-0600

Just happened again. This time I did a backtrace, which I've attached. Also, I want to add that I think this may also be related to res_timing_pthread. I do not have res_timing_timerfd available (CentOS 5.5 kernel too old) and res_timing_dahdi causes MusicOnHold to stop working so I've disabled it. I only have res_timing_pthread.

By: Luke H (luckman212) 2011-02-15 15:22:35.000-0600

Just deadlocked again.
This time I also have the output of "core show locks" to go along with the backtrace.

Want to add some more possibly relevant info:
On this particular system (CPU=Atom230) my hardware clocksource in grub.conf is "acpi_pm". I don't know if that affects timing but it would make sense to me if the answer was YES. I have hpet, tsc, jiffies, and pit vailable as well. If you believe this to be the root of the problem then I will change & retest.
By: Luke H (luckman212) 2011-02-16 11:21:39.000-0600

Happened again, Added another backtrace & locks output.
By: David Woolley (davidw) 2011-02-16 11:46:10.000-0600

I'm not familiar with the internals of res_timing_pthread, but a blocked write seems to be part of the deadlock cycle.

My guess is that you will find something like it is using a pipe to deliver the timing pulses, but the pipe has filled and the the bit that would read events off the pipe is waiting for a lock held by the thread writing to the pipe.
By: Luke H (luckman212) 2011-02-17 07:16:30.000-0600

I realized I don't think I was capturing "full" backtraces.
now I am issuing
bt
bt full
thread apply all bt

here's the latest backtrace & core-show-locks output.
By: Luke H (luckman212) 2011-02-17 11:58:47.000-0600

Adding another locks & backtrace. Is it helpful if keep submitting these or is it now redundant?
By: Luke H (luckman212) 2011-02-20 21:41:55.000-0600

Is this possibly related to https://issues.asterisk.org/view.php?id=18845 ?
By: Russell Bryant (russell) 2011-03-11 12:27:44.000-0600

Please try res_timing_dahdi now. I believe the MOH issue you mentioned has been fixed.
By: Luke H (luckman212) 2011-04-20 14:58:18

I have now updated the following:

DAHDI: 2.4.1.2+2.4.1
Wanpipe: 1.0.11
Asterisk: 1.8.3.2 (release)

Sadly, the MOH issue is not fixed. Should I try the latest SVN?
By: Leif Madsen (lmadsen) 2011-05-16 08:36:26

Please try the latest SVN to determine if the issue has already been resolved.
By: Luke H (luckman212) 2011-05-16 22:03:29

russel, lmadsen: i've compiled SVN-branch-1.8-r319261M and tested, I am sure it does not fix the 'streaming MOH goes silent after moh reload' issue. As for the deadlocks with res_timing_pthread when res_timing_dahdi is disabled + using TCP transport, I can't say because I've since re-enabled res_timing_dahdi.

By: Leif Madsen (lmadsen) 2011-09-22 07:18:31.831-0500

I'm seeing the use of res_timing_pthread in all the logs, so I'm closing this issue out as res_timing_pthread has known issues that appear as a deadlock.