ASTERISK-17418: Asterisk deadlocks with a SIP channel locked

[Home]

Summary: ASTERISK-17418: Asterisk deadlocks with a SIP channel locked

Reporter: Kirill Katsnelson (kkm) Labels:

Date Opened: 2011-02-16 18:46:49.000-0600 Date Closed: 2011-09-22 07:40:22

Priority: Critical Regression? No

Status: Closed/Complete Components: General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) astlocks-2011-02-15-1005
( 1) astlocks-2011-02-15-1647
( 2) backtrace-redacted-2011-02-15-1005
( 3) backtrace-redacted-2011-02-15-1647

Description: We run a complex application with calls originated via AMI Originate command to Local channel. The local side of channel obtains an "agent" from a queue, and the remote side is calling a customer via an outgoing SIP trunk.

****** STEPS TO REPRODUCE ******

Asterisk freezes sometimes once a few weeks, sometimes a few times a day. We have been getting the freezes since an upgrade from 1.6 to 1.8.1. The error is still there as of 1.8.3-rc2.

Two episode's backtrace and lock traces are attached.

****** ADDITIONAL INFORMATION ******

Quite obvious what is going on: the do_monitor thread hold a lock #1, goes into find_channel loop, stumbles upon another locked channel #2 and waits for it. Other threads are mostly locked on the same lock #1 this thread holds.

Now, the mystery lies with the lock #2. That lock (0xe45970) is held by a pbx_thread which is seemingly not waiting for a lock:

=== Thread ID: 140177859758416 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) 0x7F7DB3894950
=== ---> Lock #0 (channel.c): MUTEX 3636 __ast_read chan 0xe45970 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x4ea973]
/usr/sbin/asterisk(__ast_pthread_mutex_trylock+0xad) [0x4e3b3e]
/usr/sbin/asterisk(__ao2_trylock+0x53) [0x445b89]
/usr/lib/asterisk/modules/chan_sip.so [0x7f7dc0c86b38] handle_request_do 23992 if (!p->owner || !ast_channel_trylock(p->owner))
/usr/lib/asterisk/modules/chan_sip.so [0x7f7dc0c867f2] sipsock_read 23936 handle_request_do(&req, &addr);
/usr/sbin/asterisk(ast_io_wait+0x1ba) [0x4dd9e0]
/usr/lib/asterisk/modules/chan_sip.so [0x7f7dc0c88684] do_monitor 24455 res = ast_io_wait(io, res);
/usr/sbin/asterisk [0x56aed5]
/lib/libpthread.so.0 [0x7f7ddb3ed3ba]
/lib/libc.so.6(clone+0x6d) [0x7f7ddbc4902d]

I annotated function names and source lines on the right.

The mystery part is that a pbx_thread should not ever enter do_monitor, and seemingly cannot have obtained the lock with that stack backtrace. I cannot understand what is going on.

Here's some additional info from the GDB.

The do_monitor thread is easy to locate from thread backtrace dumps:

(gdb) thr 17
[Switching to thread 17 (process 31955)]#3 0x00000000004e384c in __ast_pthread_mutex_lock (filename=0x5a0af4 "channel.c", lineno=1641, func=0x5a2940 "ast_channel_cmp_cb", mutex_name=0x5a20d8 "chan", t=0xe45970) at lock.c:230
230 res = pthread_mutex_lock(&t->mutex);
(gdb) fr 3
#3 0x00000000004e384c in __ast_pthread_mutex_lock (filename=0x5a0af4 "channel.c", lineno=1641, func=0x5a2940 "ast_channel_cmp_cb", mutex_name=0x5a20d8 "chan", t=0xe45970) at lock.c:230
230 res = pthread_mutex_lock(&t->mutex);
(gdb) p t->mutex
$27 = {__data = {__lock = 2, __count = 1, __owner = 25423, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\001\000\000\000Oc\000\000\001\000\000\000\001", '\0' <repeats 22 times>, __align = 4294967298}

So we know that the mutex we are blocked on is held by lwp 25423. Also, note the channel name which this lock belongs to

(gdb) fr 5
ASTERISK-1 0x000000000046dcff in ast_channel_cmp_cb (obj=0xe46558, arg=0x7f7db571c040, flags=0) at channel.c:1641
1641 ast_channel_lock(chan);
(gdb) p chan->name
$28 = 0xcf1e8c "SIP/netborder-out-0000186a"

lwp 25423 runs a pbx thread. it's backtrace shows that the channel is in a Dial application, invoking its on connect gosub, that in turn waiting for 2500 ms of silence in WaitForSilence. This is quite what it is supposed to do in our dialplan.

(gdb) thr 30
[Switching to thread 30 (process 25423)]#0 0x00007f7ddbc3af7b in read () from /lib/libc.so.6
(gdb) fr 10
ASTERISK-6 0x000000000050f8e1 in ast_spawn_extension (c=0xe46558, context=0xe46ab0 "wait-for-am-completion", exten=0xe46b00 "sw_885_\"Unknown\"", priority=12, callerid=0x119e630 "~~s~~", found=0x7f7db388e12c, combined_find_spawn=1)
at pbx.c:4605
4605 return pbx_extension_helper(c, NULL, context, exten, priority, NULL, callerid, E_SPAWN, found, combined_find_spawn);
(gdb) p c->name
$29 = 0xcf1e8c "SIP/netborder-out-0000186a"

The channel locked was really created in the Dial() application on that same thread that holds its lock. At this point, I cannot understand what is going on. Sould I not trust the backtrace from core show locks?

Comments: By: Kirill Katsnelson (kkm) 2011-02-16 18:59:11.000-0600

Real phone numbers have been replaced with all nines, and customer names with the string "CustomerName" in the backtraces. The stacks were not changed otherwise, obviously.

As an additional note on the frequency of the freeze, the application may process quite tens of thousands calls, with 40 calls going on at the same time, before a lock-up.
By: Russell Bryant (russell) 2011-02-28 17:05:39.000-0600

This looks like a problem in res_timing_timerfd. Can you try installing DAHDI and using res_timing_dahdi and see if that fixes your problem?
By: Kirill Katsnelson (kkm) 2011-02-28 17:51:34.000-0600

Very unfortunately, we had to cancel our upgrade to 1.8 and reverted the only production server back to 1.6, mostly because of this issue. So currently I have no setup that would be used in reproducing this problem any more.

I think we should be able to try a fix, should it be a likely fix. So I have two questions here:
1. Is it quite likely that DAHDI will work around the problem?
2. We have no telephony cards here, SIP only. Will DAHDI work without the hardware? I have never considered it, so I know very little about it.
By: Kadir Terzi (kterzi) 2011-03-01 08:27:08.000-0600

I am not quite sure if this is related to my issue (https://issues.asterisk.org/view.php?id=18569), but I have tried all 3 timings one by one. Unfortunately none of them helps to my problem. I still get deadlocks.
By: Kirill Katsnelson (kkm) 2011-03-03 20:40:44.000-0600

Is it possible that kernel upgrade would help the issue, if that's a timerfd issue?
By: Kadir Terzi (kterzi) 2011-03-04 02:20:34.000-0600

I was not able to use timerfd with kernel 2.6.26-2-amd64, that's why upgraded to 2.6.32-5-amd64. Unfortunately it did not help. Do you think a newer one might help?
By: Kirill Katsnelson (kkm) 2011-05-06 17:51:16

Got a lock-up with timerfd in 2.6.35-28-server ASTERISK-46-Ubuntu SMP x86_64 GNU/Linux. Just logging for the sake of keeping history of the issue. It does not look like a kernel upgrade would help.
By: Leif Madsen (lmadsen) 2011-06-04 10:22:55

This will very likely be resolved with the closure and commit of code in ASTERISK-16711
By: Kirill Katsnelson (kkm) 2011-06-04 12:15:08

@lmadsen: I see some code committed on 05/31, are you talking about that commit? should I try it already?