Summary:ASTERISK-17005: Deadlock on channel list during masquerade and channel search.
Reporter:Johnny B. Goode (chuckberry)Labels:
Date Opened:2010-11-23 10:02:00.000-0600Date Closed:2011-02-08 14:06:27.000-0600
Versions:Frequency of
Description:Found a deadlock while handling Local channel masquerades and the Pickup application.

Follows the stack trace and a brief explanation:

(gdb) thread 2
[Switching to thread 2 (Thread 0xa56ffb90 (LWP 15220))]#0  0xb7fc0410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7fc0410 in __kernel_vsyscall ()
#1  0xb7e52589 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7e4dbb4 in _L_lock_236 () from /lib/tls/i686/cmov/libpthread.so.0
#3  0xb7e4d60b in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0
#4  0xb7f46ea6 in pthread_setschedparam () from /lib/tls/i686/cmov/libc.so.6
ASTERISK-1  0x08095104 in ast_channel_search_locked (is_match=0xb7468380 <find_by_mark>, data=0xa56face0) at /root/ASTERISK/asterisk-
ASTERISK-2  0xb7468043 in pickup_exec (chan=0xa61fc088, data=0xa56fadd9) at app_directed_pickup.c:222
ASTERISK-3  0x08100df9 in pbx_exec (c=0xa61fc088, app=0xb7b6e808, data=0xa56fadd9) at pbx.c:1352
ASTERISK-4  0xb7899fbf in execif_exec (chan=0xa61fc088, data=0xa56fd148) at app_exec.c:257
ASTERISK-5  0x08100df9 in pbx_exec (c=0xa61fc088, app=0x8881028, data=0xa56fd148) at pbx.c:1352
ASTERISK-6 0x0810d170 in pbx_extension_helper (c=0xa61fc088, con=0x0, context=0xa61fc2fc "entrada-ramais", exten=0xa61fc34c "55", priority=3, label=0x0, callerid=0xa6178978 "2903",
   action=E_SPAWN, found=0xa56ff254, combined_find_spawn=1) at pbx.c:3715
ASTERISK-7 0x0810f6e6 in __ast_pbx_run (c=0xa61fc088, args=0x0) at pbx.c:4186
ASTERISK-8 0x08110f80 in pbx_thread (data=0xa61fc088) at pbx.c:4567
ASTERISK-9 0x08151a5b in dummy_start (data=0xa5c56a60) at utils.c:968
ASTERISK-10 0xb7e4b4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
ASTERISK-11 0xb7f39e5e in __call_fallocate () from /lib/tls/i686/cmov/libc.so.6
ASTERISK-12 0x00000000 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0xa58f3b90 (LWP 15213))]#0  0xb7fc0410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7fc0410 in __kernel_vsyscall ()
#1  0xb7e4f21b in pthread_rwlock_wrlock () from /lib/tls/i686/cmov/libpthread.so.0
#2  0x08095899 in ast_channel_free (chan=0xb7e4f21b) at /root/ASTERISK/asterisk-
#3  0x0809915f in ast_do_masquerade (original=0xa61028b8) at channel.c:4869
#4  0x08099761 in ast_waitfor_nandfds (c=0xa58e8b50, n=2, fds=0x0, nfds=0, exception=0x0, outfd=0x0, ms=0xa58e9024) at channel.c:2129
ASTERISK-1  0x0809a07f in ast_waitfor_n (c=0xa58e8b50, n=2, ms=0xa58e9024) at channel.c:2448
ASTERISK-2  0xb73b549c in dial_exec_full (chan=0x88ecd70, data=<value optimized out>, peerflags=0xa58e9900, continue_exec=0x0) at app_dial.c:879
ASTERISK-3  0xb73b9ff9 in dial_exec (chan=0x88ecd70, data=0xa58e99d8) at app_dial.c:2353
ASTERISK-4  0x08100df9 in pbx_exec (c=0x88ecd70, app=0xa6599258, data=0xa58e99d8) at pbx.c:1352
ASTERISK-5  0xb7899bf4 in tryexec_exec (chan=0x88ecd70, data=0xa58ec0b8) at app_exec.c:184
ASTERISK-6 0x08100df9 in pbx_exec (c=0x88ecd70, app=0x8880fe8, data=0xa58ec0b8) at pbx.c:1352
ASTERISK-7 0x0810d170 in pbx_extension_helper (c=0x88ecd70, con=0x0, context=0x88ecfe4 "physical-dial", exten=0x88ed034 "2900", priority=4, label=0x0, callerid=0x88df0a8 "00100000000",
   action=E_SPAWN, found=0xa58ee3b8, combined_find_spawn=1) at pbx.c:3715
ASTERISK-8 0xb78c1518 in _macro_exec (chan=0x88ecd70, data=<value optimized out>, exclusive=0) at app_macro.c:398
ASTERISK-9 0x08100df9 in pbx_exec (c=0x88ecd70, app=0xb7b3ac18, data=0xa58f1148) at pbx.c:1352
ASTERISK-10 0x0810d170 in pbx_extension_helper (c=0x88ecd70, con=0x0, context=0x88ecfe4 "physical-dial", exten=0x88ed034 "2900", priority=2, label=0x0, callerid=0x88df0a8 "00100000000",
   action=E_SPAWN, found=0xa58f3254, combined_find_spawn=1) at pbx.c:3715
ASTERISK-11 0x0810f6e6 in __ast_pbx_run (c=0x88ecd70, args=0x0) at pbx.c:4186
ASTERISK-12 0x08110f80 in pbx_thread (data=0x88ecd70) at pbx.c:4567
ASTERISK-13 0x08151a5b in dummy_start (data=0x88d5dc0) at utils.c:968
ASTERISK-14 0xb7e4b4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
ASTERISK-15 0xb7f39e5e in __call_fallocate () from /lib/tls/i686/cmov/libc.so.6
ASTERISK-16 0x00000000 in ?? ()

The execution stack above shows this scenario:

* thread 3 is called with channel X lock already held.
* thread 2 is called, and helds channel list lock while iterating over channels.
* thread 3 tries to lock channel list, but thread 2 already owns it.
* thread 2 tries to lock channel X while iterating, but thread 3 already owns it.

We are now in a classic deadlock situation. Channel list receives new wrlock requests but they never succed, so new calls get not processed by the PBX.

One solution is to change the search to a try-lock approach, where the list and the channels are try-locked, and everything is unlocked if some channel cannot be locked in a reasonable time.

NOTE: This bug was detected on - diff'ing both and releaved no changes in the affected areas, so I am reporting this case for as is missing from checkboxes.
Comments:By: Leif Madsen (lmadsen) 2010-12-07 10:32:13.000-0600

Please verify this against (there is a tag for it now, the release is coming this morning).

Many changes to chan_local were changed in that version which may have resolved that issue already.

By: Johnny B. Goode (chuckberry) 2010-12-23 08:20:47.000-0600

lmadsen, thanks for the reply.

As the problem was happening on our main PBX, which needs to be reliable, we upgraded it to version 1.8. Since this version already has the astobj2 container for the channel list, this should make it less prone to deadlocks.

I'm sorry, but for now I cannot go back to 1.6.2 series on that machine.

I will try to include in some internal stress/general tests, and report any misbehaviour.

By: Johnny B. Goode (chuckberry) 2010-12-28 09:28:33.000-0600

Updating my previous message: things didn't work very well with 1.8, we were experiencing Pickup issues when dialing multiple channels at once. So, we are going back to, as the new issue is unacceptable for daily usage.

I will keep this issue updated with any new information.

By: Leif Madsen (lmadsen) 2011-02-08 14:06:27.000-0600

Closing this issue as the reporter never provided additional information, so I can only assume resolved the issue :)