Summary:ASTERISK-23013: [patch] Deadlock between 'sip show channels' command and attended transfer handling
Reporter:Ben Smithurst (bensmithurst)Labels:
Date Opened:2013-12-17 07:59:51.000-0600Date Closed:2016-07-21 17:40:35
Versions:11.6.0 Frequency of
Environment:Attachments:( 0) deadlock-fix.diff
( 1) sip_show_channels.pl
Description:We have been seeing an occasional freeze of Asterisk which we have tracked down to an apparent deadlock between code handling 'sip show channels' and code handling attended transfers.

We've been seeing this maybe once every couple of days on servers handling ~200-300 concurrent calls at peak time.  'sip show channels' is being executed every 5 minutes by a munin plugin.

I'll dig out the full trace in a second - if I still have a copy - but some important details are below.

(gdb) t 6
[Switching to thread 6 (Thread 0x7fc1fa7ee700 (LWP 4006))]
#0  0x00007fc205f9bcec in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fc205f9bcec in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fc205f97354 in _L_lock_997 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fc205f971b7 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00000000005012a1 in __ast_pthread_mutex_lock (filename=0x5b0c14 "astobj2.c", lineno=1087, func=0x5b0f50 "internal_ao2_callback", mutex_name=0x5b0dfe "c", t=0x7fc2005aac20) at lock.c:273
#4  0x000000000044f5a8 in __ao2_lock (user_data=0x7fc2005aac78, lock_how=AO2_LOCK_REQ_RDLOCK, file=0x5b0c14 "astobj2.c", func=0x5b0f50 "internal_ao2_callback", line=1087, var=0x5b0dfe "c") at astobj2.c:192
#5  0x0000000000450d3e in internal_ao2_callback (c=0x7fc2005aac78, flags=OBJ_POINTER, cb_fn=0x7fc1fd1ca63d, arg=0x7fc1fa7e55e0, data=0x0, type=DEFAULT, tag=0x0, file=0x0, line=0, func=0x0) at astobj2.c:1087
#6  0x0000000000451281 in __ao2_callback (c=0x7fc2005aac78, flags=OBJ_POINTER, cb_fn=0x7fc1fd1ca63d <dialog_cmp_cb>, arg=0x7fc1fa7e55e0) at astobj2.c:1206
#7  0x00000000004513da in __ao2_find (c=0x7fc2005aac78, arg=0x7fc1fa7e55e0, flags=OBJ_POINTER) at astobj2.c:1237
#8  0x00007fc1fd178b05 in get_sip_pvt_byid_locked (callid=0x7fc1d019972a "5718d144-f8d09c30@", totag=0x7fc1d019974a "as012f4e00", fromtag=0x7fc1d0199758 "329b6e344ad2400o0") at chan_sip.c:17616
#9  0x00007fc1fd1a256d in local_attended_transfer (transferer=0x7fc1cc593fb8, current=0x7fc1fa7ec490, req=0x7fc1fa7ed330, seqno=104, nounlock=0x7fc1fa7ed248) at chan_sip.c:25924
#10 0x00007fc1fd1a446c in handle_request_refer (p=0x7fc1cc593fb8, req=0x7fc1fa7ed330, seqno=104, nounlock=0x7fc1fa7ed248) at chan_sip.c:26393
#11 0x00007fc1fd1ad03a in handle_incoming (p=0x7fc1cc593fb8, req=0x7fc1fa7ed330, addr=0x7fc1fa7ed2a0, recount=0x7fc1fa7ed24c, nounlock=0x7fc1fa7ed248) at chan_sip.c:28298
#12 0x00007fc1fd1ada7b in handle_request_do (req=0x7fc1fa7ed330, addr=0x7fc1fa7ed2a0) at chan_sip.c:28484
#13 0x00007fc1fd1ad678 in sipsock_read (id=0x1e429a0, fd=14, events=1, ignore=0x0) at chan_sip.c:28415
#14 0x00000000004fba8e in ast_io_wait (ioc=0x1e304f0, howlong=145) at io.c:292
#15 0x00007fc1fd1af4d4 in do_monitor (data=0x0) at chan_sip.c:29013
#16 0x000000000058de4c in dummy_start (data=0x7fc2005a4f90) at utils.c:1162
#17 0x00007fc205f94b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#18 0x00007fc206d76a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x0000000000000000 in ?? ()

Thread 6 above is handling an attended transfer, the key pointer is transferer=0x7fc1cc593fb8 on frame 9, this is locked in handle_request_refer (frame 10).
              /* both p and p->owner _MUST_ be locked while calling local_attended_transfer */
              if ((res = local_attended_transfer(p, &current, req, seqno, nounlock))) {
                      goto handle_refer_cleanup; /* We're done with the transfer */

Frame 9 then calls get_sip_pvt_byid_locked, which does an ao2_t_find call over the dialogs list.  Subsequently frame 5 tries to lock the dialogs list (in AO2 terms, it is the ‘c’ parameter, the container).


(gdb) t 22
[Switching to thread 22 (Thread 0x7fc1e1842700 (LWP 45850))]
#0  0x00007fc205f9bcec in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fc205f9bcec in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fc205f97354 in _L_lock_997 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fc205f971b7 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00000000005012a1 in __ast_pthread_mutex_lock (filename=0x7fc1fd1da774 "chan_sip.c", lineno=20871, func=0x7fc1fd1f4280 "show_channels_cb", mutex_name=0x7fc1fd1e723f "cur", t=0x7fc1cc593f60) at lock.c:273
#4  0x000000000044f5a8 in __ao2_lock (user_data=0x7fc1cc593fb8, lock_how=AO2_LOCK_REQ_MUTEX, file=0x7fc1fd1da774 "chan_sip.c", func=0x7fc1fd1f4280 "show_channels_cb", line=20871, var=0x7fc1fd1e723f "cur") at astobj2.c:192
#5  0x00007fc1fd18b778 in show_channels_cb (__cur=0x7fc1cc593fb8, __arg=0x7fc1e1841430, flags=2) at chan_sip.c:20871
#6  0x0000000000450e23 in internal_ao2_callback (c=0x7fc2005aac78, flags=OBJ_NODATA, cb_fn=0x7fc1fd18b70e, arg=0x7fc1e1841430, data=0x0, type=DEFAULT, tag=0x0, file=0x0, line=0, func=0x0) at astobj2.c:1101
#7  0x0000000000451281 in __ao2_callback (c=0x7fc2005aac78, flags=OBJ_NODATA, cb_fn=0x7fc1fd18b70e <show_channels_cb>, arg=0x7fc1e1841430) at astobj2.c:1206
#8  0x00007fc1fd18be28 in sip_show_channels (e=0x7fc1fd403d60, cmd=-4, a=0x7fc1e1841470) at chan_sip.c:20942
#9  0x00000000004a1912 in ast_cli_command_full (uid=0, gid=109, fd=215, s=0x7fc1e1841750 "sip show channels") at cli.c:2588
#10 0x00000000004a1a60 in ast_cli_command_multiple_full (uid=0, gid=109, fd=215, size=18, s=0x7fc1e1841ba0 "sip show channels") at cli.c:2611
#11 0x0000000000446951 in netconsole (vconsole=0x8622c0) at asterisk.c:1341
#12 0x000000000058de4c in dummy_start (data=0x7fc1e403d2b0) at utils.c:1162
#13 0x00007fc205f94b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007fc206d76a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x0000000000000000 in ?? ()

This is handling the ‘sip show channels’ command (see args in frames 9/10).  This iterates over ‘dialogs’ in a similar way, locking the container, and for each entry, show_channels_cb is called (frame 5, note that the __cur pointer is the same as the important transferer pointer from thread 6), subsequently this pointer is locked at frame 4.

So in summary my assessment is as follows

Thread 6: attended transfer, locks a sip_pvt, then the dialogs list
Thread 22: sip show channels, locks the dialogs list, then a sip_pvt

I've a patch which solves the problem for us, but it's really a workaround at best.  For our purposes a bit of missing information in 'sip show channels' output is far preferable to a deadlock though.

Comments:By: Ben Smithurst (bensmithurst) 2013-12-17 08:02:18.846-0600

Attached [^sip_show_channels.pl] is just a dirty little Perl script I used to help reproduce the problem - running that script spinning on 'sip show channels' and then initiating an attended transfer would easily make Asterisk deadlock (first time I tried).

By: Ben Smithurst (bensmithurst) 2013-12-17 08:08:18.586-0600

The full backtrace contains some sensitive customer information (phone numbers etc) - would an obfuscated backtrace be any use, or is the backtrace of the two deadlocked threads included in the description adequate here?

By: Matt Jordan (mjordan) 2013-12-17 09:42:07.386-0600

The analysis should be fine Ben - thanks. The fact that the order is swapped between threads is a clear indication that someone isn't following the correct locking order.