ASTERISK-15304: racecondition leading to deadlock in chan

[Home]

Summary: ASTERISK-15304: racecondition leading to deadlock in chan_local

Reporter: Marcus Hunger (fnordian) Labels:

Date Opened: 2009-12-11 10:12:09.000-0600 Date Closed: 2011-06-07 14:00:51

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_local

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments:

Description: Hi,

there's a race in chan_local between local_hangup and queuing a dtmf (ast_senddigit ... local_queue_frame), which can cause hanging channels and a hanging ast_senddigit.

This is possible because local_digit_*-functions assume that they were called on an unlocked channel, which is not always the cases, e.g. manager_play_dtmf uses a locked channel to call ast_senddigit. With this wrong assumption, deadlock_avoidance code is not called.

Comments: By: Leif Madsen (lmadsen) 2009-12-11 14:34:17.000-0600

Is there any sort of console output when this happens to indicate that it is happening? Or does Asterisk just crash, or hang?
By: Marcus Hunger (fnordian) 2009-12-14 03:54:46.000-0600

Unfortunatly not. channel and ami hangs, that's all.

here are backtraces auf the hanging threads:

(gdb) thread 10
[Switching to thread 10 (process 27614)]#0 0x00002aad049f1df5 in nanosleep () from /lib/libc.so.6
(gdb) bt
#0 0x00002aad049f1df5 in nanosleep () from /lib/libc.so.6
#1 0x00002aad04a231e4 in usleep () from /lib/libc.so.6
#2 0x00002aaaab932e8d in local_queue_frame (p=0x2aaab28fc070, isoutbound=1, f=0x44f6e2a0, us=0x2aaaaffbfd90, us_locked=0) at chan_local.c:233
#3 0x00002aaaab934313 in local_digit_end (ast=0x2aaaaffbfd90, digit=68 'D', duration=<value optimized out>) at chan_local.c:469
#4 0x000000000043fb36 in ast_senddigit_end (chan=0x2aaaaffbfd90, digit=0 '\0', duration=0) at channel.c:3297
ASTERISK-1 0x00002aaaab59c165 in manager_play_dtmf (s=0x44f6ef80, m=0x44f6e710) at app_senddtmf.c:101
ASTERISK-2 0x0000000000486bfe in process_message (s=0x44f6ef80, m=0x44f6e710) at manager.c:3049
ASTERISK-3 0x0000000000487186 in do_message (s=0x44f6ef80) at manager.c:3163
ASTERISK-4 0x000000000048af8a in session_do (data=<value optimized out>) at manager.c:3221
ASTERISK-5 0x00000000004cded3 in handle_tcptls_connection (data=<value optimized out>) at tcptls.c:223
ASTERISK-6 0x00000000004d76dc in dummy_start (data=<value optimized out>) at utils.c:968
ASTERISK-7 0x00002aad04ca5f1a in start_thread () from /lib/libpthread.so.0
ASTERISK-8 0x00002aad04a295d2 in clone () from /lib/libc.so.6
ASTERISK-9 0x0000000000000000 in ?? ()
(gdb) thread 53
[Switching to thread 53 (process 4427)]#0 0x00002aad049f1df5 in nanosleep () from /lib/libc.so.6
(gdb) bt
#0 0x00002aad049f1df5 in nanosleep () from /lib/libc.so.6
#1 0x00002aad04a231e4 in usleep () from /lib/libc.so.6
#2 0x00002aaaab934b22 in local_hangup (ast=0x2aaaaffbec70) at chan_local.c:617
#3 0x000000000044355e in ast_hangup (chan=0x2aaaaffbec70) at channel.c:1717
#4 0x00002aaaad198fdb in dial_exec_full (chan=0x2aaab1a8b0b0, data=<value optimized out>, peerflags=0x47c76980, continue_exec=0x0) at app_dial.c:2091
ASTERISK-1 0x00002aaaad19bc46 in dial_exec (chan=0x47c75c30, data=0x0) at app_dial.c:2143
ASTERISK-2 0x0000000000498d6a in pbx_exec (c=0x2aaab1a8b0b0, app=0x6bab20, data=0x47c78a60) at pbx.c:960
ASTERISK-3 0x000000000049f97a in pbx_extension_helper (c=0x2aaab1a8b0b0, con=<value optimized out>, context=0x2aaab1a8b500 "initialCall", exten=0x2aaab1a8b550 "1234567", priority=11, label=0x0,
callerid=0x18316d0 "00000000", action=E_SPAWN, found=0x47c7b0ec, combined_find_spawn=1) at pbx.c:3225
ASTERISK-4 0x000000000049fe80 in ast_spawn_extension (c=0x47c75c30, context=0x0, exten=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, priority=-1299201936, callerid=<value optimized out>,
found=<value optimized out>, combined_find_spawn=1) at pbx.c:3681
ASTERISK-5 0x00000000004a306a in __ast_pbx_run (c=0x2aaab1a8b0b0, args=0x0) at pbx.c:3770
ASTERISK-6 0x00000000004a4a3b in pbx_thread (data=0x47c75c30) at pbx.c:4057
ASTERISK-7 0x00000000004d76dc in dummy_start (data=<value optimized out>) at utils.c:968
ASTERISK-8 0x00002aad04ca5f1a in start_thread () from /lib/libpthread.so.0
ASTERISK-9 0x00002aad04a295d2 in clone () from /lib/libc.so.6
ASTERISK-10 0x0000000000000000 in ?? ()
(gdb)
By: Leif Madsen (lmadsen) 2009-12-14 09:36:12.000-0600

Do you have DONT_OPTIMIZE enabled in the compiler flags of menuselect when you generate this backtrace?
By: Marcus Hunger (fnordian) 2009-12-14 09:40:24.000-0600

No, I don't have this in my config. The error was on a production-system.
By: Leif Madsen (lmadsen) 2009-12-14 12:28:11.000-0600

It's typically OK to run DONT_OPTIMIZE on production systems -- I do it on all my customers systems so that when a crash like this occurs, a core file that generates a useful backtrace is produced.
By: Leif Madsen (lmadsen) 2009-12-14 12:29:13.000-0600

I've acknowledged this for now, but it will be up to a developer to determine if this backtrace is useful or not -- typically without DONT_OPTIMIZE enabled, the backtrace doesn't provide enough information. If you can either reproduce this, or if it happens often enough that you can enable DONT_OPTIMIZE until you can get this to happen again, that would be useful.
By: Stéphan Kochen (skochen) 2010-01-13 05:13:25.000-0600

I believe we're seeing the same problem on Asterisk 1.4.28, in a slightly different situation. We recently reinstalled with DEBUG_THREADS to get a better idea of what is going on. In "core show locks", I now see the following:

=== Thread ID: 88636304 (pbx_thread started at [ 2643] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_local.c): MUTEX 559 local_hangup &ast->lock 0x9551850 (1)
=== ---> Tried and failed to get Lock #1 (chan_local.c): MUTEX 588 local_hangup &p->chan->lock 0x962f250 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 52378512 (pbx_thread started at [ 2643] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_local.c): MUTEX 201 local_queue_frame &us->lock 0x962f250 (1)
=== ---> Tried and failed to get Lock #1 (chan_local.c): MUTEX 195 local_queue_frame &other->lock 0x9551850 (0)
=== -------------------------------------------------------------------

This is the first time I'm lock debugging in Asterisk, but this looks like a deadlock to me. The same output contains a second, similar case. Eventually, lots of things calling channel_find_locked get caught up in it as well, causing chan_sip to become unresponsive, and the CLI to hang on "core show channels" for example.

When not caught up in the deadlock while it is happening, the CLI may log nothing at all, or may log lots of:
channel.c: Exceptionally long queue length queuing to Local/403@default-a54d,1

Backtraces are provided below (also without DONT_OPTIMIZE). It looks like one thread hangs when app_queue calls hangupcalls() to hang up all it's agents. The agents are Local channels in this setup. The other thread hangs because one of those channels is apparently getting answered in the same split second.

=== Thread ID: 88636304 (pbx_thread started at [ 2643] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_local.c): MUTEX 559 local_hangup &ast->lock 0x9551850 (1)
=== ---> Tried and failed to get Lock #1 (chan_local.c): MUTEX 588 local_hangup &p->chan->lock 0x962f250 (0)
Thread 21 (process 12142):
#0 0x00366410 in __kernel_vsyscall ()
#1 0x007a1996 in nanosleep () from /lib/libc.so.6
#2 0x007db15c in usleep () from /lib/libc.so.6
#3 0x00a03e57 in local_hangup (ast=0x95517d8) at chan_local.c:589
#4 0x080a39b5 in ast_hangup ()
ASTERISK-1 0x07d40a7b in try_calling (qe=0x547dc40, options=0x547dbdf "twr", announceoverride=0x547dbe4 "", url=0x547dbe3 "", tries=0x547dde0, noption=0x547dddc, agi=0x547dbe8 "") at app_queue.c:1751
ASTERISK-2 0x07d437e9 in queue_exec (chan=0x9624750, data=0x547fe58) at app_queue.c:4124
ASTERISK-3 0x08113ace in pbx_extension_helper ()
ASTERISK-4 0x00c4e4f9 in _macro_exec (chan=0x9624750, data=0x5484ee8, exclusive=0) at app_macro.c:352
ASTERISK-5 0x08113ace in pbx_extension_helper ()
ASTERISK-6 0x08117921 in __ast_pbx_run ()
ASTERISK-7 0x08118b4e in pbx_thread ()
ASTERISK-8 0x0816297f in dummy_start ()
ASTERISK-9 0x008ed5ab in start_thread () from /lib/libpthread.so.0
ASTERISK-10 0x007e1cfe in clone () from /lib/libc.so.6

=== Thread ID: 52378512 (pbx_thread started at [ 2643] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_local.c): MUTEX 201 local_queue_frame &us->lock 0x962f250 (1)
=== ---> Tried and failed to get Lock #1 (chan_local.c): MUTEX 195 local_queue_frame &other->lock 0x9551850 (0)
Thread 20 (process 12146):
#0 0x00366410 in __kernel_vsyscall ()
#1 0x007a1996 in nanosleep () from /lib/libc.so.6
#2 0x007db15c in usleep () from /lib/libc.so.6
#3 0x00a02f7e in local_queue_frame (p=0x962d2b8, isoutbound=1, f=0x31e887c, us=0x962f1d8, us_locked=1) at chan_local.c:200
#4 0x00a04cf2 in local_indicate (ast=0x962f1d8, condition=-1, data=0x0, datalen=0) at chan_local.c:399
ASTERISK-1 0x08099762 in ast_indicate_data ()
ASTERISK-2 0x0809ad0e in ast_answer ()
ASTERISK-3 0x00283a59 in ast_bridge_call (chan=0x962f1d8, peer=0x9667d58, config=0x31e9c7c) at res_features.c:1729
ASTERISK-4 0x004f697e in dial_exec_full (chan=0x962f1d8, data=<value optimized out>, peerflags=0x31e9e14, continue_exec=0x0) at app_dial.c:1889
ASTERISK-5 0x004f9342 in dial_exec (chan=0x962f1d8, data=0x31ebe78) at app_dial.c:1937
ASTERISK-6 0x08113ace in pbx_extension_helper ()
ASTERISK-7 0x00c4e4f9 in _macro_exec (chan=0x962f1d8, data=0x31f0ee8, exclusive=0) at app_macro.c:352
ASTERISK-8 0x08113ace in pbx_extension_helper ()
ASTERISK-9 0x08117921 in __ast_pbx_run ()
ASTERISK-10 0x08118b4e in pbx_thread ()
ASTERISK-11 0x0816297f in dummy_start ()
ASTERISK-12 0x008ed5ab in start_thread () from /lib/libpthread.so.0
ASTERISK-13 0x007e1cfe in clone () from /lib/libc.so.6
By: Stéphan Kochen (skochen) 2010-01-25 05:33:31.000-0600

Curiously, we're seeing this problem after a week of uptime, give or take a few hours.

I don't see the link, but perhaps it will make sense to someone else?

(I was not able to test this on 1.4.29, because it burdens our customers with some regressions that have popped up.)
By: Leif Madsen (lmadsen) 2010-01-25 08:45:29.000-0600

skochen: and the regressions you've found -- have those been reported either by yourself or someone else?
By: Stéphan Kochen (skochen) 2010-01-25 09:23:08.000-0600

lmadsen: Yes, issue ASTERISK-15453, attended transfer is broken.

I also have another issue with Followme, where the receiving line is silent, but I'm still investigating that myself.
By: Stéphan Kochen (skochen) 2010-02-11 09:40:57.000-0600

I know it's not a lot to go by, but we're now running the server in question on version 1.4.26. The uptime just passed the two week mark.

So it might have been a change somewhere between 1.4.26 and 1.4.28. That's a large window, though.
By: Leif Madsen (lmadsen) 2010-02-22 16:04:53.000-0600

This may already be fixed per the other related issues. The one note on 16630 reads as: "lmadesen: I just installed 244151 and tested my reported scenario and the transfer problem is fixed in this revision."

Please test after revision 244151 and report back. Thanks!
By: Paul Belanger (pabelanger) 2010-04-28 15:29:57

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested.

Further information can be found at http://www.asterisk.org/developers/bug-guidelines