Summary:ASTERISK-19639: [patch] - Deadlock in queue with attended transfer
Reporter:Byron Clark (byronclark)Labels:
Date Opened:2012-04-04 14:53:06Date Closed:2014-01-09 14:57:57.000-0600
Versions: Frequency of
One Time
Environment:Attachments:( 0) backtrace.txt
( 1) queue_memberstatus_deadlock.patch
Description:I haven't been able to find an easy way to reproduce this deadlock, but from the backtrace it appears to happen when a transfer occurs on a queue agent while a new caller is entering the queue. That said, it seems to be somewhat rare.
Comments:By: Byron Clark (byronclark) 2012-04-04 14:55:02.293-0500

[^backtrace.txt] contains a backtrace generated when I killed the deadlocked process with SIGABRT.

The threads that appear to be deadlocked are threads 56 and 13.  Here's my quick analysis on which locks are held by those threads:
* Thread 56 (attended transfer)
** Locks Held
*** channels (channel.c:6437)
*** channel (channel.c:6440)
** Awaiting Lock
*** queue (app_queue.c:4014)

* Thread 13 (looking up devices to ring channel) [queue]
** Locks Held
*** queue (app_queue.c:3050)
** Awaiting Lock
*** channels (channel.c:1742)

By: Matt Jordan (mjordan) 2012-04-10 09:15:07.638-0500

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install).  This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!

If you can get the 'core show locks' output with a matching backtrace, that'd be great - if you're unable to do so, please let us know.  Getting a 'core show locks' makes debugging and resolving these types of issues much easier then using a backtrace alone.

By: Byron Clark (byronclark) 2012-04-10 09:39:48.106-0500

We've seen a pretty big performance hit running with lock debugging enabled at our current load. Usually, when we get a frequent deadlock that we can't debug from the backtrace we'll push out a build with lock debugging enabled. On this one we've only seen this deadlock once and don't have a good test case to reproduce it with lock debugging enabled.

By: Matt Jordan (mjordan) 2012-04-17 08:37:42.014-0500

I'll acknowledge this issue, but without 'core show locks' it may be difficult for us to resolve, and the response time may reflect that.  If you can find a way to get the 'core show locks' output and attach it to this issue, let me know.

By: Jeff Hutchins (jhutchins) 2012-05-09 16:41:28.343-0500

This is an as yet untested patch that I believe fixes this issue. I'm going to test it, but if anyone has any comment on why this is a bad idea it would be greatly appreciated.

By: Jeff Hutchins (jhutchins) 2012-05-15 17:38:19.912-0500

The patch appears to have fixed the deadlock for us.

By: Jeff Hutchins (jhutchins) 2012-06-20 14:50:31.446-0500

What needs to be done to get action on this ticket?

By: Paul Belanger (pabelanger) 2014-01-09 11:03:01.660-0600

Which version of asterisk is your patch against? Pretty sure I just hit this bug.

By: Jeff Hutchins (jhutchins) 2014-01-09 11:09:22.562-0600

It was written against and then ignored because Digium apparently hate it when people fix their crappy code for free.

By: Matt Jordan (mjordan) 2014-01-09 13:33:19.286-0600

Actually, I just missed the update in the asterisk-bugs e-mails that a patch had been provided for the issue. Sometimes, a prod in #asterisk, #asterisk-dev, or on the mailing lists is helpful. Sorry about that!

And, as an aside, there is a whole community of Asterisk developers who have commit access who can - and often do - look and solve issues in the issue tracker. Paul happens to be one of those, in fact. It isn't just Digium :-).

By: Matt Jordan (mjordan) 2014-01-09 14:57:49.137-0600

Good news! With a bit of trolling through the history that Paul provided in #asterisk-dev, this particular piece of "crappy code" has already been fixed.

The locking inversion occurred because of the following:
* Masquerade locks the channel. During the masquerade, the app_queue fixup function is called, grabbing the Queue lock.
* At the same time, calling get_queue_member_status *can* cause the channel to be locked. If get_queue_member_status is called within the Queue lock, this causes a deadlock.

However, in r378663, rmudgett modified the code in this place such that it no longer calls get_queue_member_status in the first place (note that this was done as part of the cleanup in ASTERISK-16115; see [this patch here|https://issues.asterisk.org/jira/secure/attachment/45753/jira_asterisk_16115_single_q_v1.8.patch]). Now, since all we need to do there is clear their status, that is done via {{member_call_pending_clear}} - which, while it locks the queue member, does not have to go out to the device state/extension state core.

Since this no longer appears to be a bug in the latest Asterisk 1.8, I'm going to go ahead and close this out as Fixed.