[Home]

Summary:ASTERISK-11730: Deadlock in chan_local
Reporter:callguy (callguy)Labels:
Date Opened:2008-03-26 10:46:31Date Closed:2008-04-07 11:08:48
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_local
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12307.patch
( 1) csl-12307.txt
( 2) gdb_dump.30326.txt
Description:We are seeing in certain situations (mainly a SIP hairpin that redirects a call on the same server out to another user via the local channel) a deadlock will occur. Once the deadlock occurs the system will continue to process calls for a period of time, but will eventually hang completely.

Console output, core show locks, and output of ast_grab core attached.
Comments:By: callguy (callguy) 2008-03-26 10:51:03

Here are 2 asterisk log snippets.  In both cases notice the ERROR:

/usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: chan_local.c line 179 (local_queue_frame): Error releasing mutex: Operation not permitted

In both cases it the same person dialing a ten digit local number in their pbx, which causes a SIP hairpin and creation of a LOCAL channel.  

FIRST EXAMPLE

[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Executing [15555555555@XXXX:1] Goto("SIP/XXXX-p2533-b6e6b2e8", "5555555555|1") in new stack
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Goto (XXXX,5555555555,1)
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Executing [5555555555@XXXX:1] Macro("SIP/XXXX-p2533-b6e6b2e8", "voip-nocongest-out|5555555555|srvr|70") in new stack
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Executing [s@macro-voip-nocongest-out:1] Dial("SIP/XXXX-p2533-b6e6b2e8", "SIP/5555555555@srvr|70") in new stack
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Called 5555555555@srvr
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Now forwarding SIP/XXXX-p2533-b6e6b2e8 to 'Local/5555555555@pstn-in' (thanks to SIP/srvr-0a6228d0)
[Mar 24 12:08:28] VERBOSE[22690] logger.c:     -- Local/5555555555@pstn-in-19eb,1 answered SIP/XXXX-p2533-b6e6b2e8
[Mar 24 12:09:34] ERROR[22690] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: chan_local.c line 179 (local_queue_frame): mutex '&us->lock' freed more times than we've locked!
[Mar 24 12:09:34] ERROR[22690] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: chan_local.c line 179 (local_queue_frame): Error releasing mutex: Operation not permitted [Mar 24 12:09:34] ERROR[22690] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: channel.c line 1226 (ast_channel_free): Error destroying mutex &chan->lock: Device or resource busy
[Mar 24 12:09:34] VERBOSE[22690] logger.c:   == Spawn extension (macro-voip-nocongest-out, s, 1) exited non-zero on 'SIP/XXXX-p2533-b6e6b2e8' in macro 'voip-nocongest-out'
[Mar 24 12:09:34] VERBOSE[22690] logger.c:   == Spawn extension (macro-voip-nocongest-out, s, 1) exited non-zero on 'SIP/XXXX-p2533-b6e6b2e8'


SECOND EXAMPLE

[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Executing [15555555555@XXXX:1] Goto("SIP/XXXX-p2533-b6a98078", "5555555555|1") in new stack
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Goto (XXXX,5555555555,1)
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Executing [5555555555@XXXX:1] Macro("SIP/XXXX-p2533-b6a98078", "voip-nocongest-out|5555555555|srvr|70") in new stack
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Executing [s@macro-voip-nocongest-out:1] Dial("SIP/XXXX-p2533-b6a98078", "SIP/5555555555@srvr|70") in new stack
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Called 5555555555@srvr
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Now forwarding SIP/XXXX-p2533-b6a98078 to 'Local/5555555555@pstn-in' (thanks to SIP/srvr-09505590)
[Mar 24 12:52:14] VERBOSE[25184] logger.c:     -- Local/5555555555@pstn-in-f751,1 answered SIP/XXXX-p2533-b6a98078
[Mar 24 12:53:32] ERROR[25184] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: chan_local.c line 179 (local_queue_frame): mutex '&us->lock' freed more times than we've locked!
[Mar 24 12:53:32] ERROR[25184] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: chan_local.c line 179 (local_queue_frame): Error releasing mutex: Operation not permitted [Mar 24 12:54:04] ERROR[25184] /usr/src/asterisk-trunk/1.4/r105049/asterisk-1.4/include/asterisk/lock.h: channel.c line 1226 (ast_channel_free): Error destroying mutex &chan->lock: Device or resource busy
[Mar 24 12:54:04] VERBOSE[25184] logger.c:   == Spawn extension (macro-voip-nocongest-out, s, 1) exited non-zero on 'SIP/XXXX-p2533-b6a98078' in macro 'voip-nocongest-out'
[Mar 24 12:54:04] VERBOSE[25184] logger.c:   == Spawn extension (macro-voip-nocongest-out, s, 1) exited non-zero on 'SIP/XXXX-p2533-b6a98078'

By: Mark Michelson (mmichelson) 2008-03-26 16:25:33

callguy - great core show locks output. The problem is that I can't really get a good idea of the code path being followed that is leading from __ast_read to local_queue_frame. The backtrace you provided would have been helpful, but it is full of ??'s instead of useful function names. I'm not going to ask if you compiled with DONT_OPTIMIZE turned on because I know that you've uploaded backtraces before with that flag turned on and you still have those odd ?? backtraces. Unfortunately, I'm not sure what to do to get useful backtraces from your machine.

So, instead, the way I suggest tackling this is, if this is reliably reproducible, provide step-by-step instructions for how to reproduce so that I or someone else may be able to try to reproduce it locally. I can see from the output you've provided that apparently a person calls from a SIP phone to another SIP phone, which forwards to a Local extension...but from there it isn't clear what's happening.

By: callguy (callguy) 2008-03-26 16:44:41

putnopvut: I have no idea why we get that output - we use ast_grab_core and that's what it always ends-up with. I just checked and if I use gdb to re-run the output manually it looks more reasonable, so I'll upload that in a minute.

The issue with reproducibility here is that it is a combination that we can reproduce, but only under load and certain conditions that we can't completely figure out.

If you can't find what you need in the updated backtrace let me know and I can arrange to get you access to the box.

By: Mark Michelson (mmichelson) 2008-03-27 11:11:28

What appears to be the problem here is that there are two local channels that are, for lack of a better term, "pointed" at each other in such a way that the component channels of each local channel are the same as the component channels of the far end local channel. If this is somewhat confusing, I can provide a bit more information about how local channels work.

There are potentially two approaches to fixing this deadlock. First, if we can figure out how the local channels have become pointed at each other the way they are, we can figure out if this is a problem itself or if it is something that should be handleable. If it should be handleable, then the correct course of action would be to play with the locking order to make it so that this deadlock cannot happen. I suspect that this local channel configuration should be handleable. I plan on spending some time today trying to see if I can come up with the proper course of action.

By: Mark Michelson (mmichelson) 2008-03-27 13:21:31

I found the code path that causes this deadlock.

Here's the stack which leads to the problem:

__ast_read -> ast_read_generator_actions -> moh_files_generator -> ast_write -> local_write -> local_queue_frame

What was confusing before was that there actually is deadlock avoidance built into the local_queue_frame function. local_queue_frame is called with the assumption that the channel passed in is locked. local_queue_frame has to lock a second channel too, and if it fails to lock it, then it unlocks the alreday locked channel (as well as the local_pvt structure) to allow whatever thread is currently using the channel we're attempting to lock to finish with it. The problem here is that by following the code path shown above, the initial channel passed into local_queue_frame is locked TWICE beforehand. This means that releasing the lock on that channel does nothing since it was recursively locked.

Now that the problem has been identified, I'll work on finding a solution.

By: Mark Michelson (mmichelson) 2008-03-28 09:22:41

I have uploaded 12307.patch which should fix this problem. Please test to be sure that this deadlock is resolved. Thanks!

By: callguy (callguy) 2008-03-28 09:42:48

putnopvut: thanks for your help! we'll get this onto some boxes to test over the weekend and let you know how it goes.

By: Mark Michelson (mmichelson) 2008-04-01 14:22:01

Hi callguy. I was wondering if you were able to test over the weekend.

By: callguy (callguy) 2008-04-01 14:27:15

putnopvut: yes - we have it on one box, and haven't had any issues. I plan to put it on a few more tonight then let it run for a few days, which should give us a better idea of whether the problem is actually solved, but so far so good.

By: Digium Subversion (svnbot) 2008-04-07 11:04:06

Repository: asterisk
Revision: 113065

U   branches/1.4/main/channel.c

------------------------------------------------------------------------
r113065 | mmichelson | 2008-04-07 11:04:03 -0500 (Mon, 07 Apr 2008) | 13 lines

This fix prevents a deadlock that was experienced in chan_local. There was
deadlock prevention in place in chan_local, but it would not work in a specific
case because the channel was recursively locked. By unlocking the channel prior
to calling the generator's generate callback in ast_read_generator_actions(), we
prevent the recursive locking, and therefore the deadlock.

(closes issue ASTERISK-11730)
Reported by: callguy
Patches:
     12307.patch uploaded by putnopvut (license 60)
Tested by: callguy


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=113065

By: Digium Subversion (svnbot) 2008-04-07 11:07:49

Repository: asterisk
Revision: 113066

_U  trunk/
U   trunk/main/channel.c

------------------------------------------------------------------------
r113066 | mmichelson | 2008-04-07 11:07:48 -0500 (Mon, 07 Apr 2008) | 21 lines

Merged revisions 113065 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r113065 | mmichelson | 2008-04-07 11:08:45 -0500 (Mon, 07 Apr 2008) | 13 lines

This fix prevents a deadlock that was experienced in chan_local. There was
deadlock prevention in place in chan_local, but it would not work in a specific
case because the channel was recursively locked. By unlocking the channel prior
to calling the generator's generate callback in ast_read_generator_actions(), we
prevent the recursive locking, and therefore the deadlock.

(closes issue ASTERISK-11730)
Reported by: callguy
Patches:
     12307.patch uploaded by putnopvut (license 60)
Tested by: callguy


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=113066

By: Digium Subversion (svnbot) 2008-04-07 11:08:48

Repository: asterisk
Revision: 113067

_U  branches/1.6.0/
U   branches/1.6.0/main/channel.c

------------------------------------------------------------------------
r113067 | mmichelson | 2008-04-07 11:08:44 -0500 (Mon, 07 Apr 2008) | 29 lines

Merged revisions 113066 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r113066 | mmichelson | 2008-04-07 11:12:30 -0500 (Mon, 07 Apr 2008) | 21 lines

Merged revisions 113065 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r113065 | mmichelson | 2008-04-07 11:08:45 -0500 (Mon, 07 Apr 2008) | 13 lines

This fix prevents a deadlock that was experienced in chan_local. There was
deadlock prevention in place in chan_local, but it would not work in a specific
case because the channel was recursively locked. By unlocking the channel prior
to calling the generator's generate callback in ast_read_generator_actions(), we
prevent the recursive locking, and therefore the deadlock.

(closes issue ASTERISK-11730)
Reported by: callguy
Patches:
     12307.patch uploaded by putnopvut (license 60)
Tested by: callguy


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=113067