[Home]

Summary:ASTERISK-14585: [patch] Deadlock after peer answeres queue call
Reporter:Christoph Stadlmann (cstadlmann)Labels:
Date Opened:2009-08-03 08:43:02Date Closed:2011-06-07 14:01:05
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/Channels
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) _chan_sip.c.patch
Description:We have the following set up:

Asterisk is hosting a queue with 4 members, ring strategy 'least recent'. The complete setup is realtime, but I don't think this has anything to do with the deadlock.

The queue is playing MOH, which is a wav-file (no .mp3). After the SIP peer answeres the call, MOH stops playing, and exactly in this moment the deadlock occurs.

It does not matter if queue members are 'Local/' or 'SIP/' channels, patching Asterisk as stated in Bug 14112 only makes the deadlock occur every 6 hours, with SIP channels the deadlock occurs every 3 hours or so.

The only way to get Asterisk running again is to 'kill -9' the process. No other CLI command than one 'core show locks' is working, any subsequent CLI command does not produce any response anymore.

****** ADDITIONAL INFORMATION ******

   -- Executing [s-A1_HV@extensions-queue:1] Answer("SIP/2244404001-083a4bc8", "") in new stack
   -- Executing [s-A1_HV@extensions-queue:2] Set("SIP/2244404001-083a4bc8", "CDR(accountcode)=Q-1249305180.297") in new stack
   -- Executing [s-A1_HV@extensions-queue:3] Set("SIP/2244404001-083a4bc8", "_QUID=Q-1249305180.297") in new stack
   -- Executing [s-A1_HV@extensions-queue:4] Set("SIP/2244404001-083a4bc8", "MONITOR_FILENAME=/var/spool/asterisk/monitor/qs-A1_HV--1249305180.297") in new stack
   -- Executing [s-A1_HV@extensions-queue:5] Playback("SIP/2244404001-083a4bc8", "custom/ringtone_1s") in new stack
   -- <SIP/2244404001-083a4bc8> Playing 'custom/ringtone_1s' (language 'de')
   -- Executing [s-A1_HV@extensions-queue:6] Queue("SIP/2244404001-083a4bc8", "A1_HV|t|||") in new stack
   -- Started music on hold, class 'A1HVPeak', on SIP/2244404001-083a4bc8
Extension Changed 601[internal] new state Ringing for Notify User 603
Extension Changed 601[internal] new state Ringing for Notify User 601
Extension Changed 601[internal] new state Ringing for Notify User 602
Extension Changed 601[internal] new state Ringing for Notify User 604
Extension Changed 601[internal] new state Ringing for Notify User 600
   -- SIP/601-083a62b8 is ringing
   -- SIP/601-083a62b8 is ringing
   -- SIP/601-083a62b8 is ringing
   -- SIP/601-083a62b8 is ringing
Extension Changed 601[internal] new state InUse for Notify User 603
Extension Changed 601[internal] new state InUse for Notify User 601
Extension Changed 601[internal] new state InUse for Notify User 602
Extension Changed 601[internal] new state InUse for Notify User 604
Extension Changed 601[internal] new state InUse for Notify User 600
   -- SIP/601-083a62b8 answered SIP/2244404001-083a4bc8
   -- Stopped music on hold on SIP/2244404001-083a4bc8
smartelpbx*CLI> core show locks
smartelpbx*CLI>
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3062475664 (do_monitor           started at [16811] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 16659 do_monitor &iflock 0xb691eba0 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 16671 do_monitor &sip->lock 0x83a62b8 (1)
=== ---> Waiting for Lock #2 (pbx.c): MUTEX 5785 pbx_builtin_getvar_helper (channel lock) 0x84227f8 (1)
=== --- ---> Locked Here: channel.c line 2565 (ast_indicate_data)
=== -------------------------------------------------------------------
===
=== Thread ID: 3059358608 (pbx_thread           started at [ 2597] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2565 ast_indicate_data (channel lock) 0x84227f8 (1)
=== ---> Waiting for Lock #1 (chan_sip.c): MUTEX 4017 sip_indicate &p->lock 0x83a62b8 (1)
=== --- ---> Locked Here: chan_sip.c line 16671 (do_monitor)
=== -------------------------------------------------------------------
===
=======================================================================
Comments:By: Christoph Stadlmann (cstadlmann) 2009-08-05 09:54:01

Asterisk just crashed again...

smartelpbx*CLI> core show locks
smartelpbx*CLI>
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3062631312 (do_monitor           started at [16811] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 16659 do_monitor &iflock 0xb6944fa0 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 16671 do_monitor &sip->lock 0x83d6640 (1)
=== ---> Waiting for Lock #2 (pbx.c): MUTEX 5785 pbx_builtin_getvar_helper (channel lock) 0x839ac68 (1)
=== --- ---> Locked Here: channel.c line 2565 (ast_indicate_data)
=== -------------------------------------------------------------------
===
=== Thread ID: 3060509584 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81b6ac0 (1)
=== ---> Tried and failed to get Lock #1 (channel.c): MUTEX 1127 channel_find_locked (channel lock) 0x839ac68 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 3059051408 (pbx_thread           started at [ 2597] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2565 ast_indicate_data (channel lock) 0x839ac68 (1)
=== ---> Waiting for Lock #1 (chan_sip.c): MUTEX 4017 sip_indicate &p->lock 0x83d6640 (1)
=== --- ---> Locked Here: chan_sip.c line 16671 (do_monitor)
=== -------------------------------------------------------------------
===
=======================================================================

Maybe the whole issue has something to do with indication and sip notify message processing?

By: Christoph Stadlmann (cstadlmann) 2009-08-10 09:07:21

I recently upgraded to SVN-branch-1.4-r210575, Asterisk was now running für 4 days without problems, but just now the deadlock occurred again, just after a peer answered a queue call.
I'm using 'Local/' channels right now for queue members because it seems that Asterisk is more stable than using 'SIP/' channel members...


smartelpbx*CLI> core show locks
smartelpbx*CLI>
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3062274960 (do_monitor           started at [16891] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 16739 do_monitor &iflock 0xb68ef2e0 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 16751 do_monitor &sip->lock 0x83fbc18 (1)
=== ---> Waiting for Lock #2 (pbx.c): MUTEX 5814 pbx_builtin_getvar_helper (channel lock) 0x8864498 (1)
=== --- ---> Locked Here: rtp.c line 1546 (ast_rtp_early_bridge)
=== -------------------------------------------------------------------
===
=== Thread ID: 3060153232 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81b7c00 (1)
=== ---> Lock #1 (channel.c): MUTEX 1090 channel_find_locked &(&channels)->lock 0x81ab448 (1)
=== ---> Tried and failed to get Lock #2 (channel.c): MUTEX 1127 channel_find_locked (channel lock) 0x8864498 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 3058936720 (pbx_thread           started at [ 2623] pbx.c ast_pbx_start())
=== ---> Lock #0 (rtp.c): MUTEX 1544 ast_rtp_early_bridge (channel lock) 0x83f32c0 (1)
=== ---> Lock #1 (rtp.c): MUTEX 1546 ast_rtp_early_bridge (channel lock) 0x8864498 (1)
=== ---> Waiting for Lock #2 (chan_sip.c): MUTEX 18798 sip_get_rtp_peer &p->lock 0x83fbc18 (1)
=== --- ---> Locked Here: chan_sip.c line 16751 (do_monitor)
=== -------------------------------------------------------------------
===
=======================================================================


Please can anybody tell me what's wrong?

By: Mark Michelson (mmichelson) 2009-08-10 10:54:05

The problem in your output is quite clear. The first thread is attempting to lock a sip_pvt and then a channel, and the third thread is trying to lock the channel and then the sip_pvt. This lock ordering is what is causing the deadlock to occur.

That being said, there are some strange things occurring here. Firstly, if I update my copy of 1.4 to rev 210575, The line numbers in chan_sip.c do not match the line numbers in your core show locks output. As an example, the first thread in your output shows that sip->lock was acquired on line 16751. In my copy, that line is in the middle of a multi-line if statement which is not attempting to acquire any locks. Secondly, I cannot find any place in that section of code which calls any function which would result in pbx_builtin_getvar_helper being called without first safely acquiring the channel lock. Is your copy of chan_sip modified in some way? I do not see how such a deadlock could be occurring.

By: Christoph Stadlmann (cstadlmann) 2009-08-10 12:02:19

mmichelson,

sorry, I forgot to mention that I have some patches and backports in my chan_sip.c.
I'll attach the patches, but don't add a disclaimer, because almost all code is written by someone else.

By: Christoph Stadlmann (cstadlmann) 2009-08-10 12:11:05

Just right before the line 16751 of chan_sip.c, there are the following lines in the code:

/*! \note If we can't get a lock on an interface, skip it and come
* back later. Note that there is the possibility of a deadlock with
* sip_hangup otherwise, because sip_hangup is called with the channel
* locked first, and the iface lock is attempted second.
*/
if (ast_mutex_trylock(&sip->lock))
continue;

Sorry, I was a kind of blind not to read what's in the comment, which states that there IS a possibility of deadlock...

So please could you assist in coping with that issue?

By: Mark Michelson (mmichelson) 2009-08-10 15:45:04

I see that comment, but using svn annotate, I can see that the DEADLOCK_AVOIDANCE code was added after that comment was put in the code. The author of the DEADLOCK_AVOIDANCE loop probably should have removed that comment, since it is not accurate. Of course, I may just be missing some operation causing pbx_builtin_getvar_helper to be called in between acquiring the sip_pvt lock and safely acquiring the channel lock. I'll take another look to be certain.

By: Mark Michelson (mmichelson) 2009-08-10 15:53:05

All right, I have verified that the problem is not in a stock build of Asterisk 1.4, and the problem is in your patch.

In do_monitor, after acquiring the sip_pvt lock, you are calling your transmit_cid_update function. One of the first operations it does is to call pbx_builtin_getvar_helper (specifically, it is getting the value of the "CHANNELPICKUP" variable). This is what is causing the deadlock. This can be avoided by using a DEADLOCK_AVOIDANCE loop similar to what is done later in the do_monitor function. If you add such a loop just before calling transmit_cid_update, the problem should go away.

Since the problem is in your patch and not in the code we distribute, I am going to close this issue.