Summary: | ASTERISK-14585: [patch] Deadlock after peer answeres queue call | ||
Reporter: | Christoph Stadlmann (cstadlmann) | Labels: | |
Date Opened: | 2009-08-03 08:43:02 | Date Closed: | 2011-06-07 14:01:05 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/Channels |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) _chan_sip.c.patch | |
Description: | We have the following set up: Asterisk is hosting a queue with 4 members, ring strategy 'least recent'. The complete setup is realtime, but I don't think this has anything to do with the deadlock. The queue is playing MOH, which is a wav-file (no .mp3). After the SIP peer answeres the call, MOH stops playing, and exactly in this moment the deadlock occurs. It does not matter if queue members are 'Local/' or 'SIP/' channels, patching Asterisk as stated in Bug 14112 only makes the deadlock occur every 6 hours, with SIP channels the deadlock occurs every 3 hours or so. The only way to get Asterisk running again is to 'kill -9' the process. No other CLI command than one 'core show locks' is working, any subsequent CLI command does not produce any response anymore. ****** ADDITIONAL INFORMATION ****** -- Executing [s-A1_HV@extensions-queue:1] Answer("SIP/2244404001-083a4bc8", "") in new stack -- Executing [s-A1_HV@extensions-queue:2] Set("SIP/2244404001-083a4bc8", "CDR(accountcode)=Q-1249305180.297") in new stack -- Executing [s-A1_HV@extensions-queue:3] Set("SIP/2244404001-083a4bc8", "_QUID=Q-1249305180.297") in new stack -- Executing [s-A1_HV@extensions-queue:4] Set("SIP/2244404001-083a4bc8", "MONITOR_FILENAME=/var/spool/asterisk/monitor/qs-A1_HV--1249305180.297") in new stack -- Executing [s-A1_HV@extensions-queue:5] Playback("SIP/2244404001-083a4bc8", "custom/ringtone_1s") in new stack -- <SIP/2244404001-083a4bc8> Playing 'custom/ringtone_1s' (language 'de') -- Executing [s-A1_HV@extensions-queue:6] Queue("SIP/2244404001-083a4bc8", "A1_HV|t|||") in new stack -- Started music on hold, class 'A1HVPeak', on SIP/2244404001-083a4bc8 Extension Changed 601[internal] new state Ringing for Notify User 603 Extension Changed 601[internal] new state Ringing for Notify User 601 Extension Changed 601[internal] new state Ringing for Notify User 602 Extension Changed 601[internal] new state Ringing for Notify User 604 Extension Changed 601[internal] new state Ringing for Notify User 600 -- SIP/601-083a62b8 is ringing -- SIP/601-083a62b8 is ringing -- SIP/601-083a62b8 is ringing -- SIP/601-083a62b8 is ringing Extension Changed 601[internal] new state InUse for Notify User 603 Extension Changed 601[internal] new state InUse for Notify User 601 Extension Changed 601[internal] new state InUse for Notify User 602 Extension Changed 601[internal] new state InUse for Notify User 604 Extension Changed 601[internal] new state InUse for Notify User 600 -- SIP/601-083a62b8 answered SIP/2244404001-083a4bc8 -- Stopped music on hold on SIP/2244404001-083a4bc8 smartelpbx*CLI> core show locks smartelpbx*CLI> ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 3062475664 (do_monitor started at [16811] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 16659 do_monitor &iflock 0xb691eba0 (1) === ---> Lock #1 (chan_sip.c): MUTEX 16671 do_monitor &sip->lock 0x83a62b8 (1) === ---> Waiting for Lock #2 (pbx.c): MUTEX 5785 pbx_builtin_getvar_helper (channel lock) 0x84227f8 (1) === --- ---> Locked Here: channel.c line 2565 (ast_indicate_data) === ------------------------------------------------------------------- === === Thread ID: 3059358608 (pbx_thread started at [ 2597] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 2565 ast_indicate_data (channel lock) 0x84227f8 (1) === ---> Waiting for Lock #1 (chan_sip.c): MUTEX 4017 sip_indicate &p->lock 0x83a62b8 (1) === --- ---> Locked Here: chan_sip.c line 16671 (do_monitor) === ------------------------------------------------------------------- === ======================================================================= | ||
Comments: | By: Christoph Stadlmann (cstadlmann) 2009-08-05 09:54:01 Asterisk just crashed again... smartelpbx*CLI> core show locks smartelpbx*CLI> ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 3062631312 (do_monitor started at [16811] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 16659 do_monitor &iflock 0xb6944fa0 (1) === ---> Lock #1 (chan_sip.c): MUTEX 16671 do_monitor &sip->lock 0x83d6640 (1) === ---> Waiting for Lock #2 (pbx.c): MUTEX 5785 pbx_builtin_getvar_helper (channel lock) 0x839ac68 (1) === --- ---> Locked Here: channel.c line 2565 (ast_indicate_data) === ------------------------------------------------------------------- === === Thread ID: 3060509584 (session_do started at [ 2522] manager.c accept_thread()) === ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81b6ac0 (1) === ---> Tried and failed to get Lock #1 (channel.c): MUTEX 1127 channel_find_locked (channel lock) 0x839ac68 (0) === ------------------------------------------------------------------- === === Thread ID: 3059051408 (pbx_thread started at [ 2597] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 2565 ast_indicate_data (channel lock) 0x839ac68 (1) === ---> Waiting for Lock #1 (chan_sip.c): MUTEX 4017 sip_indicate &p->lock 0x83d6640 (1) === --- ---> Locked Here: chan_sip.c line 16671 (do_monitor) === ------------------------------------------------------------------- === ======================================================================= Maybe the whole issue has something to do with indication and sip notify message processing? By: Christoph Stadlmann (cstadlmann) 2009-08-10 09:07:21 I recently upgraded to SVN-branch-1.4-r210575, Asterisk was now running für 4 days without problems, but just now the deadlock occurred again, just after a peer answered a queue call. I'm using 'Local/' channels right now for queue members because it seems that Asterisk is more stable than using 'SIP/' channel members... smartelpbx*CLI> core show locks smartelpbx*CLI> ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 3062274960 (do_monitor started at [16891] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 16739 do_monitor &iflock 0xb68ef2e0 (1) === ---> Lock #1 (chan_sip.c): MUTEX 16751 do_monitor &sip->lock 0x83fbc18 (1) === ---> Waiting for Lock #2 (pbx.c): MUTEX 5814 pbx_builtin_getvar_helper (channel lock) 0x8864498 (1) === --- ---> Locked Here: rtp.c line 1546 (ast_rtp_early_bridge) === ------------------------------------------------------------------- === === Thread ID: 3060153232 (session_do started at [ 2522] manager.c accept_thread()) === ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81b7c00 (1) === ---> Lock #1 (channel.c): MUTEX 1090 channel_find_locked &(&channels)->lock 0x81ab448 (1) === ---> Tried and failed to get Lock #2 (channel.c): MUTEX 1127 channel_find_locked (channel lock) 0x8864498 (0) === ------------------------------------------------------------------- === === Thread ID: 3058936720 (pbx_thread started at [ 2623] pbx.c ast_pbx_start()) === ---> Lock #0 (rtp.c): MUTEX 1544 ast_rtp_early_bridge (channel lock) 0x83f32c0 (1) === ---> Lock #1 (rtp.c): MUTEX 1546 ast_rtp_early_bridge (channel lock) 0x8864498 (1) === ---> Waiting for Lock #2 (chan_sip.c): MUTEX 18798 sip_get_rtp_peer &p->lock 0x83fbc18 (1) === --- ---> Locked Here: chan_sip.c line 16751 (do_monitor) === ------------------------------------------------------------------- === ======================================================================= Please can anybody tell me what's wrong? By: Mark Michelson (mmichelson) 2009-08-10 10:54:05 The problem in your output is quite clear. The first thread is attempting to lock a sip_pvt and then a channel, and the third thread is trying to lock the channel and then the sip_pvt. This lock ordering is what is causing the deadlock to occur. That being said, there are some strange things occurring here. Firstly, if I update my copy of 1.4 to rev 210575, The line numbers in chan_sip.c do not match the line numbers in your core show locks output. As an example, the first thread in your output shows that sip->lock was acquired on line 16751. In my copy, that line is in the middle of a multi-line if statement which is not attempting to acquire any locks. Secondly, I cannot find any place in that section of code which calls any function which would result in pbx_builtin_getvar_helper being called without first safely acquiring the channel lock. Is your copy of chan_sip modified in some way? I do not see how such a deadlock could be occurring. By: Christoph Stadlmann (cstadlmann) 2009-08-10 12:02:19 mmichelson, sorry, I forgot to mention that I have some patches and backports in my chan_sip.c. I'll attach the patches, but don't add a disclaimer, because almost all code is written by someone else. By: Christoph Stadlmann (cstadlmann) 2009-08-10 12:11:05 Just right before the line 16751 of chan_sip.c, there are the following lines in the code: /*! \note If we can't get a lock on an interface, skip it and come * back later. Note that there is the possibility of a deadlock with * sip_hangup otherwise, because sip_hangup is called with the channel * locked first, and the iface lock is attempted second. */ if (ast_mutex_trylock(&sip->lock)) continue; Sorry, I was a kind of blind not to read what's in the comment, which states that there IS a possibility of deadlock... So please could you assist in coping with that issue? By: Mark Michelson (mmichelson) 2009-08-10 15:45:04 I see that comment, but using svn annotate, I can see that the DEADLOCK_AVOIDANCE code was added after that comment was put in the code. The author of the DEADLOCK_AVOIDANCE loop probably should have removed that comment, since it is not accurate. Of course, I may just be missing some operation causing pbx_builtin_getvar_helper to be called in between acquiring the sip_pvt lock and safely acquiring the channel lock. I'll take another look to be certain. By: Mark Michelson (mmichelson) 2009-08-10 15:53:05 All right, I have verified that the problem is not in a stock build of Asterisk 1.4, and the problem is in your patch. In do_monitor, after acquiring the sip_pvt lock, you are calling your transmit_cid_update function. One of the first operations it does is to call pbx_builtin_getvar_helper (specifically, it is getting the value of the "CHANNELPICKUP" variable). This is what is causing the deadlock. This can be avoided by using a DEADLOCK_AVOIDANCE loop similar to what is done later in the do_monitor function. If you add such a loop just before calling transmit_cid_update, the problem should go away. Since the problem is in your patch and not in the code we distribute, I am going to close this issue. |