|Summary:||ASTERISK-18740: Deadlock in queues during dialplan reload|
|Reporter:||Byron Clark (byronclark)||Labels:|
|Date Opened:||2011-10-19 15:55:23||Date Closed:||2011-11-09 14:45:15.000-0600|
|Environment:||Real Asterisk version: 184.108.40.206-rc2 OS: CentOS 5.4. Platform: x86_64||Attachments:||( 0) after_jira_asterisk_18740_v1.8.patch_deadlock.txt|
( 1) ASTERISK-18740.patch
( 2) backtrace.full.txt
( 3) backtrace.txt
( 4) jira_asterisk_18740_v1.8.patch
( 5) locks.txt
( 6) sip_exists_exten_dlock_2.diff
( 7) sip_exists_exten_dlock_3.diff
( 8) sip_exists_exten_dlock.diff
|Description:||On PBX nodes handling ~1200 SIP users, we're seeing a deadlock almost every time "dialplan reload" is run. The deadlock appears to take place only when a call to a queue is starting.|
The deadlock is reproducible on a smaller system by running "while true; do asterisk -rx 'dialplan reload'; done" in a shell and then dialing an extension that places the call in a queue.
|Comments:||By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:13:07.164-0500|
I can see 2 options here
1)hold the channels container lock while trying to get the conlock then release it again
or hold the channels lock whil the dilplan is reloaded.
2)unlock "tmp" in sip_new while calling ast_exists_extension this is my favored option
By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:16:49.370-0500
I dont like deadlocks !!!
patch that unlocks the channel to prevent deadlock in dialplan reload
By: Byron Clark (byronclark) 2011-10-20 09:21:53.512-0500
That patch does seem like the simplest route, but is it safe to unlock tmp there?
By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:24:34.304-0500
as its just been created and its sitting around waiting to do something ie no pbx thread running on it should be fine we holding the ref to it so it wont disapear either.
the lock should come after the if and else ill post a v2 soon just to be pedantic while the exten is been set.
By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:25:34.588-0500
More pedantic lock before changing exten.
By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:27:39.190-0500
the one thing i see is that the locking order gets messed up this can lead to problems as the channel will be locked after the pvt.
By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:59:47.295-0500
Make sure locking order not inverted.
By: Byron Clark (byronclark) 2011-10-20 10:41:16.390-0500
Thanks for the patch. I'm going to build and test under load.
By: Byron Clark (byronclark) 2011-10-20 12:47:15.384-0500
[^ASTERISK-18740.patch] is a slightly modified version of [^sip_exists_exten_dlock_3.diff]. It uses the correct locking function on the channel.
I've been unable to reproduce the deadlock with the patch.
By: Gregory Hinton Nietsky (irroot) 2011-10-20 12:59:37.547-0500
shot thx for the feedback ... nice job
By: Byron Clark (byronclark) 2011-10-24 16:55:50.085-0500
[^ASTERISK-18740.patch] has been running under full load for a few days now. No deadlocks, and no corruption that I've seen.
By: Richard Mudgett (rmudgett) 2011-11-08 18:45:37.635-0600
[^jira_asterisk_18740_v1.8.patch] should take care of the deadlock in the general case of a hint extension callback. This patch is a continuation of the fix for ASTERISK-17760.
I was initially going to just commit [^ASTERISK-18740.patch] as the most expedient way to avoid the deadlock. However, I found that the hint callback notifying of a hint extension removal during a dialplan merge was called with too many locks held.
By: Richard Mudgett (rmudgett) 2011-11-08 18:51:27.724-0600
Please test the new patch.
By: Byron Clark (byronclark) 2011-11-09 10:49:02.475-0600
I'm still getting the deadlock with [^jira_asterisk_18740_v1.8.patch]. [^after_jira_asterisk_18740_v1.8.patch_deadlock.txt] is the "core show locks" output.
By: Byron Clark (byronclark) 2011-11-09 10:50:42.594-0600
Most recent testing was done against asterisk 220.127.116.11-rc2 with this patch applied.
By: Richard Mudgett (rmudgett) 2011-11-09 14:09:28.544-0600
I finally found the backtrace path that the core show locks file inconveniently lists as unknown calls between ast_merge_contexts_and_delete() and ast_device_state(). It is the call to ast_add_extension_nolock() in ast_merge_contexts_and_delete(). The changes [^jira_asterisk_18740_v1.8.patch] makes are after the deadlock path.
I am just going to commit [^ASTERISK-18740.patch] as it adequately fixes this deadlock.