Summary:ASTERISK-18740: Deadlock in queues during dialplan reload
Reporter:Byron Clark (byronclark)Labels:
Date Opened:2011-10-19 15:55:23Date Closed:2011-11-09 14:45:15.000-0600
Status:Closed/CompleteComponents:Applications/app_queue PBX/pbx_config
Versions: Frequency of
is related toASTERISK-19009 Deadlock on sip_new and load_realtime_queue.
Environment:Real Asterisk version: OS: CentOS 5.4. Platform: x86_64Attachments:( 0) after_jira_asterisk_18740_v1.8.patch_deadlock.txt
( 1) ASTERISK-18740.patch
( 2) backtrace.full.txt
( 3) backtrace.txt
( 4) jira_asterisk_18740_v1.8.patch
( 5) locks.txt
( 6) sip_exists_exten_dlock_2.diff
( 7) sip_exists_exten_dlock_3.diff
( 8) sip_exists_exten_dlock.diff
Description:On PBX nodes handling ~1200 SIP users, we're seeing a deadlock almost every time "dialplan reload" is run.  The deadlock appears to take place only when a call to a queue is starting.

The deadlock is reproducible on a smaller system by running "while true; do asterisk -rx 'dialplan reload'; done" in a shell and then dialing an extension that places the call in a queue.
Comments:By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:13:07.164-0500

I can see 2 options here

1)hold the channels container lock while trying to get the conlock then release it again
or hold the channels lock whil the dilplan is reloaded.
2)unlock "tmp" in sip_new while calling ast_exists_extension this is my favored option

By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:16:49.370-0500

I dont like deadlocks !!!

patch that unlocks the channel to prevent deadlock in dialplan reload

By: Byron Clark (byronclark) 2011-10-20 09:21:53.512-0500

That patch does seem like the simplest route, but is it safe to unlock tmp there?

By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:24:34.304-0500

as its just been created and its sitting around waiting to do something ie no pbx thread running on it should be fine we holding the ref to it so it wont disapear either.

the lock should come after the if and else ill post a v2 soon just to be pedantic while the exten is been set.

By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:25:34.588-0500

More pedantic lock before changing exten.

By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:27:39.190-0500

the one thing i see is that the locking order gets messed up this can lead to problems as the channel will be locked after the pvt.

By: Gregory Hinton Nietsky (irroot) 2011-10-20 09:59:47.295-0500

Make sure locking order not inverted.

By: Byron Clark (byronclark) 2011-10-20 10:41:16.390-0500

Thanks for the patch. I'm going to build and test under load.

By: Byron Clark (byronclark) 2011-10-20 12:47:15.384-0500

[^ASTERISK-18740.patch] is a slightly modified version of [^sip_exists_exten_dlock_3.diff]. It uses the correct locking function on the channel.

I've been unable to reproduce the deadlock with the patch.

By: Gregory Hinton Nietsky (irroot) 2011-10-20 12:59:37.547-0500

shot thx for the feedback ... nice job

By: Byron Clark (byronclark) 2011-10-24 16:55:50.085-0500

[^ASTERISK-18740.patch] has been running under full load for a few days now. No deadlocks, and no corruption that I've seen.

By: Richard Mudgett (rmudgett) 2011-11-08 18:45:37.635-0600

[^jira_asterisk_18740_v1.8.patch] should take care of the deadlock in the general case of a hint extension callback.  This patch is a continuation of the fix for ASTERISK-17760.

I was initially going to just commit [^ASTERISK-18740.patch] as the most expedient way to avoid the deadlock.  However, I found that the hint callback notifying of a hint extension removal during a dialplan merge was called with too many locks held.

By: Richard Mudgett (rmudgett) 2011-11-08 18:51:27.724-0600

Please test the new patch.

By: Byron Clark (byronclark) 2011-11-09 10:49:02.475-0600

I'm still getting the deadlock with [^jira_asterisk_18740_v1.8.patch]. [^after_jira_asterisk_18740_v1.8.patch_deadlock.txt] is the "core show locks" output.

By: Byron Clark (byronclark) 2011-11-09 10:50:42.594-0600

Most recent testing was done against asterisk with this patch applied.

By: Richard Mudgett (rmudgett) 2011-11-09 14:09:28.544-0600

I finally found the backtrace path that the core show locks file inconveniently lists as unknown calls between ast_merge_contexts_and_delete() and ast_device_state().  It is the call to ast_add_extension_nolock() in ast_merge_contexts_and_delete().  The changes [^jira_asterisk_18740_v1.8.patch] makes are after the deadlock path.

I am just going to commit [^ASTERISK-18740.patch] as it adequately fixes this deadlock.