|Summary:||ASTERISK-17760: [patch] deadlock in chan_sip|
|Reporter:||Byron Clark (byronclark)||Labels:|
|Date Opened:||2011-04-27 11:16:12||Date Closed:||2011-07-22 15:49:32|
|Environment:||Attachments:||( 0) 1.8_handle_statechange_deadlock_fix.patch|
( 1) ASTERISK-17760.patch
( 2) backtrace.txt
( 3) handle_statechange_deadlock_fix_v2.patch
( 4) handle_statechange_deadlock_fix.patch
( 5) issue_19191.patch
( 6) jira_asterisk_17760_v1.8.patch
|Description:||I've only seen this happen twice, both instances separated by about a month. |
The only symptom is that Asterisk stops responding to SIP packets.
This time I ran 'pkill -11 asterisk' when it happened so I could get a core dump. So, when looking at the backtrace, the segfault isn't the problem.
****** ADDITIONAL INFORMATION ******
I'm running asterisk 220.127.116.11.3 with patches for the following issues applied:
- issue ASTERISK-16961
- issue ASTERISK-17118
- issue ASTERISK-17268
|Comments:||By: Gregory Hinton Nietsky (irroot) 2011-04-27 12:12:12|
There seems to be a lock on adding a hint with ao2_callback handle_statechange
see ASTERISK-17666 and if you could supply "core show locks"
1.6.2 will be security release only.
By: Byron Clark (byronclark) 2011-04-27 12:14:13
I'll get the "core show locks" output when it happens again.
By: Byron Clark (byronclark) 2011-05-03 17:36:00
Here's the problem:
- The thread calling ast_get_hint (thread 38 in the backtrace) holds a lock on the hints container when it attempts to lock contexts by calling ast_rdlock_contexts().
- The thread calling ast_add_hint (thread 33 in the backtrace) holds a lock on the contexts (acquired with ast_rdlock_contexts()) and attempts to lock the hints container.
We're probably seeing this more than others because the initial hint setup in configuration requires a realtime lookup.
By: Byron Clark (byronclark) 2011-05-03 17:42:31
It appears that ast_rdlock_contexts() and ast_wrlock_contexts() both call ast_mutex_lock which in turn calls pthread_mutex_lock, so these definitely aren't rwlocks.
By: Byron Clark (byronclark) 2011-05-04 12:54:51
handle_statechange_deadlock_fix.patch (patch against 1.6.2 svn) takes the contexts lock before taking the lock on the hints container so that we don't get the out of order requests in ast_add_hint.
By: Byron Clark (byronclark) 2011-05-04 12:59:55
The same lock ordering problem (ast_hint_extension() tries to take the contexts lock while hints is already locked) in 1.8. 1.8_handle_statechange_deadlock_fix.patch is the same patch against 1.8 svn.
By: Byron Clark (byronclark) 2011-05-04 13:00:36
I'm assuming that it's safe to take the contexts lock when it will be taken again by the same thread because it is a pthread_mutex_t and thus recursive.
By: Byron Clark (byronclark) 2011-05-12 12:01:37
The original version of this patch exposed a deadlock in queue handling. handle_statechange_deadlock_fix_v2.patch adds the same locking semantics to queues (take the contexts lock before locking the queues container) as is used for hints.
Patch is against svn trunk but applies fairly cleanly to 1.8 (I'm testing it against 1.8.4 and 18.104.22.168.3).
By: Byron Clark (byronclark) 2011-05-16 11:24:10
These patches are currently unsafe. They reintroduce the deadlock fixed in issue ASTERISK-16961.
By: Byron Clark (byronclark) 2011-05-16 11:47:15
This deadlock only happens after the introduction of ASTERISK-16961 and is somewhat related to ASTERISK-17607.
By: Byron Clark (byronclark) 2011-05-20 09:32:21
issue_19191.patch (generated against trunk, but also needed for 1.8 and 1.6.2) is a far simpler patch that seems to be avoiding the deadlock on our production systems. The real issue is that ASTERISK-16961 changes the locking order for hints and contexts. This patch just makes sure that the hints lock is taken before the contexts lock on the path where we were seeing the deadlock.
By: Byron Clark (byronclark) 2011-06-06 10:35:06.519-0500
We ran into another location where the lock order being reversed causes a deadlock. [^ASTERISK-17760.patch] fixes both code paths.
By: Richard Mudgett (rmudgett) 2011-07-08 11:06:15.713-0500
[^jira_asterisk_17760_v1.8.patch] is the first patch posted in the reviewboard link
See the reviewboard description for more details.
By: Richard Mudgett (rmudgett) 2011-07-22 15:49:32.933-0500
Committed to -r329332 trunk.
Merged revisions 329331 via svnmerge from
r329331 | rmudgett | 2011-07-22 15:43:07 -0500 (Fri, 22 Jul 2011) | 55 lines
Merged revisions 329299 via svnmerge from
r329299 | rmudgett | 2011-07-22 10:44:58 -0500 (Fri, 22 Jul 2011) | 48 lines
Deadlocks dealing with dialplan hints during reload.
There are two remaining different deadlocks reported dealing with dialplan
The deadlock in ASTERISK-17666 is caused by invalid locking order in
ast_remove_hint(). The hints container must be locked before the hint
The deadlock in ASTERISK-17760 is caused by a catch-22 situation in
handle_statechange(). The deadlock is caused by not having the conlock
before calling the watcher callbacks. Unfortunately, having that lock
causes a different deadlock as reported in ASTERISK-16961.
* Fixed ast_remove_hint() locking order.
* Made handle_statechange() no longer call the watcher callbacks holding
any locks that matter.
* Made hint ao2 destructor do the watcher callbacks for extension
deactivation to guarantee that they get called.
* Fixed hint reference leak in ast_add_hint() if the callback container
* Fixed hint reference leak in complete_core_show_hint() for every hint it
found for CLI tab completion.
* Adjusted locking in ast_merge_contexts_and_delete() for safety.
* Added context_merge_lock to prevent ast_merge_contexts_and_delete() and
handle_statechange() from interfering with each other.
* Fixed ast_change_hint() not taking into account that the extension is
used for the hash key.
(closes issue ASTERISK-17666)
Reported by: irroot
Tested by: irroot
(closes issue ASTERISK-17760)
Reported by: Byron Clark
Tested by: irroot
By: Steven Wheeler (swheeler) 2011-09-29 15:01:41.735-0500
We are seeing this same problem in 1.6.2, any chance the fix will be back ported?
By: Richard Mudgett (rmudgett) 2011-09-29 15:12:46.488-0500
Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.