Summary:ASTERISK-17760: [patch] deadlock in chan_sip
Reporter:Byron Clark (byronclark)Labels:
Date Opened:2011-04-27 11:16:12Date Closed:2011-07-22 15:49:32
Versions:Frequency of
is duplicated byASTERISK-18156 SIP messages stop being processed with res_timing_dahdi
is duplicated byASTERISK-18205 Deadlock in app_queue when loading real-time queues and handling state change.
is related toASTERISK-17666 [patch] Deadlock in ast_remove_hint Race with ao2_callback
Environment:Attachments:( 0) 1.8_handle_statechange_deadlock_fix.patch
( 1) ASTERISK-17760.patch
( 2) backtrace.txt
( 3) handle_statechange_deadlock_fix_v2.patch
( 4) handle_statechange_deadlock_fix.patch
( 5) issue_19191.patch
( 6) jira_asterisk_17760_v1.8.patch
Description:I've only seen this happen twice, both instances separated by about a month.  
The only symptom is that Asterisk stops responding to SIP packets.

This time I ran 'pkill -11 asterisk' when it happened so I could get a core dump. So, when looking at the backtrace, the segfault isn't the problem.


I'm running asterisk with patches for the following issues applied:
- issue ASTERISK-16961
- issue ASTERISK-17118
- issue ASTERISK-17268
Comments:By: Gregory Hinton Nietsky (irroot) 2011-04-27 12:12:12

There seems to be a lock on adding a hint with ao2_callback handle_statechange

see ASTERISK-17666 and if you could supply "core show locks"

1.6.2 will be security release only.

By: Byron Clark (byronclark) 2011-04-27 12:14:13

I'll get the "core show locks" output when it happens again.

By: Byron Clark (byronclark) 2011-05-03 17:36:00

Here's the problem:
- The thread calling ast_get_hint (thread 38 in the backtrace) holds a lock on the hints container when it attempts to lock contexts by calling ast_rdlock_contexts().
- The thread calling ast_add_hint (thread 33 in the backtrace) holds a lock on the contexts (acquired with ast_rdlock_contexts()) and attempts to lock the hints container.

We're probably seeing this more than others because the initial hint setup in configuration requires a realtime lookup.

By: Byron Clark (byronclark) 2011-05-03 17:42:31

It appears that ast_rdlock_contexts() and ast_wrlock_contexts() both call ast_mutex_lock which in turn calls pthread_mutex_lock, so these definitely aren't rwlocks.

By: Byron Clark (byronclark) 2011-05-04 12:54:51

handle_statechange_deadlock_fix.patch (patch against 1.6.2 svn) takes the contexts lock before taking the lock on the hints container so that we don't get the out of order requests in ast_add_hint.

By: Byron Clark (byronclark) 2011-05-04 12:59:55

The same lock ordering problem (ast_hint_extension() tries to take the contexts lock while hints is already locked) in 1.8.  1.8_handle_statechange_deadlock_fix.patch is the same patch against 1.8 svn.

By: Byron Clark (byronclark) 2011-05-04 13:00:36

I'm assuming that it's safe to take the contexts lock when it will be taken again by the same thread because it is a pthread_mutex_t and thus recursive.

By: Byron Clark (byronclark) 2011-05-12 12:01:37

The original version of this patch exposed a deadlock in queue handling.  handle_statechange_deadlock_fix_v2.patch adds the same locking semantics to queues (take the contexts lock before locking the queues container) as is used for hints.

Patch is against svn trunk but applies fairly cleanly to 1.8 (I'm testing it against 1.8.4 and

By: Byron Clark (byronclark) 2011-05-16 11:24:10

These patches are currently unsafe.  They reintroduce the deadlock fixed in issue ASTERISK-16961.

By: Byron Clark (byronclark) 2011-05-16 11:47:15

This deadlock only happens after the introduction of ASTERISK-16961 and is somewhat related to ASTERISK-17607.

By: Byron Clark (byronclark) 2011-05-20 09:32:21

issue_19191.patch (generated against trunk, but also needed for 1.8 and 1.6.2) is a far simpler patch that seems to be avoiding the deadlock on our production systems.  The real issue is that ASTERISK-16961 changes the locking order for hints and contexts.  This patch just makes sure that the hints lock is taken before the contexts lock on the path where we were seeing the deadlock.

By: Byron Clark (byronclark) 2011-06-06 10:35:06.519-0500

We ran into another location where the lock order being reversed causes a deadlock. [^ASTERISK-17760.patch] fixes both code paths.

By: Richard Mudgett (rmudgett) 2011-07-08 11:06:15.713-0500

[^jira_asterisk_17760_v1.8.patch] is the first patch posted in the reviewboard link

See the reviewboard description for more details.

By: Richard Mudgett (rmudgett) 2011-07-22 15:49:32.933-0500

Committed to -r329332 trunk.

Merged revisions 329331 via svnmerge from

 r329331 | rmudgett | 2011-07-22 15:43:07 -0500 (Fri, 22 Jul 2011) | 55 lines

 Merged revisions 329299 via svnmerge from

   r329299 | rmudgett | 2011-07-22 10:44:58 -0500 (Fri, 22 Jul 2011) | 48 lines

   Deadlocks dealing with dialplan hints during reload.

   There are two remaining different deadlocks reported dealing with dialplan

   The deadlock in ASTERISK-17666 is caused by invalid locking order in
   ast_remove_hint().  The hints container must be locked before the hint

   The deadlock in ASTERISK-17760 is caused by a catch-22 situation in
   handle_statechange().  The deadlock is caused by not having the conlock
   before calling the watcher callbacks.  Unfortunately, having that lock
   causes a different deadlock as reported in ASTERISK-16961.

   * Fixed ast_remove_hint() locking order.

   * Made handle_statechange() no longer call the watcher callbacks holding
   any locks that matter.

   * Made hint ao2 destructor do the watcher callbacks for extension
   deactivation to guarantee that they get called.

   * Fixed hint reference leak in ast_add_hint() if the callback container
   constructor failed.

   * Fixed hint reference leak in complete_core_show_hint() for every hint it
   found for CLI tab completion.

   * Adjusted locking in ast_merge_contexts_and_delete() for safety.

   * Added context_merge_lock to prevent ast_merge_contexts_and_delete() and
   handle_statechange() from interfering with each other.

   * Fixed ast_change_hint() not taking into account that the extension is
   used for the hash key.

   (closes issue ASTERISK-17666)
   Reported by: irroot
   Tested by: irroot
   JIRA SWP-3318

   (closes issue ASTERISK-17760)
   Reported by: Byron Clark
   Tested by: irroot
   JIRA SWP-3393

   Review: https://reviewboard.asterisk.org/r/1313/

By: Steven Wheeler (swheeler) 2011-09-29 15:01:41.735-0500

We are seeing this same problem in 1.6.2, any chance the fix will be back ported?

By: Richard Mudgett (rmudgett) 2011-09-29 15:12:46.488-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.