[Home]

Summary:ASTERISK-16834: [patch] hint state changes deadlock problem
Reporter:antonio (antonio)Labels:
Date Opened:2010-10-19 13:45:26Date Closed:2010-12-06 13:16:43.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/PBX
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bug18165-1.6.patch
( 1) hint_ao2_locks-1.6.2.13.patch
( 2) issue18165.patch
Description:What happens is there are two threads that lock resources in a different order causing a deadlock.

The threads are:

1. from taskprocessor handle_statechange() is called. This one locks ast_rdlock_contexts, and then eventually call cb_extensionstate() which immediately does a sip_pvt_lock()

2. do_monitor calls sipsock_read() which calls handle_request_do() which calls find_call(), which does a sip_pvt_lock() too. The incoming request is a subscription, so it calls ast_get_hint(), which calls ast_hint_extension() which calls ast_rdlock_contexts(). Deadlock.


****** ADDITIONAL INFORMATION ******

There are related locking problems with reloading the dialplan as well.
This issue was first reported on the -dev list here:
http://lists.digium.com/pipermail/asterisk-dev/2010-October/046509.html, created this issue as per Russel's request.
Comments:By: antonio (antonio) 2010-10-19 13:49:35

It hard to reproduce this at will, but it's easy to wait for it: it happens every two days for me.

By: Stefan Schmidt (schmidts) 2010-10-19 14:37:01

with 1.6.2.14rc1 it should happens on more times cause handle_statechange is just much slower in 1.6.2.14 so the possible deadlock could easier occur.

i had found an easy way to reproduce this by sending 400 subscribes per second with sipp but it doesnt happen allways.

By: Jeff Peeler (jpeeler) 2010-10-28 10:32:51

Schmidts will you share that sipp scenario file?

By: Stefan Schmidt (schmidts) 2010-10-28 10:43:20

jeff have a look at issue 18145 (its private). take the subscribe.xml scenario and i have also added the inf file and also hints i use.

i have seen this happing when sending around 10k subscribes in with 400 subscribes per second and doing some dialplan reloads.

By: Michael Gaudette (bluefox) 2010-11-01 08:15:29

This just happened to me twice this morning, and here is what I managed to fetch before I "panicked-restarted" Asterisk.

This was the last warning in the CLI before total deadlock:
[Nov  1 09:07:00] WARNING[6135]: pbx.c:7480 add_pri_lockopt: Unable to register extension '115', priority -1 in 'hints-parking', already in use

Could this be linked to parking hints, as opposed to ordinary subscribe?

By: Steve Davies (one47) 2010-11-01 09:32:00

We can cause what I believe is this issue fairly simply in our test lab. We use:

- multiple copies of "pjsua" registered (about 100 handsets)
- 2 or 3 handsets BLF monitoring for 40 to 80 hints each
- A call script which dials 10 calls per second. The calls hangup after 2 seconds

pjsua uses its default fairly short SIP subscribe period, the handsets re-subscribe their BLFs in a "flurry" every 30 minutes, and there is constant call/BLF state activity.

I have so far failed to get any data from asterisk as it locks solid after an hour or two - This may be because it is in a small "appliance" environment, and logs are on a RAMFS. This lock does not appear to happen in 1.6.2.10, but does occur in 1.6.2.11 to 1.6.2.13; I assume this is due to timing in the BLF code paths as there are no obvious changes.

I am willing to test patches in our test environment.

By: Michael Gaudette (bluefox) 2010-11-01 14:51:45

0018052 is also most likely a duplicate.

By: Stefan Schmidt (schmidts) 2010-11-02 00:50:12

bluefox it could be that there is a relation between this and 0018052 but without logs its hard to say.

the relation i have set is definitly the same bug but in two different versions.

By: antonio (antonio) 2010-11-02 01:30:40

Looking at the aforementioned mailing list thread: the root cause of this issue is already known, but it requires a rewrite of hint state change processing. Russell took a shot at this, but someone else is now assigned to finalizing Russell's patch. No timeframe was mentioned.

By: Jeff Peeler (jpeeler) 2010-11-03 12:47:20

I created a branch for this here:
http://svn.digium.com/svn/asterisk/team/jpeeler/issue18165

I'm also attaching the patch that will hopefully apply to later versions (it is 1.4 based).

By: Michael Gaudette (bluefox) 2010-11-04 07:46:30

I'll wait until someone who can reproduce it at will confirms it works on 1.6, and I'll try it on a prod system.  But I'm certainly willing to go ahead quickly.

By: Steve Davies (one47) 2010-11-08 11:32:31.000-0600

I am working on a 1.6.2.13 version of this patch. In the meantime, the original patch appears to be missing a call to
     ao2_ref(hint, -1);
in the
     if(hint) { ... }
test near the start of ast_add_hint.

I do not have 1.4 downloaded, so could not easily generate an updated patch.

By: Jeff Peeler (jpeeler) 2010-11-08 17:30:46.000-0600

Thanks for pointing that out, I plan on committing shortly as it has been tested here with no problems.

By: Digium Subversion (svnbot) 2010-11-09 11:38:02.000-0600

Repository: asterisk
Revision: 294384

U   branches/1.4/include/asterisk.h
U   branches/1.4/main/asterisk.c
U   branches/1.4/main/pbx.c

------------------------------------------------------------------------
r294384 | jpeeler | 2010-11-09 11:38:00 -0600 (Tue, 09 Nov 2010) | 47 lines

Fix a deadlock in device state change processing.

Copied from some notes from the original author (Russell):

Deadlock scenario:
Thread 1: device state change thread
 Holds - rdlock on contexts
 Holds - hints lock
 Waiting on channels container lock

Thread 2: SIP monitor thread
 Holds the "iflock"
 Holds a sip_pvt lock
 Holds channel container lock
 Waiting for a channel lock

Thread 3: A channel thread (chan_local in this case)
 Holds 2 channel locks acquired within app_dial
 Holds a 3rd channel lock it got inside of chan_local
 Holds a local_pvt lock
 Waiting on a rdlock of the contexts lock

A bunch of other threads waiting on a wrlock of the contexts lock


To address this deadlock, some locking order rules must be put in place and
enforced. Existing relevant rules:

1) channel lock before a pvt lock
2) contexts lock before hints lock
3) channels container before a channel

What's missing is some enforcement of the order when you involve more than any
two. To fix this problem, I put in some code that ensures that (at least in the
code paths involved in this bug) the locks in (3) come before the locks in (2).
To change the operation of thread 1 to comply, I converted the storage of hints
to an astobj2 container. This allows processing of hints without holding the
hints container lock. So, in the code path that led to thread 1's state, it no
longer holds either the contexts or hints lock while it attempts to lock the
channels container.

(closes issue ASTERISK-16834)
Reported by: antonio

ABE-2583


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=294384

By: Jeff Peeler (jpeeler) 2010-11-10 15:50:21.000-0600

Merge to other branches in progress, reopening for patch attachment.

By: Stefan Schmidt (schmidts) 2010-11-10 16:06:23.000-0600

jeff please have a look at https://reviewboard.asterisk.org/r/1003/ this is a "merge" of your 1.4 patch with my patch to speed up statechange handling but in most parts it should be the same like yours but for trunk.

regards

stefan

By: Jeff Peeler (jpeeler) 2010-11-10 17:10:08.000-0600

Steve is going to upload his 1.6 version of his patch tomorrow and we'll compare to make sure I didn't miss anything. I found 3 places that probably need fixing in the original patch.

Stefan - I will look at your reviewboard request after I finish merging up to 1.8 (your patch will go into trunk instead of this one). At minimum though we'll need to make sure you add the things I missed in the 1.4 version.



By: Steve Davies (one47) 2010-11-11 03:52:41.000-0600

Hi, Just uploaded hint_ao2_locks-1.6.2.13.patch - This is a straight conversion of the 1.4 patch. It has been running for about 3 days, and for the last 12 hours has been put under constant BLF load.

100 (pjsua) devices, 10 calls per second (resulting in 30 state changes per second) monitored by 3 handsets. Hints re-subscribed hourly, and occasional dialplan reloads.

I have probably missed something :)

By: Stefan Schmidt (schmidts) 2010-11-11 04:18:29.000-0600

thanks you your work steve!

i have noticed one possible problem (which is covered from the original patch or atleast in the branch)
in the handle_statechange function on row: 3890 you have to check if hint->exten exists, after you have take the lock on it to make sure this exten wasnt destroy in between.

and a simple change which cause a better performance. Dont make hints just a linklist use the extenname (ast_get_extension_name(hint->exten)) for hash value and init the container with a usefull amount of buckets like 563.
extenname is unique per context so this would decrease the search time without causing any problems. even if you have only 1 hint per context then it would be again a linklist and not slower.

By: Jeff Peeler (jpeeler) 2010-11-11 13:29:03.000-0600

Yes, I actually did miss a few things and they have been corrected now. The only one that mattered though was some bad locking in ast_add_hint. I went ahead and also pulled in the small performance improvement for the number of buckets too. Stefan you can either wait for my review or look at the small differences in complete_core_show_hint, handle_show_hint, and ast_merge_contexts_and_delete.

By: Digium Subversion (svnbot) 2010-11-11 13:31:02.000-0600

Repository: asterisk
Revision: 294639

_U  branches/1.6.2/
U   branches/1.6.2/include/asterisk.h
U   branches/1.6.2/main/asterisk.c
U   branches/1.6.2/main/pbx.c

------------------------------------------------------------------------
r294639 | jpeeler | 2010-11-11 13:31:01 -0600 (Thu, 11 Nov 2010) | 53 lines

Merged revisions 294384 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r294384 | jpeeler | 2010-11-09 11:37:59 -0600 (Tue, 09 Nov 2010) | 47 lines
 
 Fix a deadlock in device state change processing.
 
 Copied from some notes from the original author (Russell):
 
 Deadlock scenario:
 Thread 1: device state change thread
   Holds - rdlock on contexts
   Holds - hints lock
   Waiting on channels container lock
 
 Thread 2: SIP monitor thread
   Holds the "iflock"
   Holds a sip_pvt lock
   Holds channel container lock
   Waiting for a channel lock
 
 Thread 3: A channel thread (chan_local in this case)
   Holds 2 channel locks acquired within app_dial
   Holds a 3rd channel lock it got inside of chan_local
   Holds a local_pvt lock
   Waiting on a rdlock of the contexts lock
 
 A bunch of other threads waiting on a wrlock of the contexts lock
 
 
 To address this deadlock, some locking order rules must be put in place and
 enforced. Existing relevant rules:
 
 1) channel lock before a pvt lock
 2) contexts lock before hints lock
 3) channels container before a channel
 
 What's missing is some enforcement of the order when you involve more than any
 two. To fix this problem, I put in some code that ensures that (at least in the
 code paths involved in this bug) the locks in (3) come before the locks in (2).
 To change the operation of thread 1 to comply, I converted the storage of hints
 to an astobj2 container. This allows processing of hints without holding the
 hints container lock. So, in the code path that led to thread 1's state, it no
 longer holds either the contexts or hints lock while it attempts to lock the
 channels container.
 
 (closes issue ASTERISK-16834)
 Reported by: antonio
 
 ABE-2583
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=294639

By: Digium Subversion (svnbot) 2010-11-11 13:42:08.000-0600

Repository: asterisk
Revision: 294640

_U  branches/1.8/
U   branches/1.8/include/asterisk.h
U   branches/1.8/main/asterisk.c
U   branches/1.8/main/pbx.c

------------------------------------------------------------------------
r294640 | jpeeler | 2010-11-11 13:42:07 -0600 (Thu, 11 Nov 2010) | 60 lines

Merged revisions 294639 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.6.2

................
 r294639 | jpeeler | 2010-11-11 13:31:00 -0600 (Thu, 11 Nov 2010) | 53 lines
 
 Merged revisions 294384 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r294384 | jpeeler | 2010-11-09 11:37:59 -0600 (Tue, 09 Nov 2010) | 47 lines
   
   Fix a deadlock in device state change processing.
   
   Copied from some notes from the original author (Russell):
   
   Deadlock scenario:
   Thread 1: device state change thread
     Holds - rdlock on contexts
     Holds - hints lock
     Waiting on channels container lock
   
   Thread 2: SIP monitor thread
     Holds the "iflock"
     Holds a sip_pvt lock
     Holds channel container lock
     Waiting for a channel lock
   
   Thread 3: A channel thread (chan_local in this case)
     Holds 2 channel locks acquired within app_dial
     Holds a 3rd channel lock it got inside of chan_local
     Holds a local_pvt lock
     Waiting on a rdlock of the contexts lock
   
   A bunch of other threads waiting on a wrlock of the contexts lock
   
   
   To address this deadlock, some locking order rules must be put in place and
   enforced. Existing relevant rules:
   
   1) channel lock before a pvt lock
   2) contexts lock before hints lock
   3) channels container before a channel
   
   What's missing is some enforcement of the order when you involve more than any
   two. To fix this problem, I put in some code that ensures that (at least in the
   code paths involved in this bug) the locks in (3) come before the locks in (2).
   To change the operation of thread 1 to comply, I converted the storage of hints
   to an astobj2 container. This allows processing of hints without holding the
   hints container lock. So, in the code path that led to thread 1's state, it no
   longer holds either the contexts or hints lock while it attempts to lock the
   channels container.
   
   (closes issue ASTERISK-16834)
   Reported by: antonio
   
   ABE-2583
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=294640