Summary: | ASTERISK-16834: [patch] hint state changes deadlock problem | ||
Reporter: | antonio (antonio) | Labels: | |
Date Opened: | 2010-10-19 13:45:26 | Date Closed: | 2010-12-06 13:16:43.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Core/PBX |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) bug18165-1.6.patch ( 1) hint_ao2_locks-1.6.2.13.patch ( 2) issue18165.patch | |
Description: | What happens is there are two threads that lock resources in a different order causing a deadlock. The threads are: 1. from taskprocessor handle_statechange() is called. This one locks ast_rdlock_contexts, and then eventually call cb_extensionstate() which immediately does a sip_pvt_lock() 2. do_monitor calls sipsock_read() which calls handle_request_do() which calls find_call(), which does a sip_pvt_lock() too. The incoming request is a subscription, so it calls ast_get_hint(), which calls ast_hint_extension() which calls ast_rdlock_contexts(). Deadlock. ****** ADDITIONAL INFORMATION ****** There are related locking problems with reloading the dialplan as well. This issue was first reported on the -dev list here: http://lists.digium.com/pipermail/asterisk-dev/2010-October/046509.html, created this issue as per Russel's request. | ||
Comments: | By: antonio (antonio) 2010-10-19 13:49:35 It hard to reproduce this at will, but it's easy to wait for it: it happens every two days for me. By: Stefan Schmidt (schmidts) 2010-10-19 14:37:01 with 1.6.2.14rc1 it should happens on more times cause handle_statechange is just much slower in 1.6.2.14 so the possible deadlock could easier occur. i had found an easy way to reproduce this by sending 400 subscribes per second with sipp but it doesnt happen allways. By: Jeff Peeler (jpeeler) 2010-10-28 10:32:51 Schmidts will you share that sipp scenario file? By: Stefan Schmidt (schmidts) 2010-10-28 10:43:20 jeff have a look at issue 18145 (its private). take the subscribe.xml scenario and i have also added the inf file and also hints i use. i have seen this happing when sending around 10k subscribes in with 400 subscribes per second and doing some dialplan reloads. By: Michael Gaudette (bluefox) 2010-11-01 08:15:29 This just happened to me twice this morning, and here is what I managed to fetch before I "panicked-restarted" Asterisk. This was the last warning in the CLI before total deadlock: [Nov 1 09:07:00] WARNING[6135]: pbx.c:7480 add_pri_lockopt: Unable to register extension '115', priority -1 in 'hints-parking', already in use Could this be linked to parking hints, as opposed to ordinary subscribe? By: Steve Davies (one47) 2010-11-01 09:32:00 We can cause what I believe is this issue fairly simply in our test lab. We use: - multiple copies of "pjsua" registered (about 100 handsets) - 2 or 3 handsets BLF monitoring for 40 to 80 hints each - A call script which dials 10 calls per second. The calls hangup after 2 seconds pjsua uses its default fairly short SIP subscribe period, the handsets re-subscribe their BLFs in a "flurry" every 30 minutes, and there is constant call/BLF state activity. I have so far failed to get any data from asterisk as it locks solid after an hour or two - This may be because it is in a small "appliance" environment, and logs are on a RAMFS. This lock does not appear to happen in 1.6.2.10, but does occur in 1.6.2.11 to 1.6.2.13; I assume this is due to timing in the BLF code paths as there are no obvious changes. I am willing to test patches in our test environment. By: Michael Gaudette (bluefox) 2010-11-01 14:51:45 0018052 is also most likely a duplicate. By: Stefan Schmidt (schmidts) 2010-11-02 00:50:12 bluefox it could be that there is a relation between this and 0018052 but without logs its hard to say. the relation i have set is definitly the same bug but in two different versions. By: antonio (antonio) 2010-11-02 01:30:40 Looking at the aforementioned mailing list thread: the root cause of this issue is already known, but it requires a rewrite of hint state change processing. Russell took a shot at this, but someone else is now assigned to finalizing Russell's patch. No timeframe was mentioned. By: Jeff Peeler (jpeeler) 2010-11-03 12:47:20 I created a branch for this here: http://svn.digium.com/svn/asterisk/team/jpeeler/issue18165 I'm also attaching the patch that will hopefully apply to later versions (it is 1.4 based). By: Michael Gaudette (bluefox) 2010-11-04 07:46:30 I'll wait until someone who can reproduce it at will confirms it works on 1.6, and I'll try it on a prod system. But I'm certainly willing to go ahead quickly. By: Steve Davies (one47) 2010-11-08 11:32:31.000-0600 I am working on a 1.6.2.13 version of this patch. In the meantime, the original patch appears to be missing a call to ao2_ref(hint, -1); in the if(hint) { ... } test near the start of ast_add_hint. I do not have 1.4 downloaded, so could not easily generate an updated patch. By: Jeff Peeler (jpeeler) 2010-11-08 17:30:46.000-0600 Thanks for pointing that out, I plan on committing shortly as it has been tested here with no problems. By: Digium Subversion (svnbot) 2010-11-09 11:38:02.000-0600 Repository: asterisk Revision: 294384 U branches/1.4/include/asterisk.h U branches/1.4/main/asterisk.c U branches/1.4/main/pbx.c ------------------------------------------------------------------------ r294384 | jpeeler | 2010-11-09 11:38:00 -0600 (Tue, 09 Nov 2010) | 47 lines Fix a deadlock in device state change processing. Copied from some notes from the original author (Russell): Deadlock scenario: Thread 1: device state change thread Holds - rdlock on contexts Holds - hints lock Waiting on channels container lock Thread 2: SIP monitor thread Holds the "iflock" Holds a sip_pvt lock Holds channel container lock Waiting for a channel lock Thread 3: A channel thread (chan_local in this case) Holds 2 channel locks acquired within app_dial Holds a 3rd channel lock it got inside of chan_local Holds a local_pvt lock Waiting on a rdlock of the contexts lock A bunch of other threads waiting on a wrlock of the contexts lock To address this deadlock, some locking order rules must be put in place and enforced. Existing relevant rules: 1) channel lock before a pvt lock 2) contexts lock before hints lock 3) channels container before a channel What's missing is some enforcement of the order when you involve more than any two. To fix this problem, I put in some code that ensures that (at least in the code paths involved in this bug) the locks in (3) come before the locks in (2). To change the operation of thread 1 to comply, I converted the storage of hints to an astobj2 container. This allows processing of hints without holding the hints container lock. So, in the code path that led to thread 1's state, it no longer holds either the contexts or hints lock while it attempts to lock the channels container. (closes issue ASTERISK-16834) Reported by: antonio ABE-2583 ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=294384 By: Jeff Peeler (jpeeler) 2010-11-10 15:50:21.000-0600 Merge to other branches in progress, reopening for patch attachment. By: Stefan Schmidt (schmidts) 2010-11-10 16:06:23.000-0600 jeff please have a look at https://reviewboard.asterisk.org/r/1003/ this is a "merge" of your 1.4 patch with my patch to speed up statechange handling but in most parts it should be the same like yours but for trunk. regards stefan By: Jeff Peeler (jpeeler) 2010-11-10 17:10:08.000-0600 Steve is going to upload his 1.6 version of his patch tomorrow and we'll compare to make sure I didn't miss anything. I found 3 places that probably need fixing in the original patch. Stefan - I will look at your reviewboard request after I finish merging up to 1.8 (your patch will go into trunk instead of this one). At minimum though we'll need to make sure you add the things I missed in the 1.4 version. By: Steve Davies (one47) 2010-11-11 03:52:41.000-0600 Hi, Just uploaded hint_ao2_locks-1.6.2.13.patch - This is a straight conversion of the 1.4 patch. It has been running for about 3 days, and for the last 12 hours has been put under constant BLF load. 100 (pjsua) devices, 10 calls per second (resulting in 30 state changes per second) monitored by 3 handsets. Hints re-subscribed hourly, and occasional dialplan reloads. I have probably missed something :) By: Stefan Schmidt (schmidts) 2010-11-11 04:18:29.000-0600 thanks you your work steve! i have noticed one possible problem (which is covered from the original patch or atleast in the branch) in the handle_statechange function on row: 3890 you have to check if hint->exten exists, after you have take the lock on it to make sure this exten wasnt destroy in between. and a simple change which cause a better performance. Dont make hints just a linklist use the extenname (ast_get_extension_name(hint->exten)) for hash value and init the container with a usefull amount of buckets like 563. extenname is unique per context so this would decrease the search time without causing any problems. even if you have only 1 hint per context then it would be again a linklist and not slower. By: Jeff Peeler (jpeeler) 2010-11-11 13:29:03.000-0600 Yes, I actually did miss a few things and they have been corrected now. The only one that mattered though was some bad locking in ast_add_hint. I went ahead and also pulled in the small performance improvement for the number of buckets too. Stefan you can either wait for my review or look at the small differences in complete_core_show_hint, handle_show_hint, and ast_merge_contexts_and_delete. By: Digium Subversion (svnbot) 2010-11-11 13:31:02.000-0600 Repository: asterisk Revision: 294639 _U branches/1.6.2/ U branches/1.6.2/include/asterisk.h U branches/1.6.2/main/asterisk.c U branches/1.6.2/main/pbx.c ------------------------------------------------------------------------ r294639 | jpeeler | 2010-11-11 13:31:01 -0600 (Thu, 11 Nov 2010) | 53 lines Merged revisions 294384 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r294384 | jpeeler | 2010-11-09 11:37:59 -0600 (Tue, 09 Nov 2010) | 47 lines Fix a deadlock in device state change processing. Copied from some notes from the original author (Russell): Deadlock scenario: Thread 1: device state change thread Holds - rdlock on contexts Holds - hints lock Waiting on channels container lock Thread 2: SIP monitor thread Holds the "iflock" Holds a sip_pvt lock Holds channel container lock Waiting for a channel lock Thread 3: A channel thread (chan_local in this case) Holds 2 channel locks acquired within app_dial Holds a 3rd channel lock it got inside of chan_local Holds a local_pvt lock Waiting on a rdlock of the contexts lock A bunch of other threads waiting on a wrlock of the contexts lock To address this deadlock, some locking order rules must be put in place and enforced. Existing relevant rules: 1) channel lock before a pvt lock 2) contexts lock before hints lock 3) channels container before a channel What's missing is some enforcement of the order when you involve more than any two. To fix this problem, I put in some code that ensures that (at least in the code paths involved in this bug) the locks in (3) come before the locks in (2). To change the operation of thread 1 to comply, I converted the storage of hints to an astobj2 container. This allows processing of hints without holding the hints container lock. So, in the code path that led to thread 1's state, it no longer holds either the contexts or hints lock while it attempts to lock the channels container. (closes issue ASTERISK-16834) Reported by: antonio ABE-2583 ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=294639 By: Digium Subversion (svnbot) 2010-11-11 13:42:08.000-0600 Repository: asterisk Revision: 294640 _U branches/1.8/ U branches/1.8/include/asterisk.h U branches/1.8/main/asterisk.c U branches/1.8/main/pbx.c ------------------------------------------------------------------------ r294640 | jpeeler | 2010-11-11 13:42:07 -0600 (Thu, 11 Nov 2010) | 60 lines Merged revisions 294639 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.6.2 ................ r294639 | jpeeler | 2010-11-11 13:31:00 -0600 (Thu, 11 Nov 2010) | 53 lines Merged revisions 294384 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r294384 | jpeeler | 2010-11-09 11:37:59 -0600 (Tue, 09 Nov 2010) | 47 lines Fix a deadlock in device state change processing. Copied from some notes from the original author (Russell): Deadlock scenario: Thread 1: device state change thread Holds - rdlock on contexts Holds - hints lock Waiting on channels container lock Thread 2: SIP monitor thread Holds the "iflock" Holds a sip_pvt lock Holds channel container lock Waiting for a channel lock Thread 3: A channel thread (chan_local in this case) Holds 2 channel locks acquired within app_dial Holds a 3rd channel lock it got inside of chan_local Holds a local_pvt lock Waiting on a rdlock of the contexts lock A bunch of other threads waiting on a wrlock of the contexts lock To address this deadlock, some locking order rules must be put in place and enforced. Existing relevant rules: 1) channel lock before a pvt lock 2) contexts lock before hints lock 3) channels container before a channel What's missing is some enforcement of the order when you involve more than any two. To fix this problem, I put in some code that ensures that (at least in the code paths involved in this bug) the locks in (3) come before the locks in (2). To change the operation of thread 1 to comply, I converted the storage of hints to an astobj2 container. This allows processing of hints without holding the hints container lock. So, in the code path that led to thread 1's state, it no longer holds either the contexts or hints lock while it attempts to lock the channels container. (closes issue ASTERISK-16834) Reported by: antonio ABE-2583 ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=294640 |