Summary: | ASTERISK-16120: Race condition causes manager session event list to underflow causing null pointer de-ref and crash. | ||
Reporter: | David Woolley (davidw) | Labels: | |
Date Opened: | 2010-05-20 12:26:26 | Date Closed: | 2010-06-01 11:42:04 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/ManagerInterface |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) eventpurge-full.log ( 1) eventpurge-gdb.txt | |
Description: | There is a flaw in the reference count logic for events in manager.c which means about once in the order of 100,000,000 events, the last event gets purged, resulting in a trying to dereference a null last_ev pointer at: 0x0811b951 in process_events (s=0xb773819c) at manager.c:2685 2685 while ( (eqe = NEW_EVENT(s)) ) { Scenario. Run Asterisk for long enough with a manager logged in and events being generated. Expect. Nothing untoward. Get. Crash, as above. ****** ADDITIONAL INFORMATION ****** Again this is a race condition and again I have done a patch to exacerbate it, although in this case, the patch is too intimately connected with the bug to be able to separate the test patch from the fix. The normal logic for manager.c is to have a session holding a usecount on at least the last event that it output, and to hang onto that last event as the link to the next one, when it arrives. When a new event arrives, a pointer to it is fetched, from the previous event structure, which is the line that is crashing, then the reference count is updated. This update is safe, because events are purged in list order and it still has a reference count on the previous one. When it has finished with the event, it calls unref_event, on the last event (not the one it has just output). This decrements the use count, then follows the chain forwards. The problem is that this event is vulnerable to purging in the few instructions between the decrement and the chaining! Every five seconds, events are purged if they are between the list head and one with a non-zero use count (or the list tail entry). The diagnostic patch uses sleep to extend this window from 10s of nano-seconds to a second. It then takes a relatively short time for the 5 second purge cycle to land in the middle of this gap. Also confusing about the code is that it increments the reference count on event it has fetched to output (ege above), but then refetches this value, hidden within unref_event, to follow the chain. That somewhat obfuscates the subjects of the reference count manipulation. For reference the current code for unref_event is: static struct eventqent *unref_event(struct eventqent *e) { ast_atomic_fetchadd_int(&e->usecount, -1); return AST_LIST_NEXT(e, eq_next); } The real life case was on 1.6.1.0, and took about four months. I can see no obvious code difference between 1.6.1.0, 1.6.2 SVN and trunk SVN. The configuration used was essentially make samples with agent mark enabled in manager.conf. The patch was to add a sleep between the above two lines. I'm not sure it is worth uploading files for either of these. The test was to make a call to 600 from an anoymous SIP phone and then wait. So, for the moment, I'll just update the backtraces, plus some prints. (If you are wondering, the 100,000,000 is based on a five second purge cycle and no more than 50ns between the count decrement and the pointer fetch.) | ||
Comments: | By: Tilghman Lesher (tilghman) 2010-05-20 13:29:50 Work on the related issue ASTERISK-16005 produced a patch that should fix that exact race condition. By: David Woolley (davidw) 2010-05-21 11:49:58 The code looks OK, with the caveats that I'm assuming that the list management macros do sensible things. As my original test started from an assumption about where the race condition was and it is now assumed to have gone away, we can only really do a general regression test of the AMI interface part of our application, and make sure there are no obvious new problems. That will probably happen sometime next week. By: Digium Subversion (svnbot) 2010-06-01 11:41:03 Repository: asterisk Revision: 266682 U trunk/main/manager.c ------------------------------------------------------------------------ r266682 | tilghman | 2010-06-01 11:41:00 -0500 (Tue, 01 Jun 2010) | 16 lines Eliminate stale manager events after a set interval, even if AMI clients don't query for them. Actions (or failures to act) by external clients should not cause memory leaks in Asterisk, especially when those continued leaks could cause Asterisk to misbehave later. (closes issue ASTERISK-16005) Reported by: mav3rick Patches: 20100510__issue17234.diff.txt uploaded by tilghman (license 14) 20100517__issue17234__trunk.diff.txt uploaded by tilghman (license 14) Tested by: mav3rick, davidw (closes issue ASTERISK-16120) Reported by: davidw ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=266682 By: Digium Subversion (svnbot) 2010-06-01 11:42:04 Repository: asterisk Revision: 266683 _U branches/1.6.2/ U branches/1.6.2/main/manager.c ------------------------------------------------------------------------ r266683 | tilghman | 2010-06-01 11:42:02 -0500 (Tue, 01 Jun 2010) | 23 lines Merged revisions 266682 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ........ r266682 | tilghman | 2010-06-01 11:41:00 -0500 (Tue, 01 Jun 2010) | 16 lines Eliminate stale manager events after a set interval, even if AMI clients don't query for them. Actions (or failures to act) by external clients should not cause memory leaks in Asterisk, especially when those continued leaks could cause Asterisk to misbehave later. (closes issue ASTERISK-16005) Reported by: mav3rick Patches: 20100510__issue17234.diff.txt uploaded by tilghman (license 14) 20100517__issue17234__trunk.diff.txt uploaded by tilghman (license 14) Tested by: mav3rick, davidw (closes issue ASTERISK-16120) Reported by: davidw ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=266683 |