[Home]

Summary:ASTERISK-16120: Race condition causes manager session event list to underflow causing null pointer de-ref and crash.
Reporter:David Woolley (davidw)Labels:
Date Opened:2010-05-20 12:26:26Date Closed:2010-06-01 11:42:04
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/ManagerInterface
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) eventpurge-full.log
( 1) eventpurge-gdb.txt
Description:There is a flaw in the reference count logic for events in manager.c which means about once in the order of 100,000,000 events, the last event gets purged, resulting in a trying to dereference a null last_ev pointer at:

0x0811b951 in process_events (s=0xb773819c) at manager.c:2685
2685       while ( (eqe = NEW_EVENT(s)) ) {

Scenario.  Run Asterisk for long enough with a manager logged in and events being generated.

Expect.  Nothing untoward.

Get. Crash, as above.

****** ADDITIONAL INFORMATION ******

Again this is a race condition and again I have done a patch to exacerbate it, although in this case, the patch is too intimately connected with the bug to be able to separate the test patch from the fix.

The normal logic for manager.c is to have a session holding a usecount on at least the last event that it output, and to hang onto that last event as the link to the next one, when it arrives.  When a new event arrives, a pointer to it is fetched, from the previous event structure, which is the line that is crashing, then the reference count is updated.  This update is safe, because events are purged in list order and it still has a reference count on the previous one.

When it has finished with the event, it calls unref_event, on the last event (not the one it has just output).  This decrements the use count, then follows the chain forwards.  The problem is that this event is vulnerable to purging in the few instructions between the decrement and the chaining!  Every five seconds, events are purged if they are between the list head and one with a non-zero use count (or the list tail entry).

The diagnostic patch uses sleep to extend this window from 10s of nano-seconds to a second.  It then takes a relatively short time for the 5 second purge cycle to land in the middle of this gap.

Also confusing about the code is that it increments the reference count on event it has fetched to output (ege above), but then refetches this value, hidden within unref_event, to follow the chain.  That somewhat obfuscates the subjects of the reference count manipulation.

For reference the current code for unref_event is:

static struct eventqent *unref_event(struct eventqent *e)
{
ast_atomic_fetchadd_int(&e->usecount, -1);
return AST_LIST_NEXT(e, eq_next);
}

The real life case was on 1.6.1.0, and took about four months.

I can see no obvious code difference between 1.6.1.0, 1.6.2 SVN and trunk SVN.

The configuration used was essentially make samples with agent mark enabled in manager.conf.  The patch was to add a sleep between the above two lines.  I'm not sure it is worth uploading files for either of these.  The test was to make a call to 600 from an anoymous SIP phone and then wait.  So, for the moment, I'll just update the backtraces, plus some prints.

(If you are wondering, the 100,000,000 is based on a five second purge cycle and no more than 50ns between the count decrement and the pointer fetch.)
Comments:By: Tilghman Lesher (tilghman) 2010-05-20 13:29:50

Work on the related issue ASTERISK-16005 produced a patch that should fix that exact race condition.

By: David Woolley (davidw) 2010-05-21 11:49:58

The code looks OK, with the caveats that I'm assuming that the list management macros do sensible things.

As my original test started from an assumption about where the race condition was and it is now assumed to have gone away, we can only really do a general regression test of the AMI interface part of our application, and make sure there are no obvious new problems.  That will probably happen sometime next week.

By: Digium Subversion (svnbot) 2010-06-01 11:41:03

Repository: asterisk
Revision: 266682

U   trunk/main/manager.c

------------------------------------------------------------------------
r266682 | tilghman | 2010-06-01 11:41:00 -0500 (Tue, 01 Jun 2010) | 16 lines

Eliminate stale manager events after a set interval, even if AMI clients don't query for them.

Actions (or failures to act) by external clients should not cause memory leaks
in Asterisk, especially when those continued leaks could cause Asterisk to
misbehave later.

(closes issue ASTERISK-16005)
Reported by: mav3rick
Patches:
      20100510__issue17234.diff.txt uploaded by tilghman (license 14)
      20100517__issue17234__trunk.diff.txt uploaded by tilghman (license 14)
Tested by: mav3rick, davidw

(closes issue ASTERISK-16120)
Reported by: davidw

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=266682

By: Digium Subversion (svnbot) 2010-06-01 11:42:04

Repository: asterisk
Revision: 266683

_U  branches/1.6.2/
U   branches/1.6.2/main/manager.c

------------------------------------------------------------------------
r266683 | tilghman | 2010-06-01 11:42:02 -0500 (Tue, 01 Jun 2010) | 23 lines

Merged revisions 266682 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r266682 | tilghman | 2010-06-01 11:41:00 -0500 (Tue, 01 Jun 2010) | 16 lines
 
 Eliminate stale manager events after a set interval, even if AMI clients don't query for them.
 
 Actions (or failures to act) by external clients should not cause memory leaks
 in Asterisk, especially when those continued leaks could cause Asterisk to
 misbehave later.
 
 (closes issue ASTERISK-16005)
  Reported by: mav3rick
  Patches:
        20100510__issue17234.diff.txt uploaded by tilghman (license 14)
        20100517__issue17234__trunk.diff.txt uploaded by tilghman (license 14)
  Tested by: mav3rick, davidw
 
 (closes issue ASTERISK-16120)
  Reported by: davidw
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=266683