Summary:ASTERISK-19215: Segfault chan_sip originating call
Reporter:Jac Barben (jbarben)Labels:
Date Opened:2012-01-18 15:46:34.000-0600Date Closed:2012-01-30 06:58:34.000-0600
Versions: Frequency of
Environment:RHEL 6 2.6.32-131.21.1.el6.x86_64Attachments:( 0) bt.txt
( 1) btfull.txt
( 2) log.txt
( 3) morelog.tgz
Description:Core dump under load.  Affected area is always an outbound SIP call (chan_sip).  Net result is always a reference to __ast_channel_alloc,  __ast_channel_alloc_ap, internal_ao2_alloc, _ast_calloc, calloc, _int_malloc
Comments:By: Jac Barben (jbarben) 2012-01-18 15:48:23.608-0600

gdb backtrace data from recent crash

By: Jac Barben (jbarben) 2012-01-18 16:00:32.552-0600

I'm not sure this will help... this is an excerpt of the "full" log the crash occured at 11:04:19..  so that's what's here.

By: Jac Barben (jbarben) 2012-01-18 16:11:42.231-0600

more background:  We made the jump from 1.2 to 1.8 in December, 2011.  !.2 did not use the ao2 container.  

By: David Woolley (davidw) 2012-01-19 06:22:44.726-0600

Segmentation faults in malloc tend to indicate general memory corruption prior to the call, so the case against ao2 isn't actually very strong.

What I would be looking for, in the logs, is ERRORs relating to locking problems, which might be up to several minutes before the crash.

By: Matt Jordan (mjordan) 2012-01-19 08:29:22.345-0600

As David said, this is highly unlikely to be related to the ao2 container, which is used extensively throughout the codebase, and not just in chan_sip.

To determine what is going on here, please provide a DEBUG log that includes more then just the information that occurs at the second Asterisk crashes.  A SIP trace (using either sip set debug on or a pcap from wireshark) would also be helpful.

By: Jac Barben (jbarben) 2012-01-19 08:51:14.611-0600

I defer to you both :)
Attached is morelog.tgz. Uncompressed this is file 30 minutes prior to the crash at 11:04:19.  There are lots of "ERROR" messages but only 3 categories:
astobj2 bad magic number
app_fax ...
chan_sip.c No SRTP module loaded

There are plenty of references to "Deadlock avoided" and a couple of references to "Released clone lock" both references related to channel.c

By: Jac Barben (jbarben) 2012-01-19 09:01:40.645-0600

Regarding a pcap dump... do you want everything on the interface or are you looking specifically for SIP messages?

By: Matt Jordan (mjordan) 2012-01-19 12:17:20.618-0600

Jac - it appears as if something has messed up the ref counting on an ao2 object (the bad magic number errors).  These usually indicate that a user of an ao2 object did something illegal.

SIP pcaps wouldn't hurt, but at the same time, may not be needed for this problem.  What is really useful is a log file that is not normally created, which tracks the ref counting of the ao2 objects.  I can put together a patch that will turn this logging on if you want.  Alternatively, if you feel comfortable with it, you can enable the ref counting by adding:
#define REF_DEBUG 1
before inclusiong of the astobj2.h header file in chan_sip.c, app_fax.c, and rtp_engine.c.  That should turn the ref debugging on where we need it.

As an aside, why did you choose to use app_fax over res_fax?

By: Jac Barben (jbarben) 2012-01-19 12:32:32.776-0600

Matt - I can enable the ref counting.  In the meantime, I've already enabled SIP tracing.

You know... I actually started with the res_fax and regressed to app_fax.  I can't remember the specifics... but in short it didn't *seem* to work as well.

By: Jac Barben (jbarben) 2012-01-27 09:42:14.276-0600

Fellas:  I think we can close this one.  I did enable the SIP Debug, added my own external PCAP trace and enabled the ref counting.  In short, your tip regarding "locking" lead me to a race condition created by a third-party implementation of "follow-me" where the subscriber channel was being deleted before the caller channel and the caller channel was trying to reference the subscriber channel at teardown.

The "magic number" problem mysteriously went away after I recompiled with ref counting enabled.

By: Matt Jordan (mjordan) 2012-01-30 06:58:16.202-0600

Yikes.  Yeah, having an orphaned channel reference would be all sorts of no good.  Thanks for hunting that one down yourself - those take awhile to figure out.