Summary:ASTERISK-20551: Segfault when scheduled provisional keepalive is handled - dialog has already been destroyed
Reporter:David Brillert (aragon)Labels:
Date Opened:2012-10-10 18:57:41Date Closed:2012-11-07 11:49:59.000-0600
Versions: Frequency of
is related toASTERISK-22079 Segfault: INTERNAL_OBJ (user_data=0x6374652f) at astobj2.c:120
Environment:SVN-branch-1.8-r374802Attachments:( 0) cli.txt
( 1) gdb_bt.txt
( 2) segfault_ref_counts.rar
Description:While debugging ref count issue for another bug report I ran into a segfault
ref count debugging was enabled at the time so I am attaching
gdb backtrace
CLI before crash
ref count log
Comments:By: Matt Jordan (mjordan) 2012-10-15 20:48:55.709-0500

The backtrace indicates that the ao2 dialog pointer passed to the scheduler is pointing to garbage when the scheduler serviced the item.  At first glance, this indicates an inbalance in the ref counting on the dialog somewhere something decreased the ref count on the scheduled dialog one time more than it should have; when the scheduler serviced the request, the pointer it had was no longer good.

Unfortunately, a dig through the ref count log doesn't really indicate this - everything that was scheduled during a provisional keep alive was properly ref bumped prior to the keep alive being scheduled, and none of them were destroyed before the system crashed.

The only indication as to what happened prior to the crash in the logs is this:

[2012-10-10 19:22:20] WARNING[31843]: app_dial.c:2341 dial_exec_full: Unable to create channel of type 'SIP' (cause 20 - Subscriber absent)

Unfortunately, this too feels like a bit of a red herring - (1) there are a lot of them in your ref logs, indicating that you have some peers that we can't create an address for, and (2) tracking the objects in the ref log at what appears to be the crash, none of them appear to have been scheduled at any point in time.

And everything points at the point where the scheduler fires for the provisional keep alive on an already disposed of dialog that causes the crash.

This is quite odd.

So... hm.  At this point it feels like there's something we're still missing here.  If you can reproduce this crash, can you see if you can provide a DEBUG log with 'sip set debug on' as well?  I'm sure it will be quite large - so you may need to rotate the log file and only include the portion that leads up to the actual crash itself.

By: David Brillert (aragon) 2012-11-07 11:11:28.495-0600

I haven't been able to reproduce this since I opened the report.

By: David Brillert (aragon) 2012-11-07 11:49:59.506-0600

I think we were able to fix this with some changes to all-hangup AGI script. When Asterisk hangs up a channel prematurely and we didn't detect that hangup in our all-hangup AGI script.