ASTERISK-20551: Segfault when scheduled provisional keepalive is handled

[Home]

Summary: ASTERISK-20551: Segfault when scheduled provisional keepalive is handled - dialog has already been destroyed

Reporter: David Brillert (aragon) Labels:

Date Opened: 2012-10-10 18:57:41 Date Closed: 2012-11-07 11:49:59.000-0600

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 1.8.17.0 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-22079 Segfault: INTERNAL_OBJ (user_data=0x6374652f) at astobj2.c:120

Environment: SVN-branch-1.8-r374802 Attachments: ( 0) cli.txt
( 1) gdb_bt.txt
( 2) segfault_ref_counts.rar

Description: While debugging ref count issue for another bug report I ran into a segfault
ref count debugging was enabled at the time so I am attaching
gdb backtrace
CLI before crash
ref count log

Comments: By: Matt Jordan (mjordan) 2012-10-15 20:48:55.709-0500

The backtrace indicates that the ao2 dialog pointer passed to the scheduler is pointing to garbage when the scheduler serviced the item. At first glance, this indicates an inbalance in the ref counting on the dialog somewhere something decreased the ref count on the scheduled dialog one time more than it should have; when the scheduler serviced the request, the pointer it had was no longer good.

Unfortunately, a dig through the ref count log doesn't really indicate this - everything that was scheduled during a provisional keep alive was properly ref bumped prior to the keep alive being scheduled, and none of them were destroyed before the system crashed.

The only indication as to what happened prior to the crash in the logs is this:

{noformat}
[2012-10-10 19:22:20] WARNING[31843]: app_dial.c:2341 dial_exec_full: Unable to create channel of type 'SIP' (cause 20 - Subscriber absent)
{noformat}

Unfortunately, this too feels like a bit of a red herring - (1) there are a lot of them in your ref logs, indicating that you have some peers that we can't create an address for, and (2) tracking the objects in the ref log at what appears to be the crash, none of them appear to have been scheduled at any point in time.

And everything points at the point where the scheduler fires for the provisional keep alive on an already disposed of dialog that causes the crash.

This is quite odd.

So... hm. At this point it feels like there's something we're still missing here. If you can reproduce this crash, can you see if you can provide a DEBUG log with 'sip set debug on' as well? I'm sure it will be quite large - so you may need to rotate the log file and only include the portion that leads up to the actual crash itself.
By: David Brillert (aragon) 2012-11-07 11:11:28.495-0600

I haven't been able to reproduce this since I opened the report.
By: David Brillert (aragon) 2012-11-07 11:49:59.506-0600

I think we were able to fix this with some changes to all-hangup AGI script. When Asterisk hangs up a channel prematurely and we didn't detect that hangup in our all-hangup AGI script.