Summary:ASTERISK-21064: Crash when handling ACK on dialog that has no channel
Reporter:Colin Cuthbertson (colinc)Labels:
Date Opened:2013-02-11 10:21:45.000-0600Date Closed:2013-08-27 10:54:01
Versions: Frequency of
is duplicated byASTERISK-24305 Crash with astobj2.c: user_data is NULL
Environment:CentOS release 5.5 (Final) Intel(R) Xeon(R) CPU E5420 @ 2.50GHz 2 GB RAM VM Ware ESXI Attachments:( 0) AST-2013-004-1.8.15-cert.diff
( 1) AST-2013-004-1.8.diff
( 2) AST-2013-004-11.2-cert.diff
( 3) AST-2013-004-11.diff
( 4) backtrace_20130211.txt
( 5) backtrace_20130212.txt
( 6) debug_20130212.tar.gz
( 7) issueA21064_fix.patch
( 8) issueA21064_late_ack_of_death.xml
( 9) pcap_20130212.pcap
Description:Asterisk is frequently segfaulting (several times per week) in __ast_queue_frame at at channel.c:1450.

This is a fairly busy call routing server with up to 120 calls at peak times. Calls come in on a SIP channel and are routed straight back out on another SIP channel. directmedia=yes.

There are about eleven lines of dialplan for each call:
Set("SIP/xxxx-0000044c", "TAG=xxxxx")
NoOp("SIP/xxxx-0000044c", "Call from x to y")
Gosub("SIP/xxxx-0000044c", "sub-set-group,s,1(xxxxx)
Set("SIP/xxxx-0000044c", "GROUP()=xxxxx")
NoOp("SIP/xxxx-0000044c", "xxxxx: Call 27")
Return("SIP/xxxx-0000044c", "")
Set("SIP/xxxx-0000044c", "CDR(accountcode)=xxxx")
Gosub("SIP/xxxx-0000044c", "sub-set-cli,xxx,1")
Set("SIP/xxxx-0000044c", "CALLERID(num)=nnnnnnnnnnnn")
Return("SIP/xxxx-0000044c", "")
Dial("SIP/xxxx-0000044c", "SIP/trunk/yyyy,300,C")

The problem does not appear to be load related. The segfault has happened with only 26 calls in progress.

The segfault has also happened after only 30 minutes of uptime and then not again for a few days.

We have seen the problem on three different VM's.

I have a feeling that the problem is only happening with calls from one source SIP trunk but I have been unable to prove this.

Program terminated with signal 11, Segmentation fault.
#0  0x000000000046a961 in __ast_queue_frame (chan=0x0, fin=0x41d72f60, head=0, after=0x0) at channel.c:1450
1450            cur = AST_LIST_LAST(&chan->readq);

I have attached a backtrace. Please let me know if any further information is required.
Comments:By: Matt Jordan (mjordan) 2013-02-11 10:32:31.087-0600

For some reason, there isn't a channel in place when Asterisk handles this particular ACK.

Can you provide a DEBUG log, as well as a pcap showing the message traffic leading up to the crash? There may be a deeper issue here where the channel that is having direct media negotiated on it is getting disposed of improperly.

By: Joshua C. Colp (jcolp) 2013-02-18 06:31:49.132-0600

ASTERISK-21112 is an internal Digium task, any updates will be posted here.

By: Colin Cuthbertson (colin_c) 2013-05-31 11:01:23.445-0500

Has there been any progress made on this issue or ASTERISK-21112 ?

Our customer is still experiencing the segfaults. If you need any further debug information from me, please let me know.

By: Matt Jordan (mjordan) 2013-06-06 09:14:31.275-0500

Your issue is in queue, please be patient, and we will get to it as time permits and developer resources become available.

By: Elwin Formsma (formsmae) 2013-06-21 04:24:42.198-0500

We are experiencing this problem as well. Any sight on when you will look at this?

By: Michael (talkttcom) 2013-07-30 13:19:24.538-0500

This problem happens 3-4 times a day. We really appreciate someone can take a look at it and provide a fix or workaround. Thanks.

By: Walter Doekes (wdoekes) 2013-08-19 04:50:03.526-0500

Is that ACK (from readbuf in {{backtrace_20130212.txt}}) supposed to be in the pcap ({{pcap_20130212.pcap}}), because I don't see it.

You might set directmedia to no, to avoid the issue.

By: Walter Doekes (wdoekes) 2013-08-20 06:31:23.643-0500

$ sudo asterisk -V
Asterisk SVN-branch-1.8-r396745

$ cat /etc/asterisk/sip.conf

$ cat /etc/asterisk/extensions.conf
exten => answer,1,Answer()
same => n,Hangup()

$ sudo gdb asterisk
(gdb) run -c

$ sipp -m 1 -sf issueA21064_late_ack_of_death.xml -key tel answer

# OBSERVE: you may need to tweak the pause in the sipp scenario a bit
# In my case 510ms was 100% reliable. Asterisk had started the Hangup but
# hadn't sent the BYE yet.

[Aug 20 10:23:04] DEBUG[23620]: chan_sip.c:3946 retrans_pkt: ** SIP timers: Rescheduling retransmission 2 to 1000 ms (t1 500 ms (Retrans id #2))
Retransmitting #1 (NAT) to
SIP/2.0 200 OK
   -- Executing [answer@default:2] Hangup("SIP/", "") in new stack
 == Spawn extension (default, answer, 2) exited non-zero on 'SIP/'
[Aug 20 10:23:04] DEBUG[23628]: chan_sip.c:6682 sip_hangup: Hangup call SIP/, SIP callid 1-23627@
Scheduling destruction of SIP dialog '1-23627@' in 32000 ms (Method: INVITE)

<--- SIP read from UDP: --->
ACK sip:answer@ SIP/2.0
Found audio description format PCMA for ID 8
[Aug 20 10:23:04] DEBUG[23620]: chan_sip.c:9617 process_sdp: Processing media-level (audio) SDP a=rtpmap:8 PCMA/8000... OK.
Found audio description format PCMU for ID 0
[Aug 20 10:23:04] DEBUG[23620]: chan_sip.c:9617 process_sdp: Processing media-level (audio) SDP a=rtpmap:0 PCMU/8000... OK.
Capabilities: us - 0x80000008000e (gsm|ulaw|alaw|h263|testlaw), peer - audio=0xc (ulaw|alaw)/video=0x0 (nothing)/text=0x0 (nothing), combined - 0xc (ulaw|alaw)
Non-codec capabilities (dtmf): us - 0x1 (telephone-event|), peer - 0x0 (nothing), combined - 0x0 (nothing)
Peer audio RTP is at port
[Aug 20 10:23:04] DEBUG[23620]: chan_sip.c:9871 process_sdp: We're settling with these formats: 0xc (ulaw|alaw)
[Aug 20 10:23:04] ERROR[23620]: astobj2.c:115 INTERNAL_OBJ: user_data is NULL

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4184700 (LWP 23620)]
0x0000000000475d60 in __ast_queue_frame (chan=0x0, fin=0x7ffff4182da0, head=0, after=0x0) at channel.c:1454
1454 cur = AST_LIST_LAST(&chan->readq);

By: Walter Doekes (wdoekes) 2013-08-20 09:45:32.401-0500

I just noticed that this bug wasn't in my 10.7 version. It's relatively new:
$ svn log -r 371337
r371337 | jrose | 2012-08-16 17:46:26 +0200 (do, 16 aug 2012) | 11 lines

chan_sip: Trigger reinvite if the SDP answer is included in the SIP ACK

Under certain conditions, a SIP transaction involving directmedia wouldn't
trigger a re-invite because the SDP answer was included in an ACK instead
of in a message that we would have triggered the invite with. This patch
just queues a source change control frame if the dialog is using
directmedia when we find sdp for an ACK.

(closes issue AST-913)
Reported by: Thomas Arimont

Index: channels/chan_sip.c
--- channels/chan_sip.c (revision 371336)
+++ channels/chan_sip.c (revision 371337)
@@ -25445,8 +25445,12 @@
p->pendinginvite = 0;
acked = __sip_ack(p, seqno, 1 /* response */, 0);
if (find_sdp(req)) {
- if (process_sdp(p, req, SDP_T38_NONE))
+ if (process_sdp(p, req, SDP_T38_NONE)) {
return -1;
+ }
+ if (ast_test_flag(&p->flags[0], SIP_DIRECT_MEDIA)) {
+ ast_queue_control(p->owner, AST_CONTROL_SRCCHANGE);
+ }
} else if (p->glareinvite == seqno) {

process_sdp is sprinkled with p->owner checks. This one wasn't.

/me mumbles something about private AST-* issues.

By: Colin Cuthbertson (colinc) 2013-08-23 17:44:29.497-0500

Thanks for your work on this.

I ran your tests on two of my Asterisk servers and re-created the problem. I then applied the patch to each and was unable to re-create the segfault again.

I will keep you posted.

By: Matt Jordan (mjordan) 2013-08-27 09:36:16.613-0500

We've got everything lined up to get the security release out. These should go out today, 08/27.