ASTERISK-13876: Cause code for INVITE timeout doesn't match intent of code and semi-randomly chooses between DIALSTATUS of BUSY and CONGESTION

[Home]

Summary: ASTERISK-13876: Cause code for INVITE timeout doesn't match intent of code and semi-randomly chooses between DIALSTATUS of BUSY and CONGESTION

Reporter: David Woolley (davidw) Labels:

Date Opened: 2009-04-01 07:03:45 Date Closed: 2011-07-26 15:15:03

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments:

Description: With 1.6.0.1 we were observing a cause code of 18, and also possibly of 0, for invites that timed out because of a connection failure (we are testing a failure case in which a Cisco CCM goes down, by pulling its network cable). When we tried to do this as a more controlled test with 1.6.0.6, we found that we were always getting cause 17. Whilst, as note in ASTERISK-13761, I don't believe that 18 is an appropriate encoding of this condition, at least it doesn't clash with a code that can actually be generated by normal SIP events in Asterisk, whereas 17 is AST_CAUSE_USER_BUSY, the standard subscriber engaged code.

Moreover, we found that we sometimes got a DIALSTATUS of BUSY and sometimes of CONGESTION. The former is associated with a critical request timeout log message and the latter with an Auto-congest log message. My guess is that the latter corresponds to the cause 0 we observed for 1.6.0.1.

Examination of the no reply to critical request code path shows that it is actually trying to set a cause code of 18, but will not do so if one is already set. On that basis, I argue that the use of code 17 is a bug, as it is not what the code is trying to do.

****** ADDITIONAL INFORMATION ******

The BUSY/CONGESTION conflict seems to be because there are two mechanisms used to detect the failure of a reply to an INVITE. One is to count the number of transmission retries, and the other is to use the Timer B from the SIP RFC (counting the retransmissions doesn't seem to be sanctioned by the RFC). Both tend to result in very similar timeouts, in simple cases.

In our case, we observed the Auto-congest route initially, when we where using Dial(SIP/6906@192.168.10.10) and the critical request case when using the, more correct for our application, Dial(SIP/6906@cisco1). However, I'm not convinced that the indirection of the downstream IP address was the real cause of going through the different paths.

As far as I can tell, the Auto-congest path fails, entirely, to set the cause code, which may explain why we saw cause 0 on 1.6.0.1 (although under less controlled conditions, so I can't swear that it happened).

The critical request path does set the cause code, but only if it is not zero. From the above, I speculate that something has started pre-emptively, setting a USER_BUSY cause code, although I haven't been able to track that down in the code.

The dialplan is:

exten => 6188,1,Dial(SIP/6906@cisco1)
exten => 6188,n,Noop(${DIALSTATUS})
exten => 6188,n,Noop(${HANGUPCAUSE})

In the case of getting CONGESTION, the cisco1 was replaced by the literal IP address.

cisco1 is defined as:

[cisco1]
type=friend
context=default
host=192.168.10.10
disallow=all
;allow=alaw
allow=ulaw
canreinvite=yes
qualify=yes
call-limit=1000
dtmfmode=rfc2833

A typical congestion case trace is:

[2009-03-31 08:34:18.873] VERBOSE[2680] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:1] Dial("SIP/xxx-messenger-090a0a78", "SIP/6906@192.168.10.10") in new stack
[2009-03-31 08:34:18.873] VERBOSE[2680] logger.c: == Using SIP RTP CoS mark 5
[2009-03-31 08:34:18.878] VERBOSE[2680] logger.c: -- Called 6906@192.168.10.10
[2009-03-31 08:34:50.878] NOTICE[31862] chan_sip.c: Auto-congesting SIP/192.168.10.10-0903fb98
[2009-03-31 08:34:50.878] VERBOSE[2680] logger.c: -- SIP/192.168.10.10-0903f
b98 is circuit-busy
[2009-03-31 08:34:50.878] VERBOSE[2680] logger.c: == Everyone is busy/congeste
d at this time (1:0/1/0)
[2009-03-31 08:34:50.878] VERBOSE[2680] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:2] NoOp("SIP/xxx-messenger-090a0a78", "CONGESTION") in new stack
[2009-03-31 08:34:50.879] VERBOSE[2680] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:3] NoOp("SIP/xxx-messenger-090a0a78", "17") in new stack
[2009-03-31 08:34:50.879] VERBOSE[2680] logger.c: -- Auto fallthrough, chann
el 'SIP/xxx-messenger-090a0a78' status is 'CONGESTION'

and a typical BUSY trace is:

[2009-03-31 08:45:04.965] VERBOSE[2976] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:1] Dial("SIP/xxx-messenger-090b6c50", "SIP/6906@cisco1") in new stack
[2009-03-31 08:45:04.965] VERBOSE[2976] logger.c: == Using SIP RTP CoS mark 5
[2009-03-31 08:45:04.966] VERBOSE[2976] logger.c: -- Called 6906@cisco1
[2009-03-31 08:45:18.365] WARNING[31862] chan_sip.c: Maximum retries exceeded on
transmission 666e322873fa7966506492da5f3cbc51@192.168.130.140 for seqno 102 (Cr
itical Request) -- See doc/sip-retransmit.txt.
[2009-03-31 08:45:18.365] WARNING[31862] chan_sip.c: Hanging up call 666e322873f
a7966506492da5f3cbc51@192.168.130.140 - no reply to our critical packet (see doc
/sip-retransmit.txt).
[2009-03-31 08:45:18.365] VERBOSE[2976] logger.c: == Everyone is busy/congeste
d at this time (1:1/0/0)
[2009-03-31 08:45:18.365] VERBOSE[2976] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:2] NoOp("SIP/xxx-messenger-090b6c50", "BUSY") in new stack
[2009-03-31 08:45:18.366] VERBOSE[2976] logger.c: -- Executing [6188@yyyyyyyyyyyyyyyy:3] NoOp("SIP/xxx-messenger-090b6c50", "17") in new stack
[2009-03-31 08:45:18.366] VERBOSE[2976] logger.c: -- Auto fallthrough, chann
el 'SIP/xxx-messenger-090b6c50' status is 'BUSY'

This is the code that attempts to set the cause in the critical request path:

2825 if (pkt->owner->owner && !pkt->owner->owner->hangupcause)
2826 pkt->owner->owner->hangupcause = AST_CAUSE_NO_USER_RESPONSE;

Comments: By: David Woolley (davidw) 2009-04-01 08:37:23

svn.digium.com has been broken for a least a day, so I can't use the web view onto the archive to definitively identify the patch, but this patch that was introduced between 1.6.0 and 1.6.0.6 is probably the cause of the BUSY path problem. I think the CONGESTION path was always broken. Comparing the 1.6.0 archive (possibly modified, but not at this point) and the 1.6.0.6 archive copies of chan_sip.c:

@@ -4270,8 +4299,11 @@

res = update_call_counter(p, INC_CALL_RINGING);

- if (res == -1)
+ if (res == -1) {
return res;
+ } else {
+ ast->hangupcause = AST_CAUSE_USER_BUSY;
+ }

p->callingpres = ast->cid.cid_pres;
p->jointcapability = ast_translate_available_formats(p->capability, p->prefcodec);

Note, except possibly for the exact left hand side line numbers, this is a difference between two versions already licensed to Digium. It is provided solely to identify the revision that has already been made.
By: David Woolley (davidw) 2009-04-06 13:03:12

The problem patch seems to have been introduced by issue ASTERISK-13041.
By: David Woolley (davidw) 2009-04-06 13:34:35

Looks like I have another issue report on this patch, although only based on code reading. It has been broken by a bad merge into 1.6, see ASTERISK-1452686. In fact, it may be this breakage that is the real problem here.
By: David Woolley (davidw) 2009-04-07 08:10:35

Applying the obvious fix to ASTERISK-1452686 reverts the behaviour to that in 1.6.0.

Maybe we can track the cause 17 part of the problem against ASTERISK-13764 (assuming that will get fixed), leaving the following issues here:

1) The somewhat indeterminate choice between the auto-congest and critical request paths (although noting that using an explicit IP address v sip.conf seems to make a fairly consistent difference);

2) The auto-congest path returning an invalid cause code of zero.

I can live with both of these because they are not mimiced by things that can happen in normal operation on a SIP trunk.

This does imply that ASTERISK-1452686 is now more severe.
By: Leif Madsen (lmadsen) 2009-05-22 12:58:10

Now that 14686 is closed, what is left to do for this issue?
By: David Woolley (davidw) 2009-06-18 06:18:44

I believe the race conditions between the two methods of timing out an INVITE still exists.
By: Russell Bryant (russell) 2011-07-26 15:14:57.128-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions

If this is still an issue, please open a new issue so it can be re-triaged appropriately. Thanks!