[Home]

Summary:ASTERISK-07293: [branch] dropped just after INVITE call doesn't finishes correctly
Reporter:caspy (caspy)Labels:
Date Opened:2006-07-06 02:21:18Date Closed:2007-01-08 05:32:09.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) full_with_cancel_fix.txt
( 1) full_with_rfc_fix.txt
( 2) sip_and_console_debug_1.2.12.1.redacted.txt
( 3) sip_cancel_patch_1.2_r44332.txt
( 4) sip_cancel_patch_1.4_r44366.txt
( 5) sip_cancel_patch.diff.txt
( 6) sip_debug_forgotten_call_1.txt
( 7) SVN-branch-1.2-r48319.log
Description:If Asterisk, for some reasons, decideded to drop/destruct a call just after sending INVITE, it does not clear it correctly, thus forcing a sip-phone to ring infinetely for already unexisting call.

Please, look for attached 'sip debug' capture.
In this log you will see ALL packets for 3fd0e0e973fae7b1001acca6356235d6 call, this is not a truncated log.

It seems to me, that Asterisk should correctly clear a call prior to destroing it.

****** ADDITIONAL INFORMATION ******

Really, this is not a 1.2.9.1, this is SVN-branch-1.2-r36377.
Comments:By: Olle Johansson (oej) 2006-07-06 08:35:01

Caspy, you should know by now that we need much more information. Set debug to 4, turn on SIP history, dumphistory, verbose to 4 and turn on SIP debug - then capture the log. As usual I state "this should not happen" even though it obviously does...

Thanks for returning promptly with the requested information!

By: caspy (caspy) 2006-07-06 09:28:11

oej, i know. but this it is not easy in this case.
i can't force this, but (debug) log from my production will be fully unreadable (too much messages cause of high load).

and why do you think, that current log not enought?
it's clear from it, that call was finished on Asterisk just after sending INVITE (looks like calling party hanged up). all next phone's replies on this INVITE was ignored. it's not ok, cause phone become unsynced with Asterisk.

By: caspy (caspy) 2006-07-06 09:30:00

anyway, i'll try to collect some more logs on morning hours, where it will be little easy to find corresponding info in log, but i can't promise it ;(

By: Olle Johansson (oej) 2006-08-07 10:34:16

caspy: Any more information for us?

By: Olle Johansson (oej) 2006-08-07 15:12:11

There's been some fixes in this area, does the problem exist with the latest trunk?

By: caspy (caspy) 2006-08-08 01:35:59

oej,
is there any backport for 1.2? i have only production now. :(

By: Serge Vecher (serge-v) 2006-08-08 08:48:38

caspy: the trunk in current form should be close to "stability" of a 1.2 release branch, while offering major improvements in certain areas (like chan_sip), which will not be backported back to 1.2. I would suggest dedicating a machine to doing a test run of trunk to, at least, see if the issues in this report are resolved for you.

By: caspy (caspy) 2006-08-15 10:13:56

it test env it is unreal to reproduce.
i'll try later with trunk, when second server will become available.

By: Serge Vecher (serge-v) 2006-08-25 11:29:55

any luck?

By: Curt Moore (jcmoore) 2006-09-04 01:12:29

I too experienced this exact issue in both the 1.2 branch and SVN trunk.  This only happens whenever the source UA terminates the call before the target UA answers the call, ie. it is in the Ringing state.

After lots of reading of the SIP RFC and experimentation, it appears that this is due to some of the SIP_CAN_BYE code from a few months back.

As it turned out, there were 2 issues at play here.

1) More checks needed to be done in the 1xx cases in handle_response_invite(). Although it _should_ be impossible to get a 1xx after a dialog is established, I added the checks just to be RFC compliant.
2) The logic in sip_hangup() needed to be fixed as it made no sense.  If we have already made the determination that a CANCEL needs to be sent, why would we also need to check if we can send a BYE? It seems that this extra check in sip_hangup() was just plain wrong and needed to be removed.  See the diff for details.

I have attached a patch against SVN trunk, sip_cancel_patch.diff.txt which seems to have fixed this issue in both 1.2 and SVN trunk in all tests I have conducted thus far.

As always, please let me know if there is another or better way of doing this.

By: Serge Vecher (serge-v) 2006-09-05 10:21:24

tgrman: it would be absolutely awesome if you could show some call flows with and without your patch as per the following instructions:
1) Prepare test environment (reduce the ammount of unrelated traffic on the server);
2) Make sure your logger.conf has the following line:
  console => notice,warning,error,debug
3) restart Asterik.
4) Enable SIP transaction logging with the following CLI commands:
set debug 4
set verbose 4
sip debug
5) Save complete console log to file and _attach_ said file to the bug.


By: Curt Moore (jcmoore) 2006-09-08 10:15:19

I have attached 2 logfiles. The first, full_with_rfc_fix.txt, is with only the additional provisional response checks in handle_response_invite() which fixes the RFC noncompliance.  With this fix, it amplifies the occurance of Asterisk tearing down calls without sending a BYE/CANCEL as appropriate.

The second logfile, full_with_cancel_fix.txt, is with both the RFC noncompliance and the SIP_CAN_BYE check in sip_hangup() removed in the section where we've already determined that a CANCEL needs to be sent instead of a BYE.

Unfortunately, the error doesn't occur on demand if the extra provisional response checks are not added in handle_response_invite().

I initially saw this behavior when I was dialing multiple SIP devices simultaneously, ie. Dial(SIP/device1&SIP/device2&...).  Since implementing this patch, I have not seen any further occurances of this behavior.

Please let me know if further information is needed.

By: Olle Johansson (oej) 2006-09-09 08:49:36

Accidentally I committed something quite similar without looking into this issue. Please check if the issue still exist within svn trunk and 1.2. Thanks.

By: Curt Moore (jcmoore) 2006-09-11 08:58:28

oej, shouldn't we still be testing whether we can actually send a BYE in the 1xx cases of handle_response_invite()?  Unless I've read the RFC incorrectly, which is entirely possible, a BYE cannot be sent on a provisional response to an INVITE unless there is a tag in the To header field and the INVITE is not pending, ie. the INVITE has been been ACKd.  It appears to me that we're blindly setting SIP_CAN_BYE in the 1xx cases without doing any checks.

I'll test oej's latest changes when I'm back in the office on 9/13. Until then, any other feedback is appreciated.

By: daroz (daroz) 2006-10-03 15:41:35

I can confirm this same behavior in 1.2.12.1. Is there any hope for a patch for the 1.2 branch?

By: Serge Vecher (serge-v) 2006-10-03 16:33:12

ok, the purpoted fix mentioned by oej in 0051496 would be in 1.2.12.1 release.

daroz: can you please make a new sip debug trace?

By: daroz (daroz) 2006-10-03 18:22:50

Uploaded debug with console output (for context). File is redacted to remove phone numbers and non-private IP addresses. If the original is needed it can be sent privately.

This particular call was inbound to a ringall queue which rang 3 phones. 1 took the call, the other two continues to ring. No SIP CANCEL message was sent after the initial INVITE until the phones (Cisco 7961s) attempted to answer the call and returned a 200 on the initial INVITE which * turned around and responded with a BYE message.

This was from a verbose 10 / debug 10 console output. (No timestamps) if the portion of the 'full' logfile is needed it can be arranged as well.

By: Curt Moore (jcmoore) 2006-10-03 19:09:55

I still think that oej's fix on this issue is incomplete.  I've been running my code modifications in sip_cancel_patch.diff.txt for a month now and haven't had any further issues with CANCELs not being sent where necessary.  While my patch may not be the preferred way of handling this, it does work.

As I mentioned previously, why in sip_hangup(), do we need to check if we need to send a BYE if we've already entered an "if" statement that is "if(needcancel)"?  Shouldn't we have already figured this out that a BYE is not needed but a CANCEL is needed instead?

In addition, I still maintain that extra checks are needed in handle_response_invite() before blindly setting SIP_CAN_BYE in the 1xx cases.  The RFC is quite clear on not being able to send BYEs on INVITEs which either have not yet been ACKed or do not have a tag in the To header field.  While this is an edge case, checks should still be done.

I could be totally off in my logic but it seems to be working in my production environment with over 50,000 SIP calls per day.

oej: Could you take another look at this, please? :-)

By: Serge Vecher (serge-v) 2006-10-04 08:22:02

Daroz can you please patch your system with sip_cancel_patch.diff.txt and give some feedback on testing? Thanks.

By: daroz (daroz) 2006-10-04 08:38:14

Unfortunately I'm in a production enviroment and two of our manager apps break with the changes for 1.4 leaving moving to HEAD or the 1.4 branch impossible. This patch doesn't apply to 1.2.12.1 source, nor the top of the 1.2 branch.

By: Curt Moore (jcmoore) 2006-10-04 09:38:06

I uploaded a version of the patch, sip_cancel_patch_1.2_r44332.txt, against the latest 1.2 branch, r44332.  This should apply cleanly. If not, let me know.

By: daroz (daroz) 2006-10-04 10:04:32

tgrman - Thanks. That applies clean. I'll recompile and install at the end of the day and should have an update by this time tomorrow.

By: hristo (hristo) 2006-10-04 12:13:17

I have very similar problem with 1.4 (r44333) that involves INVITES, 183 w/SDP and CANCELs. Even though it is a bit different it may be related:

carrier1, asterisk in the middle, carrier2

carrier1 --INVITE--> asterisk --INVITE--> carrier2
carrier1 <--Trying-- asterisk <--Trying-- carrier2
carrier1 <--183/SDP-- asterisk <--183/SDP-- carrier2
carrier1 --CANCEL--> asterisk --> no message is ever sent to carrier2
carrier1 <--487-- asterisk
carrier1 <--200-- asterisk
carrier1 --ACK--> asterisk

With 180 in place of 183 everything in the above scenario works fine and both call legs are terminated properly. Similar problem is observed if timeout is specified with the Dial application - once the timeout is reached and there is no answer the call is destroyed but asterisk never sends CANCEL to the called party and it keeps on ringing. Again this only appears if asterisk receives 183/SDP provisional responce - with 180 everything works fine.

Note, that in my case 1.2 branch works as expected.

sip_cancel_patch.diff.txt fails against 1.4-r44333 and I wasn't able to test if it fixes the problem.

If you belive that the problems are related I can attach SIP debug for the call flow above or open a new bug otherwise.

By: Curt Moore (jcmoore) 2006-10-04 12:21:47

I uploaded another version of the patch, sip_cancel_patch_1.4_r44366.txt, against the latest 1.4 branch, r44366. This should apply cleanly. If not, let me know.

By: daroz (daroz) 2006-10-04 16:28:36

Confirmed: sip_cancel_patch_1.2_r44332.txt (2,097 bytes) 10-04-06 09:35 has corrected the issue in a few brief after-hours tests. I'll get more information when load rises tomorrow in the case of unintended side effects, but the primary issue is resolved for 1.2.

_Thank_You_

By: hristo (hristo) 2006-10-05 04:01:20

My problem still remains even after trying the 1.4 patch. I guess it's a different problem and will report it in new bug.

By: daroz (daroz) 2006-10-05 11:36:00

Under load we haven't identified any new issues that have arisen as a result of this patch. Thanks very much.

By: daroz (daroz) 2006-10-13 16:48:32

Quick Q: Any hope of getting this applied to 1.2/1.4/trunk SVN?

By: Curt Moore (jcmoore) 2006-10-30 21:33:21.000-0600

oej: Could you take another look at this?  If you agree, it would be nice to get this into SVN ASAP as it's a fairly significant bug for chan_sip, to the point of making it almost unusable in certain instances.

I've now been running my patch for almost 2 months in the 1.2 branch and most recently in trunk/1.4 and it seems to have resolved the issue.  If you need more info, I should be in #asterisk-dev as jcmoore.



By: Olle Johansson (oej) 2006-11-03 04:48:24.000-0600

This is high on the todo-list

By: Olle Johansson (oej) 2006-11-12 13:15:38.000-0600

Created a new branch called "invitestate" to work on a patch for this. Please check out

http://svn.digium.com/svn/asterisk/team/oej/invitestate

and test if this is an improvement. Let's work on that branch and fix this properly.



By: daroz (daroz) 2006-11-14 19:37:14.000-0600

I'll pull the branch down and recompile over the weekend. Thanks oej.

By: Serge Vecher (serge-v) 2006-11-20 08:38:47.000-0600

Daroz: any results from the weeekend?

By: daroz (daroz) 2006-11-24 00:02:28.000-0600

sorry I got derailed and didn't get the branch installed last weekend. It has, however, now been installed. I'll let you know how it turns out when it gets some load.

By: daroz (daroz) 2006-11-29 22:42:14.000-0600

Ok.. I'm not 100% but something doesn't seem to be _quite_ right with this branch. I'm gathering more debugging (sip packets) but on some inbound calls it appears as though asterisk is ignoring a response to an invite packet and continually retransmitting the already acknowledged invite - yet the call completes and is abruptly terminted some seconds later when all the retransmission attemps 'fail'.

This is with the invitestate branch r47659.

Will attach logs when I get some hard data.

By: Olle Johansson (oej) 2006-11-30 00:34:35.000-0600

Daroz: That's very valuable feedback. Looking forward to debug output so I can fix that.

By: Olle Johansson (oej) 2006-11-30 14:38:25.000-0600

Created invitestate-1.4 branch too. I need test results.

Spent a lot of time testing the 1.2 based version today, found no issues...

By: Olle Johansson (oej) 2006-12-02 14:36:28.000-0600

This was related to 183 Session progress.

Please test latest version of the branch of your choice
- invitestate for 1.2
- invitestate-1.4 for 1.4
- trunk for trunk

I believe this is fixed. Your urgent reply is really appreciated, so we can close this issue and merge
the branches. Thanks.

Many thanks to Leif Madsen (Blitzrage) who gave me access to a test system and for placing
many test calls. International bug hunting - I'm in Sweden, Leif's in Canada and the server was
located in Florida!

By: Olle Johansson (oej) 2006-12-05 12:58:31.000-0600

waiting for feedback.

By: daroz (daroz) 2006-12-05 19:36:21.000-0600

Updated to Asterisk SVN-oej-invitestate-r48214 on our production box. Will test and let you know.

By: Olle Johansson (oej) 2006-12-06 02:14:25.000-0600

To add to your list: Can you please also try with the clean 1.2 svn - latest revision? Thanks.

By: daroz (daroz) 2006-12-06 09:26:23.000-0600

Ok, we had an issue with the "#" for blindtransfer not working in the invitestate branch. I took the opportunity to switch to the 1.2 trunk and we're now testing that.

By: daroz (daroz) 2006-12-06 09:40:41.000-0600

Ok, 1.2 trunk still has the 'forever ringing' problem... Asterisk not sending CANCELs when a call is answered by another endpoint. We'll be rolling back to somewhere in the invitestate branch shortly.

Log file uploaded: SVN-branch-1.2-r48319.log
(verbose/debug 10 + sip debug)



By: daroz (daroz) 2006-12-06 11:22:38.000-0600

Ok, I rolled back to the latest invitestate branch, but I can reconfirm the problem with DTMF -- it appears as though there's some failure of asterisk to send DTMF out (inbound DTMF seemed fine).

Unfortunately we needed to get things stablized so we've moved back to the tip of the 1.2 branch and re-applied the sip_cancel_patch_1.2_r44332.txt patch which has everything working again, so far.

By: Olle Johansson (oej) 2007-01-08 05:31:52.000-0600

Seems like the issue is resolved. If you don't agree and still have the same problem as the original reporter, then re-open and contribute new facts. Thanks everyone for helping out.