ASTERISK-11539: SIP channel hung due to CANCEL ReliableXmit (ReTX)

[Home]

Summary: ASTERISK-11539: SIP channel hung due to CANCEL ReliableXmit (ReTX)

Reporter: mvf (mvf) Labels:

Date Opened: 2008-02-28 13:23:29.000-0600 Date Closed: 2009-12-08 19:03:33.000-0600

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 20080227_sip_channel_hung_at_cancel_ReTX.txt
( 1) 20080318_sip_debug_and_history.txt
( 2) hangfix.patch

Description: I've detected a problem that make asterisk keep sip channels up forever even when the call has been disconnected, the problem keep UDP ports and memory taken making asterisk to drop new calls when the sip channels hung increase.

Reviewing the "sip show history" of one of these hung channels I see that the call is cancel by calling party while is on ringing state, after that asterisk send CANCEL to the peer but it didn't get an answer (maybe because of network problems). After I get some bad responses from peer, maybe also due to network problems that asterisk reply with an ACK but the history shows a new ReliableXmit timeout message every 30 seconds.

(originating peer)
* SIP Call
1. TxReqRel INVITE / 102 INVITE - -UNKNOWN-
2. Rx SIP/2.0 / 102 INVITE / 100 Trying
3. Rx SIP/2.0 / 102 INVITE / 100 Trying
4. Rx SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx SIP/2.0 / 102 INVITE / 180 Ringing
6. Cancel Cause Normal Clearing
7. SchedDestroy 32000 ms
8. TxReqRel CANCEL / 102 CANCEL - -UNKNOWN-
9. SchedDestroy 32000 ms
10. ReTx 1000 CANCEL sip:17898702397@20.20.20.25 SIP/2.0
11. ReTx 2000 CANCEL sip:17898702397@20.20.20.25 SIP/2.0
12. Rx SIP/2.0 / 102 CANCEL / 100 Trying
13. ReliableXmit timeout
14. Rx SIP/2.0 / 102 INVITE / 180 Ringing
15. Rx SIP/2.0 / 102 INVITE / 486 Busy Here
16. TxReq ACK / 102 ACK - -UNKNOWN-
17. ReliableXmit timeout
18. ReliableXmit timeout
19. ReliableXmit timeout
20. ReliableXmit timeout
21. ReliableXmit timeout
22. ReliableXmit timeout
23. ReliableXmit timeout
24. ReliableXmit timeout
25. ReliableXmit timeout
26. ReliableXmit timeout
27. ReliableXmit timeout
28. ReliableXmit timeout
29. ReliableXmit timeout
30. ReliableXmit timeout
31. ReliableXmit timeout
32. ReliableXmit timeout
33. ReliableXmit timeout
34. ReliableXmit timeout
35. ReliableXmit timeout
36. ReliableXmit timeout
37. ReliableXmit timeout
38. ReliableXmit timeout
39. ReliableXmit timeout
40. ReliableXmit timeout
41. ReliableXmit timeout
42. ReliableXmit timeout
43. ReliableXmit timeout
44. ReliableXmit timeout
45. ReliableXmit timeout
46. ReliableXmit timeout
47. ReliableXmit timeout
48. ReliableXmit timeout
49. ReliableXmit timeout
50. ReliableXmit timeout

I got a capture for the sip transaction messages that generate these hung sip channels, please take a look at it in the attached file including two captures (20080227_sip_channel_hung_at_cancel_ReTX.txt), note that there is no sip messages sent after the last ACK, the "ReliableXmit timeout" messages appear in the history of the call filling the 50 slot buffer and even deleting the old history messages. The "sip show channels" command shows the channel hung forever.

20.20.20.25 1789870239 185d051b629 00102/00000 0x0 (nothing) No (d) Tx: ACK

Comments: By: Olle Johansson (oej) 2008-03-18 01:50:48

Seems like the CANCEL doesn't get received by the end point, which continues the call. That resets some counters in our sip stack, which should not happen. Wonder who sends the "100 trying" after the cancel, it indicates that someone received the CANCEL though.

The instructions for the bug tracker clearly instructs you to take an Asterisk debug log with sip debugging and debug and verbose set to 4. We need to see what happens inside your Asterisk. A packet trace doesn't show that.

Thank you.
By: mvf (mvf) 2008-03-18 16:04:18

Sorry for the lack of correct debug, I'm attaching it now. Call ID 085357c64acd9560021e7b542852562d correspond to the outgoing leg of a call that generate a hung channel. The history for this channel is not in the debug log because the channel is active forever. This is the history of the channel obtained from CLI:

B2BUA*CLI> sip show history 085357c64ac
B2BUA*CLI>
* SIP Call
1. NewChan Channel SIP/PEER_DID_01-0a25b420 - from 085357c64acd9560021e7b
2. TxReqRel INVITE / 102 INVITE - -UNKNOWN-
3. Rx SIP/2.0 / 102 INVITE / 100 Trying
4. Rx SIP/2.0 / 102 INVITE / 100 Trying
5. Rx SIP/2.0 / 102 INVITE / 180 Ringing
6. Rx SIP/2.0 / 102 INVITE / 180 Ringing
7. Cancel Cause Normal Clearing
8. SchedDestroy 32000 ms
9. TxReqRel CANCEL / 102 CANCEL - -UNKNOWN-
10. SchedDestroy 32000 ms
11. ReTx 1000 CANCEL sip:5625709200@20.88.6.25 SIP/2.0
12. ReTx 2000 CANCEL sip:5625709200@20.88.6.25 SIP/2.0
13. Rx SIP/2.0 / 102 CANCEL / 100 Trying
14. Rx SIP/2.0 / 102 INVITE / 180 Ringing
15. ReliableXmit timeout
16. Rx SIP/2.0 / 102 INVITE / 486 Busy Here
17. TxReq ACK / 102 ACK - -UNKNOWN-
18. ReliableXmit timeout
19. ReliableXmit timeout
20. ReliableXmit timeout
21. ReliableXmit timeout
22. ReliableXmit timeout
23. ReliableXmit timeout
24. ReliableXmit timeout
25. ReliableXmit timeout
26. ReliableXmit timeout
27. ReliableXmit timeout
28. ReliableXmit timeout
29. ReliableXmit timeout
30. ReliableXmit timeout
31. ReliableXmit timeout
32. ReliableXmit timeout
33. ReliableXmit timeout

The incoming leg of the call (Call ID d7c7df47-96c6-a73a-fa14-0002a40217d0) is properly destroy.

By: Joshua C. Colp (jcolp) 2008-03-19 11:10:44

oej: The CANCEL request is being retained because of the 100 Trying thus when the dialog destruction code is executed (because it was scheduled) it sees that packet in there and continues to schedule over... and over... and over... any thoughts on what we should do?
By: Emmanuel BUU (neutrino88) 2008-03-19 12:09:49

I suggest that we define a global SIP parameter that is the maximum duration for outgoing transactions. If a transaction is not closed after this duration it should be:

- automatically cancelled (if not done already)
- destroyed and from the outside point of view, it would be like the transaction failed with an 5xx error.
By: mvf (mvf) 2008-03-19 13:45:24

AFAIK * is keeping a channel in Trying status forever, is understable that the destroy would be delayed a few times because of the current state of the channel but this re scheduling can't go forever due to the UDP nature of SIP messages. I think that * should destroy channels that seems hung on Trying state, maybe a global timeout for Trying is a good solution to identify hung channels.
By: Emmanuel BUU (neutrino88) 2008-03-28 18:03:28

I suggest the following patch in chan_sip.c

Index: channels/chan_sip.c
===================================================================
--- channels/chan_sip.c (révision 111858)
+++ channels/chan_sip.c (copie de travail)
@@ -2078,10 +2078,11 @@
}

/* If there are packets still waiting for delivery, delay the destruction */
- if (p->packets) {
+ if (p->packets && !ast_test_flag(&p->flags[0], SIP_NEEDDESTROY)) {
if (option_debug > 2)
ast_log(LOG_DEBUG, "Re-scheduled destruction of SIP call %s\n", p->callid ? p->callid : "<unknown>");
append_history(p, "ReliableXmit", "timeout");
+ ast_set_flag(&p->flags[0], SIP_NEEDDESTROY);
return 10000;
}

Can MVK test this (as the condition is difficult to reproduce) ?
By: Emmanuel BUU (neutrino88) 2008-03-28 18:08:51

Here is a refined patch:

Index: channels/chan_sip.c
===================================================================
--- channels/chan_sip.c (révision 111858)
+++ channels/chan_sip.c (copie de travail)
@@ -2078,10 +2078,12 @@
}

/* If there are packets still waiting for delivery, delay the destruction */
- if (p->packets) {
+ if (p->packets && !ast_test_flag(&p->flags[0], SIP_NEEDDESTROY)) {
if (option_debug > 2)
ast_log(LOG_DEBUG, "Re-scheduled destruction of SIP call %s\n", p->callid ? p->callid : "<unknown>");
append_history(p, "ReliableXmit", "timeout");
+ if (p->method == SIP_CANCEL || p->method == SIP_BYE)
+ ast_set_flag(&p->flags[0], SIP_NEEDDESTROY);
return 10000;
}
By: Raj Jain (rjain) 2008-03-30 06:12:28

This is a non-INVITE client transaction timeout scenario. It is described in section 17.1.2.2 in RFC 3261 (state machine copied below). Basically, a timer called Timer F (64*T1; be default 32 secs) is what triggers the client transaction to come out of the Trying state if no final response is received.

<pre>
Timer F 64*T1 Section 17.1.2.2 non-INVITE transaction
timeout timer

|Request from TU
|send request
Timer E V
send request +-----------+
+---------| |-------------------+
| | Trying | Timer F |
+-------->| | or Transport Err.|
+-----------+ inform TU |
200-699 | | |
resp. to TU | |1xx |
+---------------+ |resp. to TU |
| | |
| Timer E V Timer F |
| send req +-----------+ or Transport Err. |
| +---------| | inform TU |
| | |Proceeding |------------------>|
| +-------->| |-----+ |
| +-----------+ |1xx |
| | ^ |resp to TU |
| 200-699 | +--------+ |
| resp. to TU | |
| | |
| V |
| +-----------+ |
| | | |
| | Completed | |
| | | |
| +-----------+ |
| ^ | |
| | | Timer K |
+--------------+ | - |
| |
V |
NOTE: +-----------+ |
| | |
transitions | Terminated|<------------------+
labeled with | |
the event +-----------+
over the action
to take
</pre>

By: mvf (mvf) 2008-03-31 10:37:26

neutrino88, thanks for your patch. I just installed it in one of my machines. I will be monitoring and will let you know the results.
By: mvf (mvf) 2008-04-07 15:13:14

Ok, after one week under relative high load (the calls per second that result in about 40-50 simultaneuos calls) the patched server has zero hung channels :D
Lets see what the admins say about it, thank you again.
By: Patrick DT (dd1916) 2008-05-09 11:31:49

Hi

Can we get an update on this bug, we seem to have the same issue here.

Will the patch be added in the next official release ?

Can you make the patch for the latest official release ?
By: mvf (mvf) 2008-05-09 14:33:47

I can add that I have my asterisk running with this patch for 5 weeks and 4 days and there are no peers hung due to this cause.
It seems that the case status is confirmed but there isn't an admin assigned yet.
By: Patrick DT (dd1916) 2008-05-30 15:07:32

I tried the patch on our 1.4.17 but I still have several stuck channels

here is the sip show channel

* SIP Call
Curr. trans. direction: Outgoing
Call-ID: 94b138f0-a06000e-13c4-45026-11f95f-536b76fa-11f95f
Owner channel ID: <none>
Our Codec Capability: 4
Non-Codec Capability (DTMF): 1
Their Codec Capability: 4
Joint Codec Capability: 4
Format: 0x0 (nothing)
MaxCallBR: 384 kbps
Theoretical Address: 10.6.0.14:5060
Received Address: 10.6.0.14:5060
SIP Transfer mode: open
NAT Support: No
Audio IP: 172.16.100.101 (local)
Our Tag: as32f92656
Their Tag: 94b16bc8-a06000e-13c4-45026-11f95f-723aba38-11f95f
SIP User agent: ARRIS-TM402P release v.05.02.0X SN/0013113FC54F
Username: 4185558698
Peername: 4185558698
Original uri: sip:4185558698@10.6.0.14:5060
Caller-ID: 4185558698
Need Destroy: 2
Last Message: Rx: BYE
Promiscuous Redir: No
Route: sip:4185558698@10.6.0.14:5060
DTMF Mode: rfc2833
SIP Options: replaces replace 100rel timer join

here is the sip show history

1. Rx INVITE / 1 INVITE / sip:5550927@here.com:5060
2. AuthChal Auth challenge sent for - nc 0
3. TxRespRel SIP/2.0 / 1 INVITE - 407 Proxy Authentication Required
4. SchedDestroy 32000 ms
5. Rx ACK / 1 ACK / sip:5550927@here.com:5060
6. Rx INVITE / 2 INVITE / sip:5550927@royaume.com:5060
7. CancelDestroy
8. Invite New call: 94b138f0-a06000e-13c4-45026-11f95f-536b76fa-11f95f
9. AuthOK Auth challenge succesful for 4185558698
10. NewChan Channel SIP/4185558698-0194b850 - from 94b138f0-a06000e-13c4-45
11. TxResp SIP/2.0 / 2 INVITE - 100 Trying
12. TxResp SIP/2.0 / 2 INVITE - 180 Ringing
13. TxRespRel SIP/2.0 / 2 INVITE - 200 OK
14. Rx ACK / 2 ACK / sip:5550927@172.16.100.101
15. ReInv Re-invite sent
16. TxReqRel INVITE / 102 INVITE - -UNKNOWN-
17. Rx SIP/2.0 / 102 INVITE / 100 Trying
18. Rx SIP/2.0 / 102 INVITE / 200 OK
19. TxReq ACK / 102 ACK - -UNKNOWN-
20. ReInv Re-invite sent
21. TxReqRel INVITE / 103 INVITE - -UNKNOWN-
22. Rx SIP/2.0 / 103 INVITE / 100 Trying
23. Rx BYE / 3 BYE / sip:5550927@172.16.100.101
24. RTCPaudio Quality:ssrc=1371236592;themssrc=25523736;lp=0;rxjitter=0.01273
25. TxResp SIP/2.0 / 3 BYE - 200 OK
26. Hangup Cause Normal Clearing
By: David Brillert (aragon) 2008-05-30 15:23:33

Possibly related to bug 12603 12584 ? But this is fixed in 1.4.20 official release
Give 1.4.20 a try

------------------------------------------------------------------------
r116039 | russell | 2008-05-13 16:14:28 -0500 (Tue, 13 May 2008) | 32 lines

Merged revisions 116038 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4 [^]

........
r116038 | russell | 2008-05-13 16:17:23 -0500 (Tue, 13 May 2008) | 24 lines

Fix a deadlock involving channel autoservice and chan_local that was debugged
and fixed by mmichelson and me.

We observed a system that had a bunch of threads stuck in ast_autoservice_stop().
The reason these threads were waiting around is because this function waits to
ensure that the channel list in the autoservice thread gets rebuilt before the
stop() function returns. However, the autoservice thread was also locked, so
the autoservice channel list was never getting rebuilt.

The autoservice thread was stuck waiting for the channel lock on a local channel.
However, the local channel was locked by a thread that was stuck in the autoservice
stop function.

It turned out that the issue came down to the local_queue_frame() function in
chan_local. This function assumed that one of the channels passed in as an
argument was locked when called. However, that was not always the case. There
were multiple cases in which this channel was not locked when the function was
called. We fixed up chan_local to indicate to this function whether this channel
was locked or not. The previous assumption had caused local_queue_frame() to
improperly return with the channel locked, where it would then never get unlocked.

(closes issue 0012584)
(related to issue 0012603)
By: Olle Johansson (oej) 2008-07-01 08:06:27

This bug report needs some action. First, can someone please confirm that it still exists after Russell's change?
By: mvf (mvf) 2008-07-01 08:33:43

Hi Oej, I didn't try 1.4.20 because the patch submitted by neutrino88 applied to 1.4.18 totally solve the "ReliableXmit timeout" problem that I reported. Reading the Russell's patch description I can't tell if is actually related to this specific problem.
If you think that is related please let me know to give 1.4.20 a try in one of our servers and check if "ReliableXmit timeout" is still there.
Regards.

By: Emmanuel BUU (neutrino88) 2008-07-01 08:38:01

The first analysis of the bug was that Astersik SIP "stack" would not release the dialog at all if a first 1xx reply at has been received.

I have trouble to see how chan_local is involved here. I would say the from a logical point of view, the bug is still here. I'll try to fire a SIP simulator to confirm this.
By: Olle Johansson (oej) 2008-07-01 08:41:29

Looking forward to your replies before I spend time on this. Thanks.
By: mvf (mvf) 2008-07-24 16:35:58

I installed 1.4.21.2 version (without patch) and I can verify that the problem is still there. After 1 day I already have 14 channels hung. Please let me know if you need something else.
By: urzedo (urzedo) 2008-08-05 06:42:33

Hi all,

I also have this problem with 1.4.21.2. Has anyone successfully tested the suggested patch with that version?

Regards
By: mvf (mvf) 2008-08-08 15:35:51

I'm running 1.4.21.2 with neutrino88's patch and by now I have zero hung channels:

System uptime: 5 days, 5 hours, 54 minutes, 22 seconds
By: Tilghman Lesher (tilghman) 2008-09-11 17:59:56

I believe this should be fixed with the linked issue. Could any of you test the 1.4.22 release candidate and confirm?
By: urzedo (urzedo) 2008-09-11 18:53:24

Hi Corydon76,

At least for me, this issue is not always reproduceable. I will try the release candidate, but I will not be 100% sure the problem is solved.

What I can tell you is that after I patched my 1.4.21.2 system as suggested by neutrino88, the issue never happened again.

Regards!
By: Steve Murphy (murf) 2008-09-12 15:17:16

OK, no, I don't think this bug is actually closely related to 13235;
except that maybe 13235 involves a channel being infinitely rescheduled
for destruction and it appears that this is also the issue for this
bug. In 13235, the problem is overcome by using the __pretend_ack (sic)
routine in the BYE handler. But, it looks like neutron88's fix probably
would have short-circuited that problem as well.

But the site that experienced the problem in 13235 also, it appears, is seeing this problem as well, where a malfunctioning cisco box sending INVITE's to asterisk, starts cancelling them after it sends them out, and the channels get zombified. We caught one such conversation with sip history:

1. Rx INVITE / 101 INVITE /
sip:1707xxxxxxx@66.xx.xx.xx:5060
2. NewChan Channel SIP/67.xx.xx.xx-e540baa0 - from
91A3F741-7EAD11DD-9E
3. TxResp SIP/2.0 / 101 INVITE - 100 Trying
4. Rx CANCEL / 101 CANCEL /
sip:1707xxxxxxx@66.xx.xx.xx:5060
5. TxRespRel SIP/2.0 / 101 INVITE - 487 Request Terminated
6. TxResp SIP/2.0 / 101 CANCEL - 200 OK
7. Rx ACK / 101 ACK / sip:1707xxxxxxx@66.xx.xx.xx:5060

(in the above, the ips & numbers have been changed, of course.)
That site may see other zombie-channel problems occasionally, but
exactly how many problems are being bumped into, we won't know until
we can trace them out.

I'm going to apply neutron88's patch in the code that site is running,
and see if it calms things down there.
By: Steve Murphy (murf) 2008-09-12 16:02:45

Ok, I'm taking on this bug. I've inserted neutron88's patch into an asterisk
that is frequently suffering from the bloating-zombie-death.

If's it reaches a virtual mem size equilibrium, and holds that over the weekend into Monday or Tues, I'll commit the above patch, and close this bug.
By: Steve Murphy (murf) 2008-09-15 12:22:51

OK, the server under exam with this patch installed is in the act of blowing up again.

I have over 4000 channels; I have one channel with history:

* SIP Call
1. Rx INVITE / 101 INVITE / sip:1706xxxxxx@66.28.xxx.xxx:5060
2. NewChan Channel SIP/67.xxx.xxx.xxx-c7fe50c0 - from 6F1C306C-828111DD-8D
3. TxResp SIP/2.0 / 101 INVITE - 100 Trying
4. Rx CANCEL / 101 CANCEL / sip:1706xxxxxxx@66.28.xxx.xx:5060
5. TxRespRel SIP/2.0 / 101 INVITE - 487 Request Terminated
6. TxResp SIP/2.0 / 101 CANCEL - 200 OK
7. Rx ACK / 101 ACK / sip:1706xxxxxx@66.28.xxx.xxx:5060

By: Steve Murphy (murf) 2008-09-15 16:50:42

Aha!

We were able to determine the problem I recorded in the last note. It's not
a SIP driver problem at all. What's happening, is the dialplan is hanging
in a FUNC_ODBC call, and the app call never returns. So, the channel goes
into limbo, and eventually, the other end will send a CANCEL, to which the driver will respond appropriately, but the channel will not be destroyed. The pbx_start thread is still running, and thousands of these build up, hang onto memory, file handles, etc.

We are theorizing why the func_odbc call is hanging, but it's a connection
to sql server, and our guess is that the sql server is dropping the connection. Tilghman supplied us a patch; it's not involving this issue, so I'll make no more mention of it than this: if you see lots of hung channels, check to see if you also have hung threads. If you have both, it's probably not bug 12101, but this other problem.
By: Steve Murphy (murf) 2008-09-17 15:44:05

OK, guys, I've attached "hangfix.patch" for 1.4;
It's neutrino88's fix plus some code to provide
"sip show hangfix", which will show a count of the
number of times neutrino88's fix was triggered.

So, if *anyone* can verify that they've run this
patch, and have no buildup of dead channels,
and the "sip show hangfix" command yields a non-
zero number (the bigger, the better), then,
I'll commit this fix to 1.4 on up. Deal?
By: urzedo (urzedo) 2008-09-17 16:19:26

Hi murf,

I will give it a try. Please allow some days 'till my feedback.

Thanks for your effort!
By: Steve Murphy (murf) 2008-09-23 10:35:41

OK, I've determined that, on the site with the zombie-bloating death-hung channel problem, that this problem is NOT one of its problems. I had this patch on those servers for over a week now, and not once was hangfix greater than zero.

We also had an ODBC problem with cdr_odbc, which uses, in 1.4, its own connection and methods to pump CDR records to the db. If the connection gets dropped on the server side, the interface never recovered. As of 143864, this problem should be solved. We tested, and it's already long solved in trunk (and 1.6.x, I'm sure).

So, as a reminder: anybody out there seen hangfix greater than zero yet? Have the other fixes solved this problem, or do we need to commit it?
By: Thiago (thiagofernandes) 2008-09-24 07:51:23

Hi murf,

urzedo and me applied the patch, and in two weeks we have 292 channels destroyed, and the problem does not happen anymore.

Thanks
By: Digium Subversion (svnbot) 2008-09-25 11:02:25

Repository: asterisk
Revision: 144420

U branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r144420 | murf | 2008-09-25 11:02:24 -0500 (Thu, 25 Sep 2008) | 25 lines

(closes issue ASTERISK-11539)
Reported by: MVF
Tested by: neutrino88, urzedo, murf, thiagofernandes

Many thanks to neutrino88 for this patch, which
solves a problem whereby channels get a CANCEL
request, respond to it properly, but end up
in a hung state, infinitely being rescheduled.
This fix is a bit crude, in that catches the
problem at a rather late phase, but it may
prevent infinite rescheduling problems that
might still arise.

It might have been better to find out why,
in the course of protocol handling, the channel
was not destroyed, but we leave that to
future generations.

Many thanks to urzedo and thiagofernandes for
their work in verifying that the patch code
indeed is being executing, and averting the
problem.

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=144420