Summary: | ASTERISK-11539: SIP channel hung due to CANCEL ReliableXmit (ReTX) | ||
Reporter: | mvf (mvf) | Labels: | |
Date Opened: | 2008-02-28 13:23:29.000-0600 | Date Closed: | 2009-12-08 19:03:33.000-0600 |
Priority: | Minor | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_sip/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 20080227_sip_channel_hung_at_cancel_ReTX.txt ( 1) 20080318_sip_debug_and_history.txt ( 2) hangfix.patch | |
Description: | I've detected a problem that make asterisk keep sip channels up forever even when the call has been disconnected, the problem keep UDP ports and memory taken making asterisk to drop new calls when the sip channels hung increase. Reviewing the "sip show history" of one of these hung channels I see that the call is cancel by calling party while is on ringing state, after that asterisk send CANCEL to the peer but it didn't get an answer (maybe because of network problems). After I get some bad responses from peer, maybe also due to network problems that asterisk reply with an ACK but the history shows a new ReliableXmit timeout message every 30 seconds. (originating peer) * SIP Call 1. TxReqRel INVITE / 102 INVITE - -UNKNOWN- 2. Rx SIP/2.0 / 102 INVITE / 100 Trying 3. Rx SIP/2.0 / 102 INVITE / 100 Trying 4. Rx SIP/2.0 / 102 INVITE / 180 Ringing 5. Rx SIP/2.0 / 102 INVITE / 180 Ringing 6. Cancel Cause Normal Clearing 7. SchedDestroy 32000 ms 8. TxReqRel CANCEL / 102 CANCEL - -UNKNOWN- 9. SchedDestroy 32000 ms 10. ReTx 1000 CANCEL sip:17898702397@20.20.20.25 SIP/2.0 11. ReTx 2000 CANCEL sip:17898702397@20.20.20.25 SIP/2.0 12. Rx SIP/2.0 / 102 CANCEL / 100 Trying 13. ReliableXmit timeout 14. Rx SIP/2.0 / 102 INVITE / 180 Ringing 15. Rx SIP/2.0 / 102 INVITE / 486 Busy Here 16. TxReq ACK / 102 ACK - -UNKNOWN- 17. ReliableXmit timeout 18. ReliableXmit timeout 19. ReliableXmit timeout 20. ReliableXmit timeout 21. ReliableXmit timeout 22. ReliableXmit timeout 23. ReliableXmit timeout 24. ReliableXmit timeout 25. ReliableXmit timeout 26. ReliableXmit timeout 27. ReliableXmit timeout 28. ReliableXmit timeout 29. ReliableXmit timeout 30. ReliableXmit timeout 31. ReliableXmit timeout 32. ReliableXmit timeout 33. ReliableXmit timeout 34. ReliableXmit timeout 35. ReliableXmit timeout 36. ReliableXmit timeout 37. ReliableXmit timeout 38. ReliableXmit timeout 39. ReliableXmit timeout 40. ReliableXmit timeout 41. ReliableXmit timeout 42. ReliableXmit timeout 43. ReliableXmit timeout 44. ReliableXmit timeout 45. ReliableXmit timeout 46. ReliableXmit timeout 47. ReliableXmit timeout 48. ReliableXmit timeout 49. ReliableXmit timeout 50. ReliableXmit timeout I got a capture for the sip transaction messages that generate these hung sip channels, please take a look at it in the attached file including two captures (20080227_sip_channel_hung_at_cancel_ReTX.txt), note that there is no sip messages sent after the last ACK, the "ReliableXmit timeout" messages appear in the history of the call filling the 50 slot buffer and even deleting the old history messages. The "sip show channels" command shows the channel hung forever. 20.20.20.25 1789870239 185d051b629 00102/00000 0x0 (nothing) No (d) Tx: ACK | ||
Comments: | By: Olle Johansson (oej) 2008-03-18 01:50:48 Seems like the CANCEL doesn't get received by the end point, which continues the call. That resets some counters in our sip stack, which should not happen. Wonder who sends the "100 trying" after the cancel, it indicates that someone received the CANCEL though. The instructions for the bug tracker clearly instructs you to take an Asterisk debug log with sip debugging and debug and verbose set to 4. We need to see what happens inside your Asterisk. A packet trace doesn't show that. Thank you. By: mvf (mvf) 2008-03-18 16:04:18 Sorry for the lack of correct debug, I'm attaching it now. Call ID 085357c64acd9560021e7b542852562d correspond to the outgoing leg of a call that generate a hung channel. The history for this channel is not in the debug log because the channel is active forever. This is the history of the channel obtained from CLI: B2BUA*CLI> sip show history 085357c64ac B2BUA*CLI> * SIP Call 1. NewChan Channel SIP/PEER_DID_01-0a25b420 - from 085357c64acd9560021e7b 2. TxReqRel INVITE / 102 INVITE - -UNKNOWN- 3. Rx SIP/2.0 / 102 INVITE / 100 Trying 4. Rx SIP/2.0 / 102 INVITE / 100 Trying 5. Rx SIP/2.0 / 102 INVITE / 180 Ringing 6. Rx SIP/2.0 / 102 INVITE / 180 Ringing 7. Cancel Cause Normal Clearing 8. SchedDestroy 32000 ms 9. TxReqRel CANCEL / 102 CANCEL - -UNKNOWN- 10. SchedDestroy 32000 ms 11. ReTx 1000 CANCEL sip:5625709200@20.88.6.25 SIP/2.0 12. ReTx 2000 CANCEL sip:5625709200@20.88.6.25 SIP/2.0 13. Rx SIP/2.0 / 102 CANCEL / 100 Trying 14. Rx SIP/2.0 / 102 INVITE / 180 Ringing 15. ReliableXmit timeout 16. Rx SIP/2.0 / 102 INVITE / 486 Busy Here 17. TxReq ACK / 102 ACK - -UNKNOWN- 18. ReliableXmit timeout 19. ReliableXmit timeout 20. ReliableXmit timeout 21. ReliableXmit timeout 22. ReliableXmit timeout 23. ReliableXmit timeout 24. ReliableXmit timeout 25. ReliableXmit timeout 26. ReliableXmit timeout 27. ReliableXmit timeout 28. ReliableXmit timeout 29. ReliableXmit timeout 30. ReliableXmit timeout 31. ReliableXmit timeout 32. ReliableXmit timeout 33. ReliableXmit timeout The incoming leg of the call (Call ID d7c7df47-96c6-a73a-fa14-0002a40217d0) is properly destroy. By: Joshua C. Colp (jcolp) 2008-03-19 11:10:44 oej: The CANCEL request is being retained because of the 100 Trying thus when the dialog destruction code is executed (because it was scheduled) it sees that packet in there and continues to schedule over... and over... and over... any thoughts on what we should do? By: Emmanuel BUU (neutrino88) 2008-03-19 12:09:49 I suggest that we define a global SIP parameter that is the maximum duration for outgoing transactions. If a transaction is not closed after this duration it should be: - automatically cancelled (if not done already) - destroyed and from the outside point of view, it would be like the transaction failed with an 5xx error. By: mvf (mvf) 2008-03-19 13:45:24 AFAIK * is keeping a channel in Trying status forever, is understable that the destroy would be delayed a few times because of the current state of the channel but this re scheduling can't go forever due to the UDP nature of SIP messages. I think that * should destroy channels that seems hung on Trying state, maybe a global timeout for Trying is a good solution to identify hung channels. By: Emmanuel BUU (neutrino88) 2008-03-28 18:03:28 I suggest the following patch in chan_sip.c Index: channels/chan_sip.c =================================================================== --- channels/chan_sip.c (révision 111858) +++ channels/chan_sip.c (copie de travail) @@ -2078,10 +2078,11 @@ } /* If there are packets still waiting for delivery, delay the destruction */ - if (p->packets) { + if (p->packets && !ast_test_flag(&p->flags[0], SIP_NEEDDESTROY)) { if (option_debug > 2) ast_log(LOG_DEBUG, "Re-scheduled destruction of SIP call %s\n", p->callid ? p->callid : "<unknown>"); append_history(p, "ReliableXmit", "timeout"); + ast_set_flag(&p->flags[0], SIP_NEEDDESTROY); return 10000; } Can MVK test this (as the condition is difficult to reproduce) ? By: Emmanuel BUU (neutrino88) 2008-03-28 18:08:51 Here is a refined patch: Index: channels/chan_sip.c =================================================================== --- channels/chan_sip.c (révision 111858) +++ channels/chan_sip.c (copie de travail) @@ -2078,10 +2078,12 @@ } /* If there are packets still waiting for delivery, delay the destruction */ - if (p->packets) { + if (p->packets && !ast_test_flag(&p->flags[0], SIP_NEEDDESTROY)) { if (option_debug > 2) ast_log(LOG_DEBUG, "Re-scheduled destruction of SIP call %s\n", p->callid ? p->callid : "<unknown>"); append_history(p, "ReliableXmit", "timeout"); + if (p->method == SIP_CANCEL || p->method == SIP_BYE) + ast_set_flag(&p->flags[0], SIP_NEEDDESTROY); return 10000; } By: Raj Jain (rjain) 2008-03-30 06:12:28 This is a non-INVITE client transaction timeout scenario. It is described in section 17.1.2.2 in RFC 3261 (state machine copied below). Basically, a timer called Timer F (64*T1; be default 32 secs) is what triggers the client transaction to come out of the Trying state if no final response is received. <pre> Timer F 64*T1 Section 17.1.2.2 non-INVITE transaction timeout timer |Request from TU |send request Timer E V send request +-----------+ +---------| |-------------------+ | | Trying | Timer F | +-------->| | or Transport Err.| +-----------+ inform TU | 200-699 | | | resp. to TU | |1xx | +---------------+ |resp. to TU | | | | | Timer E V Timer F | | send req +-----------+ or Transport Err. | | +---------| | inform TU | | | |Proceeding |------------------>| | +-------->| |-----+ | | +-----------+ |1xx | | | ^ |resp to TU | | 200-699 | +--------+ | | resp. to TU | | | | | | V | | +-----------+ | | | | | | | Completed | | | | | | | +-----------+ | | ^ | | | | | Timer K | +--------------+ | - | | | V | NOTE: +-----------+ | | | | transitions | Terminated|<------------------+ labeled with | | the event +-----------+ over the action to take </pre> By: mvf (mvf) 2008-03-31 10:37:26 neutrino88, thanks for your patch. I just installed it in one of my machines. I will be monitoring and will let you know the results. By: mvf (mvf) 2008-04-07 15:13:14 Ok, after one week under relative high load (the calls per second that result in about 40-50 simultaneuos calls) the patched server has zero hung channels :D Lets see what the admins say about it, thank you again. By: Patrick DT (dd1916) 2008-05-09 11:31:49 Hi Can we get an update on this bug, we seem to have the same issue here. Will the patch be added in the next official release ? Can you make the patch for the latest official release ? By: mvf (mvf) 2008-05-09 14:33:47 I can add that I have my asterisk running with this patch for 5 weeks and 4 days and there are no peers hung due to this cause. It seems that the case status is confirmed but there isn't an admin assigned yet. By: Patrick DT (dd1916) 2008-05-30 15:07:32 I tried the patch on our 1.4.17 but I still have several stuck channels here is the sip show channel * SIP Call Curr. trans. direction: Outgoing Call-ID: 94b138f0-a06000e-13c4-45026-11f95f-536b76fa-11f95f Owner channel ID: <none> Our Codec Capability: 4 Non-Codec Capability (DTMF): 1 Their Codec Capability: 4 Joint Codec Capability: 4 Format: 0x0 (nothing) MaxCallBR: 384 kbps Theoretical Address: 10.6.0.14:5060 Received Address: 10.6.0.14:5060 SIP Transfer mode: open NAT Support: No Audio IP: 172.16.100.101 (local) Our Tag: as32f92656 Their Tag: 94b16bc8-a06000e-13c4-45026-11f95f-723aba38-11f95f SIP User agent: ARRIS-TM402P release v.05.02.0X SN/0013113FC54F Username: 4185558698 Peername: 4185558698 Original uri: sip:4185558698@10.6.0.14:5060 Caller-ID: 4185558698 Need Destroy: 2 Last Message: Rx: BYE Promiscuous Redir: No Route: sip:4185558698@10.6.0.14:5060 DTMF Mode: rfc2833 SIP Options: replaces replace 100rel timer join here is the sip show history 1. Rx INVITE / 1 INVITE / sip:5550927@here.com:5060 2. AuthChal Auth challenge sent for - nc 0 3. TxRespRel SIP/2.0 / 1 INVITE - 407 Proxy Authentication Required 4. SchedDestroy 32000 ms 5. Rx ACK / 1 ACK / sip:5550927@here.com:5060 6. Rx INVITE / 2 INVITE / sip:5550927@royaume.com:5060 7. CancelDestroy 8. Invite New call: 94b138f0-a06000e-13c4-45026-11f95f-536b76fa-11f95f 9. AuthOK Auth challenge succesful for 4185558698 10. NewChan Channel SIP/4185558698-0194b850 - from 94b138f0-a06000e-13c4-45 11. TxResp SIP/2.0 / 2 INVITE - 100 Trying 12. TxResp SIP/2.0 / 2 INVITE - 180 Ringing 13. TxRespRel SIP/2.0 / 2 INVITE - 200 OK 14. Rx ACK / 2 ACK / sip:5550927@172.16.100.101 15. ReInv Re-invite sent 16. TxReqRel INVITE / 102 INVITE - -UNKNOWN- 17. Rx SIP/2.0 / 102 INVITE / 100 Trying 18. Rx SIP/2.0 / 102 INVITE / 200 OK 19. TxReq ACK / 102 ACK - -UNKNOWN- 20. ReInv Re-invite sent 21. TxReqRel INVITE / 103 INVITE - -UNKNOWN- 22. Rx SIP/2.0 / 103 INVITE / 100 Trying 23. Rx BYE / 3 BYE / sip:5550927@172.16.100.101 24. RTCPaudio Quality:ssrc=1371236592;themssrc=25523736;lp=0;rxjitter=0.01273 25. TxResp SIP/2.0 / 3 BYE - 200 OK 26. Hangup Cause Normal Clearing By: David Brillert (aragon) 2008-05-30 15:23:33 Possibly related to bug 12603 12584 ? But this is fixed in 1.4.20 official release Give 1.4.20 a try ------------------------------------------------------------------------ r116039 | russell | 2008-05-13 16:14:28 -0500 (Tue, 13 May 2008) | 32 lines Merged revisions 116038 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 [^] ........ r116038 | russell | 2008-05-13 16:17:23 -0500 (Tue, 13 May 2008) | 24 lines Fix a deadlock involving channel autoservice and chan_local that was debugged and fixed by mmichelson and me. We observed a system that had a bunch of threads stuck in ast_autoservice_stop(). The reason these threads were waiting around is because this function waits to ensure that the channel list in the autoservice thread gets rebuilt before the stop() function returns. However, the autoservice thread was also locked, so the autoservice channel list was never getting rebuilt. The autoservice thread was stuck waiting for the channel lock on a local channel. However, the local channel was locked by a thread that was stuck in the autoservice stop function. It turned out that the issue came down to the local_queue_frame() function in chan_local. This function assumed that one of the channels passed in as an argument was locked when called. However, that was not always the case. There were multiple cases in which this channel was not locked when the function was called. We fixed up chan_local to indicate to this function whether this channel was locked or not. The previous assumption had caused local_queue_frame() to improperly return with the channel locked, where it would then never get unlocked. (closes issue 0012584) (related to issue 0012603) By: Olle Johansson (oej) 2008-07-01 08:06:27 This bug report needs some action. First, can someone please confirm that it still exists after Russell's change? By: mvf (mvf) 2008-07-01 08:33:43 Hi Oej, I didn't try 1.4.20 because the patch submitted by neutrino88 applied to 1.4.18 totally solve the "ReliableXmit timeout" problem that I reported. Reading the Russell's patch description I can't tell if is actually related to this specific problem. If you think that is related please let me know to give 1.4.20 a try in one of our servers and check if "ReliableXmit timeout" is still there. Regards. By: Emmanuel BUU (neutrino88) 2008-07-01 08:38:01 The first analysis of the bug was that Astersik SIP "stack" would not release the dialog at all if a first 1xx reply at has been received. I have trouble to see how chan_local is involved here. I would say the from a logical point of view, the bug is still here. I'll try to fire a SIP simulator to confirm this. By: Olle Johansson (oej) 2008-07-01 08:41:29 Looking forward to your replies before I spend time on this. Thanks. By: mvf (mvf) 2008-07-24 16:35:58 I installed 1.4.21.2 version (without patch) and I can verify that the problem is still there. After 1 day I already have 14 channels hung. Please let me know if you need something else. By: urzedo (urzedo) 2008-08-05 06:42:33 Hi all, I also have this problem with 1.4.21.2. Has anyone successfully tested the suggested patch with that version? Regards By: mvf (mvf) 2008-08-08 15:35:51 I'm running 1.4.21.2 with neutrino88's patch and by now I have zero hung channels: System uptime: 5 days, 5 hours, 54 minutes, 22 seconds By: Tilghman Lesher (tilghman) 2008-09-11 17:59:56 I believe this should be fixed with the linked issue. Could any of you test the 1.4.22 release candidate and confirm? By: urzedo (urzedo) 2008-09-11 18:53:24 Hi Corydon76, At least for me, this issue is not always reproduceable. I will try the release candidate, but I will not be 100% sure the problem is solved. What I can tell you is that after I patched my 1.4.21.2 system as suggested by neutrino88, the issue never happened again. Regards! By: Steve Murphy (murf) 2008-09-12 15:17:16 OK, no, I don't think this bug is actually closely related to 13235; except that maybe 13235 involves a channel being infinitely rescheduled for destruction and it appears that this is also the issue for this bug. In 13235, the problem is overcome by using the __pretend_ack (sic) routine in the BYE handler. But, it looks like neutron88's fix probably would have short-circuited that problem as well. But the site that experienced the problem in 13235 also, it appears, is seeing this problem as well, where a malfunctioning cisco box sending INVITE's to asterisk, starts cancelling them after it sends them out, and the channels get zombified. We caught one such conversation with sip history: 1. Rx INVITE / 101 INVITE / sip:1707xxxxxxx@66.xx.xx.xx:5060 2. NewChan Channel SIP/67.xx.xx.xx-e540baa0 - from 91A3F741-7EAD11DD-9E 3. TxResp SIP/2.0 / 101 INVITE - 100 Trying 4. Rx CANCEL / 101 CANCEL / sip:1707xxxxxxx@66.xx.xx.xx:5060 5. TxRespRel SIP/2.0 / 101 INVITE - 487 Request Terminated 6. TxResp SIP/2.0 / 101 CANCEL - 200 OK 7. Rx ACK / 101 ACK / sip:1707xxxxxxx@66.xx.xx.xx:5060 (in the above, the ips & numbers have been changed, of course.) That site may see other zombie-channel problems occasionally, but exactly how many problems are being bumped into, we won't know until we can trace them out. I'm going to apply neutron88's patch in the code that site is running, and see if it calms things down there. By: Steve Murphy (murf) 2008-09-12 16:02:45 Ok, I'm taking on this bug. I've inserted neutron88's patch into an asterisk that is frequently suffering from the bloating-zombie-death. If's it reaches a virtual mem size equilibrium, and holds that over the weekend into Monday or Tues, I'll commit the above patch, and close this bug. By: Steve Murphy (murf) 2008-09-15 12:22:51 OK, the server under exam with this patch installed is in the act of blowing up again. I have over 4000 channels; I have one channel with history: * SIP Call 1. Rx INVITE / 101 INVITE / sip:1706xxxxxx@66.28.xxx.xxx:5060 2. NewChan Channel SIP/67.xxx.xxx.xxx-c7fe50c0 - from 6F1C306C-828111DD-8D 3. TxResp SIP/2.0 / 101 INVITE - 100 Trying 4. Rx CANCEL / 101 CANCEL / sip:1706xxxxxxx@66.28.xxx.xx:5060 5. TxRespRel SIP/2.0 / 101 INVITE - 487 Request Terminated 6. TxResp SIP/2.0 / 101 CANCEL - 200 OK 7. Rx ACK / 101 ACK / sip:1706xxxxxx@66.28.xxx.xxx:5060 By: Steve Murphy (murf) 2008-09-15 16:50:42 Aha! We were able to determine the problem I recorded in the last note. It's not a SIP driver problem at all. What's happening, is the dialplan is hanging in a FUNC_ODBC call, and the app call never returns. So, the channel goes into limbo, and eventually, the other end will send a CANCEL, to which the driver will respond appropriately, but the channel will not be destroyed. The pbx_start thread is still running, and thousands of these build up, hang onto memory, file handles, etc. We are theorizing why the func_odbc call is hanging, but it's a connection to sql server, and our guess is that the sql server is dropping the connection. Tilghman supplied us a patch; it's not involving this issue, so I'll make no more mention of it than this: if you see lots of hung channels, check to see if you also have hung threads. If you have both, it's probably not bug 12101, but this other problem. By: Steve Murphy (murf) 2008-09-17 15:44:05 OK, guys, I've attached "hangfix.patch" for 1.4; It's neutrino88's fix plus some code to provide "sip show hangfix", which will show a count of the number of times neutrino88's fix was triggered. So, if *anyone* can verify that they've run this patch, and have no buildup of dead channels, and the "sip show hangfix" command yields a non- zero number (the bigger, the better), then, I'll commit this fix to 1.4 on up. Deal? By: urzedo (urzedo) 2008-09-17 16:19:26 Hi murf, I will give it a try. Please allow some days 'till my feedback. Thanks for your effort! By: Steve Murphy (murf) 2008-09-23 10:35:41 OK, I've determined that, on the site with the zombie-bloating death-hung channel problem, that this problem is NOT one of its problems. I had this patch on those servers for over a week now, and not once was hangfix greater than zero. We also had an ODBC problem with cdr_odbc, which uses, in 1.4, its own connection and methods to pump CDR records to the db. If the connection gets dropped on the server side, the interface never recovered. As of 143864, this problem should be solved. We tested, and it's already long solved in trunk (and 1.6.x, I'm sure). So, as a reminder: anybody out there seen hangfix greater than zero yet? Have the other fixes solved this problem, or do we need to commit it? By: Thiago (thiagofernandes) 2008-09-24 07:51:23 Hi murf, urzedo and me applied the patch, and in two weeks we have 292 channels destroyed, and the problem does not happen anymore. Thanks By: Digium Subversion (svnbot) 2008-09-25 11:02:25 Repository: asterisk Revision: 144420 U branches/1.4/channels/chan_sip.c ------------------------------------------------------------------------ r144420 | murf | 2008-09-25 11:02:24 -0500 (Thu, 25 Sep 2008) | 25 lines (closes issue ASTERISK-11539) Reported by: MVF Tested by: neutrino88, urzedo, murf, thiagofernandes Many thanks to neutrino88 for this patch, which solves a problem whereby channels get a CANCEL request, respond to it properly, but end up in a hung state, infinitely being rescheduled. This fix is a bit crude, in that catches the problem at a rather late phase, but it may prevent infinite rescheduling problems that might still arise. It might have been better to find out why, in the course of protocol handling, the channel was not destroyed, but we leave that to future generations. Many thanks to urzedo and thiagofernandes for their work in verifying that the patch code indeed is being executing, and averting the problem. ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=144420 |