Summary: | ASTERISK-19992: SIP re-INVITEs have no transaction timeout | ||||||
Reporter: | Steve Davies (one47) | Labels: | |||||
Date Opened: | 2012-06-13 10:51:42 | Date Closed: | 2012-07-03 05:56:20 | ||||
Priority: | Major | Regression? | No | ||||
Status: | Closed/Complete | Components: | Channels/chan_sip/General | ||||
Versions: | 1.8.11.1 | Frequency of Occurrence | Occasional | ||||
Related Issues: |
| ||||||
Environment: | SIP with a mis-behaving UA | Attachments: | ( 0) reinvite_mk2_optional.diff ( 1) reinvite_mk2.diff ( 2) reinvite_tweak.diff ( 3) trace.txt ( 4) uac.xml ( 5) uas.xml | ||||
Description: | I believe that this issue exists through all Asterisk versions as it appears to be an omission in the SIP RFCs (I am happy to be corrected if that is not true). In a "normal" INVITE, we have a global timeout that is specified by the Dial command's ring duration, such that if all else fails, app_dial or equivalent will stop the channel from existing beyond its natural lifespan. With a Re-INVITE, this is not the case, and I have a scenario where Asterisk leaks RTP ports: - Inbound call (any channel tech) to SIP endpoint A. - Original call on-hold by SIP A - SIP A calls SIP B, and direct-media call is set up. - SIP A 'REFER's the call to bridge the inbound call to SIP B. - In response to the REFER, Asterisk Re-INVITEs the SIP A and SIP B RTP audio back to Asterisk - SIP A has a bug that means it responds "100 Trying" to the Re-INVITE, but then nothing more. The 100 Trying clears SIP Timer B, and no other SIP events will occur to progress the Re-INVITE. At this point, Asterisk has a channel that will never timeout, and SIP A considers the channel finished with. This results in us leaking any RTP ports opened for that channel as they are never cleaned up. This leaves a potential DoS where the original SIP A RTP ports are leaked, and over a period of time can prevent Asterisk and/or the host from working. | ||||||
Comments: | By: Steve Davies (one47) 2012-06-13 10:55:42.882-0500 Thoughts: If the SIP A channel sends a 1xx or 18x response to a Re-INVITE packet (A media update, COLP update INVITE, or any other non-initial INVITE), then perhaps we should reset the PVT destroy timer rather than clearing it in handle_response_invite() ? A non-initial INVITE ought to be completed in sub-1 second, so I think allowing the default Timer-B 32 seconds would be plenty? Could this change be applied to any SIP channel that is in AST_STATE_UP ? Or is that not a safe way to identify a Re-INVITE rather than an initial INVITE? By: David Woolley (davidw) 2012-06-14 05:31:43.774-0500 We have found that Philips IS3000 can take several seconds to respond to some re-invites. By: Terry Wilson (twilson) 2012-06-21 17:43:12.403-0500 Can't set the private destroy timer because a failed reinvite, per the RFC, should not result in termination of the call. EDIT: For reference it is in 4 Overview of Operaion. Search for re-invite. By: David Woolley (davidw) 2012-06-22 04:32:50.177-0500 I think failing to send a response to a re-invite constitutes a protocol violation, not a failure of the re-invite. I would consider an explicit rejection to be a failure. By: Terry Wilson (twilson) 2012-06-27 16:41:11.058-0500 To my knowledge, it is perfectly allowable to have an arbitrarily long amount of time between an INVITE (re-INVITE or otherwise) and a final response. I think the biggest issue (though we need to handle this situation due to possible attack) is that endpoints shouldn't even bother sending a provisional response on a re-INVITE. It isn't like there is user interaction required. I have applied a patch that cleans up after a failed re-invite. It does not tear down the call, as I still believe that that would be an error. By: Steve Davies (one47) 2012-06-28 08:44:53.020-0500 I just grabbed r369436 from svn - Does this change risk making Asterisk do the wrong thing more often than not? If during a REFER, Asterisk does a Re-INVITE to grab the media stream, and assuming it is also going to send a BYE on that channel once the media is restored, then with this patch, Asterisk is not going to wait for the Re-INVITE to complete before sending the BYE, which I would assume is "illegal". What if that new behaviour causes Asterisk to break the UA? I plan to set this test up and if I can prove or deny this theory, will feed it back, but sending a BYE mid INVITE feels wrong. By: Mark Michelson (mmichelson) 2012-06-28 11:01:21.076-0500 I'm having a difficult time parsing the situation you're describing. "If during a REFER, Asterisk does a Re-INVITE to grab the media stream" Does "If during a REFER" mean a transfer? During a transfer, Asterisk does not typically send any BYEs. When a phone sends a REFER in order to transfer a call, an implicit subscription is started, and Asterisk will send NOTIFYs to the transferer to indicate the progress of the transfer. Typically, once the transferer has received a NOTIFY indicating that the call to the transfer target is progressing or that the call to the transfer target has been answered, then the transferer will send a BYE to Asterisk. Asterisk will only send a BYE to the transferer if, within a certain period of time (32 seconds by default), the transferer does not send a BYE to Asterisk. So basically, your concern should not happen because Asterisk won't send a BYE until 32 seconds have passed without Asterisk receiving a BYE. If the reinvite transaction has not completed in that amount of time, I feel it is safe to send a BYE to the phone without worry of it being premature. With regards to your concerns of sending a BYE before a reinvite completes as being illegal, it's hard to say for sure because RFC 3261 doesn't always do a good job of clarifying whether certain rules apply only to dialog-creating INVITEs or to mid-dialog INVITEs as well. This sip-implementors mailing list post (thanks Matt Jordan for pointing me to it) argues that sending a BYE while a reinvite transaction is pending should be acceptable: https://lists.cs.columbia.edu/pipermail/sip-implementors/2007-March/015994.html By: Terry Wilson (twilson) 2012-06-28 11:11:27.885-0500 chan_sip doesn't really have a transaction layer, so I doubt any solution will be perfect. I would be interested in seeing if you can come up with some way to break things with this patch. We have to send a BYE at some point because we have a dialog established with an INVITE. Although there is this: {noformat} If a UA receives a non-2xx final response to a re-INVITE, the session parameters MUST remain unchanged, as if no re-INVITE had been issued. Note that, as stated in Section 12.2.1.2, if the non-2xx final response is a 481 (Call/Transaction Does Not Exist), or a 408 (Request Timeout), or no response at all is received for the re- INVITE (that is, a timeout is returned by the INVITE client transaction), the UAC will terminate the dialog. {noformat} which makes it sound like it might be Ok to just hang up the call on a timeout. Of course there *is* no timeout specified for an INVITE that has gotten a provisional response. And since we have sent a re-INVITE, and only gotten a provisional response (which could come from a proxy and not the UA) the endpoint is still very possibly going to be expecting a BYE. In fact, in 15 Terminating a session: {noformat} The notion of "hanging up" is not well defined within SIP. It is specific to a particular, albeit common, user interface. Typically, when the user hangs up, it indicates a desire to terminate the attempt to establish a session, and to terminate any sessions already created. For the caller's UA, this would imply a CANCEL request if the initial INVITE has not generated a final response, and a BYE to all confirmed dialogs after a final response. {noformat} Since we already have a confirmed dialog, a BYE seems appropriate. By: Steve Davies (one47) 2012-06-29 09:07:36.898-0500 bq. Does "If during a REFER" mean a transfer? Yes, but specifically an attended transfer initiated by a SIP REFER bq. During a transfer, Asterisk does not typically send any BYEs Half right. With an attended transfer, Asterisk waits for a BYE on the REFERing channel, but only Asterisk can safely send BYE on the "target" channel, as the UA does not want to destroy the channel before the transfer/masquerade is complete. On this 2nd channel, Asterisk will (with the r369436 patch applied) send a re-INVITE, and then an immediate BYE. I have traced this on a couple of handsets, and it is messy, causing one UA to send 487 messages to Asterisk until it times out! Calls do appear to be cleared down correctly, but it now makes Asterisk look like it is the "bad actor" IMHO. I agree that the RFCs are not clear, so I would prefer to be cautious, as Asterisk always has been. bq. We have to send a BYE at some point because we have a dialog established with an INVITE I 100% agree, but as per above would prefer to give the UA a chance to close-off the outstanding re-INVITE. Regarding r369436, the "p->ongoing_reinvite = 0" code (below) is probably never called: {noformat} /* Final response, clear out pending invite */ if ((resp == 200 || resp >= 300) && p->pendinginvite && seqno == p->pendinginvite) { p->pendinginvite = 0; if (reinvite) { p->ongoing_reinvite = 0; } } {noformat} this is because I believe that in the scenarios we are describing, either the REFER call is flagged for a delayed-bye, so "p->owner" can be null so "reinvite" is false, or __sip_ack() can be called by handle_response(), so p->pendinginvite is already 0 before this code is reached. I fully expect you to be able to punch holes through them, but I am attaching 2 patches. One to update your work with an alternative approach, and a 2nd optional piece which I think might be useful but might not. By: Terry Wilson (twilson) 2012-07-02 12:59:58.052-0500 Steve, when I tried your patch I seemed to get some spurious retransmissions of some INVITEs. Can I get you to look at/try reinvite_take_2.diff? It is basically the original patch, but instead of treating a call with an ongoing reinvite as one that doesn't have one, it instead uses the PENDING_BYE stuff and schedules a call to check_pendings after 4 seconds to process the BYE later. I picked 4 seconds as a time since if 1) one side of the call is already hung up and 2) it has been 4 seconds and we still haven't gotten a re-invite it seems like that should be plenty of time to delay. We could always bump it up or make it configurable if we had to. By: Terry Wilson (twilson) 2012-07-02 13:18:54.156-0500 Actually, the reinvite_timeout shouldn't be cancelled on a final response because it is only scheduled from sip_hangup. If it is scheduled, it *has* to run to tear down the call properly. By: Terry Wilson (twilson) 2012-07-02 17:22:19.369-0500 Steve, I changed things around once again. Could you just check out the patch on reviewboard? https://reviewboard.asterisk.org/r/2009/ We need to get a security release out, so if you get a chance please take a look. The most recent change addresses the issue you warned me about and I just completely missed (the ongoing_reinvite=0 not getting called) and cleans up immediately after getting a final response to a reinvite after we've already called sip_hangup so we don't have to rely on the timeout occurring. By: Steve Davies (one47) 2012-07-03 04:37:14.163-0500 I will test this properly today - Sorry about the delay, being on UK time does not help me :) I had one point about the clearing of ongoing_reinvite, but looking at the reviewboard patch, I think you've addressed that already. I want to follow the teardown process through to my satisfaction for all of the different cases. Broken UA and working UA. By: Steve Davies (one47) 2012-07-03 05:53:23.674-0500 This looks good to me... My personal preference (that is all it is) would be to re-use sip_scheddestroy(), rather than creating the new reinvite_timeout() and reinviteid couplet. Call it paranoia, but my reasoning is that long-lasting chan_sip code ensures that any left-over autokillid references are cleaned up, and I would not want to leak a scheduler entry... Let me see if I can invent a scenario: - In sip_hangup(), line 6432, autokillid is cleared and then reinviteid is set. - check_pendings() is called by an interim response (183 perhaps?) from the UA. This sets autokillid. - UA stops sending after this, so we have 2 timers scheduled Neither sched stops the other. pvt ref counts will prevent this causing a crash-and-burn, so I guess even this paranoid case is safe. I still plan to backport this for our older 1.6 systems, and will test it again there - I assume 1.6.2 is sufficiently dead that this will not be done by Digium? By: Terry Wilson (twilson) 2012-07-03 08:46:35.797-0500 Correct. As per https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions 1.6.2 is EOL since April. By: Steve Davies (one47) 2012-07-03 09:56:36.843-0500 No problem on 1.6.2 - That is as expected, but I just spotted a double-BYE on the latest version of the patch for 1.8. It should be fixed by removing check_pendings(p) from chan_sip.c line:30895 (see block below) {noformat} if ((resp >= 200 && reinvite)) { p->ongoing_reinvite = 0; if (p->reinviteid > -1) { AST_SCHED_DEL_UNREF(sched, p->reinviteid, dialog_unref(p, "unref dialog for reinvite timeout because of a final response")); /* Since we got a final response to the reinvite, but were relying on the reinvite_timeout * function to clean up after the reinvite, we need to make sure and call check_pendings */ check_pendings(p); } } {noformat} The reason for this... If we receive a "200 OK" for the reINVITE, then SIP_PENDINGBYE will have been set, and when we call check_pendings(p) as above a BYE is sent and the flag cleared. The execution then continues, to line 20367, where SIP_PENDINGBYE is set, then on line 20406, check_pendings(p) is called again, causing a duplicate BYE. Sorry I missed that earlier. By: Terry Wilson (twilson) 2012-07-03 10:10:21.804-0500 Hmm, can't just remove the check_pendings() or the call won't get torn down if we don't get a response. I didn't see a duplicate BYE in my testing, so I'll re-check. By: Terry Wilson (twilson) 2012-07-03 10:12:07.874-0500 It looks like check_pendings should clear the PENDING_BYE so we shouldn't get two BYEs? By: Steve Davies (one47) 2012-07-03 10:16:06.716-0500 Let me re-check my line numbers. I think I might have them a bit mixed up in that previous comment. The check_pending that I was suggesting that we delete is the new one in the codepath where the reINVITE is being naturally cleared down. By: Terry Wilson (twilson) 2012-07-03 10:21:47.328-0500 Did you actually see an extra BYE in a capture, or did you deduce that that was what was going to happen from looking at the code? By: Steve Davies (one47) 2012-07-03 10:22:37.047-0500 Sorry, there was indeed a typo in my previous post. I meant to remove the call to check_pendings() on line 20126 - This is a call that is added as part of this patch, and it is within handle_response_invite(). handle_response_invite() will call check_pendings() already, after it sends the final ACK, so the sequence of calls I get is: chan_sip.c(20126): check_pendings() <-- Send BYE, clear flag chan_sip.c(20367): If we have no owner, set SIP_PENDINGBYE chan_sip.c(20406): check_pendings() <-- Send BYE, clear flag The test on line 20367 is "true" in my test case of a SIP REFER attended transfer, so I get the duplicate BYE. By: Steve Davies (one47) 2012-07-03 10:25:01.577-0500 I saw the double-BYE on a "sip debug ip x.x.x.x" By: Terry Wilson (twilson) 2012-07-03 10:49:16.243-0500 If I remove the new check_pendings(), but leave the rest of the patch alone I don't even get a BYE at all with a normal directmedia+atxfer call. Does removing the line actually work for you when you test it? By: Terry Wilson (twilson) 2012-07-03 10:50:38.071-0500 My situation SIP/1003 calls SIP/1001. SIP/1001 atxfers SIP/1003 to SIP/1002. Later, SIP/1003 hangs up. By: Steve Davies (one47) 2012-07-03 10:58:03.297-0500 Perhaps my patch is not clean. I'll checkout 1.8 branch and re-patch it. By: Terry Wilson (twilson) 2012-07-03 11:29:28.369-0500 In fact, even without either patch applied, waiting a minute after hanging up an attended transfer w/ directmedia and running sip show channels shows a dialog that never gets destroyed: {noformat} *CLI> sip show channels Peer User/ANR Call ID Format Hold Last Message Expiry Peer 172.16.61.1 1001 3ab934f2309e359 0x0 (nothing) Yes Tx: ACK 1001 {noformat} It looks like Eyebeam, which I'm using as my tranferer, ends up sending back a 488 to the re-INVITE that Asterisk sends right before hang up. So my testing platform may be weird. With the patch, things get torn down but Eyebeam still acts a little weird. I'll try doing the transfer with something else. By: Steve Davies (one47) 2012-07-03 11:54:20.025-0500 I have experienced Eyebeam weirdness previously. It does attended transfers "backwards" for want of a better description. Attached is a SIP DEBUG output using a freshly patched/rebuilt tree (the version number in the packets is faked, this IS branch/1.8) By: Terry Wilson (twilson) 2012-07-03 11:54:43.205-0500 Ok, switching from Eyebeam seems to give me the results that you show. Eyebeam is just behaving badly on the re-INVITE in general which was totally screwing up my results. Removing the check_pendings() does seem to be the appropriate action. I shall do so. Thanks! By: Terry Wilson (twilson) 2012-07-03 12:07:58.141-0500 I have committed yet again, removing the check_pendings(). Please, oh please let this be the last issue with this issue. :) By: Terry Wilson (twilson) 2012-07-03 12:35:05.109-0500 I would also like to point out how humorous it is that the only thing standing between me and taking off early for the US Independence Day holiday is a person from the UK. :) By: Steve Davies (one47) 2012-07-04 10:43:54.741-0500 Guess what! See the attached patch. I have included explanations in the patch for these tweaks. On 3rd July, I was concentrating on the working UA, and we managed to leave the actual fix for a broken UA behind as a result. I have tested this latest patch with both a broken and working UA... so far so good! By: Terry Wilson (twilson) 2012-07-05 08:32:15.348-0500 It currently works for the broken case that my tests cover. I have sipp scenarios which I will attach showing what I am testing. Basically a UAC that does INVITE, get 180/200, send ACK, receive reINVITE, send 100, pause send BYE recv 200. On the UAS send 180/200, recv ACK, recv reINVITE, send 100. After 32-seconds one side is torn down. After another few seconds the other side is torn down. What is the case that doesn't work for you? By: Terry Wilson (twilson) 2012-07-05 09:27:33.876-0500 I've also tried where the UAS sends the BYE after a pause. Both dialogs still get torn down in my test. One, a few seconds after the other. By: Steve Davies (one47) 2012-07-05 10:09:17.396-0500 The currently checked-in patch had 2 issues in my testing. I hope the following makes sense. Basically I added about 20 extra warning messages into the timer code to see what was happening. 1) For the Okay UA case where no 1xx packet sent (INVITE/OK/ACK): This works 100% fine. My trace shows: {noformat} > reINVITE < 200 OK (invite) > ACK > BYE < 200 OK (bye) {noformat} 2) For the Okay UA case, 1xx packet sent (INVITE/Trying/OK/ACK): If a "100 Trying" is sent in response to the reINVITE, then check_pendings() is called when it arrives, and because we are an "ongoing_reinvite" we allow the BYE to be sent mid-transaction. This BYE is sent even though the UA is about to send a "200 OK". This "early BYE" is one of the symptoms we are trying to avoid, or I was anyway :) This early-BYE does upset our UA, which does send a "100 Trying". The reINVITE 200 OK and ACK do happen, so the pvt is cleared down fine. The trace I got was: {noformat} > reINVITE < 100 Trying (invite) > BYE <--- too soon, UA gets confused. < 200 OK (invite) > ACK < 487 Cancel (invite) <--- Perhaps UA sees "BYE" instead of "ACK" ? < 487 Cancel (invite) < 487 Cancel (invite) < 487 Cancel (invite) etc. {noformat} 3) For the Broken UA case (INVITE/Trying... Dead air): It starts as per 2) above,... Asterisk receives the "100 Trying", and immediately sends "BYE" but because the 200 OK and ACK never happen, when __sip_autodestroy() is eventually called for final cleanup pvt->packets is non-empty, and the last method is stuck on INVITE, so __sip_autodestroy just retries forever. At least that is what happened here. {noformat} > reINVITE < 100 Trying (invite) > BYE <---- THIS is the difference! < 200 OK (bye) {noformat} "sip show channels" showed a stuck channel still. Debug showed __sip_autodestroy being called every 10 seconds on this pvt. NOTE: I just added the "THIS is the difference!" comment. If the UA still responds to the BYE, then the pvt lastmsg is changed from BYE back to the outstanding INVITE, and __sip_autodestroy gets stuck. With my PROPOSED patch I get: 1) Same result. {noformat} > reINVITE < 200 OK (invite) > ACK > BYE < 200 OK (bye) {noformat} 2) BYE waits until after reINVITE {noformat} > reINVITE < 100 Trying (invite) < 200 OK (invite) <--- Clears reinvite timer > ACK > BYE <--- Happens here naturally. < 200 OK (bye) {noformat} 3) reINVITE timer expires, BYE is sent, all cleans up. {noformat} > reINVITE < 100 Trying (invite) (pause for new reinvite timer here) > BYE < 200 OK (bye) (pause for __sip_autodestroy here) {noformat} "sip show channels" shows clean. As always I fully expect you to be able to find a flaw in all this ;) By: Terry Wilson (twilson) 2012-07-05 10:23:01.405-0500 I'll try to find time sometime to look at this, but mmichelson may have to take. I'm not really working today or tomorrow and am about to hop on a motorcycle and ride about 250 miles to get home. So this is still a re-INVITE in the context of an attended transfer that you are seeing issues? By: Steve Davies (one47) 2012-07-05 10:39:42.668-0500 Yes, the context is the same as originally, so there are 2 legs. 1st leg handles the "REFER/Accept/NOTIFY/BYE/OK" normally, then the second leg gets the reINVITE sequence above, and proceeds well or badly depending on nastyness of the UA. FYI for the purposes of reproducing the issue, I am testing this with: Bad UA = Aastra 55i, 2.6.0 firmware Good UA = Aastra 55i, 3.2.2 firmware So I am not just making this stuff up :) By: Mark Michelson (mmichelson) 2012-07-05 11:11:34.122-0500 I think the reason why Terry wasn't seeing the bad behavior is that the SIPp scenarios he is using are not testing transfers so much as the general reinvite case. In Steve's case, the SIP_PENDINGBYE flag is being set and in Terry's, it's not. This means that when the 100 Trying is received, Steve sees Asterisk behave badly by sending a BYE out immediately whereas Terry does not. I'm going to have a look at Steve's patch in more detail to see if it's the right way to move forward. I suspect it is though. By: Mark Michelson (mmichelson) 2012-07-05 11:38:45.370-0500 I think Steve's latest patch has it right. If the reinvite times out with no pending BYE, things just move on. If the reinvite times out with a BYE pending, though, then Asterisk will send a BYE. |