[Home]

Summary:ASTERISK-13728: [patch] Asterisk should transform SIP 503 code to SIP 500
Reporter:Iñaki Baz Castillo (ibc)Labels:
Date Opened:2009-03-11 11:08:38Date Closed:2009-08-26 12:21:00
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Interoperability
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 14644_v2.patch
( 1) 14644_v3.patch
( 2) 14644.patch
Description:Hi, in the following simple dialplan:

 exten => _X.,1,Dial(SIP/trunk1/${EXTEN})
 exten => _X.,n,Hangup

In case the trunk1 replies "SIP/2.0 503 Service Unavailable" Asterisk uses the same SIP code to reply upstream. Asterisk shouldn't do it and MUST convert that 503 into 500.

503 means that a client receiving it should try the same request against an alternate server (got via DNS SRV and so).

This is "clearly" defined is RFC 3261:

-----------------------
21.5.4 503 Service Unavailable
  [...]
  A client (proxy or UAC) receiving a 503 (Service Unavailable) SHOULD
  attempt to forward the request to an alternate server.  It SHOULD NOT
  forward any other requests to that server for the duration specified
  in the Retry-After header field, if present.
-----------------------

Since Asterisk keep the 503 and replies it to the client, Asterisk breaks the SIP failover mechanism, since it forces a client to contact an alternate server when it's not needed at all (Asterisk is still alive and working).

The correct behaviour is easy: When Asterisk receives a 503 from leg_B it must convert it to 500 in leg_A.

****** ADDITIONAL INFORMATION ******

Also, Asterisk replies 503 when a Dial fails and there is no more dialplan (neither a "Hangup"). It should also be avoided due to the mean of a 503.
But this should be a new bug report.
Comments:By: Mark Michelson (mmichelson) 2009-03-11 16:26:11

Asterisk doesn't exactly pass the 503 along to the other side. It appears there are several response codes which Asterisk translates into an AST_CONTROL_CONGESTION frame and then queues onto the channel. When the other side reads this AST_CONTROL_CONGESTION frame, if it is a SIP channel, it will send a 503.

I'm pretty sure that the interpretation of "congestion" frames into a 503 response is also the cause of the issue you mentioned in the additional information section.

I wonder if the proper fix would be to change the interpretation of congestion frames in chan_sip into 500 instead of 503...

By: Mark Michelson (mmichelson) 2009-03-11 16:47:53

I've uploaded 14644.patch, which does as I stated in my first note.

I don't really have confidence that this is the most correct route to take; however, for your bug report here, it should do exactly what you want.

By: Iñaki Baz Castillo (ibc) 2009-03-12 04:01:20

@mmichelson: Thanks a lot for the patch, I'll try ASAP (however I don't use 1.6 so in case it's doesn't work in 1.4.23 I must adapt it).

About the confidence in this behaviour I can sure you that it's the correct one, also implemented in some proxies as OpenSer/Kamailio/OpenSips (503 from downstream is converted to 500 before ruting upstream).
The point here is that a *working* PBX/proxy shouldn't reply a 503 to the caller since 503 forces the caller to try an alternate server (got previously via DNS SRV).

About the AST_CONTROL_CONGESTION reaction on Asterisk, I agree with you: 500 should be replied instead of 503. At least 500 would be a better response. I will open a new bug for it.

Thanks a lot.

By: Iñaki Baz Castillo (ibc) 2009-03-12 05:52:41

@mmichelson: I'm testing the patch. It seems to work partially:


case 1)

 exten => _XXX.,1,Dial(SIP/some_server/${EXTEN})

"some_server" returns 503 and with your patch Asterisk replies 500 to the caller (OK).


case 2)

 exten => _XXX.,1,Dial(SIP/some_server/${EXTEN})
 exten => _XXX.,n,Hangup

"some_server" returns 503 and with your patch Asterisk replies 503 to the caller (Wrong).


I've realized of many cases in chan_sip.c replying 503. I think it would be really safe to replace all of them by 500. Asterisk should *never* reply a 503 except in the case Asterisk *does* know it is not working properly, and AFAIK there is no way in Asterisk to know it.


Regards.

By: Iñaki Baz Castillo (ibc) 2009-03-12 06:19:38

I've open a new bug ASTERISK-13736 to discuss about generic AST_CONTROL_CONGESTION and "SIP 503" replies in chan_sip.

By: Mark Michelson (mmichelson) 2009-03-12 14:18:34

Hmm, I see several cases of Asterisk responding with a 503, but the two cases you presented should have both resulted in Asterisk responding with a 500. There's obviously something subtle that I am missing while reading the code.

In both cases, the Dial application should have hung up the incoming channel when the 503 was received. With the patch I supplied, this should have resulted in Asterisk sending a 500 to the calling side. As I said, there must be something subtle I'm missing in the code. I'll set up a SIPp server to respond to my INVITEs with a 503 so I can test this. Thanks!

By: Mark Michelson (mmichelson) 2009-03-13 10:49:01

I see what the issue was now. New patch uploaded. I tested in the two scenarios you presented before. Let me know how the new patch works for you.

By: Amilcar S Silvestre (amilcar) 2009-03-14 09:13:03

I have the same issue here, but with another consequences.

If I make a call from a softphone (X-Lite or Eyebeam, in my case) to a number that is going to use a DAHDI PRI link, for example, most failure causes are translated into AST_CONTROL_CONGESTION. Look at the code below:

case PRI_CAUSE_CALL_REJECTED:
case PRI_CAUSE_NETWORK_OUT_OF_ORDER:
case PRI_CAUSE_NORMAL_CIRCUIT_CONGESTION:
case PRI_CAUSE_SWITCH_CONGESTION:
case PRI_CAUSE_DESTINATION_OUT_OF_ORDER:
case PRI_CAUSE_NORMAL_TEMPORARY_FAILURE:
   pri->pvts[chanpos]->subs[SUB_REAL].needcongestion =1;
   break;

And, as it is now,  if chan_dahdi returns AST_CONTROL_CONGESTION, chan_sip will than return "503 Service unavailable" to the softphone. That's absolutely wrong answer in any of the pri causes.

Well, after this answer (503), the softphone can't dial anywhere in the next 60 seconds (according to RFC: "It SHOULD NOT forward any other requests to that server for the duration specified in the Retry-After header field, if present."). Retry-Asfter header is not present, and x-lite seems to have a default value of 60 seconds.

Conclusion: the phone will be stucked and can't dial anywhere else (other good extensions, for example) after any PRI cause that generates a AST_CONTROL_CONGESTION in reply.

Just changing 503 to 500 doesn't solve the problem. My solution to this is exactly the same patch that mmichelson have uploaded (v2), but instead using "500 Service unavailable", I use "480 Temporarily unavailable", and everything works fine (i can make any calls after that PRI causes). After all, the PRI cause that chan_dahdi received is about THAT particular call, not every other calls, neither is a problem in asterisk. And I think it fixes the reporter case too.



By: Amilcar S Silvestre (amilcar) 2009-03-14 09:31:26

About why 480 instead of 503 or 500:

Both 500 (Server Internal Error) or 503 (Service Unavailable) are relative to SERVER status, and not destination status. Returning that code to the pri causes above is incorrect.

DESTINATION status, like in this particular case, are handled with 404 (Not Found), 413 (Request Entity Too Large), 480 (Temporarily Unavailable), 486 (Busy Here), 600 (Busy), or 603 (Decline).

By: Iñaki Baz Castillo (ibc) 2009-03-16 09:18:42

@amilcar: Your solution (using 480) is just valid for your case and it's not a correct approach since it *breaks* SIP DNS failover.
If Asterisk has to dial via PRI, "Dial" is the last application, and all the PRI channels are used then Asterisk MUST reply 503 so the UA would try other server (if it's present in SRV resolution). That's the correct behaviour. Yours is like a hack for your specific case (no DNS SRV failover and a client wrongly assumming 60 seconds of waiting after a 503 with no reason for that).

480 means "User not available now". This reply occurs often when the called rejects the call, or has DND enabled. It's also replies by gateways when the caller doesn't answer a call in 60 seconds. But in this report we are speaking about SIP SRV failover and the fact that Asterisk *cannot* process the call due to internal failure/issue (as all the PRI channels busy is).

I think SIP specifications and standars whould be above of particular escenarios. Regards.

By: Amilcar S Silvestre (amilcar) 2009-03-17 06:09:33

@ibc: You missed the whole point, I guess. I'm not saying that is correct to use 480 in all cases. What I'm saying is, because of the architecture of asterisk, all calls that end on a CHANUNAVAIL or CONGESTION status will return 503 to the endpoint. That's incorrect, and I use the chan_dahdi example to show a very common case that many hangup causes are simply transformed into a 503.

There are many other examples: try to dial to a non-existant SIP channels. Asterisk will return cause "20 - absent", and 503 to endpoint. Try calling with libpri to a number and that returns causes 1, 3 or 20, for example.... In all cases, endpoint will receive 503. That's cleary incorrect. According to RFC3398, cause 21 should return 403, causes 1, 3 should return 404, cause 20 should return 480, etc... And you're wrong about 480 maeaning only that the user rejected the call or has DND enabled.

I'm not saying that is correct using 480 in all AST_CONTROL_CONGESTION cases. All I'm saying is that is not correct returning 503 to all AST_CONTROL_CONGESTION cases (there are cases for 503, like the cases when SIP DNS failover applies).

And in the strict sense of SIP specifications, your bug report is incorrect anyway, According to RFC:

21.5.4 503 Service Unavailable

  The server is temporarily unable to process the request due to a
  temporary overloading or maintenance of the server.  The server MAY
  indicate when the client should retry the request in a Retry-After
  header field.  If no Retry-After is given, the client MUST act as if
  it had received a 500 (Server Internal Error) response.

Every 503 WITHOUT "Retry-After" MUST be treated as a 500. So, the whole point of changing 503 to 500 is useless.

By: Iñaki Baz Castillo (ibc) 2009-03-17 06:29:25

@amilcar:

> I'm not saying that is correct using 480 in all AST_CONTROL_CONGESTION cases.
> All I'm saying is that is not correct returning 503 to all
> AST_CONTROL_CONGESTION cases (there are cases for 503, like the cases when
> SIP DNS failover applies).

That's *exactly* what I mean. I opened a new bug for it: ASTERISK-1449653.


> And you're wrong about 480 maeaning only that the user rejected the call or
> has DND enabled.

480 can mean:
- The user is note registered (reply from its proxy).
- The user rejects the call (by pressing "Reject" or with DND enabled).
- A gateway terminates a ringing call after 60 seconds since in that time the callee didn't answer, so replies 480 upstream (I've seen it in some softswitches).
Probably there are more cases for 480 (one of the worst specifications in RFC 3261), but I think this is not the point here.


> Every 503 WITHOUT "Retry-After" MUST be treated as a 500. So, the whole
> point of changing 503 to 500 is useless.

That's clearly another bad and ambiguous specification in RFC 3261, and I agree on the fact that it can cause confussion.

RFC 3263 says clearly:
-----------
4.3 Details of RFC 2782 Process
  [...]
  For SIP requests, failure occurs if the transaction layer reports a
  503 error response or a transport failure of some sort (generally,
  due to fatal ICMP errors in UDP or connection failures in TCP).
  Failure also occurs if the transaction layer times out without ever
  having received any response, provisional or final (i.e., timer B or
  timer F in RFC 3261 [1] fires).  If a failure occurs, the client
  SHOULD create a new request, which is identical to the previous, but
  has a different value of the Via branch ID than the previous (and
  therefore constitutes a new SIP transaction).  That request is sent
  to the next element in the list as specified by RFC 2782.
-----------


Also RFC 3261 says:
------------
21.5.4 503 Service Unavailable
  [...]
  A client (proxy or UAC) receiving a 503 (Service Unavailable) SHOULD
  attempt to forward the request to an alternate server.  It SHOULD NOT
  forward any other requests to that server for the duration specified
  in the Retry-After header field, if present.
------------

I understand what you mean, it also says "If no Retry-After is given, the client MUST act as if it had received a 500 (Server Internal Error) response." But I think it doesn't invalidate the next paragraphs about failover.

I'm opening a thread in sip-implementors about it:
 https://lists.cs.columbia.edu/pipermail/sip-implementors/2009-March/022102.html

However I think we agree on the basics: Asterisk shouldn't reply 503 in lots of cases :)

Thanks for your comments.
PD: Perhaps we could continue this discussion in bug ASTERISK-1449653.

By: Iñaki Baz Castillo (ibc) 2009-03-23 09:43:29

In the thread opened in sip-implementors, the whole conclussion is that, even if a 503 has no "Retry-After" header, it must force the client to try an alternate server (got previously using DNS SRV).

By: David Vossel (dvossel) 2009-08-25 11:29:19

I'm going to post here exactly what I did for issue ASTERISK-13736.

"I understand the issue, and I understand where your concern is coming from, but changing all 503 errors 500 errors in chan_sip is not a good idea... If it was only SIP that we were concerned about this might be different, but its not.  SIP talks to ISDN and other channels, and in many cases a congestion frame should be translated into a 503 error and sent out.  There is a dialplan solution for this however.  If you understand your setup enough to know a 503 error should not be forwarded, changing the hangup cause in the dialplan, ${HANGUPCAUSE}, from AST_CAUSE_CONGESTION or AST_CAUSE_SWITCH_CONGESTION to AST_CAUSE_FAILURE will convert the 503 to a 500 error."

By: Iñaki Baz Castillo (ibc) 2009-08-25 11:52:48

Please, don't close the report so fast, I'm able to reply your last comment:

"SIP talks to ISDN and other channels, and in many cases a congestion frame should be translated into a 503 error and sent out."

Could you please tell me a case in which a ISDN error should be translated into a 503 and it couldn't be translated into 500?

RFC 3398 defines the folowing ISUP->SIP(503) translations):


* Resource unavailable:

  This kind of cause value indicates a temporary failure.  A 'Retry-
  After' header MAY be added to the response if appropriate.

  ISUP Cause value                        SIP response
  ----------------                        ------------
  34 no circuit available                 503 Service unavailable
  38 network out of order                 503 Service unavailable
  41 temporary failure                    503 Service unavailable
  42 switching equipment congestion       503 Service unavailable
  47 resource unavailable                 503 Service unavailable

For all these cases, a 503 is good since it gives an oportunity to the client to call another server (if SRV is present) as *any* call is impossible using this server.



* Service or option not available:

  This kind of cause value indicates that there is a problem with the
  request, rather than something that will resolve itself over time.

  ISUP Cause value                        SIP response
  ----------------                        ------------
  58 bearer capability not presently      503 Service unavailable
     available

The client's request is wrong or not allowed/suppoerted, so *any* call would also fail. Then 503 is also suitable.


* Invalid message:

  ISUP Cause value                        SIP response
  ----------------                        ------------
  88 incompatible destination             503 Service unavailable

No idea about this code meaning.


But in the example I told, using 503 is incorrect:

 exten => _X.,1,Dial(SIP/trunk1/${EXTEN})
 exten => _X.,n,Hangup

In case the trunk1 replies "SIP/2.0 503 Service Unavailable" Asterisk uses the same SIP code to reply upstream. Asterisk shouldn't do it and MUST convert that 503 into 500.

If Asterisk bypasses 503 the caller will understand that Asterisk is unavailable to forward calls and, according to RFC's, it would try other server (SRV). this is wrong since an unique call and an unique destination cannot produce a 503 from Asterisk.

So, not all the 503 should be translated into 500, but some should (IMHO).

By: David Vossel (dvossel) 2009-08-26 12:20:59

"So, not all the 503 should be translated into 500, but some should (IMHO)."

I agree, and I'm not saying you're wrong.

What we're dealing with here is a side effect of Asterisk being a multi-protocol system.  Every error has to go from protocol specific, into generic Asterisk land, and then back to protocol specific. That way, it doesn't matter what is on each end.  The problem here is that lots of errors in all kinds of channel drivers can get lumped into the generic CONGESTION frame.  Chan_sip has no idea why CONGESTION was sent to it, it just knows that should result in a 503 response, and in most cases this is probably correct, but you have found a situation in which it is not.  We understand this and that's why ISDN cause codes and HANGUP cause codes are available in the dialplan.  You can change the generic mapping to be anything you want via the dialplan.  If you want all AST_CAUSE_CONGESTION causes to be mapped to AST_CAUSE_FAILURE, this will force sip to respond respond with a 500 instead of a 503.

so here's the break down.
When sip gets a 503 it maps that to CONGESTION, it doesn't know what the other side is.  When sip gets CONGESTION from Asterisk it maps that to a 503, and it doesn't know what the other side is.  If we changed that mapping to be anything else, it would mess up how sip talks with other channel drivers, but would resolve the single issue you are having.  We simply can not do this, especially when there are valid workarounds in the dialplan.

I'm closing the issue, but that doesn't mean it should not be discussed further.  We're open for proposals for architecture changes on how this stuff works.  If you have any ideas, please pursue them on the -dev list.