[Home]

Summary:ASTERISK-17117: [patch] IAX2 Retry Time Review
Reporter:Leo Brown (netfuse)Labels:patch
Date Opened:2010-12-16 09:41:38.000-0600Date Closed:
Priority:MinorRegression?No
Status:Open/NewComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) iax2.c.patch
( 1) iax2.c.updated.patch
Description:Guys,

The IAX2 channel driver has a number of timeout settings for communication. I have found that customers with network problems are not particularly benefitted by these timeouts because they are so incredibly high.

If you look at the maths, it will be at least 30 seconds with absolutely no traffic before a DIAL on an IAX2 host fails. However, a SIP peer with qualify=yes defaults to 2000ms. I recommend that the 2 seconds timeout should be applied to dialling on IAX trunks also.

Patch attached for your consideration. I have tested this in production use with good results.

Cheers
Leo
Comments:By: Leif Madsen (lmadsen) 2010-12-16 11:03:34.000-0600

I don't think the qualify option has any benefit when Dial()ing with a SIP channel either. It would actually timeout after 63 seconds (if session-timers are not enabled).

By: Leo Brown (netfuse) 2010-12-16 11:10:00.000-0600

Sorry, I guess you misunderstand. My point is that latency over 2 seconds is clearly not suitable for general communication - hence why 2000ms is the default qualify value for SIP.

By this thinking, we should see similar timeouts on other technologies.

You could argue that a custom (non-Asterisk) IAX server would defer a call acceptance until it'd done some background checks (looked up the Caller ID, say) but this should still send an ACK and this would prevent the Dial() being aborted.

Adding these reduced timeouts means that you can have meaningful route prioritisation. Take the example of FreePBX "Outbound Routes". You can add as many routes as you want and set priority, but if the first one can not be reached, it will be a long time before the next route is attempted. These modified values suddenly make that route order useful again!

Anyway, I'm sure someone will disagree with my values, but we've used the first version (500ms and 1 retry) in production for a good while and it works well.

By: David Vossel (dvossel) 2011-05-06 16:57:15

The initial retry time is 2 times the last ping pong round trip time.  So changing the default retry time shouldn't matter if qualifies are being used... At least I think qualifies do the ping pong request, I can't remember.

The RFC has this to say about retransmission timers.
7.2.1. Retransmission Timer


  The message retransmission procedures are described in Section 7.  On
  each call, there is a timer for how long to wait for an
  acknowledgment of a message.  This timer starts at twice the measured
  Round-Trip Time from the last PING/PONG command.  If a retransmission
  is needed, it is exponentially increased until it meets a boundary
  value.  The maximum retry time period boundary is 10 seconds.



So, it just needs to "exponentially" increase.  It should be safe to just double it each retry, I believe that is what the SIP retransmission timer does.

By: Leo Brown (netfuse) 2011-05-09 03:56:23

Hi

Great, thanks for your feedback.

I feel we're close - here are some relevant elements from the chan_iax2.c source:

/* Don't retry more frequently than every 10 ms, or less frequently than every 5 seconds */
#define MIN_RETRY_TIME 100
#define MAX_RETRY_TIME 10000

/* Retry after 2x the ping time has passed */
 fr->retrytime = pvt->pingtime * 2;
 if (fr->retrytime < MIN_RETRY_TIME)
   fr->retrytime = MIN_RETRY_TIME;
 if (fr->retrytime > MAX_RETRY_TIME)
   fr->retrytime = MAX_RETRY_TIME;

So this happens as according to the RFC. The bit that doesn't, however, is this:

 /* Attempt transmission */
 send_packet(f);
 f->retries++;
 /* Try again later after 10 times as long */
 f->retrytime *= 10;
 if (f->retrytime > MAX_RETRY_TIME)
   f->retrytime = MAX_RETRY_TIME;
   /* Transfer messages max out at one second */
   if (f->transfer && (f->retrytime > 1000))
       f->retrytime = 1000;

So there is this embedded counter which makes the retry time *10 times longer* on each packet that's not received to a max of 10 seconds. The default max qualify time is 2 seconds before the host is unreachable, and there are up to 4 retries, which means the transmission fail time would be be 4s + 40s + 400s + 4000s = 4444 seconds, but is instead limited by MAX_RETRY_TIME to 4s + 10s + 10s +10s = 34 seconds.

Consider without the multiplier, the worst case would be 4s *4 = 16s for a really lagged host, or, say, 200ms *
4 = 1s for a host which was previously only 100ms away.

Basically, there is no good reason for the retry time to be multiplied - it works as hoped without this retry timer being multiplied. The multiplier is not extracted as a constant and is not required from the RFC.

I have a dozen boxes in production with this patched, and they work as expected. If an IAX trunk goes down, the call to be placed to it doesn't hang for 30+ seconds!

Leo

By: Leo Brown (netfuse) 2011-05-12 05:59:50

Hi

Re: the exponential timer, I do think 1.5x would be more appropriate if we're retrying 4 times.

The question is, where is the 4x retry chosen from?

It doesn't make sense if you think about real world scenarios, we would never keep trying on a dead call for 30 seconds!

Leo

By: David Vossel (dvossel) 2011-05-12 09:20:59

The same thing happens in sip as well. invites retransmit for 32seconds by default.  SIP's timer by default increments by 2x until it reaches 4 seconds i believe.

By: Leo Brown (netfuse) 2011-05-12 09:24:23

So based on this, it's impossible to detect a dead trunk on the fly and fail over to another while the user waits. This is a problem right?