Summary:ASTERISK-04514: [patch] after running for a while, chan_sip permanently stops registration attempts on the first failure
Reporter:Steve Davies . (stevedavies)Labels:
Date Opened:2005-07-03 13:28:04Date Closed:2008-01-15 15:39:50.000-0600
Versions:Frequency of
Environment:Attachments:( 0) asterisk-resetregattempts.patch
Description:Patch in bug 3850 - applied 06/03, added code intended to permanently stop registration attempts for a target if it gets too many failures.

Unfortunately, the code increments regattempts on every registration attempt, but never resets this counter to 0.  Consequently, the regattempts only increases.  So, on the very first registration failure once the regattempts gets greater than the max, chan_sip will permanently turn off the registration.

To duplicate the fault, simply:
 - setup 2 Asterisk boxes, one registering to the other, and with a fairly short registration interval.
 - leave it running for a while until the regattempt counter is >10.
 - take down the target Asterisk box.
 - on the very next registration attempt you'll see asterisk permanently stop trying that register.

I attach a small patch that adds the missing regattempt = 0 and fixes the issue.

The patch also fixes one or two spelling things and adds a clear log message so you know when chan_sip decides to give up on a registration.

Comments:By: Olle Johansson (oej) 2005-07-03 13:39:45

Good catch! Thanks.

By: Michael Jerris (mikej) 2005-07-03 18:00:16

Passed functionality tests based upon dev list responses on this.

By: capouch (capouch) 2005-07-03 18:07:57

My SIP registration had formerly been failing within an hour or so of bringing up a new server instance.

Now, several hours later, things are still fine.  So I think that was the source of my problems and it is now fixed.

By: Steve Davies . (stevedavies) 2005-07-04 04:34:41


My test bed runs 5000 SIP registrations from one system to the other (chan_sip needs some other changes to be able to send so many).

Looking at that test, its my opinion that the default regattempts max of 10 is still much too low.  At a 20second timeout per attempt, the target of the registration only has to be unreachable for 200 seconds and we give up forever.

If you are running a network of servers and have a "hub" go down, now not only do you have to get it up again, you also have to go to every peer box and reload or whatever to get the registrations going again.

So I'd propose that we either:
 A) increase the default of 10 to more like 100, or
 B) have a steady backoff of attempt frequency if an attempt is failing, so we "use up" our tries less quickly.

How long should a target be down before we give up on it?  Surely it should be closer to an hour - perhaps even longer to allow for a host going down overnight and being fixed the next morning...

By: Olle Johansson (oej) 2005-07-04 08:30:01

I think we might have to add a new option for setting a restart timer - when something fails, should we restart it at all and how long should we wait? The restart only has to restart in certain cases, not when we get error messages from the other end.

This is however an addition that requires another issue report.

By: Steve Davies . (stevedavies) 2005-07-04 14:18:05


Is the regattempts max only supposed to be triggered when we get a clear reject from the other end?  At the moment, a "no response" also counts as a failure.  Perhaps I should adjust things so that a "no response" doesn't count as a failure?


By: Kevin P. Fleming (kpfleming) 2005-07-05 11:09:41

I've applied this patch to CVS HEAD; discussion of a more flexible backoff algorithm can happen on asterisk-dev or in a new bug. Thanks!

By: Digium Subversion (svnbot) 2008-01-15 15:39:50.000-0600

Repository: asterisk
Revision: 6023

U   trunk/channels/chan_sip.c

r6023 | kpfleming | 2008-01-15 15:39:50 -0600 (Tue, 15 Jan 2008) | 2 lines

reset regattempts counter after successful registration (bug ASTERISK-4514)