Summary:ASTERISK-16792: Reregistration with small (20 sec) Expiry fails
Reporter:Timo Teräs (fabled)Labels:
Date Opened:2010-10-12 03:36:55Date Closed:2012-01-18 10:30:51.000-0600
Versions:Frequency of
Environment:Attachments:( 0) chan_sip.c.patch-issue18119
( 1) chan_sip.c.patch-issue18119v2
Description:Apparently the following happens:

1. Asterisk register succesfully, 20s Expiry -> reregister in 15s
  - destruction of sip_pvt scheduled with timeout > 15s
2. When reregister triggers:
  - sip_registry->call is NULL, because of successful registration
  - transmit_register builds new sip_pvt with same call ID as #1
3. Server replies
4. Asterisk rejects registration reply because:
  - the sip_pvt from #1 has not been yet destroyed
  - and asterisk routes the reply to sip_pvt from #1 due to call-id
  - this is indicated by log messages such as:
[Oct 11 21:34:26] DEBUG[16751] chan_sip.c: Ignoring out of order
response 476 (expecting 475)
  - until we get timer_t1*64 ms later the message
[Oct 11 21:34:42] DEBUG[16751] chan_sip.c: Auto destroying SIP dialog
'768de833447ed7e558ee6dcc44f94821 at ...'
  - immediately after which the registration succeeds again

Olle suggested on mailing list:
I think it clearly has to do with the t1 timer for #1. It propably hangs around for more than 20 secs. We should set max
T1 for registry transactions to be the expiry time minus some small unit.

And yes, with default global_t1 the timeout would be around 30 seconds.
Comments:By: Stefan Schmidt (schmidts) 2010-10-12 04:15:39

i have looked into the rfc to see if something is in there to make sure we will do the right thing.
part 10.3.7 says:
If the Call-ID value in the existing binding differs from the
Call-ID value in the request, the binding MUST be removed if
the expiration time is zero and updated otherwise.  If they are
the same, the registrar compares the CSeq value.  If the value
is higher than that of the existing binding, it MUST update or
remove the binding as above.  If not, the update MUST be
aborted and the request fails.

could you please add a sip trace for this issue so we can ensure asterisk does it right.

IMHO asterisk should reply with an 423 Interval too short message if someone tries to register with an expiry value lower than T1 * 64 to prevent such behavior cause i think there is no need to reregister every 15 seconds but i may be wrong on this ;)

thank you!

By: Timo Teräs (fabled) 2010-10-12 05:18:40

Sorry for not mentioning: Asterisk is the client sending SIP registration messages out here.

Since the server (non-Asterisk) replies with 200 OK and small "expires" in Contact header, we cannot possibly respond to reply anymore. My quick read of RFC would indicate that it's OK for the server to send this small 20s expires. Minimum expires setting is server-only apparently.

Asterisk is doing the correct thing according to RFC. The problem becomes in the fact that there exists multiple sip_pvt for same Call-ID which confuses asterisk internal message passing code.

It seems that Call-ID must match for the reregister to update binding properly, so we have some options to fix this:
1. Force sip_pvt of REGISTER to expire faster for small expires intervals as suggested
2. Fix registration code to reuse sip_pvt if it has not been expired yet
3. Fix registration code to not delete sip_pvt at all and reuse it always

By: Stefan Schmidt (schmidts) 2010-10-12 07:13:54

No way on 3. IMO this could cause many dead dialogs in there.
To number 2 i am not sure if this really would be the best way but thats just a personal opinion.

i would prefer #1 cause i was thinking about the other way, that asterisk was the registrar not the registering client. In this case it would be ok to set down the timeout when this sip_pvt should be destroyed.

BTW who force clients to reregister in 20 seconds and WHY? thats some kind of nat keep alive for one who doesnt know a better solution ;)

By: Timo Teräs (fabled) 2010-10-12 07:27:13

Yes, I kinda figured 2 & 3 are not too good. Probably 2 is hard to implement as it would imply that we need to keep weak reference. Or maybe it just needs a new scheduler function to clear the sip_registry stuff when the dialogue is destroyed. But it would need more care to make sure we don't hit strange race conditions.

So yes, fix no 1 is seems simplest.

And yes, it really smells like a NAT keepalive. The server does that only when it detects NAT. If externip is set correctly, the Expires it sends is several minutes. This is done by Finnish service provider Saunalahti on their Nettipuhelin service. It is a consumer targeted POTS to SIP gateway service.

By: Stefan Schmidt (schmidts) 2010-10-12 08:31:45

please try the attached patch. its not very nice code but i want to know if this works as expected. if yes i make it more readable ;)

By: Stefan Schmidt (schmidts) 2010-10-12 08:39:39

ok i forget the small amount of safety for the expires_ms value please just add this whatever you think would be ok ;)
something like this:
s/expires_ms/(expires_ms - EXPIRY_GUARD_MIN)/ on the latest expires_ms

By: Timo Teräs (fabled) 2010-10-12 09:05:14

I did have some failures after starting the new binary, but that was probably just me doing something wrong.

Now it's been running on short update and consecutive reregistrations succeed.

So yes, looks like this fixes the issue.

By: Stefan Schmidt (schmidts) 2010-10-12 16:11:49

i have updated the patch. it still does the same but now its easier to read ;)

By: Timo Teräs (fabled) 2010-10-12 23:55:53

The functionality differs, if:
- p->timer_t1 is set and large enough to not cause clipping
- and global_t1 is small enough to cause clipping
the expiry timeout is clipped according to global_t1 which was not the original functionality. The pedantically correct thing is to use only p->timer_t1 if it is set.

Also, if expires_ms - EXPIRY_GUARD_MIN becomes negative the default timeout gets used again. But then again, if we get smaller expires_ms than EXPIRY_GUARD_MIN something is very wrong.

By: Stefan Schmidt (schmidts) 2010-10-13 03:49:19

you are right fable i have turned this round so i first check global_t1 and only if p->t1 is 0 and then check p->t1. thank you for this catch.
look at the new file.

it could not happen that expires_ms will be below 0 only equal cause the expires header is set in seconds and with 500 ms substracted its the value for expires_ms. If there would be a expire header with only 1 second reregister timeout expire_ms would be 500 ms and the value i use for sched destroy would be 0 and this value is ok cause it would not be replaced with the T1 timer in this case.

By: Timo Teräs (fabled) 2010-10-13 03:55:48

Technically the new patch looks correct. Though, I would have probably merged the p->t1 checks to "parent" if. That would avoid at least one compare on runtime. The semantics of C require the compiler to always evaluate the first expression of &&.

And you are right on expires_ms, I forgot that on wire it is in seconds.

So looks good to me.

By: Stefan Schmidt (schmidts) 2010-10-13 14:09:11

i have used the most common case as first break condition and this would be expires_ms is greater than global T1. the else if condition is allways wrong only if p->t1 is set so the check if p->t1 is not equal to 0 is useless on the second condition.

By: Joshua C. Colp (jcolp) 2012-01-18 10:26:39.092-0600

After testing I have confirmed that this issue is actually resolved in Asterisk 1.8 and above.