[Home]

Summary:ASTERISK-11734: [patch] DNS SRV lookups causing re-registration problems
Reporter:J.R. (jrast)Labels:
Date Opened:2008-03-26 17:38:44Date Closed:2010-02-01 09:46:52.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Registration
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12312.patch
( 1) 20080109_issue12312_trunk_3.diff
( 2) 20080626_issue12312_srv_rereg.diff
( 3) 20080708_issue12312_14branch.diff
( 4) 20080708_issue12312_trunk_2.diff
Description:SIP peer uses DNS SRV records which list multiple entries for _sip._udp.domain.  

Asterisk is selecting one and successfully registering with it.  When attempting to re-register, Asterisk is not selecting the same host.  Instead it is ALWAYS selecting another host, sending seven SIP REGISTER requests to it, then failing over to the next (correct) host, at which point the re-register succeeds.  However, this process exceeds the registration expiry and the registration is timing out on every re-registration cycle.

Peer in question is callcentric.com.
Comments:By: J.R. (jrast) 2008-03-26 19:49:57

Problem seems to be that Asterisk is performing a new SRV lookup on every re-register.  Callcentric's SRV entries are equally weighted, so Asterisk is selecting from them in a round-robin fashion.  Asterisk's attempt to re-register to a server other than that which it is registered to is failing.

The following hack stores the host looked up during the initial register's SRV lookup and re-uses it for the re-registers.  This solves the problem for Callcentric.

Hopefully someone more familiar with the workings of the code can change this hack into a proper patch.



By: Olle Johansson (oej) 2008-03-27 03:06:17

We should really stay with the first choice until it fails, not select another one.

By: Dennis DeDonatis (dennisd) 2008-04-07 10:21:25

I can confirm there is an issue with callcentric.com and asterisk 1.6.0 beta 7.1 which has srvlookup=yes turned on by default.  Setting srvlookup=no in sip.conf makes the registration problem go away (it doesn't fix it, obviously).

By: Brett Bryant (bbryant) 2008-06-26 09:59:07

jrast, please submit patches as files attached to a bug so we can keep track of license agreements. We don't look at any code that's not uploaded.

I've attached a patch for this issue against trunk. It keeps track of the number of SRV records for each registration attempt, and continues to use the current one as long as it succeeds and then checks for a new one if some kind of error has occurred.

Please test and tell me if this fixes your problem.

By: Olle Johansson (oej) 2008-06-26 10:15:56

Please use doxygen format for comments, bbryant.

By: Olle Johansson (oej) 2008-06-26 10:16:47

Bbryant: Can you please explain this comment, I don't understand what you're doing here:

/* when a 400 type response is sent that doesn't
+ * mean an unauthorized response that could be to an
+ * initial registration, set a flag on the registration
+ * structure to look for a new srv record from dns ...
+ */

By: Olle Johansson (oej) 2008-06-26 10:18:10

The issue here is that a transmit fails. If we're getting a response AT ALL, we should stay with that server. I think you're fixing the wrong issue with this patch (only a quick review though).

By: Brett Bryant (bbryant) 2008-06-26 10:30:29

oej, yes, I'll update to use doxygen formats.

That line of code decides a new SRV record should be used if handle_response_register got any response code in the 400 range that wasn't a 401 or 407 which could be a response to an initial register.

I made an assumption in this patch that connection issues would be handled by this area of the code receiving a timeout, but I could be wrong.

I wasn't sure if it would be worth attempting a new SRV record when one failed for something besides a connection problem, but if this section of code does get a timeout for connection issues this would be easy.

By: Olle Johansson (oej) 2008-07-01 11:17:58

You should not pick another SRV host when you get a reply - any reply. You should only change when you have a connection issue or no reply - not even a 100 trying.

By: Brett Bryant (bbryant) 2008-07-08 17:31:16

I've uploaded two new patches against 1.4 and trunk that incorporate oej's suggestions. jrast, please test and see if they fix the issue you're having.

Some more improvements can be done to the way SRV records are fetched in asterisk. The problem is now a srv record that has a high weight but causes a connection problem still has the same chance to be used that it did the first time (asterisks picks the first record and srv records are sorted by chance, according to a weight).



By: J.R. (jrast) 2008-08-01 09:33:49

Folks, thanks for your efforts in working on a proper fix for this.

Unfortunately at the moment I cannot easily test it.  The system in question is in production use, with my original hack that fixed the problem.  At the moment, I am personally preparing for an extended trip out of town so do not have time to set up another system.  Because of that trip, it is unlikely I will be back and able to look at this much before November or December!

I wanted to provide you this info so you didn't think I was rudely ignorning you all and say thanks, again, for the efforts.  Perhaps someone else with a Callcentric account can look at this in the meantime?

By: Leif Madsen (lmadsen) 2008-11-20 09:14:25.000-0600

jrast: it's nearing the end of November.. suppose you're not back and able to test this?

By: Leif Madsen (lmadsen) 2008-11-20 09:15:18.000-0600

Would this be a good candidate for reviewboard since it is unlikely anyone is going to test this in the near future?

By: Leif Madsen (lmadsen) 2009-01-09 07:49:20.000-0600

putnopvut: I've assigned this to you, but that's just mostly because I'd like to get this up on reviewboard, checked, and committed. Let me know if you can't do that! Thanks!

By: Leif Madsen (lmadsen) 2009-01-09 07:51:23.000-0600

Last call for some testing on this issue. Either we deem it to be worthy of commit, or we deem it as functionality no one really needs and is sent to the great dust bin in the sky (i.e. suspended)

By: Leif Madsen (lmadsen) 2009-01-09 08:23:38.000-0600

I've updated the patch so it applies cleanly to trunk recent. It is marked at 20080109_issue12312_trunk_3.diff

By: J.R. (jrast) 2009-01-09 08:24:58.000-0600

Please do not discard this work!  A fix for this problem is essential for peering with Callcentric, and by extension, with any other peer with multiple SRV hosts.

Unfortunately, to test this, I would have to build a completely new Asterisk test platform since my existing one remains in production use.  My schedule has not allowed me the time to do all that would be necessary for this.  I had thought I would be able to by now, but in fact, further trips away appear likely for the time being.

I did include a patch for a hack that does solve the problem, so very specific info is available about this.  Surely someone more familiar with the code than I can compare that and the new patch to verify that the new one should also work.  Failing that, perhaps someone else can set up a Callcentric account (they are free) and try out the new patch?

This is an essential patch for anyone wishing to use Callcentric with Asterisk.

By: Mark Michelson (mmichelson) 2009-01-09 08:56:25.000-0600

I'll go ahead and place the patch on reviewboard. If I have comments about the patch, I'll make them there.

By: Mark Michelson (mmichelson) 2009-01-09 09:02:51.000-0600

By the way, has anyone actually done any testing of the final patch uploaded here? I don't see any mention in the bugnote history of anyone actually trying the patch out. I don't know if bbryant did some local tests while writing the patch or not...

By: Mark Michelson (mmichelson) 2009-01-09 09:06:13.000-0600

The patch has been placed on reviewboard at the following URL:

http://reviewboard.digium.com/r/124/

I will now take a look through and see if there are any obvious problems with the patch.

By: Mark Michelson (mmichelson) 2009-01-09 11:06:02.000-0600

I have added my reaction to the provided patch at the reviewboard page now.

By: Mark Michelson (mmichelson) 2009-01-09 15:45:48.000-0600

After reviewing the patch and investigating more regarding SRV handling in Asterisk, my suspicions were confirmed, which are:

1) We never actually will fail over to a lower priority or weighted SRV record, ever.
2) We don't actually look them up in a round-robin fashion as assumed in the bug description. We just use the order of the records returned to us by whatever resolver is used.

So here's something that doesn't make sense to me. If there are SRV records set up at callcentric for SIP, and they are equally prioritized and weighted, why is it that a REGISTER sent to one works properly but a REGISTER sent to the other does not? This seems like a misconfiguration on the end of the provider and not something that Asterisk is doing wrong.

Furthermore, because of point 2) above, it's not Asterisk's fault that whatever DNS resolver is in use is returning the records in a different order than they previously were. As far as Asterisk is concerned, the only attributes of the records that matter when sorting them is the weight and priority, we do not go to the effort of making sure that records of the same weight and priority are in the same order as our cached list because that should not matter.

Now, I'm not trying to say there's not a bug here, because as oej pointed out in his first note, we should continue using the first SRV record until we have a failure. If Asterisk is really performing a new SRV lookup on every re-registration, then that behavior is incorrect. The existing trunk code appears to handle this correctly, but it looks as though 1.4 may have issues. I will look at bbryant's 1.4 patch to see if it handles things properly (I've just been looking at the trunk patch up to this point).

By: Olle Johansson (oej) 2009-01-09 15:50:28.000-0600

We never had proper SRV support as you say, so that part is a feature request. The problem here is that we're changing server for a RE-register. The second server sees the To: tag and assumes an existing session, which only exists with the first server. If we change server because the first one fails, it's no longer a re-register, it's a new dialog. We should not change server for a re-register.

I have code for handling failures like that in the peer-failover branch, but that's another issue.

By: Mark Michelson (mmichelson) 2009-01-09 16:00:53.000-0600

That makes good sense. Thanks for the clarification :)

By: Mark Michelson (mmichelson) 2009-01-09 16:49:50.000-0600

Having a look at bbryant's 1.4 patch, he only appears to do DNS lookups at three times:

1) The first time we register
2) If sendto() fails at some point
3) If we get no response to our REGISTER and thus time out.

I think this is exactly what is needed here and will fix the issue. As far as I'm concerned this patch is a go. Of course, it would be just amazing if there is anyone out there that could give us a positive test result on it. I would hate to commit code that looks correct but ends up having some small problem.

I'm fairly certain the trunk patch is actually a step backwards since it removes the dnsmgr functionality and doesn't actually change the operations performed.

By: Leif Madsen (lmadsen) 2009-01-10 09:21:46.000-0600

I have setup a callcentric account, so will attempt to try the patch as soon as I reproduce the issue.

By: Leif Madsen (lmadsen) 2009-01-10 09:46:05.000-0600

FYI, this may not be an issue at all on trunk anymore, as I cannot seem to get it to fail in registration on that version. I am going to try on the latest 1.4 branch and see if it is an issue there or not. It seems the ast_get_srv function is doing something right, and never causing the registration to file form what I can see.

I'm re-registering every 60 seconds, so I've been monitoring it for a bit now, and have seen no failures.

By: Leif Madsen (lmadsen) 2009-01-10 09:56:02.000-0600

Is it possible callcentric fixed something on their end? I cannot get this to fail at all. I see registrations going to separate IPs each time a re-registration happens:

For example:

[Jan 10 06:45:51] NOTICE[25009]: chan_sip.c:7588 sip_reregister:    -- Re-registration for  1777289XXXX@callcentric.com
[Jan 10 06:45:51] DEBUG[25009]: chan_sip.c:7817 transmit_register:    >>> Re-using Auth data for 1777289XXXX@callcentric.com
REGISTER 13 headers, 0 lines
Reliably Transmitting (no NAT) to 204.11.192.23:5080:


Then next time...

[Jan 10 06:46:35] NOTICE[25009]: chan_sip.c:7588 sip_reregister:    -- Re-registration for  1777289XXXX@callcentric.com
[Jan 10 06:46:35] DEBUG[25009]: chan_sip.c:7817 transmit_register:    >>> Re-using Auth data for 1777289XXXX@callcentric.com
REGISTER 13 headers, 0 lines
Reliably Transmitting (no NAT) to 204.11.192.37:5080:



So I'm seeing it change the IP address, but I never get a failure.

Sorry, so I'm not sure how to reproduce this in order to determine if the patch helps or hurts :)

I can apply the patch if requested and try it to see if it has any negative effects I suppose. Just let me know!

By: J.R. (jrast) 2009-01-11 10:58:57.000-0600

Well, you are seeing the same IP address change as I had, so that is demonstrating that Asterisk is switching SRVs.

When I reported this, getting on for 10 months ago now, I did also open a ticket with Callcentric.  Their response at the time was that it was wrong for Asterisk to switch SRVs which is why I went on to patch Asterisk.  Since I solved the problem that way quite easily, I closed the Callcentric ticket.

It is quite possible that, in the intervening months, Callcentric has made changes at their end so that a re-register to the other IP now also works.  I have not heard if that's the case, though.  It certainly looks that way.  The re-register to the other IP address that you are seeing is where it would have failed before.

Since Callcentric does appear to have made a change, it is now arguable as to whether this patch is still needed.  I would say yes... to me it seems that a re-register should be that: register again to the SAME server.  We should not be going back to the DNS SRV lookup just to re-register.  We should only look up the DNS SRV records again if a (re-)registration fails.

Thanks very much for taking the trouble to test it out, by the way.



By: Leif Madsen (lmadsen) 2009-01-11 11:07:47.000-0600

No problem. Thank you for continuing to follow up even though you do not need the patch supplied.

I am going to see what putnopvut thinks in terms of this being a bug, and whether this should be committed or not.

I will probably still need to test the patch to see if it has any adverse affects on the callcentric account to make sure it doesn't break things :)

By: Mark Michelson (mmichelson) 2009-01-12 09:24:52.000-0600

blitzrage, definitely test the 1.4 version of the patch on this issue and be sure that we send to the same IP for each re-registration.

My thoughts are that the 1.4 version of the patch should go in because there's really no reason to do a refresh of our cached SRV records just because we're re-registering.

By: Olle Johansson (oej) 2009-01-12 09:26:27.000-0600

To be more precise: There's no need to re-resolve until we have an issue.

Cached SRV records should be updated as soon as the TTL indicates it, but that is a different issue.

By: Leif Madsen (lmadsen) 2009-01-12 10:03:10.000-0600

OK, I will test the 1.4 version that is attached to this issue. Hopefully it'll apply mostly clean :)

By: Leif Madsen (lmadsen) 2009-01-12 10:19:24.000-0600

Seems to fail for me on 1.4:

*CLI> Really destroying SIP dialog '6c82c48c3fa9c718646841dc1ce8a34a@192.168.128.50' Method: REGISTER
[Jan 12 07:06:05] NOTICE[12445]: chan_sip.c:7597 sip_reregister:    -- Re-registration for  1777289xxxx@callcentric.com
[Jan 12 07:06:05] DEBUG[12445]: chan_sip.c:7829 transmit_register:    >>> Re-using Auth data for 1777289xxxx@callcentric.com
REGISTER 13 headers, 0 lines
Reliably Transmitting (no NAT) to 0.0.0.0:5080:
REGISTER sip:callcentric.com:5080 SIP/2.0
Via: SIP/2.0/UDP 127.0.0.1:5060;branch=z9hG4bK265c1070;rport
From: <sip:1777289xxxx@>;tag=as4a8b9251
To: <sip:1777289xxxx@>
Call-ID: 6c82c48c3fa9c718646841dc1ce8a34a@192.168.128.50
CSeq: 104 REGISTER
User-Agent: Asterisk PBX
Max-Forwards: 70
Authorization: Digest username="1777289xxxx", realm="callcentric.com", algorithm=MD5, uri="sip:callcentric.com", nonce="fccff47271971947b6663cc222d66786", response="b2cabadb1aa107d7b2ffc46996b8dff2"
Expires: 60
Contact: <sip:1777289xxxx@127.0.0.1>
Event: registration
Content-Length: 0

By: Leif Madsen (lmadsen) 2009-01-12 10:26:46.000-0600

Looking at this a bit more, this is what happens when not using the patch:

Register --> 407 --> Register --> 200 OK

upon re-registration, if we hit a new IP:

Register --> 407 --> Register --> 200 OK

or upon re-registration with an IP we've successfully reg'd with:

Register --> 200 OK


With the patch, it seems to try sending to 0.0.0.0 as I posted above.

By: Mark Michelson (mmichelson) 2009-01-12 12:50:18.000-0600

blitzrage, was the 0.0.0.0 target occurring only on re-registrations or also on initial registrations?

By: Leif Madsen (lmadsen) 2009-01-12 12:56:16.000-0600

Sorry, wish I had made that more clear.

It was only happening on re-registration. Original registration would work correctly.

By: Mark Michelson (mmichelson) 2009-01-12 17:38:11.000-0600

I see why the 0.0.0.0 is occurring, and I know of a quick way to fix it. The result of my quick fix would result in another SRV lookup, though, which is exactly what we're trying to avoid. I only have about 5 more minutes at work today, so I'll think about it at home and hopefully come up with a better solution tomorrow.

By: Leif Madsen (lmadsen) 2009-01-14 14:38:06.000-0600

Switching back to confirmed as we're not really 'Ready for Review' or commit at this time :)

By: Mark Michelson (mmichelson) 2009-01-15 16:08:22.000-0600

blitzrage: I've uploaded a new patch to attempt using. Please see if the desired behavior occurs. Thanks!

By: Leif Madsen (lmadsen) 2009-01-26 08:44:25.000-0600

This looks great to me!  Registrations work every time, and no matter how many times I reload, this seems to go to the same IP address.

By: Digium Subversion (svnbot) 2009-02-05 17:19:17.000-0600

Repository: asterisk
Revision: 173770

U   branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r173770 | mmichelson | 2009-02-05 17:19:17 -0600 (Thu, 05 Feb 2009) | 20 lines

Fix logic regarding when to perform an SRV lookup for outgoing REGISTER requests

With this fix, we only will perform an SRV lookup at the following times:

* The first time we register with a remote registrar
* If we send a REGISTER but do not receive a response
* If the sendto() function returns an error

While I wrote the patch that fixes this issue, a huge amount of credit is due
to Brett Bryant, who wrote the initial patch on which I based this one.

(closes issue ASTERISK-11734)
Reported by: jrast
Patches:
     12312.patch uploaded by putnopvut (license 60)
Tested by: blitzrage

Review: http://reviewboard.digium.com/r/132/


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=173770

By: Digium Subversion (svnbot) 2009-02-05 17:19:52.000-0600

Repository: asterisk
Revision: 173771

_U  trunk/

------------------------------------------------------------------------
r173771 | mmichelson | 2009-02-05 17:19:52 -0600 (Thu, 05 Feb 2009) | 27 lines

Blocked revisions 173770 via svnmerge

........
r173770 | mmichelson | 2009-02-05 17:19:16 -0600 (Thu, 05 Feb 2009) | 20 lines

Fix logic regarding when to perform an SRV lookup for outgoing REGISTER requests

With this fix, we only will perform an SRV lookup at the following times:

* The first time we register with a remote registrar
* If we send a REGISTER but do not receive a response
* If the sendto() function returns an error

While I wrote the patch that fixes this issue, a huge amount of credit is due
to Brett Bryant, who wrote the initial patch on which I based this one.

(closes issue ASTERISK-11734)
Reported by: jrast
Patches:
     12312.patch uploaded by putnopvut (license 60)
Tested by: blitzrage

Review: http://reviewboard.digium.com/r/132/


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=173771

By: Digium Subversion (svnbot) 2009-02-05 17:20:26.000-0600

Repository: asterisk
Revision: 173772

_U  branches/1.6.1/

------------------------------------------------------------------------
r173772 | mmichelson | 2009-02-05 17:20:25 -0600 (Thu, 05 Feb 2009) | 34 lines

Blocked revisions 173771 via svnmerge

................
r173771 | mmichelson | 2009-02-05 17:19:51 -0600 (Thu, 05 Feb 2009) | 27 lines

Blocked revisions 173770 via svnmerge

........
r173770 | mmichelson | 2009-02-05 17:19:16 -0600 (Thu, 05 Feb 2009) | 20 lines

Fix logic regarding when to perform an SRV lookup for outgoing REGISTER requests

With this fix, we only will perform an SRV lookup at the following times:

* The first time we register with a remote registrar
* If we send a REGISTER but do not receive a response
* If the sendto() function returns an error

While I wrote the patch that fixes this issue, a huge amount of credit is due
to Brett Bryant, who wrote the initial patch on which I based this one.

(closes issue ASTERISK-11734)
Reported by: jrast
Patches:
     12312.patch uploaded by putnopvut (license 60)
Tested by: blitzrage

Review: http://reviewboard.digium.com/r/132/


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=173772