|Summary:||ASTERISK-19106: SIP registration fails after temporary dns failure|
|Reporter:||Timo Teräs (fabled)||Labels:|
|Date Opened:||2011-12-26 03:26:08.000-0600||Date Closed:||2012-01-30 17:27:01.000-0600|
|Environment:||Attachments:||( 0) fix-sip-registration-after-dns-fail.patch|
( 1) fix-sip-registration-after-dns-fail2.txt
( 2) fix-sip-registration-after-dns-fail3.txt
( 3) fix-sip-registration-after-dns-fail4.txt
|Description:||I have sip registrations using DNS name, and dnsmgr enabled. If I get temporary DNS lookup failure at asterisk startup, the SIP registrations will fail (until DNS works and sip reload is done).|
There seems to be two separate problems:
1. ast_sockaddr loses the port number (set when loading configuration) because dnsmgr sets the sockaddr to AST_AF_UNSPEC on failure. later on when the lookup succeeds, the port is reset to zero when the code in dnsmgr_refresh() tries to salvage the port from previous lookup.
2. the registry related sip peer is not updated properly: sip_find_peers() used to update the entry is different form create_addr() calls peer name.
See attached patch that fixes it.
|Comments:||By: Timo Teräs (fabled) 2011-12-26 03:28:49.075-0600|
Regarding patch, the hunk #1 just removes the redundant sip peer update code. It's unneeded as transmit_register() does the same thing.
In hunk #2 we have two fixes:
- the r->us port number is updated always if it's zero (due to the dnsmgr bug) -- it anyway does not make sense to register to port zero
- sip_find_peer() call arguments are updated to reflect how create_addr() is called, so we actually find the correct sip peer that needs the address updated
By: Matt Jordan (mjordan) 2011-12-27 07:43:41.669-0600
Is this also a problem in 1.8?
By: Timo Teräs (fabled) 2011-12-27 07:50:17.425-0600
Looking at the code for 1.8 branch it appears to suffer from the same issue. I have not explicitly tested it, but the relevant parts of the code are identical (no changes between 1.8 to 1.10). So yes, I would say 1.8 is affected too.
By: Terry Wilson (twilson) 2012-01-18 11:15:36.583-0600
I have been unable to reproduce this issue. My steps to try to reproduce:
(which I uncomment after load to allow DNS lookups)
register => x@testx
Both with and without the patch, I get proper behavior of the registration working after the refreshinterval. I like the way the patch looks, I would just be more comfortable making a change (changes in chan_sip can almost always cause unintended regressions), if I could demonstrate that it was fixing a specific issue. Can you give me a set of configs and actions that will absolutely reproduce this issue? (btw, I'm running from the 10.0 branch).
By: Terry Wilson (twilson) 2012-01-18 11:17:59.897-0600
Actually, I just ran the test again with your patch applied and it *started* failing where it didn't before.
By: Timo Teräs (fabled) 2012-01-19 00:07:14.057-0600
Your test case is not correct. My problem occurs only on _dns failure_. Since you provide an alternate DNS address in hosts file this does not trigger the problem.
1. Have any valid configuration where it registers to a SIP server
2. Make DNS resolving attempts _FAIL_. It means, have your resolv.conf empty, or unplug the machine from network. That is dig / host tool must return TEMPORARY ERROR.
3. Start asterisk. And wait it to be fully booted so the first registration attempt happens.
4. Reconnect network, or add back the name server.
5. See with 'asterisk -r' that the SIP registration never happens.
As long as the DNS service (or host file) returns an IP-address for Asterisk things will work. But if you get a "temporary error" at any give time, the dnsmgr code will fail.
Could you please describe what kind of problems you started to have with the patch applied?
By: Timo Teräs (fabled) 2012-01-19 00:14:11.373-0600
Actually, your test case might be enough. I was too hasty comment about that, sorry.
The only difference is, that I'm not using explicit "register" lines in [general], but I'm instead using the [extension] block with "callbackextension=foo" that will generate the registration entry for me. Given how those two cases have certain different code paths, this could be a prerequisite for the problem.
By: Terry Wilson (twilson) 2012-01-19 22:57:28.368-0600
Please verify that the patch I just added still fixes the issue for you. It fixes my tests. The changes are 1) remove setting r->portno to the default as it will always be set to a defined port or default for sip/sips and 2) There were cases were peer->addr would be set, but have no port. So, in that case go ahead and copy over the address from r->us as well.
By: Timo Teräs (fabled) 2012-01-20 03:04:25.197-0600
Your patch seems to fix the registration issue. It successfully registers with it. Tested with 10.0.1 + patch.
However, it's not still working as expected.
I'm not sure if this happens with both patches, or only with yours. But it seems that after the temporary error which is followed by a registration, the incoming calls do not still work properly.
This happens because the incoming call does not match the peer entry. This is very likely because it ends up doing search by IP-address for the peer entry. Since the peer's by-IP hash changes, but "peers_by_ip" ao2 container is not updated accordingly. We should unlink/link to peers_by_ip when ever the ip changes.
By: Terry Wilson (twilson) 2012-01-23 23:15:54.322-0600
Ok, I'm attaching a new (more radical) patch. dnsmgr updating the address without letting the channel driver know that it has seems like a bad idea since *any* time there is an update (not just at registration), we need to know about it so we can take the appropriate action (like re-linking a peer in peers_by_ip).
This patch adds and update callback to dnsmgr, and (hopefully) properly tracks refcounts for items using dnsmgr (before, it would have been possible for a refresh thread to update the addr while/after the object's destructor was called.
I've tested a lot on 1.8, but went ahead and made an Asterisk 10 version for you to test with. Please let me know if it works for you.
By: Terry Wilson (twilson) 2012-01-24 18:35:28.096-0600
This patch remembers to set *->dnsmgr = NULL after calling ast_dnsmgr_release().
By: Timo Teräs (fabled) 2012-01-25 04:27:45.162-0600
I tested with 10.0.1 and backported patch (one trivial change in one hunk to make it apply).
This indeed seems to fix all problems for me. Thank you.