Summary:ASTERISK-21378: chan_sip completely blocks on DNS lookups
Reporter:Jaco Kroon (jkroon)Labels:
Date Opened:2013-04-03 14:25:12Date Closed:2013-05-20 07:38:23
Versions:11.3.0 Frequency of
duplicatesASTERISK-14675 chan_sip lockup by DNS and connect timeouts
duplicatesASTERISK-18930 Asterisk stops responding to SIP devices if it loses Internet Access (DNS)
duplicatesASTERISK-03638 DNS Error prevents SIP module from functioning correctly.
duplicatesASTERISK-17214 DNS Lookup blocking registration
is related toASTERISK-17722 SIP SRV lookups for registration discard the port when dnsmgr disabled (the default)
Environment:Gentoo Linux, asterisk 11.3.0Attachments:
Description:One of the bigger ISPs in South Africa decided to blow up their entire network today.  Our setup has quite a number (16 to be exact) of register lines of the form:

register => 2787....:secret@sip.iburst.co.za/087....

As soon as they decided to press the big red button to take down their network and we hit a SIP reload ... *boom* - we could for the live of us not get asterisk back up into a working state.  We have a sip peer looking like this:

host = sip.iburst.co.za

Knowing that iBurst went down, and spotting this log entry brought up the theory:

[Apr  3 19:36:12] ERROR[27636] netsock2.c: getaddrinfo("sip.iburst.co.za", "(null)", ...): Name or service not known

so, commented out the register lines, and behove and behold, it takes about 20 seconds longer than usual for asterisk to start servicing the :5060 udp socket (normally a watch netstat -nulp won't ever show the Recv-Q being anything other than 0, currently it'll keep climbing for around 20 seconds before dropping back down to zero).

With the register lines uncommented you can forget about sane operation.  It will not happen.  In fact, the only way for me to recover is to kill -9 asterisk.

I currently have dnsmgr disabled, even though I can see (from the code) that the handling differs with dnsmgr enabled, and it does make more sense for me to have it enabled anyway.

I'm not sure what the best way would be to handle this, but I suspect that registrations needs to happen in a separate thread, DNS lookups should probably happen without any locks held in chan_sip.

For the moment (since none of peers I need to peer with use SRV records, and their DNS should not change that often) I might be better off to perform the DNS lookups outside of asterisk and just hard-code the IPs into the config.  From a rudementary test this seems to work quite well (asterisk is back to normal behaviour of starting up chan_sip in a VERY short time frame).

A quick test with dnsmgr enabled, but utilizing DNS names again instead of IP addresses results in completely broken behaviour again.
Comments:By: Jaco Kroon (jkroon) 2013-04-03 15:58:30.914-0500

After discussion in #asterisk on IRC a few things became clear:

* Fixing this issue is invasive to the current chan_sip.
* The new design for sip in asterisk-12 should (will?) not suffer the same problems (also referring to other configuration reload issues).
* Chances of this bug getting fixed pre-12 is slim to none (which is sad).

So a few suggestions to mitigate (I must point out that NONE of this fixes the problem, as I'll explain later) the risks of this problem striking you, and improving performance in general.

1.  You should list all your local IPs (as shown by "ifconfig" or "ip ad sh") in /etc/hosts - this is reasonable as most systems does this anyway.  If you have one or two dynamic IPs however this becomes trickier.  In my case above I don't.

2.  Run a local DNS cache and have /etc/resolv.conf point to that.  I *always* run djb's dnscache on on all my machines anyway, it's fast and reliable (http://cr.yp.to/ - it's old though).  Having the cache local reduces latency on successive DNS lookups, in my *normal* case above this saves around 60 odd DNS queries from leaving the machine (search and domain lines in /etc/resolv.conf often causes more harm than good, fortunately my authoritative servers for my search lines are in the same cabinet and have a response of <1ms).

This will only improve the situation with chan_sip load times if there are not serious external problems unfortunately.  In the case like above where sip.iburst.co.za cannot be resolved at all and all the auth name servers for iburst.co.za is gone from the face of the earth you're still stuffed if you don't have a locally cached record.  And to make matters worse - you're waiting for two DNS timeouts, first for SRV _sip._udp.${sipdomain} and then for A ${sipdomain}.  Should the SRV record resolve, and have a list of 10 other names to be looked up which all fail then the problem actually becomes even worse.  Consider for example a SRV record for _sip._udp.me.co.za that lists (sip1.iburst.co.za, sip2.iburst.co.za ... sip10.iburst.co.za) and then you wait for all 10 those lookups to time out.

There are *risky* ways to mitigate the risk further.  Specifically, if you "replicate" the external zones into /etc/hosts (won't work with SRV records in the mix), for example, lets say sip.iburst.co.za normally (when it works) resolves to then you can add sip.iburst.co.za into /etc/hosts, and disable srv lookups in sip.conf (srvlookup=no).  This obviously won't work if you need SRV records.

As another mechanism, if you use a config generator, whenever you place a hostname into the config file, look up the desired IP in the config generator and put the IP into the asterisk config instead.  This prevents the need for making DNS lookups in chan_sip, preventing chan_sip from needlessly blocking.  This suffers similar risks to the above /etc/hosts solution.  If desired, store the looked up RR somewhere on disk in a text file so that you can re-use the lookup again at a later stage if a newer lookup fails.

Another suggestion was to see if we cannot perhaps localize any changes to dnsmgr.  The changes that was mentioned specifically was as follows:

1. Alter the scheduler to refresh on DNS TTL values.
2. Coalesce lookups for the same host and type (currently multiple register lines as per above will still result in multiple DNS queries being generated).
3. It's unclear whether SRV lookups are being handled by dnsmgr or not at this point.
4. Configurable DNS timeout failure (eg, normally my lookups succeed in <5ms, so set failure time to 50ms)
5. Re-use stale records in case of DNS failure.
6. Store DNS lookups into astdb to cache over asterisk restarts.

I seriously doubt all of these changes are required, however, from a quick scan we will need at least (4) and (5).  If (3) is of such a nature that SRV records are dealt with by DNSMGR then it's sufficient, otherwise, SRV support in chan_sip should be disabled to ensure that this issue won't strike.

By: Rusty Newton (rnewton) 2013-04-03 17:15:02.815-0500

Assigning this to Matt to make sure he gets a chance to look at it.

By: Rusty Newton (rnewton) 2013-04-03 17:16:06.827-0500

Linked all issues which appear to be reports of the same issue in the past. Some of them have discussion and reasoning on the issue and why it hasn't been fixed so far.

By: Jacek Konieczny (jkonieczny) 2013-05-20 02:49:55.925-0500

The problem may be much more serious, as a single misconfigured SIP device may trigger a blocking DNS lookup, causing chan_sip lock-ups and log entries like these:

Apr 24 13:06:47 pbx asterisk[2094]: ERROR[2254]: netsock2.c:263 in ast_sockaddr_resolve: getaddrinfo("broken", "5060", ...): Name or service not known
Apr 24 13:06:47 pbx asterisk[2094]: WARNING[2254]: chan_sip.c:16041 in check_via: Could not resolve socket address for 'broken:5060'

In this case the problem should be probably mitigated by
https://reviewboard.asterisk.org/r/2400/ , but there may be other cases when a client request triggers a DNS lookup.

By: Matt Jordan (mjordan) 2013-05-20 07:38:06.683-0500

I know Olle is looking into this problem and may be doing some work in this area. He may have some insight and/or work that will alleviate this problem. I'm not sure how far down the DNS rabbit hole he's going however, so it may not fix all of the problems you've alluded to here.

There's really two problems at play here:
# DNS is not asynchronous. Performing a DNS lookup blocks the calling thread.
# {{chan_sip}} is single threaded.

Both of these items would require massive rewrites of {{chan_sip}}.

For DNS to be asynchronous, {{chan_sip}} would have to have a callback for each DNS lookup. Whenever a DNS lookup occurs, {{chan_sip}} would have to return from that point and resume with its state restored when the DNS lookup completes. I can't even begin to scope out the scale of this work. {{chan_sip}} is structured in such a way that once a request/response handling begins, it is expected to run to completion for that request/response. There is no way to defer additional handling for later processing or for another thread.

Similarly, {{chan_sip}} being made multi-threaded would invalidate the threading model (such as it is) in {{chan_sip}}. Often, we "know" that only a single thread is processing requests/responses, and certain operations take place sequentially because of it. Opening this up to multiple threads would, again, invalidate much of the structure in {{chan_sip}}.

There is no way to address these problems in a release branch without:
* Consuming significant developer resources
* Injecting a huge amount of risk into release branches
* Requiring a substantial testing effort from the entire Asterisk user community

So what about trunk?

This is why we wrote a new SIP channel driver.
# It is multi-threaded. Its entire design assumes multiple threads servicing requests/responses and processing them in a well defined stack.
# It uses asynchronous DNS.

I cannot foresee a general effort attempting to resolve this problem in {{chan_sip}}.

Now, all of that being said, Olle does good work. He may have a solution that you can try and that may also be generally applicable to release branches of Asterisk. It would be a good idea to contact him and see if you can assist with his development and testing efforts.

By: Jaco Kroon (jkroon) 2013-05-20 09:18:25.686-0500

Olle is dealing with SRV records.  His work will in a manner depends on this being fixed.  Or will need to be fixed separately.

By: Olle Johansson (oej) 2013-08-26 08:56:02.915-0500

Yes, I'm focusing only on SRV record support right now, not the asynch part. Install a caching DNS resolver like bind on the same host and most cases will be fine. I don't believe that Matt is completely correct in what he writes, we should be able to improve the situation a lot of starts and reloads just by being a bit clever and use an async DNS library - but I think he is correct for calls. If we fix peers and reloads we have come a long way.