[Home]

Summary:ASTERISK-14675: chan_sip lockup by DNS and connect timeouts
Reporter:muenning (muenning)Labels:
Date Opened:2009-08-18 09:33:42Date Closed:2011-06-07 14:00:46
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
is duplicated byASTERISK-21378 chan_sip completely blocks on DNS lookups
Environment:Attachments:( 0) debug.txt
( 1) history.txt
( 2) valgrind17082009.txt
Description:This seems to be related to issue ASTERISK-9259 but as it is closed I have to open a new bug. This is a major issue as the disruption of the internet connection caused a complete service disruption which was not happening with asterisk 1.2.x I was using until few weeks ago for which it seems to have been fixed, according to the issue above.

I am now using asterisk-1.6.1.1-r1 on Gentoo with several external SIP accounts and a mISDN local line. This night my internet connection went down and as a result all SIP phones went offline as they could not register to asterisk. I assume the reason is that chan_sip was stuck in DNS lookups to send registration requests to external SIP providers and was not responding to SIP requests from the phones. A SIP debug or verbosity setting was not showing anything from the phones (as probably chan_sip did not process the debug request) but tcpdump was showing that requests were sent. The only messages appearing were like:

[Aug 18 09:56:27] NOTICE[13801] chan_sip.c:    -- Registration for 'xxxx@nikotel-zyxel' timed out, trying again (Attempt #4)
[Aug 18 09:56:27] WARNING[13801] chan_sip.c: Probably a DNS error for registration to xxxx@nikotel-zyxel, trying REGISTER again (after 20 seconds)

I had to kill asterisk as issuing a restart gracefully just hung CLI. Probably as chan_sip was not responding to that either. When asterisk was restarted, above warning started appearing again but no CLI sip channel commands were possible (word sip was not recognized) - probably channel was not yet initialized. After adding the domain names asterisk had to lookup in local hosts file sip channel was up and phones registered so at least local and ISDN connections were working again. After internet was up again I could delete the hosts lines and everything was OK again.

So chan_sip should not block in such cases (as probably was fixed with issue mentioned above) so other communication can continue in case of internet/DNS failure.
Comments:By: David Brillert (aragon) 2009-08-18 10:00:58

Interesting...

I was debugging bug 15109 with valgrind running and noticed some funky dns stuff
valgrind17082009.txt attached

Asterisk 1.4 revision 211807



By: Olle Johansson (oej) 2009-09-03 14:08:55

If DNS breaks, ASterisk will stop. This is an old issue that requires significant coding. If you have unreliable DNS, please run a local DNS resolver, like BIND, on the same host as Asterisk. That way, Asterisk will get proper DNS replies all the time.

By: muenning (muenning) 2009-09-03 17:33:36

It's not about unreliable DNS. As I wrote it's about internet connectivity and this can happen any time. I am running a local DNS but that does not help when internet is down - external addresses still must be resolved. But this is the second failure. The first is when external sip providers are not reachable (because internet is down) so the whole sip stack is hanging resulting in all sip devices to get down (including the local ones). As I wrote above. The DNS problem appears after restarting asterisk.

A local DNS server could only help when it holds copies of all external sip provider domains. Which makes as much sense as using numeric IP addresses instead if provider DNS names. But this is another issue.

I don't know how much coding a fix requires but it is worth it as this flaw literally prevents using (external) VoIP for a setup which _has_ to be operational. Which means _all_ asterisk setups I am using, have installed or am maintaining. If you have another resolution, please let me know.

By: Olle Johansson (oej) 2009-09-04 01:09:29

A local DNS helps in that most of them are asynchronus and will always deliver a DNS response to Asterisk. This won't happen if Asterisk, that doesn't have asynch DNS, will not get a response directly from the unreliable network. Believe me, a local DNS resolver will help you.

If your Asterisk is really using a local resolver, then this should not happen. If it does, it's propably a bug that has to be traced by adding extra debug code in chan_sip and the DNS modules in Asterisk so we can see where it actually hangs. Possibly debug with GDB as well.

By: muenning (muenning) 2009-09-04 10:29:04

As I wrote I am using local DNS (bind-9.4.3_p3 on Gentoo) but this didn't help. And how would local DNS help when chan_sip tries to connect to a host (with cached DNS/nummeric address) which is not reachable (as internet is down)? This was as I explained the first failure which made me restart asterisk. While chan_sip is waiting for a timeout on the connection attempt everything is stuck. As I explained already.

So, if a good fix requires much coding (didn't asterisk have a threaded design?) what about a "quick and dirty" fix, for example like this:

When a connect attempt fails (when this blocked for more than let's say 5 seconds), wait some reasonable time (maybe 60 seconds are OK) before trying the next one so other SIP communication, CLI sip commands etc. have a chance to be processed. This would still make some hickups on processing phone calls but the system will work. Wouldn't this be better for the start?

By: Olle Johansson (oej) 2009-09-04 11:04:19

You have filed this under DNS and I responded. Lets skip DNS and move on to connect if you belive DNS is working for you.

As stated in the bug guidelines, we need sip debug output whenever this happens. In this case, I also need to know more about your configuration - sip.conf entries in general and peer section for this, as well as Linux/unix platform. Thanks for your help.

By: Dan Radio (whys) 2009-09-08 10:34:07

Could this be related to: sip peer qualified failed, asterisk lock?

  https://issues.asterisk.org/view.php?id=13136#110073

I am not sure if qualify was used when this bug presented, but I have similar errors with/without qualify.  If a peer is down or not answering, asterisk locks trying to connect.

By: Leif Madsen (lmadsen) 2009-09-15 13:03:14

Actually I'm going to set this status to Feedback as we're going to wait on the SIP debug and history per the guidelines. Thanks!

By: Dan Radio (whys) 2009-09-17 17:03:46

submitted history and debug files from 1.6.2.0-rc1.

Call tried on a channel that's not answering.  Takes 3 min 45 seconds to timeout. The machine is up and reachable but not responding to ACKs on port 5067.


from extensions.conf:
exten => 999999,n,Dial(SIP/exchange2/${EXCHUM})


from sip.conf:
[exchange2]
type=peer
host=exch08.uwec.edu
qualify=no
transport=tcp
context=internal
nat=no
;canreinvite=yes
insecure=port,invite
port=5067
disallow=all
allow=ulaw
allow=alaw

By: Leif Madsen (lmadsen) 2010-06-23 11:51:39

Can the reporter verify if this is still an issue on the latest 1.6.2 release?

By: Paul Belanger (pabelanger) 2010-06-30 08:55:24

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested.

Further information can be found at http://www.asterisk.org/developers/bug-guidelines

By: Jeremy Visser (jeremy23) 2011-09-28 07:34:28.829-0500

Ping. I am getting this issue on Asterisk 1.8.7.0-1digium1~lucid as far as I can tell.