ASTERISK-03854: Register hangs using SRV records when host is down

[Home]

Summary: ASTERISK-03854: Register hangs using SRV records when host is down

Reporter: tootai (tootai) Labels:

Date Opened: 2005-04-04 07:09:07 Date Closed: 2005-08-15 12:00:39

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/Registration

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) crash.gz
( 1) messages.lastrun

Description: When using several register and host sip peers, the sip reload command take a while to reload or hangup on DNS lookups. An information "Previous sip reload not yet done" is sometime sended. A workaround is to put IP address instead of FQDNs which is not a solution for register statement dixit OEJ.

****** ADDITIONAL INFORMATION ******

Seems that more you add register statements/SIP peers, more the problem will appear.

Comments: By: Mark Spencer (markster) 2005-04-04 09:23:08

If your DNS is slow, don't use it.
By: tootai (tootai) 2005-04-04 12:17:57

Don't think it's a DNS server problem:

1. he's on the same lan
2. it only happens after a period without call
3. if I ping FQDN SIP host during this failure I have no problem
4. a restart is making thinks working again
5. no problem with IAX/DNSs register and peers
6. other software don't have this problem.

Daniel
By: Mark Spencer (markster) 2005-04-11 00:21:42

You can supply a backtrace if you think it's stuck somewhere. Just attach gdb as though this were a deadlock.
By: Olle Johansson (oej) 2005-04-17 09:32:27

How many register= statements do you got?

Is this problem still there?
By: tootai (tootai) 2005-04-17 10:48:56

I try to track back with gdb like asked by Mark. Since 4 april I didn't had any problem. I Have 3 register statement.
By: tootai (tootai) 2005-04-21 01:26:18

I think I found a begining of explanation: it's not a 100% DNS problem it's a register problem which could be related with DNS or internet routing.

I had the problem this morning (few times) and my asterisk couldn't register with FWD and e164.org (WARNING: No such host: fwd.pulver.com and WARNING: No such host: tx.voip.e164.org) From this moment asterisk hungs and registering to other providers is delayed.

If I try to ping those host, no reply from e164.org huge delay from pulver.com Few seconds after it's ok and asterisk starts to work. As I told on previous message, I also use IAX registration and have no problem with them, nor with the others SIP register. Was always with those two hosts.

From FWD, even registered, I always get a lot of NOTICE: peer 'fwd-gw' is now UNREACHABLE! Last qualify: 146 and some times after NOTICE: peer 'fwd-gw' is now REACHABLE (146ms / 1600ms) Delay between those states is minimum 10s, up to more than 40s

I have an 1024/128 ADSL line, asterisk box is a RH73 biproc Celeron Medocino 400 Mhz with 440 MB RAM
By: tootai (tootai) 2005-04-25 06:58:46

My version is now CVS-HEAD-04/21/05.

From this afternoon, one registration failed: sip registry gave me "Auth. send" as State. Having waiting 1/2h, I run sip reload: nothing, no message on CLI about the reloaded files (this registrar is the last entry in sip.conf). After sometime I see the IAX registration message but nothing concerning sip.

So I run gdb program <asterisk pid> and got, after all symbols stuff readed and loaded

0x400f6136 in sigsuspend () from /lib/libc.so.6
(gdb)

I then ask gdb to run the program again, asterisk restarted and everything is fine again.
By: Olle Johansson (oej) 2005-05-01 15:26:41

If you get "no such host" now and then, it's clearly a DNS problem in your network. Ask you ISP for more reliable DNS servers...

Kevin has started to add new DNS support for Asterisk, that will propably take care of this.

Currently, I see no potential bug in Asterisk in this bug report. Ok to close?
By: tootai (tootai) 2005-05-02 02:22:16

Hmmh. If it where a DNS problem:

1. why only with SIP register (Never had problem with IAX register)?
2. when I had the problem with fwd and e164 hosts, others registers where ok.

My opinion: it's a timeout registration problem (broken route, server down, ...) not a DNS one.

I updated to CVS-HEAD-05/01/05. Will see what happends with the new implementation. If you want to close this bug, feel free to do it.
By: tootai (tootai) 2005-05-05 02:31:30

This morning, no audio on calls. I check in CLI and see registration timed out for tx.voip.e164.org. I ping the host, no response. On the 3 others registry I'm well registred. For e164.org, I have "Request Send". IAX2, I'm register to 2 off 3 hosts.

Now I run a SIP reload command by calling an dialplan extension from my phone: audio is coming back but still registration timed out with tx.voip.e164.org

So DNS is ok as host is *really* not responding on ping, even on IP ping. And on IAX2, the one host who couldn't be reached don't create such kind of problem.
By: Michael Jerris (mikej) 2005-05-05 05:32:10

can you verify if this has anything to do with using SRV records for real? Does this behavior happen against down hosts where you do not use SRV records?

For example, if you null route one of the fwd hosts, I am guessing this happens, if you null route to other boxes that are listed by IP, does it happen?

By: Olle Johansson (oej) 2005-05-05 07:34:06

Check bug ASTERISK-3760 where I actually give up on registrations that doesn't work, instead of hammering the server endlessly. New updated patch for latest cvs head coming soon, working with some details.
By: tootai (tootai) 2005-05-05 11:36:04

Mike,

as pointed by OEJ before I open the bug, replacing FQDN by IP is not a solution. On the other hand, it could help to determine where exactly is the problem. But remain that each time problem appear, one or two hosts, always the same, FWD and e164.org, are unreachable. And I don't have any problem with my IAX registers.

OEJ, my asterisk is now patched with newregister.txt I use default value. FYI, I know have lots of

chan_sip.c:8217 handle_response: Got authentication request (401) on unknown REGISTER to 'sip:<user>@<domain>;tag=blabla'

for *each* register I have. When I call through one of those register, I have

chan_sip.c:8211 handle_response: Failed to authenticate on INVITE to 'sip:<user>@<domain>;tag=blabla'

and when I hangup,

chan_sip.c:8217 handle_response: Got authentication request (401) on unknown BYE to 'sip:<user>@<domain>'
By: Michael Jerris (mikej) 2005-05-05 11:49:46

I was not suggesting you change the FQDN to IP, I was suggesting you simulate the failure using both IP and FQDN by nullrouting the host on the box to see if the behavior is the same.
By: tootai (tootai) 2005-05-05 12:25:39

OEJ,

in now around 1/2h running with the patch, asterisk crashed 2 times. First time few calls where passed, second time no one. This is the CLI part:

Ouch ... error while writing audio data: : Broken pipe
Ouch ... error while writing audio data: : Broken pipe

From message log file:

To: <sip:xxxxxxx@sipgate.de>;tag=b11cb9bb270104b49a99a995b8c68544.b81c
Call-ID: 004b08cf1298afd5512f2a740ceb87d8@192.168.10.250
CSeq: 128 REGISTER
Contact: <sip:xxxxxxx@xxx.xxx.xxx.xxx>;q=0.00;expires=120
Server: sipgate ser
Content-Length: 0
Warning: 392 217.10.79.9:5060 "Noisy feedback tells: pid=11716 req_src_ip=xxx.xxx.xxx.xxx req_src_port=5060 in_uri=sip:sipgate.de out_uri=sip:sipgate.de via_cnt==1"

May 5 19:14:12 VERBOSE[18719] logger.c: --- (10 headers 0 lines)May 5 19:14:12 VERBOSE[18719] logger.c: --- (10 headers 0 lines)---

Asterisk is CVS-HEAD from 05/01/05. I remove the patch.
By: Olle Johansson (oej) 2005-05-06 16:14:29

There seems to be timing problems. We're getting 401 messages long time after we've cancelled the call. I need to see a FULL debug (SIP debug AND verbose=4 AND debug=4) and also know how long latency you have between you a sipgate.
By: tootai (tootai) 2005-05-07 09:16:09

Here you have. Verbose is set to 5 , debug to 4 and sip debug activated. The uploaded file (crash.gz) contain all the messages from the time as asterisk start to the crash. No call was passed. This is an mtr result immediately after the crash.

Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. kat-ru3.neo.tpnet.pl 0% 48 48 11 11 13 22
2. z.kat_ru3.do.kat-r2.tpnet.pl 69% 15 48 10 9 13 40
3. z.kat-r2.do.war-r3.tpnet.pl 54% 22 48 16 14 17 38
4. do-war-tr3.tptransit.pl 11% 43 48 45 14 18 65
5. So4-1-0.WRSBB1.Warsaw.opentransit.net 0% 48 48 17 15 24 142
6. So2-2-0.FFTCR1.Frankfurt.opentransit.net 0% 48 48 39 37 50 194
7. P1-0.FFTCR3.Frankfurt.opentransit.net 0% 48 48 39 38 42 123
8. Opentransit.FRA-8-eth210-101.de.lambdanet.net 0% 48 48 39 38 40 58
9. DUS-2-pos720.de.lambdanet.net 0% 47 47 42 41 42 57
10. Netzquadrat-DUS.de.lambdanet.net 0% 47 47 42 41 42 44
11. sipgate.de 0% 47 47 42 41 42 44
By: Olle Johansson (oej) 2005-05-08 03:24:43

The registration is actually successful here. Can you produce a trace of the core file (gdb/bt full) so we see where it crashes? We get an extra 200 OK on the registration, but that should not be a problem...
By: tootai (tootai) 2005-05-08 10:46:32

Running from gdb the last outputs are:

--- (7 headers 0 lines)---

<-- SIP read from 213.91.9.221:5060:
SIP/2.0 200 OK
Call-ID: 542e12842f028d5d04ad5a5c3de18e44@192.168.10.250
Contact: <sip:0369570063@83.30.65.163>;expires=120
CSeq: 106 REGISTER
From: <sip:wgUserName@len1.host.wengo.fr>;tag=as2d87c6a2
To: <sip:wgUserName@len1.host.wengo.fr>;tag=00000000000000000000156f43b8
Via: SIP/2.0/UDP 83.30.65.163:5060;received=83.30.65.163;branch=z9hG4bK41e77b4a
Content-Length: 0

--- (8 headers 0 lines)---

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 114696 (LWP 18368)]
handle_response_register (p=0x81aa840, resp=200, rest=0x4068ada0 "OK", req=0x4068ab7c, ignore=0,
seqno=106) at chan_sip.c:8014
8014 if (r->expire)
(gdb)

And result of bt full is

(gdb) bt full
#0 handle_response_register (p=0x81aa840, resp=200, rest=0x4068ada0 "OK", req=0x4068ab7c,
ignore=0, seqno=106) at chan_sip.c:8014
r = (struct sip_registry *) 0x0
#1 0x08154a80 in ?? ()
No symbol table info available.
(gdb)
By: Olle Johansson (oej) 2005-05-08 11:43:29

Guess it's the extra registration... Give me a while to dig into this.
By: Olle Johansson (oej) 2005-05-21 15:00:38

Please check the patch in ASTERISK-3760 and report back. /O
By: tootai (tootai) 2005-05-22 08:11:49

I applied the newregister0521.txt patch to a fresh CVS update

Asterisk CVS-HEAD-05/22/05-14:20:56 built by root@dual.tootai.net on a i686
running Linux

In 30mn of running without any call asterisk died twice without any information. In debug mode, this are the latest logs. I remove the patch and everything is fine.

[root@dual asterisk]# tail -20 messages
Content-Length: 0

May 22 14:00:37 VERBOSE[9930] logger.c: --- (8 headers 0 lines)May 22 14:00:37 VERBOSE[9930] logger.c: --- (8 headers 0 lines)---
May 22 14:00:37 NOTICE[9930] chan_sip.c: Failed to authenticate on REGISTER to 'UUUUUU@len1.host.wengo.fr' (Tries 2)
May 22 14:00:37 VERBOSE[9930] logger.c:
<-- SIP read from 69.90.155.70:5060:
SIP/2.0 200 OK
Via: SIP/2.0/UDP 83.30.66.247:5060;branch=z9hG4bK7a50403a
From: <sip:NNNNNN@fwd.pulver.com>;tag=as506715a8
To: <sip:NNNNNN@fwd.pulver.com>;tag=cb2000b247d89723001a836145f3b053.bde8
Call-ID: 07659e43776477945a8e5a33133c49d5@192.168.10.250
CSeq: 108 REGISTER
Contact: <sip:NNNNNN@iax.fwdnet.net>;q=1.00;expires=973094431
Contact: <sip:NNNNNN@83.30.66.247>;q=0.00;expires=120
Server: Sip EXpress router (0.8.14 (i386/linux))
Content-Length: 0

May 22 14:00:37 VERBOSE[9930] logger.c: --- (10 headers 0 lines)May 22 14:00:37 VERBOSE[9930] logger.c: --- (10 headers 0 lines)---
[root@dual asterisk]#
By: Olle Johansson (oej) 2005-05-22 13:16:06

Ok, so feedback on the patch is bad, but I will need more of the SIP debug in order to see what's wrong. I've been running the patch for a long time without problems... Strange.
By: tootai (tootai) 2005-05-22 15:25:08

Upload file is the complete output logs from the second run in debug mode
By: Smith (eng1neer) 2005-07-02 05:37:56

I am running CVSHEAD 07/01/2005 I have the same problem I have currently rolled back to stable. I have DSL, whenever network drops and reregistration time lapses (105 seconds for all but 1 of my 7 outbound reg accts) system will drop the registration to fail. Once in this state.. system will never attempt to reregister. If U sip reload, the system re registers clients.. Also log shows could be DNS issue then shows a 20,000 second retry timer. I modified the source code to test shorter time. IN CVShead, is 20 * 1000 changed to 20 * 10 and system still will not reregister after showing "State failed" Can be duplicated by taking system off line (downing dsl) waiting for rereg timer to expire allowing system to attempt rereg until failure then reconnecting dsl connection.
By: Smith (eng1neer) 2005-07-02 06:06:59

Tested with stable 1.0.9 same test as above. (I see why the code change) system hammers and hammers but does reconnect upon network reconnect. Also as tootai said, when I have the failure on the cvshead code, iax registry has no problem whatsoever with recovery. tootai also using dsl most likely pppoe as is mine.
By: Smith (eng1neer) 2005-07-03 20:36:35

Patch in 4634 seems to have solved issue
By: Enzo Michelangeli (enzo michelangeli) 2005-07-18 10:54:58

What I have noticed is that (in 1.0.7 at least) when a DNS resolution is waiting for a reply from the DNS server (e.g. because the Internet connection has been lost, or the DNS servers are not answering for some reason) a LOT of functionality of Asterisk is frozen; once the resolver gets over it, the situation returns to normal. I haven't yet inspected the code, so for the time being I can't suggest fixes, but this is definitely what happens.
By: Olle Johansson (oej) 2005-07-18 14:35:46

Yes, we do need asynchronus DNS like Squid and many web stats packages.
By: Enzo Michelangeli (enzo michelangeli) 2005-07-18 21:23:20

> Yes, we do need asynchronus DNS like Squid and many web stats packages.

But wait a sec: we should have it already. In utils.c, ast_gethostbyname() maps to gethostbyname_r(), which is re-entrant and therefore doesn't block other threads (only under *BSD is gethostbyname_r() is internally provided, around line 67 of the same file, with an MT-safe but NOT reentrant - and therefore blocking - implementation, incidentally written by myself!). So why does a pending DNS resolution for even only one host STILL block the whole business??

The only other place of DNS blockage I can think about is ast_search_dns(), which is indeed blocking (it calls ast_mutex_lock()) and is called by enum.c (rarely used) and ast_get_srv(). The latter could perhaps block chan_sip, but only if srvlookup=yes in sip.conf, and the default is "off".

Or perhaps the blockage depends on the program's logic, which I don't know well enough: if e.g. only one thread handles sequentially all the registrations, once that thread it's stuck on one of them, all the other registrations are blocked as well. In that case, more than asynchronous DNS, we just need a a shorter DNS timeout, e.g. with appropriate options in /etc/resolv.conf ...

By: Olle Johansson (oej) 2005-07-20 14:00:58

Check if the patch in ASTERISK-3843 will improve the situation. Thank you!
By: Olle Johansson (oej) 2005-08-15 12:00:21

Problem seems solved with recent patches in other bug reports.