Summary:ASTERISK-09595: Deadlocks cause SIP clients to stop responding
Reporter:furiousgeorge (furiousgeorge)Labels:
Date Opened:2007-06-05 17:27:15Date Closed:2007-06-27 13:16:49
Versions:Frequency of
Environment:Attachments:( 0) backtrace_sip.conf_and_extensions.conf
( 1) parking_MRE_errors.txt
Description:Seemingly at random, Asterisk will deadlock such that:

1>  SIP users cannot make outbound calls

2>  Users cannot answer incomming calls.  Their phone isnt connected when they try, and other phones continue to ring.

3>  CLI behavior is strange

4>  The PID must be kill -9 'ed in order to restart Asterisk.

5>  This can happen at most once a day, or at least once a week.

Things I have tried to resolve this:

1>  New motherboard purchased 3/07

2>  New enclosure purchased 3/07 (thought the old one might be improperly grounded)

3>  New PSU purchaced 3/07

4>  New ECC corsair memory 05/07  


I called this a deadlock, but since upgrading to 1.4.4 last week, the logs no longer actually say "avaiding initial deadlock".  now it says:

[Jun  5 16:31:30] NOTICE[16219] chan_zap.c: Got event 18 (Ring Begin)...
[Jun  5 16:31:32] NOTICE[16219] chan_zap.c: Got event 2 (Ring/Answered)...
[Jun  5 16:31:33] WARNING[16219] chan_sip.c: No such host: RemoteBrian
[Jun  5 16:31:33] WARNING[16219] app_dial.c: Unable to create channel of type 'sip' (cause 3 - No route to destination)
[Jun  5 16:34:15] WARNING[13616] channel.c: Channel allocation failed: Refusing due to active shutdown
[Jun  5 16:34:15] WARNING[13616] chan_zap.c: Cannot allocate new structure on channel 7
[Jun  5 16:34:21] WARNING[13616] channel.c: Channel allocation failed: Refusing due to active shutdown
[Jun  5 16:34:21] WARNING[13616] chan_zap.c: Cannot allocate new structure on channel 7

claudia tmp # svn info /usr/src/asterisk-1.4.4/
Path: /usr/src/asterisk-1.4.4
URL: http://svn.digium.com/svn/asterisk/tags/1.4.4
Repository Root: http://svn.digium.com/svn/asterisk
Repository UUID: 65c4cc65-6c06-0410-ace0-fbb531ad65f3
Revision: 66977
Node Kind: directory
Schedule: normal
Last Changed Author: russell
Last Changed Rev: 62252
Last Changed Date: 2007-04-27 18:24:33 -0400 (Fri, 27 Apr 2007)

claudia tmp # uname -a
Linux claudia 2.6.18-gentoo-r6 ASTERISK-2 SMP Wed May 9 22:02:20 EDT 2007 x86_64 Dual Core AMD Opteron(tm) Processor 165 AuthenticAMD GNU/Linux

Comments:By: furiousgeorge (furiousgeorge) 2007-06-05 18:19:43

i forgot to mention, i also replaced 2 X TDM400P 4x3FXO with 1 X Sangoma A200 4x4FXO in 5/07

By: Joshua C. Colp (jcolp) 2007-06-06 08:42:08

The backtrace is useless, but that's okay.

I notice that the RemoteBrian peer is not defined in sip.conf, if so chan_sip might be trying to do a DNS lookup on it which could cause issues if the DNS server was unreachable/slow.

By: furiousgeorge (furiousgeorge) 2007-06-06 14:58:20

Right, I took it out of sip.conf because I read that unattatched sip devices could cause that "Maximum retries exceeded" error, but I forgot to remove the entry from the dial command.

That's actually the account I use when i want to log in, which is why I have it there to begin with.

Do you think that could be causing my issues?  I've taken it out, and I'll let you know if it deadlocks again.

By: Joshua C. Colp (jcolp) 2007-06-07 12:14:25

If Asterisk has to do a DNS lookup it blocks until the result is returned. In the case of hosts that do not exist it may block for awhile, and cause strange/bad/deadlock-like behavior.

By: Russell Bryant (russell) 2007-06-18 18:16:30

First off, give 1.4.5 a trunk.  There have been a lot of these types of issues fixed.

If you still have an issue, please rebuild the system with debug stuff enabled.  Run "make menuselect", go to "Compiler Flags", enable DONT_OPTIMIZE and DEBUG_THREADS.  Then, rebuild and reinstall.

Then, would you be willing to let me log in to diagnose the issue?  Deadlocks are very hard to diagnose with access to the system using gdb.  If this isn't possible, I can provide you with some additional instructions to get the information I need, but it will just take longer.

By: furiousgeorge (furiousgeorge) 2007-06-26 00:40:30

I'm still using 1.4.4, but I will upgrade ASAP.

File's advice seems to have helped, but has not totally solved the issue.

On 6/25 I experienced the strange deadlock-like-behavior again.  It went almost 3 weeks without this issue, and that's is an improvement, but obviously I'd want to get it totally resolved.

I might have narrowed down the cause:

Parking a call using an Snom's ParkOrbit programmable button causes the same "Maximum Retries Exceeded" error which seems to coincide with past deadlock-like-behavior.  (In that case it was usually caused by a dialing a peer that wasn't currently registered).

Looking at the SIP debug output (which I barely comprehend at all), I see that for some reason the phone is sending a 487 cancellation request after the call has already been parked.  I don't know if this has anything to do with anything.

I've uploaded a snipet of the SIP debug output, which quickly becomes thousands of lines if i start it after completing the call, but before parking it.

As to logging into my server and checking it out, I'm glad you offered.  I'll see you on IRC and talk to you about it further, if that's OK.

By: Russell Bryant (russell) 2007-06-27 12:41:15

Once you are running the latest version, it would be nice to see if you are still able to reproduce the deadlock.  If so, I would be happy to log in to the machine.

By: Russell Bryant (russell) 2007-06-27 13:16:48

After a discussion with the reporter on IRC, it doesn't sound like the problem exists using the latest version.  However, if the problem comes back up, please report it and I will gladly look into it.  Thanks!