[Home]

Summary:ASTERISK-13712: Random deadlocks leading to lockup on 1.4.23.1
Reporter:Alistair Cunningham (acunningham)Labels:
Date Opened:2009-03-08 19:36:55Date Closed:2011-06-07 14:03:20
Priority:MajorRegression?No
Status:Closed/CompleteComponents:General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20090325__debug_bug14625.diff.txt
( 1) 20090326__debug_bug14625.diff.txt
( 2) 20090330__debug_bug14625.diff.txt
( 3) fd1.txt
( 4) fd2.txt
( 5) fd3.txt
( 6) gdb1.txt
( 7) gdb2.txt
( 8) gdb3.txt
( 9) gdb4.txt
(10) ps1.txt
Description:Random channels seem to be giving deadlocks on a 1.4.23.1 system:

# tail -f /var/log/asterisk/full | grep -i deadlock
[Mar  8 20:35:37] DEBUG[4249] channel.c: Avoiding initial deadlock for channel '0x8779688'
[Mar  8 20:35:37] DEBUG[4249] channel.c: Avoiding initial deadlock for channel '0x87ff698'
[Mar  8 20:35:42] DEBUG[4249] channel.c: Avoiding initial deadlock for channel '0x8314468'
[Mar  8 20:35:46] DEBUG[4249] channel.c: Avoiding initial deadlock for channel '0x861ddd0'


I've also seen "Avoiding deadlock" as well as "Avoiding initial deadlock", though these are less common. The system is continually busy at medium load (tens of calls). Over a period of days, the problem gets worse and worse until Asterisk stops handling calls and needs to be killed with "kill -9". Compiling with DEBUG_THREADS and running "core show locks" always shows:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=======================================================================

with no further output. I've tried the patch in ticket ASTERISK-12416 but this doesn't help.

I'm not sure what debugs would be useful, but if someone can advise what would be useful I'm happy to take them.
Comments:By: Tilghman Lesher (tilghman) 2009-03-08 20:26:20

You'll need to run:  asterisk -rx 'core show locks' when the lockup actually occurs.  Running it prematurely won't help.  In terms of the DEBUG messages, these can safely be ignored (they are DEBUG, after all, and not at level ERROR or WARNING).

By: Alistair Cunningham (acunningham) 2009-03-08 20:39:02

Unfortunately when Asterisk has locked up, it no longer responds to the console either. Console commands hang forever.

By: Joshua C. Colp (jcolp) 2009-03-09 08:41:38

Attaching gdb to the process and getting thread apply all bt would provide some insight to what is happening inside of Asterisk.

By: David Woolley (davidw) 2009-03-09 14:00:33

Often when it locks up, the current console is locked, but you can still start a new console to just run core show locks.

By: Alistair Cunningham (acunningham) 2009-03-10 04:17:55

Attached is gdb1.txt, which is taken at the quietest time of day (only 5 calls active). It doesn't mean much to me. Does it give you any useful information?

By: Alistair Cunningham (acunningham) 2009-03-10 04:19:21

By the way, Asterisk had only been running for around 24 hours so there probably aren't many locks built up. We can wait for a week or so to let plenty build up if you like.

By: Joshua C. Colp (jcolp) 2009-03-10 08:28:05

Unfortunately this isn't very useful as a long of details on the threads themselves are missing - was this unoptimized?

By: Alistair Cunningham (acunningham) 2009-03-13 23:32:27

No, it wasn't. I've recompiled Asterisk with DONT_OPTIMIZE, restarted it, waited for around 72 hours, then run gdb again. The output is in attached file gdb2.txt. At the time this was taken, there were 15 active calls and 29 active channels.

By: Alistair Cunningham (acunningham) 2009-03-17 18:03:07

Does gdb2.txt show anything suspicious?

By: Tilghman Lesher (tilghman) 2009-03-18 12:01:17

No, it doesn't.  It does not appear that there's even a deadlock at all, since 'core show locks' shows nothing.

By: Alistair Cunningham (acunningham) 2009-03-18 18:35:44

I guess we'll have to wait until the next complete lock-up and run traces then. I'll update the ticket when this happens.

By: Alistair Cunningham (acunningham) 2009-03-23 18:49:11

Asterisk stopped accepting connections on the console:

asterisk0:~# asterisk -rx 'core show locks'
asterisk0:~# asterisk -r
Asterisk 1.4.23.1, Copyright (C) 1999 - 2008 Digium, Inc. and others.
Created by Mark Spencer <markster@digium.com>
Asterisk comes with ABSOLUTELY NO WARRANTY; type 'core show warranty' for details.
This is free software, with components licensed under the GNU General Public
License version 2 and other licenses; you are welcome to redistribute it under
certain conditions. Type 'core show license' for details.
=========================================================================
Connected to Asterisk asterisk/asterisk.ctl currently running on No more connections allowed
(pid = 0)
No more connections allowed
*CLI>
Disconnected from Asterisk server

Asterisk needed to be killed with "kill -9". It didn't respond to a plain kill. During the problem, calls continued to be handled correctly but it's my suspicion (based on previous instances) that had we let it run longer call processing would have stopped.

I will shortly be uploading 3 files taken before restarting asterisk:

1. gdb3.txt: Output of gdb with "thread apply all bt".

2. ps1.txt: Output of "ps -ef".

3. fd1.txt: Output of "ls -l" in /proc/<asterisk pid>/fd.

By: Alistair Cunningham (acunningham) 2009-03-23 18:54:57

The files have been uploaded. ps1.txt had some minor editing to remove 12 enswitch_routed lines that revealed confidential information.

By: Tilghman Lesher (tilghman) 2009-03-24 10:18:47

Approximately how many calls were you running on your Asterisk instance when this lockup occurred?  Also, do you process your calls with AGI or do you process them in-dialplan?

By: Tilghman Lesher (tilghman) 2009-03-24 10:38:22

Oh, and BTW, this isn't actually a deadlock situation.  You're running out of file descriptors, so a quick fix for this issue would be to run:  'ulimit -n 32768' prior to starting up Asterisk.  We still need to figure out where this file descriptor leak is coming from, but at least that will allow you to maintain a longer uptime.

By: Alistair Cunningham (acunningham) 2009-03-24 17:55:27

Thank you for the prompt follow-up.

We can't actually tell how many calls there were as we were unable to connect to the console, but at that time time of day there are typically between 5 and 20 calls.

All calls are processed using FastAGI after a little manipulation of called/calling numbers in extensions.conf.

We already have the following set in /etc/security/limits.conf:

* hard nofile 65535
* soft nofile 65535
* hard nproc 65535
* soft nproc 65535

so I don't think we can increase the number of file descriptors beyond what's already set.

By: Alistair Cunningham (acunningham) 2009-03-24 21:05:02

The problem has just happened again. I did some testing of the ulimits, and see that nproc for the Asterisk process was actually 1024 rather than 65535. It seems that the limits.conf setting is not being applied. I've therefore forced it on the command line when restarting Asterisk. We'll see if this helps.

By: Abhay Gupta (agupta) 2009-03-25 03:22:16

We are also having this issue . I have updated the ulimits on my server and will let you all know the status in a few hours time .

By: Abhay Gupta (agupta) 2009-03-25 11:00:39

The status is same even after setting ulimits .

By: Tilghman Lesher (tilghman) 2009-03-25 17:51:31

Okay, I have a patch ready for you, for debugging this issue.  After applying this patch, run 'make menuselect' and under Compiler Options, enable 'DEBUG_FD_LEAKS', then recompile.

During a production run, run the command 'core show fd', capture the output, and upload the result AS A FILE to the file upload area on this issue.

By: Alistair Cunningham (acunningham) 2009-03-25 19:22:16

The problem has occurred again, and looks more like deadlocks than file descriptors.

The CPU load on the machine was rising over time, and when I loaded the Asterisk console and ran "core show locks" it produced no output (not even the headers). I ran it again and the Asterisk console locked up. Loading a new shell and running "asterisk -rx 'core show locks'" locked up as well. It took a "kill -9" on the main Asterisk process to free them. Attached in the next minute or so will be gdb4.txt from just before killing Asterisk.

We do regularly connect to the manager interface. Perhaps the file descriptors problem is caused by the deadlocks problem? My theory (and it is just a theory) is that when a process connects to the manager interface it hangs due to the deadlocks problem. This then consumes a file descriptor. A minute later another connection is made, consuming another file descriptor, and so on.

I'll try your DEBUG_FD_LEAKS patch after hours tonight.

By: Alistair Cunningham (acunningham) 2009-03-26 01:02:29

I've applied the patch and when re-running "make", I get:

Generating embedded module rules ...
make[1]: Entering directory `/usr/src/asterisk-1.4.23.1/utils'
  [CC] astman.c -> astman.o
  [CC] md5.c -> md5.o
  [LD] astman.o md5.o -> astman
astman.o: In function `login':
/usr/src/asterisk-1.4.23.1/utils/astman.c:636: undefined reference to `__ast_fdleak_socket'
collect2: ld returned 1 exit status
make[1]: *** [astman] Error 1
make[1]: Leaving directory `/usr/src/asterisk-1.4.23.1/utils'
make: *** [utils] Error 2

By: GPSH (gaurav6628) 2009-03-26 01:49:10

Same has happened with Asterisk-1.4.24 after enabling Debug FD leaks in menuselect.
Error attached below

astman.o: In function `login':
/usr/src/asterisk/asterisk-1.4.24/utils/astman.c:636: undefined reference to `__ast_fdleak_socket'
collect2: ld returned 1 exit status
make[1]: *** [astman] Error 1
make[1]: Leaving directory `/usr/src/asterisk/asterisk-1.4.24/utils'
make: *** [utils] Error 2

By: Tilghman Lesher (tilghman) 2009-03-26 10:53:05

Okay, compile error fixed.  You may need to 'make clean' to get astman.o to recompile.

By: Alistair Cunningham (acunningham) 2009-03-26 17:51:02

Thank you, will try this tonight.

By: Alistair Cunningham (acunningham) 2009-03-29 03:28:36

Uploaded is fd2.txt which is taken at the quietest time in the middle of the night with only 2 calls active. I will try again get one during the day too.

By: Alistair Cunningham (acunningham) 2009-03-29 17:49:06

Uploaded is fd3.txt which was taken with 107 calls active. At this point, the system was running healthily.

By: Alistair Cunningham (acunningham) 2009-03-30 11:15:39

Asterisk has locked up again. No console commands produce any output:

Connected to Asterisk 1.4.23.1 currently running on asterisk0 (pid = 2277)
asterisk0*CLI> core show fd
asterisk0*CLI> core show channels
asterisk0*CLI> quit

Running "asterisk -rx 'core show fd'" also produces no output.

By: Tilghman Lesher (tilghman) 2009-03-30 12:13:33

acunningham:  are you still running with the increased fd limit (32768) or did you revert back to 1024 for this?

By: Tilghman Lesher (tilghman) 2009-03-30 12:27:04

Also, fd3 indicates that at the time you had 107 calls running, there were actually 209 channels (which consume 418 file descriptors for the channel structure alone, not counting RTP streams) active.  Given that each call is typically composed of 2 channels, this is not out of line with the expected results (if all calls were connected, you would have consumed 214 channels or 428 file descriptors).  Indeed, you had 210 RTP sockets at the time, which is also not out of line with expected results.  I also see 105 FDs for UDPTL (used for T.38 setup).  Just there alone, you would be consuming 743 file descriptors for calls alone.  Do you see how close you are getting to the 1024 maximum limit?

With about 30 calls more, you'll run into that limit, probably causing a cascade failure, seeing exactly the symptoms you're witnessing.

By: Alistair Cunningham (acunningham) 2009-03-30 13:21:23

It should currently be 65535. I've added this line to /etc/init.d/asterisk:

ulimit -n 65535

Is there a way to tell the actual value in use from within Asterisk? Failing that, do you know of a way to find it using /proc or gdb?

By: Tilghman Lesher (tilghman) 2009-03-30 14:48:27

This latest version of the patch includes the total number of available file descriptors in the output.

By: Alistair Cunningham (acunningham) 2009-03-30 17:24:35

Thank you, I'll try this tonight.

By: Abhay Gupta (agupta) 2009-03-30 21:53:55

Hi are we sure that it is a fd issue . We are only having a single E1 and still facing the same and then to test we reverted to asterisk 1.4.17 and no problem for the last 3 days .

By: Alistair Cunningham (acunningham) 2009-03-31 00:51:08

I'm not sure if it's the same problem, but Asterisk has just segfaulted with:

Program terminated with signal 11, Segmentation fault.
#0  0x0809a294 in cli_next (i=0x43eb2980) at cli.c:1549
1549                            i->helpers = AST_LIST_NEXT(e, list);
(gdb) bt
#0  0x0809a294 in cli_next (i=0x43eb2980) at cli.c:1549
#1  0x0809a3a9 in find_cli (cmds=0x43eb29c0, match_type=0) at cli.c:1569
#2  0x0809b622 in ast_cli_command (fd=1024, s=0x43eb2d29 "show channel SIP/5060-08350eb0") at cli.c:1987
#3  0x080ca589 in action_command (s=0x8522430, m=0x43eb31a4) at manager.c:1762
#4  0x080cc5ba in process_message (s=0x8522430, m=0x43eb31a4) at manager.c:2225
ASTERISK-1  0x080ccbaa in do_message (s=0x8522430) at manager.c:2321
ASTERISK-2  0x080cccb7 in session_do (data=0x8522430) at manager.c:2337
ASTERISK-3  0x0811bf45 in dummy_start (data=0x85f5c68) at utils.c:856
ASTERISK-4  0x40028240 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
ASTERISK-5  0x4017a49e in clone () from /lib/tls/i686/cmov/libc.so.6

By: Tilghman Lesher (tilghman) 2009-03-31 09:53:44

agupta:  given that you have not provided the requested output, I'm inclined to suggest that either a) you have a problem completely unrelated to this report and you should open your own report, or b) your problem is related to this report, but given the lack of debugging information, there is no way for me to confirm that.

By: Tilghman Lesher (tilghman) 2009-03-31 09:54:58

acunningham:  that appears to be memory corruption.  I would suggest running under valgrind, if you can reproduce that crash.

By: Alistair Cunningham (acunningham) 2009-03-31 16:36:54

Thank you, testing does show the machine to have memory problems. It's going to be replaced. I'll let you know whether this affects the original problem. This may take a day or two however.

By: Alistair Cunningham (acunningham) 2009-03-31 22:33:51

A new machine has been installed, and the patch applied. "core show fd" now shows:

Current maxfiles: unlimited

Let's see whether this helps.

By: Tilghman Lesher (tilghman) 2009-04-08 11:50:47

It's been a week.  Anything to report?

By: Alistair Cunningham (acunningham) 2009-04-08 18:30:36

We haven't had any reports of problems since the new machine was installed. I'm happy for the ticket to be closed if we can re-open it should the problem occur again.

By: Digium Subversion (svnbot) 2009-04-08 23:31:39

Repository: asterisk
Revision: 187300

U   branches/1.4/agi/Makefile
U   branches/1.4/build_tools/cflags.xml
U   branches/1.4/include/asterisk.h
U   branches/1.4/main/Makefile
U   branches/1.4/main/asterisk.c
U   branches/1.4/main/file.c
U   branches/1.4/utils/Makefile

------------------------------------------------------------------------
r187300 | tilghman | 2009-04-08 23:31:39 -0500 (Wed, 08 Apr 2009) | 3 lines

Add debugging mode for diagnosing file descriptor leaks.
(Related to issue ASTERISK-13712)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=187300

By: Digium Subversion (svnbot) 2009-04-08 23:59:05

Repository: asterisk
Revision: 187302

_U  trunk/
U   trunk/agi/Makefile
U   trunk/build_tools/cflags.xml
U   trunk/include/asterisk.h
U   trunk/main/Makefile
U   trunk/main/asterisk.c
A   trunk/main/astfd.c
U   trunk/main/file.c
U   trunk/utils/Makefile

------------------------------------------------------------------------
r187302 | tilghman | 2009-04-08 23:59:05 -0500 (Wed, 08 Apr 2009) | 14 lines

Merged revisions 187300-187301 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r187300 | tilghman | 2009-04-08 23:31:38 -0500 (Wed, 08 Apr 2009) | 3 lines
 
 Add debugging mode for diagnosing file descriptor leaks.
 (Related to issue ASTERISK-13712)
........
 r187301 | tilghman | 2009-04-08 23:32:40 -0500 (Wed, 08 Apr 2009) | 2 lines
 
 Oops, missed this file in the last commit.
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=187302

By: Digium Subversion (svnbot) 2009-04-09 00:12:07

Repository: asterisk
Revision: 187303

_U  branches/1.6.2/
U   branches/1.6.2/agi/Makefile
U   branches/1.6.2/build_tools/cflags.xml
U   branches/1.6.2/include/asterisk.h
U   branches/1.6.2/main/Makefile
U   branches/1.6.2/main/asterisk.c
A   branches/1.6.2/main/astfd.c
U   branches/1.6.2/main/file.c
U   branches/1.6.2/utils/Makefile

------------------------------------------------------------------------
r187303 | tilghman | 2009-04-09 00:12:07 -0500 (Thu, 09 Apr 2009) | 21 lines

Merged revisions 187302 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r187302 | tilghman | 2009-04-08 23:59:05 -0500 (Wed, 08 Apr 2009) | 14 lines
 
 Merged revisions 187300-187301 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r187300 | tilghman | 2009-04-08 23:31:38 -0500 (Wed, 08 Apr 2009) | 3 lines
   
   Add debugging mode for diagnosing file descriptor leaks.
   (Related to issue ASTERISK-13712)
 ........
   r187301 | tilghman | 2009-04-08 23:32:40 -0500 (Wed, 08 Apr 2009) | 2 lines
   
   Oops, missed this file in the last commit.
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=187303

By: Digium Subversion (svnbot) 2009-04-09 00:16:07

Repository: asterisk
Revision: 187304

_U  branches/1.6.0/
U   branches/1.6.0/agi/Makefile
U   branches/1.6.0/build_tools/cflags.xml
U   branches/1.6.0/include/asterisk.h
U   branches/1.6.0/main/Makefile
U   branches/1.6.0/main/asterisk.c
A   branches/1.6.0/main/astfd.c
U   branches/1.6.0/main/file.c
U   branches/1.6.0/utils/Makefile

------------------------------------------------------------------------
r187304 | tilghman | 2009-04-09 00:16:07 -0500 (Thu, 09 Apr 2009) | 21 lines

Merged revisions 187302 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r187302 | tilghman | 2009-04-08 23:59:05 -0500 (Wed, 08 Apr 2009) | 14 lines
 
 Merged revisions 187300-187301 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r187300 | tilghman | 2009-04-08 23:31:38 -0500 (Wed, 08 Apr 2009) | 3 lines
   
   Add debugging mode for diagnosing file descriptor leaks.
   (Related to issue ASTERISK-13712)
 ........
   r187301 | tilghman | 2009-04-08 23:32:40 -0500 (Wed, 08 Apr 2009) | 2 lines
   
   Oops, missed this file in the last commit.
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=187304

By: Digium Subversion (svnbot) 2009-04-09 00:16:26

Repository: asterisk
Revision: 187305

_U  branches/1.6.1/
U   branches/1.6.1/agi/Makefile
U   branches/1.6.1/build_tools/cflags.xml
U   branches/1.6.1/include/asterisk.h
U   branches/1.6.1/main/Makefile
U   branches/1.6.1/main/asterisk.c
A   branches/1.6.1/main/astfd.c
U   branches/1.6.1/main/file.c
U   branches/1.6.1/utils/Makefile

------------------------------------------------------------------------
r187305 | tilghman | 2009-04-09 00:16:26 -0500 (Thu, 09 Apr 2009) | 21 lines

Merged revisions 187302 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r187302 | tilghman | 2009-04-08 23:59:05 -0500 (Wed, 08 Apr 2009) | 14 lines
 
 Merged revisions 187300-187301 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r187300 | tilghman | 2009-04-08 23:31:38 -0500 (Wed, 08 Apr 2009) | 3 lines
   
   Add debugging mode for diagnosing file descriptor leaks.
   (Related to issue ASTERISK-13712)
 ........
   r187301 | tilghman | 2009-04-08 23:32:40 -0500 (Wed, 08 Apr 2009) | 2 lines
   
   Oops, missed this file in the last commit.
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=187305

By: Tilghman Lesher (tilghman) 2009-04-09 00:24:09

Resolved by reporter with hardware upgrade.