Summary:ASTERISK-18543: Apparent Deadlock in chan_sip continues, even after repeated efforts.
Reporter:Ernie Dunbar (ernied)Labels:
Date Opened:2011-09-13 18:43:02Date Closed:2011-09-22 06:59:38
Versions: Frequency of
Environment:Debian Squeeze, 2.40 Ghz Pentium 4, 2 GB RAM, Wildcard TE410P/TE405P (1st Gen) PRI card, DAHDI 2.4.0, libpri 0) deadlocks.txt
Description:This issue is ostensibly similar to previous issues with SIP deadlocks. Asterisk stops responding to *new* sip logins, but continues to process calls properly until all SIP users must reconnect. We detect this condition with `netstat -anp |grep 5060` and if the recvq is greater than 6000, we assume the server is deadlocked and kill the process with signal 6.

As has been previously suggested, we have stopped loading the modules res_timing_timerfd.so and res_timing_pthread.so and only use res_timing_dahdi.so, as well as upgrading to version, but these deadlocks continue. This only happens when the server is under some significant load. Testing with dozens of clients connected does not show any problems at all, yet when our full compliment of SIP users is logged in, we see these failures. After doing a hardware reboot, the problem typically does not recur for one or two days at least, but once it starts happening restarting Asterisk becomes a frequent occurrence - say every 15 to 30 minutes - until the hardware is rebooted again.

Attachments for 'core show locks' and gdb will be attached shortly, but gdb is having some trouble reading the process? All I get is this:

gdb asterisk 17320 |tee /tmp/backtrace.txt
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /usr/sbin/asterisk...done.
Attaching to program: /usr/sbin/asterisk, process 17320
ptrace: No such process.
/root/17320: No such file or directory.
(gdb) btNo stack.
Comments:By: Leif Madsen (lmadsen) 2011-09-14 07:56:10.839-0500

From this page:


The command for attaching to the running process is like the following:

gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk `pidof asterisk` > /tmp/backtrace-threads.txt

By: Leif Madsen (lmadsen) 2011-09-14 07:56:20.997-0500

Requesting feedback from reporter.

By: Ernie Dunbar (ernied) 2011-09-14 10:03:00.930-0500

Ugh. This is not actually a bug, or at least not a deadlock. The method we used to detect these deadlocks (which were definitely present in 1.8.5) was giving us false alarms from our script that was checking to keep Asterisk alive. So now when the recvq shows excessive packets queued, Asterisk  continues to allow new SIP connections.

I don't know if this affects call quality, but it doesn't kick people's ATAs off our system.

By: Leif Madsen (lmadsen) 2011-09-14 11:23:19.399-0500

I see you're using res_timing_pthread -- you should probably avoid that module as it has known issues. If you're using anything before you should use res_timing_dahdi only. If you are using or later, then res_timing_timerfd should be fine as there was some work recently that should have fixed the issues it was having.

By: Ernie Dunbar (ernied) 2011-09-14 11:35:08.501-0500

It should be ok to close this issue.

By: Leif Madsen (lmadsen) 2011-09-22 06:59:38.117-0500

res_timing_timerfd should now be fixed in Asterisk Please test the current release candidate, and if you continue to have problems with res_timing_timerfd, please open a new issue. Thanks!