Summary: | ASTERISK-17436: random deadlocks - SIP messages not being processed | ||||
Reporter: | Henry Fernandes (usinternet) | Labels: | |||
Date Opened: | 2011-02-18 15:28:13.000-0600 | Date Closed: | 2013-04-19 11:00:56 | ||
Priority: | Major | Regression? | No | ||
Status: | Closed/Complete | Components: | Channels/chan_sip/General Resources/res_timing_pthread | ||
Versions: | Frequency of Occurrence | ||||
Related Issues: |
| ||||
Environment: | Attachments: | ( 0) backtrace-threads.txt ( 1) backtrace-threads-11021911:55:58.txt ( 2) backtrace-threads-11021912:05:44.txt ( 3) core-show-locks.txt ( 4) core-show-locks-11021911:55:58.txt ( 5) core-show-locks-11021912:05:44.txt | |||
Description: | Just upgraded to Asterisk 1.8.2 for production and we're getting random deadlocks. When this happens, it looks like Asterisk is not able to process new SIP messages. Any INVITEs or REGISTERs that arrive are not being outputted to the console or logs. This seems to happen randomly. I have not been able to reproduce it at will although it does happen about 5-10 times per day. The server is handling about 2000 calls/day, so it's not really that busy. There are about 75 phones connected to the server. Please see attached backtrace and lock output for more information. It shows the deadlocks but I'm not sure which is the root issue here. ****** ADDITIONAL INFORMATION ****** I've been able to reproduce this on Asterisk 1.8.2.2 and 1.8.2.3. OS is CentOS release 5.5 (Final). | ||||
Comments: | By: Damian Minkov (damencho) 2011-02-19 13:55:13.000-0600 We are experiencing the same locks on asterisk 1.8.2.3, we experienced it and before using 1.8.0, but rarely than after upgrading to latest. We've seen it on Ubuntu Server 10.10 and Ubuntu Server 10.04. After not able to process the INVITES and all lock ends and normal operation resumes, fantom calls are experienced, the INVITES that were not processed during lock are than processed. The duration of the locks vary between 5 and 10 minutes. Lately when trying to produce backtraces and list of locks, this immediately ends the lock period and normal operation continues. I'm attaching two such logs from today. By: Luke H (luckman212) 2011-02-20 21:45:49.000-0600 I believe I'm also experiencing this. CentOS 5.5, very small PBX (Atom230 cpu) and only a handful of phones. Some phones are using TCP/TLS to register. Deadlocks seem to be most common during REGISTER or INVITE. I'm curious, what hardware timer (clocksource) are you using on these machines? e.g. HPET, ACPI_PM, TSC, etc. Also, what timer modules in Asterisk? (pthread, dahdi, timerfd) I think this may be key. By: Damian Minkov (damencho) 2011-02-21 01:33:22.000-0600 Most of the software that connects to the machine is using TCP. In asterisk we use the timer module timerfd. *CLI> timing test Attempting to test a timer with 50 ticks per second. Using the 'timerfd' timing module for this test. It has been 1000 milliseconds, and we got 50 timer ticks. The hardware clocksource I beleave its tsc: # cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc # cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc hpet acpi_pm By: Henry Fernandes (usinternet) 2011-02-21 09:32:50.000-0600 Our phones connect via UDP. Our timing source is DAHDI. CLI> timing test Attempting to test a timer with 50 ticks per second. Using the 'DAHDI' timing module for this test. It has been 1019 milliseconds, and we got 51 timer ticks By: Casey Morford (cmorford) 2011-02-21 11:00:53.000-0600 I can confirm we are having this issue with 1.8.2.3 as well. I doesn't seemed to crash for us a frequently, but it is definitely in deadlock. Out of curiosity, are you seeing a unusually high number of chan_sip __sip_xmit warnings? By: Luke H (luckman212) 2011-02-21 11:04:51.000-0600 Hmm ok, I'm using acpi_pm on my particular box, so I guess it's probably not related to the 'clocksource' -- but I do believe it is possibly a bug in pthreads. I am in the process of swapping some things around to try to eliminate pthreads completely & just use dahdi timing as a test. But I believe this completely KILLS music on hold, at least on 1.8.x By: Luke H (luckman212) 2011-02-22 11:16:55.000-0600 I can confirm that re-enabling res_timing_dahdi has stopped the deadlocks on my PBX. MOH still seems to hook into res_timing_pthreads for something, my 'module show like timing' looks like this pbx*CLI> module show like timing Module Description Use Count res_timing_dahdi.so DAHDI Timing Interface 1 res_timing_pthread.so pthread Timing Interface 2 2 modules loaded anyway, as I expected, this has killed my MOH. MOH is not working correctly with res_timing_dahdi enabled for some reason. By: cmaj (cmaj) 2011-02-22 13:13:48.000-0600 luckman212, I'm running only res_timing_dahdi for timing, and MOH works fine. This is on 1.8.3-rc2 and a Sangoma card - couple T1's clocked from the telco - with ulaw across the board. As for the last core show locks, core-show-locks-11021912:05:44.txt, that looks like Mysql lock contention, one lock being held to the Mysql backend for both CDR insertion and realtime config. If whoever posted that could try turning off Mysql CDR logging for a bit, then it might help them. Or try setting "batchmode=yes" in cdr.conf so you can reduce some contention for that lock (this would cause a post of the CDR in a separate thread from the channel Hangup routine.) By: isrl (isrl) 2011-02-22 13:27:13.000-0600 Luckyman212: i also have a problem with moh and the dahdi timer but only if i use a external source like madplay for moh I tried debugging that but i havent found anything yet and when going to pthreads i have every few hours a deadlock which i reported at https://issues.asterisk.org/view.php?id=18867 By: Luke H (luckman212) 2011-02-22 13:33:04.000-0600 Yes I am using mpg123 for my moh source to stream, so I guess that's the issue. Has a separate bug been filed for that? or is it fixed in SVN-- do you know? By: isrl (isrl) 2011-02-22 13:53:38.000-0600 nope i tried from 1.8.1 all the way up to svn plus a few open patches in the bug tracker and the problem still exsits i havent open'd a bug report as i didnt know how to debug the problem now we know thats its when using a external program to stream moh By: Damian Minkov (damencho) 2011-02-24 04:58:28.000-0600 We've disabled cdr_mysql.so and again we had the lock today. I had to restart it to get it fixed. The interesting was: CLI> core show channels Channel Location State Application(Data) 0 active channels 1 active call CLI> sip show channels I had one channel in BYE And when trying CLI> sip show channel + typing <tab> and the console locked. Unfortunately I didn't manage to get backtrace and lock dumps cause we moved back the asterisk installation on other machine. And this one was not compiled with debug information. Will recompile now if happens again to trace it. By: Luke H (luckman212) 2011-02-24 07:52:34.000-0600 How do we even begin to debug why using mpg123 for a MOH stream + using res_timing_dahdi causes no sound? I wouldn't even begin to know where to start looking. Because it's not crashing or coredumping. By: Ronald Chan (loloski) 2011-02-24 13:12:01.000-0600 From luckman212: Yes I am using mpg123 for my moh source to stream, so I guess that's the issue. Has a separate bug been filed for that? or is it fixed in SVN-- do you know? Yes please file a separate bug for this one :D By: isrl (isrl) 2011-02-24 13:30:00.000-0600 i file'd a bug yesterday https://issues.asterisk.org/view.php?id=18877 By: Luke H (luckman212) 2011-02-24 13:49:21.000-0600 whoops I just filed one too https://issues.asterisk.org/view.php?id=18884 By: Matt Jordan (mjordan) 2013-04-19 10:56:48.584-0500 There is a patch on ASTERISK-21389 that should prevent res_timing_pthread from blocking callers. That should resolve this issue. If you test with that patch and this issue is still a problem, please let a bug marshal know in #asterisk-bugs and we will be happy to reopen this issue. Thanks! |