Summary: | ASTERISK-06820: MixMonitor() causes Asterisk to deadlock under good call volume | ||
Reporter: | Edward Eastman (whisk) | Labels: | |
Date Opened: | 2006-04-20 17:43:53 | Date Closed: | 2006-12-05 09:01:47.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Applications/app_mixmonitor |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) asterisk_bt.txt ( 1) asterisk_bt_full_-_post_patch.txt ( 2) asterisk_bt_full_-_post_patch_2.txt ( 3) asterisk_bt_full_-_post_patch_3.txt ( 4) asterisk_problem_extensions_excerpt.txt | |
Description: | We have a couple asterisk box acting as sip gateways/call recorders (using mixmonitor) for a call centre seeing c. 4000 calls/day. After a while I see an 'Avoided initial deadlock' message every time I do 'show channels' even once the server has no calls on it (and there is one channel shown even with 0 calls). Once asterisk gets into this state only a restart will fix it. Sometimes it seems happy like that for some time, others it crashes. We're seeing the warnings on show channels within 15-30 mins after a restart if there's 30+channels in use, but the crash is much more variable - probably once every 1-2 days on average. I haven't been able to reproduce this if i take out the mixmonitor recording although I haven't tested it like this for too long. I've attached a 'thread apply all bt' from an affected asterisk together with relevant portions of extensions.conf | ||
Comments: | By: Edward Eastman (whisk) 2006-04-20 17:49:50 Oh, few more details - the servers are accepting sip calls from another asterisk server, recording them to local hdds and passing them on to a tdm switch/acd system over pri. I've also noticed this problem with earlier 1.2s and trunk from a couple of weeks ago (haven't tried latest trunk) By: Tilghman Lesher (tilghman) 2006-04-20 18:23:03 Do all of the channels connect and bridge, or do any of them get hung up before bridging? By: Edward Eastman (whisk) 2006-04-20 18:42:09 Yes I'm pretty sure they're all bridging By: Tilghman Lesher (tilghman) 2006-04-20 18:46:40 Please test this patch to see if it makes a difference. By: Edward Eastman (whisk) 2006-04-21 16:15:12 Thanks for you help with this - I've tried the patch on one of our servers today, with mainly good results - we didn't have any crashes, and we didn't have any lost calls, but the warning still comes up, although I think it took longer to show than normal. Here it is btw: CLI> show channels Channel Location State Application(Data) 0 active channels 10 active calls Apr 21 21:58:17 WARNING[16609]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0xb5b1c270', 10 retries! It was a fairly busy day today, and I'd have expected some problems (crashes/lost calls) based on past form, but it seemed better. I'm pretty sure (although I wouldn't swear) that the show channels prior to this patch only showed 1 active call rather than 10. I've just done a thread apply all bt on the running asterisk and this time there are only 29 threads - i've attached it although i'm not sure how helpful it will be as it's from a normal asterisk make (no valgrind etc). I'll try and get a non-optimized one as well. By: Tilghman Lesher (tilghman) 2006-04-21 17:00:30 Actually 29 threads sounds about right, considering how many monitor threads we have in various modules. I wouldn't worry about the warning message. It happens, even on perfectly functioning systems (it's a side effect of having devicestate). By: Edward Eastman (whisk) 2006-04-22 17:40:06 This's been running today, and the problem hasn't reoccurred since a restart last night. Todays traffic has been much less (generally around 15 concurrent calls rather than 40-50) but I'm pretty sure that before this patch it would've reoccurred. I did do a test last night and ran a load of calls through and it did reoccur - there's a bt attached (post patch 2) which does show a large number of threads again - so I think while the patch has improved matters it hasn't completely fixed it? By: Edward Eastman (whisk) 2006-04-24 16:51:18 Been running it for another day (c. 2600 calls), and again the warning took much longer to show up than before the patch, but it still did appear - I'll attach another bt. By: Edward Eastman (whisk) 2006-04-26 16:07:15 I'm afraid I think this definately hasn't fixed the problem - we had a higher call volume and had a couple of full deadlocks (couldn't receive any incoming calls) - these seem to occur when the box is high on cpu. I did get a bt but typically my pc crashed before I saved it :/ Any other ideas? By: Serge Vecher (serge-v) 2006-05-03 09:58:00 Whisk: there was a fix committed to 1.2 yesterday that may impact this issue. Can you please download the latest 1.2 branch (1.2.8 which will contain it, has not been released yet). Thanks. By: Edward Eastman (whisk) 2006-05-03 11:02:26 OK - I'll test later on today - should I test this without Corydon's additional patch or with? By: Serge Vecher (serge-v) 2006-05-03 11:17:33 first without, if the problem is still there, then with the patch. By: Edward Eastman (whisk) 2006-05-05 17:14:21 I've been using 1.2 branch for the last few days and the problem still persists. It looks like Corydon's patch is already in there. By: BJ Weschke (bweschke) 2006-05-05 17:33:57 Whisk: current 1.2 branch does not yet contain Corydon's patch. Pls apply the patch and try and see if you're having the problem. By: Edward Eastman (whisk) 2006-05-06 04:19:23 Are you sure about that? I've just done a fresh "svn co http://svn.digium.com/svn/asterisk/branches/1.2 asterisk" and then tried the patch: # patch -p0 < /root/patch.mixmonitor patching file channel.c Reversed (or previously applied) patch detected! Assume -R? [n] n Apply anyway? [n] n Skipping patch. 2 out of 2 hunks ignored -- saving rejects to file channel.c.rej When I look at channel.c it looks very much like it's been applied already. By: BJ Weschke (bweschke) 2006-05-06 08:32:00 Whisk: not sure why you're having a problem with patch but Corydon76's patch is definitely not in the current 1.2 branch. I will try and take a look later today if you're still having problem integrating the code. By: Tilghman Lesher (tilghman) 2006-05-06 09:04:49 You may need to 'svn update'. The patch was in the tree briefly for half an hour two weeks ago. By: Edward Eastman (whisk) 2006-05-06 16:38:41 Sorry, /me hits self with the stupid stick (I was applying a completely wrong patch) I'll apply Corydon's patch to latest 1.2 and see if that improves matters. By: Vahram Igityan (vx) 2006-05-07 15:41:53 I am running Asterisk 1.2.7.1, using only chan_h323 + agi script for managing passed traffic. Every day I am getting mentioned warning while doing show channels, but calls are passing ok, but I cannot stop asterisk asterisk -rx 'stop now' from the cli any more, only kill is helping. Hope it will help. By: BJ Weschke (bweschke) 2006-05-09 19:33:10 I've got a /trunk system now exhibiting the same behavior. I'm trying to see what it takes to reproduce as this is a fairly high volume system and I've got the bottle held out to try and catch the bolt of lightning. :-/ By: Serge Vecher (serge-v) 2006-05-10 14:18:11 vx: do you also use mixmonitor like whisk? If not, your issue maybe different. if mixmonitor is used, please try Corydon76's patch posted earlier 20060420__bug7004.diff.txt Thanks, By: Edward Eastman (whisk) 2006-05-10 14:22:38 I've been running the latest 1.2 branch for the last few days - and doesn't seem to make any difference - again it seems a little better with corydon's patch, but still the same problems surface - in summary the latest changes to 1.2 branch seem to have made no difference to the problem. By: Vahram Igityan (vx) 2006-05-10 14:50:09 vechers: no, I am not using spy monitor. I guess the source of problems is not a monitor, imho it is global deadlock caused by chan_h323 problem. I will install last trunk and check with new patches (there are a lot of changes for chan_h323), if the problem with nonstopable asterisk and deadlock message appears I will create a separate issue with backtraces. Thank You. By: Serge Vecher (serge-v) 2006-05-22 13:28:31 vx: I believe that your problem is not related to this (7004) bug, but to 5982. Please post your findings in that bug report. Thank you. By: davet (davet) 2006-05-24 04:07:24 I have seen exactly the same problem and spent some time checking the configuration files and reloading zaptel drivers and constant restarts (had to "kill -9" since "stop now" didn't work) before taking a step back and looking for a simpler explanation. The symptoms were exactly as you describe. "show channels" doesn't return useful channel information except errors like: May 22 21:18:33 WARNING[29559]: channel.c:787 channel_find_locked: Avoided deadlock for '0x82a00c8', 10 retries! I eventually traced it to a full /var partition. I am running MySQL with its tables in /var/lib/mysql and CDRs were being written to /var/log/asterisk. I don't know whether the MySQL module was causing this blocking or the generic CDR CSV code but clearing space on /var and a restart fixed it. Where are you writing your temporary -in and -out mix files to? Is this an explanation for you? By: Serge Vecher (serge-v) 2006-05-26 12:34:29 whisk: please checkout the latest 1.2 branch code (>rev. 30546). A fix was just committed that should resolve the locking issues in app_queue. Please report and test back -- thank you! By: Edward Eastman (whisk) 2006-05-26 16:37:09 We aren't using app_queue I'm afraid - just straight pushing of calls from SIP out to PRI with mixmonitor (extensions excerpt is in the attached files). Just to recap this problem is definately linked to mixmonitor - if I remove mixmonitor it goes away - and while Corydon's broadcast patch does seem to help, it doesn't resolve the issue. By: BJ Weschke (bweschke) 2006-05-31 05:56:49 Whisk: Going back in your logs right around when this happens, do you see messages to the extent of "Dropping voice frame because we're still dialing" on a zap channel that has the lock? By: Serge Vecher (serge-v) 2006-06-08 15:39:16 whisk: could you please try 1.2.9.1 to see if that makes a difference? There is a major security update incorporated into it, so you should upgrade regardless. Thanks By: Edward Eastman (whisk) 2006-06-09 07:12:06 Bweschke: I can't see any messages like that in the logs - any special settings needed to see them? Vechers - I'll give it a go and let you know. By: Serge Vecher (serge-v) 2006-06-09 08:47:41 whisk: make sure your logger.conf has the following line: console => notice,warning,error,debug Restart Asterisk and at CLI, type set debug 4, set verbose 4. Then will log very detailed debugging information. Thanks. By: Serge Vecher (serge-v) 2006-06-19 13:17:01 ok, so what happened here? By: Edward Eastman (whisk) 2006-06-19 16:39:33 I haven't had a chance to install 1.2.9.1 until tonight - so i'll let you know how it goes tomorrow. By: Serge Vecher (serge-v) 2006-06-19 20:52:07 whisk: actually please update to the latest stable 1.2 revision. Joshnet/kpfleming have done some important code rewrites in r33725/r33842 for app_mixmonitor. These will be in 1.2.10 when that is released. Thanks. By: Edward Eastman (whisk) 2006-06-20 17:01:14 1.2.9.1 was no different, but i'll try 1.2 branch latest and see what that does - should I try corydon's patch with this version as well? By: Tilghman Lesher (tilghman) 2006-06-20 17:09:53 As that patch did not conclusively fix anything, I am removing it from this bug. You should not apply it going forward. By: Edward Eastman (whisk) 2006-06-20 17:41:30 Latest 1.2 branch crashes fairly firmly very quickly, i've had enough of tonight but i'll post a proper bug report tomorrow :) By: Serge Vecher (serge-v) 2006-06-27 14:42:07 Actually: hold of on that patch until it is rewritten. In either case, please monitor 7230 as the solution there will probably impact your case also. Thanks Whisk: could you please try the patch 2006-06-27-channel.c-M7230.patch in Bug ASTERISK-7043. There is a good chance that it will fix your problem also. Thanks. By: Edward Eastman (whisk) 2006-07-24 04:42:39 OK - as suggested in bug 7230 I've commented out the ast_channel_spy_remove in stopmon - the reasoning being (as i understand it) that it's called from channel.c on channel hangup anyway, and this can cause a race/deadlock condition. Can anyone confirm that this is a safe/sensible thing to do? It's been running for a day so far without any problems, but that was with fairly low call volume so I'll report back in a day or so. By: Matt Riddell (zx81) 2006-08-20 11:30:24 We're getting the same thing on SIP Calls to MaxTNT with a DS3. The patch in 7230 doesn't help us. The only thing that helps is removing MixMonitor. By: Michael Toop (mmmmmtoop) 2006-09-01 04:17:34 Getting exactly the same symptoms by NOT using MixMonitor. We stopped using MixMonitor when it caused a complete deadlock and eventual crash. Now it looks 'partial' with Monitor, Asterisk continues running but the 'show channels' command displays: Avoided deadlock and only shows a few channels. By: Serge Vecher (serge-v) 2006-09-01 09:25:34 ok, we need to know if this is observed on 1.2.11 or trunk or what? By: Serge Vecher (serge-v) 2006-09-05 15:25:21 actually, make that revision 42054 (or higher) of 1.2 branch. By: Michael Toop (mmmmmtoop) 2006-09-06 15:28:00 Sorry...Observed on SVN-tag-1.2.11-r40846 By: Joshua C. Colp (jcolp) 2006-09-06 15:31:19 That is a check out of 1.2.11 which DOES NOT have my latest fixes in it, you need to check out the 1.2 branch from SVN to get them. By: Joshua C. Colp (jcolp) 2006-09-06 15:32:08 Misread your statement as it was in reply to serge - so please check out the 1.2 branch to get my fixes. By: Joshua C. Colp (jcolp) 2006-09-18 10:23:49 No responses people, but that's okay. If an issue still exists please reopen this bug. (Yes I know I have just copy/pasted this) By: Jared Smith (jsmith) 2006-12-05 09:01:47.000-0600 This problem may be related ot 8376, according to a note in bug 8376. |