ASTERISK-06820: MixMonitor() causes Asterisk to deadlock under good call volume

[Home]

Summary: ASTERISK-06820: MixMonitor() causes Asterisk to deadlock under good call volume

Reporter: Edward Eastman (whisk) Labels:

Date Opened: 2006-04-20 17:43:53 Date Closed: 2006-12-05 09:01:47.000-0600

Priority: Major Regression? No

Status: Closed/Complete Components: Applications/app_mixmonitor

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) asterisk_bt.txt
( 1) asterisk_bt_full_-_post_patch.txt
( 2) asterisk_bt_full_-_post_patch_2.txt
( 3) asterisk_bt_full_-_post_patch_3.txt
( 4) asterisk_problem_extensions_excerpt.txt

Description: We have a couple asterisk box acting as sip gateways/call recorders (using mixmonitor) for a call centre seeing c. 4000 calls/day. After a while I see an 'Avoided initial deadlock' message every time I do 'show channels' even once the server has no calls on it (and there is one channel shown even with 0 calls). Once asterisk gets into this state only a restart will fix it. Sometimes it seems happy like that for some time, others it crashes.

We're seeing the warnings on show channels within 15-30 mins after a restart if there's 30+channels in use, but the crash is much more variable - probably once every 1-2 days on average.

I haven't been able to reproduce this if i take out the mixmonitor recording although I haven't tested it like this for too long.

I've attached a 'thread apply all bt' from an affected asterisk together with relevant portions of extensions.conf

Comments: By: Edward Eastman (whisk) 2006-04-20 17:49:50

Oh, few more details - the servers are accepting sip calls from another asterisk server, recording them to local hdds and passing them on to a tdm switch/acd system over pri.

I've also noticed this problem with earlier 1.2s and trunk from a couple of weeks ago (haven't tried latest trunk)
By: Tilghman Lesher (tilghman) 2006-04-20 18:23:03

Do all of the channels connect and bridge, or do any of them get hung up before bridging?
By: Edward Eastman (whisk) 2006-04-20 18:42:09

Yes I'm pretty sure they're all bridging
By: Tilghman Lesher (tilghman) 2006-04-20 18:46:40

Please test this patch to see if it makes a difference.
By: Edward Eastman (whisk) 2006-04-21 16:15:12

Thanks for you help with this - I've tried the patch on one of our servers today, with mainly good results - we didn't have any crashes, and we didn't have any lost calls, but the warning still comes up, although I think it took longer to show than normal. Here it is btw:

CLI> show channels
Channel Location State Application(Data)
0 active channels
10 active calls
Apr 21 21:58:17 WARNING[16609]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0xb5b1c270', 10 retries!

It was a fairly busy day today, and I'd have expected some problems (crashes/lost calls) based on past form, but it seemed better. I'm pretty sure (although I wouldn't swear) that the show channels prior to this patch only showed 1 active call rather than 10.

I've just done a thread apply all bt on the running asterisk and this time there are only 29 threads - i've attached it although i'm not sure how helpful it will be as it's from a normal asterisk make (no valgrind etc). I'll try and get a non-optimized one as well.
By: Tilghman Lesher (tilghman) 2006-04-21 17:00:30

Actually 29 threads sounds about right, considering how many monitor threads we have in various modules. I wouldn't worry about the warning message. It happens, even on perfectly functioning systems (it's a side effect of having devicestate).
By: Edward Eastman (whisk) 2006-04-22 17:40:06

This's been running today, and the problem hasn't reoccurred since a restart last night. Todays traffic has been much less (generally around 15 concurrent calls rather than 40-50) but I'm pretty sure that before this patch it would've reoccurred. I did do a test last night and ran a load of calls through and it did reoccur - there's a bt attached (post patch 2) which does show a large number of threads again - so I think while the patch has improved matters it hasn't completely fixed it?
By: Edward Eastman (whisk) 2006-04-24 16:51:18

Been running it for another day (c. 2600 calls), and again the warning took much longer to show up than before the patch, but it still did appear - I'll attach another bt.
By: Edward Eastman (whisk) 2006-04-26 16:07:15

I'm afraid I think this definately hasn't fixed the problem - we had a higher call volume and had a couple of full deadlocks (couldn't receive any incoming calls) - these seem to occur when the box is high on cpu. I did get a bt but typically my pc crashed before I saved it :/

Any other ideas?
By: Serge Vecher (serge-v) 2006-05-03 09:58:00

Whisk: there was a fix committed to 1.2 yesterday that may impact this issue. Can you please download the latest 1.2 branch (1.2.8 which will contain it, has not been released yet). Thanks.
By: Edward Eastman (whisk) 2006-05-03 11:02:26

OK - I'll test later on today - should I test this without Corydon's additional patch or with?
By: Serge Vecher (serge-v) 2006-05-03 11:17:33

first without, if the problem is still there, then with the patch.
By: Edward Eastman (whisk) 2006-05-05 17:14:21

I've been using 1.2 branch for the last few days and the problem still persists. It looks like Corydon's patch is already in there.
By: BJ Weschke (bweschke) 2006-05-05 17:33:57

Whisk: current 1.2 branch does not yet contain Corydon's patch. Pls apply the patch and try and see if you're having the problem.
By: Edward Eastman (whisk) 2006-05-06 04:19:23

Are you sure about that? I've just done a fresh "svn co http://svn.digium.com/svn/asterisk/branches/1.2 asterisk" and then tried the patch:

# patch -p0 < /root/patch.mixmonitor
patching file channel.c
Reversed (or previously applied) patch detected! Assume -R? [n] n
Apply anyway? [n] n
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file channel.c.rej

When I look at channel.c it looks very much like it's been applied already.
By: BJ Weschke (bweschke) 2006-05-06 08:32:00

Whisk: not sure why you're having a problem with patch but Corydon76's patch is definitely not in the current 1.2 branch. I will try and take a look later today if you're still having problem integrating the code.
By: Tilghman Lesher (tilghman) 2006-05-06 09:04:49

You may need to 'svn update'. The patch was in the tree briefly for half an hour two weeks ago.
By: Edward Eastman (whisk) 2006-05-06 16:38:41

Sorry, /me hits self with the stupid stick (I was applying a completely wrong patch)

I'll apply Corydon's patch to latest 1.2 and see if that improves matters.
By: Vahram Igityan (vx) 2006-05-07 15:41:53

I am running Asterisk 1.2.7.1, using only chan_h323 + agi script for managing passed traffic.

Every day I am getting mentioned warning while doing show channels, but calls are passing ok, but I cannot stop asterisk asterisk -rx 'stop now' from the cli any more, only kill is helping.
Hope it will help.
By: BJ Weschke (bweschke) 2006-05-09 19:33:10

I've got a /trunk system now exhibiting the same behavior. I'm trying to see what it takes to reproduce as this is a fairly high volume system and I've got the bottle held out to try and catch the bolt of lightning. :-/
By: Serge Vecher (serge-v) 2006-05-10 14:18:11

vx: do you also use mixmonitor like whisk? If not, your issue maybe different.

if mixmonitor is used, please try Corydon76's patch posted earlier 20060420__bug7004.diff.txt

Thanks,
By: Edward Eastman (whisk) 2006-05-10 14:22:38

I've been running the latest 1.2 branch for the last few days - and doesn't seem to make any difference - again it seems a little better with corydon's patch, but still the same problems surface - in summary the latest changes to 1.2 branch seem to have made no difference to the problem.
By: Vahram Igityan (vx) 2006-05-10 14:50:09

vechers: no, I am not using spy monitor.
I guess the source of problems is not a monitor, imho it is global deadlock caused by chan_h323 problem.
I will install last trunk and check with new patches (there are a lot of changes for chan_h323), if the problem with nonstopable asterisk and deadlock message appears I will create a separate issue with backtraces.
Thank You.
By: Serge Vecher (serge-v) 2006-05-22 13:28:31

vx: I believe that your problem is not related to this (7004) bug, but to 5982. Please post your findings in that bug report. Thank you.
By: davet (davet) 2006-05-24 04:07:24

I have seen exactly the same problem and spent some time checking the configuration files and reloading zaptel drivers and constant restarts (had to "kill -9" since "stop now" didn't work) before taking a step back and looking for a simpler explanation.

The symptoms were exactly as you describe. "show channels" doesn't return useful channel information except errors like:
May 22 21:18:33 WARNING[29559]: channel.c:787 channel_find_locked: Avoided deadlock for '0x82a00c8', 10 retries!

I eventually traced it to a full /var partition.

I am running MySQL with its tables in /var/lib/mysql and CDRs were being written to /var/log/asterisk. I don't know whether the MySQL module was causing this blocking or the generic CDR CSV code but clearing space on /var and a restart fixed it.

Where are you writing your temporary -in and -out mix files to? Is this an explanation for you?
By: Serge Vecher (serge-v) 2006-05-26 12:34:29

whisk: please checkout the latest 1.2 branch code (>rev. 30546). A fix was just committed that should resolve the locking issues in app_queue. Please report and test back -- thank you!
By: Edward Eastman (whisk) 2006-05-26 16:37:09

We aren't using app_queue I'm afraid - just straight pushing of calls from SIP out to PRI with mixmonitor (extensions excerpt is in the attached files).

Just to recap this problem is definately linked to mixmonitor - if I remove mixmonitor it goes away - and while Corydon's broadcast patch does seem to help, it doesn't resolve the issue.
By: BJ Weschke (bweschke) 2006-05-31 05:56:49

Whisk: Going back in your logs right around when this happens, do you see messages to the extent of "Dropping voice frame because we're still dialing" on a zap channel that has the lock?
By: Serge Vecher (serge-v) 2006-06-08 15:39:16

whisk: could you please try 1.2.9.1 to see if that makes a difference? There is a major security update incorporated into it, so you should upgrade regardless.

Thanks
By: Edward Eastman (whisk) 2006-06-09 07:12:06

Bweschke: I can't see any messages like that in the logs - any special settings needed to see them?

Vechers - I'll give it a go and let you know.
By: Serge Vecher (serge-v) 2006-06-09 08:47:41

whisk: make sure your logger.conf has the following line:
console => notice,warning,error,debug

Restart Asterisk and at CLI, type set debug 4, set verbose 4.

Then will log very detailed debugging information. Thanks.
By: Serge Vecher (serge-v) 2006-06-19 13:17:01

ok, so what happened here?
By: Edward Eastman (whisk) 2006-06-19 16:39:33

I haven't had a chance to install 1.2.9.1 until tonight - so i'll let you know how it goes tomorrow.
By: Serge Vecher (serge-v) 2006-06-19 20:52:07

whisk: actually please update to the latest stable 1.2 revision. Joshnet/kpfleming have done some important code rewrites in r33725/r33842 for app_mixmonitor. These will be in 1.2.10 when that is released. Thanks.
By: Edward Eastman (whisk) 2006-06-20 17:01:14

1.2.9.1 was no different, but i'll try 1.2 branch latest and see what that does - should I try corydon's patch with this version as well?
By: Tilghman Lesher (tilghman) 2006-06-20 17:09:53

As that patch did not conclusively fix anything, I am removing it from this bug. You should not apply it going forward.
By: Edward Eastman (whisk) 2006-06-20 17:41:30

Latest 1.2 branch crashes fairly firmly very quickly, i've had enough of tonight but i'll post a proper bug report tomorrow :)
By: Serge Vecher (serge-v) 2006-06-27 14:42:07

Actually: hold of on that patch until it is rewritten. In either case, please monitor 7230 as the solution there will probably impact your case also. Thanks

Whisk: could you please try the patch 2006-06-27-channel.c-M7230.patch in Bug ASTERISK-7043. There is a good chance that it will fix your problem also. Thanks.

By: Edward Eastman (whisk) 2006-07-24 04:42:39

OK - as suggested in bug 7230 I've commented out the ast_channel_spy_remove in stopmon - the reasoning being (as i understand it) that it's called from channel.c on channel hangup anyway, and this can cause a race/deadlock condition. Can anyone confirm that this is a safe/sensible thing to do? It's been running for a day so far without any problems, but that was with fairly low call volume so I'll report back in a day or so.
By: Matt Riddell (zx81) 2006-08-20 11:30:24

We're getting the same thing on SIP Calls to MaxTNT with a DS3.

The patch in 7230 doesn't help us.

The only thing that helps is removing MixMonitor.

By: Michael Toop (mmmmmtoop) 2006-09-01 04:17:34

Getting exactly the same symptoms by NOT using MixMonitor. We stopped using MixMonitor when it caused a complete deadlock and eventual crash. Now it looks 'partial' with Monitor, Asterisk continues running but the 'show channels' command displays: Avoided deadlock and only shows a few channels.
By: Serge Vecher (serge-v) 2006-09-01 09:25:34

ok, we need to know if this is observed on 1.2.11 or trunk or what?
By: Serge Vecher (serge-v) 2006-09-05 15:25:21

actually, make that revision 42054 (or higher) of 1.2 branch.
By: Michael Toop (mmmmmtoop) 2006-09-06 15:28:00

Sorry...Observed on SVN-tag-1.2.11-r40846
By: Joshua C. Colp (jcolp) 2006-09-06 15:31:19

That is a check out of 1.2.11 which DOES NOT have my latest fixes in it, you need to check out the 1.2 branch from SVN to get them.
By: Joshua C. Colp (jcolp) 2006-09-06 15:32:08

Misread your statement as it was in reply to serge - so please check out the 1.2 branch to get my fixes.
By: Joshua C. Colp (jcolp) 2006-09-18 10:23:49

No responses people, but that's okay. If an issue still exists please reopen this bug. (Yes I know I have just copy/pasted this)
By: Jared Smith (jsmith) 2006-12-05 09:01:47.000-0600

This problem may be related ot 8376, according to a note in bug 8376.