[Home]

Summary:ASTERISK-15611: Heavy locking in manager.c results in eventual crash and loss of CLI commands and intense CPU load
Reporter:David Brillert (aragon)Labels:
Date Opened:2010-02-11 08:29:50.000-0600Date Closed:2010-10-04 13:43:34
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/ManagerInterface
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 10032010gdbasterisk-pid_threadapplyallbt.txt
( 1) asteriskclicoreshowlocks.txt
( 2) gdbthreadapplyallbt.txt
( 3) valgrind.txt
Description:We run some scripts connecting to Asterisk AMI to display realtime Asterisk statistics.  When our AMI stuff is connected to Asterisk versions 1.4.25 to 1.4.30rc2 there is heavy duty locking and CPU load spiking which will degrade voice quality, kill the Asterisk CLI commands like core show locks, core show channels etc... and I have seen the load in HTOP reach 300 until Asterisk is forcibly restarted. Asterisk 1.4.24 is not effected so we don't think it is our scripts...

Core show locks and gdb asterisk --pid=`pidof asterisk` 'thread apply all bt' is attached as txt files

****** ADDITIONAL INFORMATION ******

This bug report is the root cause of another open bug report I have open ASTERISK-15564
Comments:By: Tilghman Lesher (tilghman) 2010-02-11 12:01:13.000-0600

This would be fixed by the following patch on reviewboard:

https://reviewboard.asterisk.org/r/219/

By: David Brillert (aragon) 2010-02-11 12:39:15.000-0600

Thanks tilghman: Should I wait until that patch on the reviewboard is submitted?

By: David Brillert (aragon) 2010-02-11 13:05:56.000-0600

tilghman: Gadzooks, the last change to the reviewboard link you posted https://reviewboard.asterisk.org/r/219/ was Mark Michelson 10 months ago (April 16th, 2009, 9:41 a.m.) and it seems to have died in the asterisk-dev list too.  Last update to http://lists.digium.com/pipermail/asterisk-dev/2009-March/037448.html was Tue Mar 24 15:44:39 CDT 2009.
Also there is much controversy over whether this patch should be submitted and  it doesn't appear to be sanctioned by anybody...

Methinks I have a long time to wait :'(
Is anyone going to pick this up and run with it?

By: David Brillert (aragon) 2010-02-12 08:47:02.000-0600

https://reviewboard.asterisk.org/r/219/
This change has been discarded.

When running Asterisk 1.4.29 or 1.4.30rc2 locks up and load screeches as soon as I execute core show channels command.  Same gdb traces as already uploaded.

Is somebody investigating a new patch?
I am very willing to test any new patch on my test rig under high load.

By: David Brillert (aragon) 2010-02-16 14:46:24.000-0600

It appears that https://reviewboard.asterisk.org/r/219/ was discarded in favour of a patch available at ASTERISK-14038
ASTERISK-14038 has been ready for review since 2009-09-18

Will this patched be reviewed shortly and/or made available for Asterisk 1.4.x ?

By: David Brillert (aragon) 2010-02-17 19:55:02.000-0600

I've been talking to wedhorn @ ASTERISK-14038 and there is an update http://lists.digium.com/pipermail/asterisk-dev/2010-February/042425.html
Since the fix tilghman has suggested will likely only be available in trunk I am humbly requesting that a developer take a look at ASTERISK-15611 and provide a SVN patch to fix the locking I reported.

By: David Brillert (aragon) 2010-03-10 15:24:07.000-0600

It happened again now using 1.4.30rc3
CPU load so high it took minutes to login to Asterisk with SSH.
After some patience I was able to attach with gdb asterisk -pid=
Uploading 10032010gdbasterisk-pid_threadapplyallbt.txt although I don't know if this will be helpful because Asterisk does not deadlock completely.  Everything just times out because the CPU is so overloaded by what appears to be a very long lock which spikes the CPU.
After I disconnected gdb the CPU load went back to normal and core show locks did not show any lock.

Peers are monitored with qualify=yes so when CPU maxes out I lose all my peers.

By: David Brillert (aragon) 2010-03-11 11:52:27.000-0600

I can't afford to wait for this to be fixed in trunk, what are my options to get this moving forward in 1.4?
Is there enough debug info here to move this along?
If not what else is required of me?

By: David Brillert (aragon) 2010-03-15 13:41:30

uploaded valgrind.txt after latest Asterisk crash today.

By: Jason Parker (jparker) 2010-09-27 12:51:40

Is this still an issue with 1.4.36?  Would it be possible to get a new "core show locks" when this occurs?

What manager commands are you executing?

By: David Brillert (aragon) 2010-09-27 14:32:10

Qwell:
There is a bug in Asterisk somewhere regarding the locking.
But this could only be reproduced using agi scripts.
My team fixed the locking problem by using fastagi instead of agi.
Since this issue no longer affects the reporter (me); I authorize you to close this report.

By: Jason Parker (jparker) 2010-09-28 16:48:46

Well, if there's a bug, I'd like to fix it.  I'm a bit confused though.  Initially, you said that AMI was causing (near-)deadlocks.  The related issue is a crash in voicemail.  You also mentioned that you've seen crashes related to this.  Now you're saying that the fix was to use FastAGI instead of AGI.

By: David Brillert (aragon) 2010-09-28 17:04:09

On each call would call up AGI command but AGI takes a long time to start and delays call processing.  The long delay would cause locks to be excessive and the more calls processed the more the CPU would spike causing even longer processing delays of calls further causing longer locks.
I don't know why Leif related this issue to a crash in voicemail...

The fix was to replace agi with fastagi to remove the delay caused by agi startup times.

AMI would monitor things like agent status, extension busy status, etc...

I'd like to help you fix this too but I couldn't wait for a developer to get around to it and now that this is fixed with fastagi I cannot reliably reproduce the problem.  Therefore feel free to use any existing debug info attached to this report to fix the bug on your own or close the ticket.  But I do think this issue is closely related to ASTERISK-14994 which I can reliably reproduce.
Both issues are related to slow agi startup times since I have resolved both issues by dumping agi in favor of fastagi

By: Jason Parker (jparker) 2010-10-04 13:43:34

Closing per reporter.