[Home]

Summary:ASTERISK-13723: Asterisk CPU usage 100% (deadlock?) when using Queue() with Local channels
Reporter:Miguel Molina (coolmig)Labels:
Date Opened:2009-03-10 12:26:33Date Closed:2011-06-07 14:07:48
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_local
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) ExtensionStateCapture.txt
Description:I was testing a dialer I'm making, and found this situation: when I dial more than one call to an agent using Local channels (busy with one call and one or more in the queue), Asterisk goes 100% CPU all the time until the queued calls get answered or hangup.

Core show locks shows this:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3077847952 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (chan_agent.c): MUTEX 2252 __login_exec &p->app_lock 0x93d59d4 (1)
=== --- ---> Locked Here: chan_agent.c line 1080 (agent_new)
=== -------------------------------------------------------------------
===
=== Thread ID: 3073665936 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_agent.c): MUTEX 1080 agent_new &p->app_lock 0x93d59d4 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 3075632016 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Tried and failed to get Lock #0 (chan_local.c): MUTEX 186 local_queue_frame (channel lock) 0x9464470 (0)
=== -------------------------------------------------------------------
===
=======================================================================

If I hangup the initial agent call, if takes a long time for the Queue to deliver the other call to the agent, while the agent is with no call, core show locks shows this:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3077847952 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_agent.c): MUTEX 2252 __login_exec &p->app_lock 0x93d59d4 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 3076860816 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Tried and failed to get Lock #0 (chan_local.c): MUTEX 186 local_queue_frame (channel lock) 0x945d820 (0)
=== -------------------------------------------------------------------
===
=======================================================================

When the queued call finally is delivered to the agent, core show locks shows this, and the 100% usage stops:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3077847952 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (chan_agent.c): MUTEX 2252 __login_exec &p->app_lock 0x93d59d4 (1)
=== --- ---> Locked Here: chan_agent.c line 1080 (agent_new)
=== -------------------------------------------------------------------
===
=== Thread ID: 3075632016 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_agent.c): MUTEX 1080 agent_new &p->app_lock 0x93d59d4 (1)
=== -------------------------------------------------------------------
===
=======================================================================

****** ADDITIONAL INFORMATION ******

Dialplan involved: (the calls are generated using AMI)

;Context where I generate the calls using Local Channels
[marcador-prueba_outbound-gencall]
exten => _X.,1,Wait(10)
exten => _X.,n,Answer()
exten => _X.,n,MusiconHold(default)

;exten => _X.,1,Dial(SIP/TRUNK/${EXTEN}) ;Untested scenario
;exten => _X.,n,Hangup() ;Untested scenario

;Context where I connect the generated calls when they answer
[marcador-prueba_outbound]
exten => s,1,Answer()
exten => s,n,Queue(prueba_outbound,ht)
exten => s,n,Hangup()

The 1.4.22.2 version I am using has a patch applied from (http://bugs.digium.com/view.php?id=13676) that introduces a DEADLOCK_AVOIDANCE on this type of channels. I don't know if this happens when I dial outside using the same method.

Any additional information please let me know.
Comments:By: Miguel Molina (coolmig) 2009-03-10 15:14:20

I tested this using the local channel to dial a SIP or IAX2 extension (the final use it will have), with the following simple dialplan:

;Context where I generate the calls using Local Channels
[marcador-prueba_outbound-gencall]
exten => _X.,1,Wait(10)
exten => _X.,n,Dial(SIP/100)
;exten => _X.,n,Dial(IAX2/100)

;Context where I connect the generated calls when they answer
[marcador-prueba_outbound]
exten => s,1,Answer()
exten => s,n,Queue(prueba_outbound,ht)
exten => s,n,Hangup()

And the issue does not reproduce this way, so it happens only with pure Local channels.

By: Miguel Molina (coolmig) 2009-06-24 15:22:57

Testing on a similar scenario, this issue doesn't happen in asterisk 1.6.0.10.

By: Paul D Hales (pdhales) 2009-07-01 20:03:54

I have a similar lockup with Asterisk 1.4.25.1 - locks up under high load (approx 3000 calls per day). Slightly sanitised code follows:

exten => _XXXXXXXX,1,Set(__target=${EXTEN})
exten => _XXXXXXXX,n,Noop(${target})
exten => _XXXXXXXX,n,Queue(loop|ri|||60)
exten => _XXXXXXXX,n,Playback(greeting1)
exten => _XXXXXXXX,n,Queue(loop|i|||90)
exten => _XXXXXXXX,n(loopstart),Playback(holdmsg1)
exten => _XXXXXXXX,n,Queue(loop|i|||120)
exten => _XXXXXXXX,n,Playback(holdmsg2)
exten => _XXXXXXXX,n,Queue(loop|i|||90)
exten => _XXXXXXXX,n,Goto(loopstart)


[lqueue]
exten => s,1,Noop
exten => s,n,Dial(DAHDI/G2/${target})
exten => s,n,Congestion

[loop]
joinempty = yes
leavewhenempty = no
ringinuse = yes
timeout=60
retry=20
strategy=random
timeoutrestart = yes

member => Local/s@lqueue

By: Paul Smith (pscs) 2009-07-24 09:08:28

I have a similar problem with Asterisk 1.6.0.9-samy-r27
(This setup is on a Intel quad core 2.33GHz, 4GB RAM, with about 12 extensions defined)

In my case, Queues simply don't work if there is more than one static member with the 'ringall' type set. If 'roundrobin' is set, then they work fine (but not as we want).

After investigation it seems as if the dialparties.agi script is failing because the 'ExtensionState' API is taking too long to return.

I hacked dialparties.agi to put some debug logging around the ExtensionState call in the 'is_ext_available' function. It logs 'iea: 0 <extnum>' just before it calls ExtensionState and 'iea: 1 <extnum>' just after it.

Then, see the log extract below. You can see 'iea: 0 807' at 14:29:58 just before the state of extension 807 is checked. Then, just after that you see 'iea: 0 808' as the state of extension 808 is checked.
Then, 13 seconds later, you see 'iea: 1 807' as the state of 807 is returned (idle). But, you don't see the state of extension 808 being returned.
Shortly after that you see 'Nobody picked up in 15000 ms' then the dialparties.agi scripts abort/are aborted, and return -1. Neither extension 807 or 808 are called.

If you have a queue with just one member, then the ExtensionState still takes a long time, but since it is only called once, it finishes just in time (after a long delay), so the queue works, but with only one member.

The queue 'timeout' here is set to 15 seconds. If I increase that to 60 seconds, then, eventually, the phones do ring, but only after about 20+ seconds of being in the queue.

Also, if the queue is on roundrobin, then each agent's phone actually only rings for about 2 or 3 seconds if the timeout is set to 15 seconds, as most of the timeout is being spent in the ExtensionState API.

[Jul 24 14:29:58] VERBOSE[4226] logger.c:  dialparties.agi: Caller ID name is '111Paul Smith' number is '804'
[Jul 24 14:29:58] VERBOSE[4225] logger.c:  dialparties.agi: Methodology of ring is  'none'
[Jul 24 14:29:58] VERBOSE[4225] logger.c:     -- dialparties.agi: Added extension 807 to extension map
[Jul 24 14:29:58] VERBOSE[4225] logger.c:     -- dialparties.agi: Extension 807 cf is disabled
[Jul 24 14:29:58] VERBOSE[4225] logger.c:     -- dialparties.agi: Extension 807 do not disturb is disabled
[Jul 24 14:29:58] VERBOSE[4225] logger.c:        > dialparties.agi: extnum 807 has:  cw: 1; hascfb: 0 [] hascfu: 0 []
[Jul 24 14:29:58] VERBOSE[4225] logger.c:        > dialparties.agi: MM: 0
[Jul 24 14:29:58] VERBOSE[4225] logger.c:        > dialparties.agi: iea: 0 807
[Jul 24 14:29:58] VERBOSE[4226] logger.c:        > dialparties.agi: USE_CONFIRMATION:  'FALSE'
[Jul 24 14:29:58] VERBOSE[4226] logger.c:        > dialparties.agi: RINGGROUP_INDEX:   ''
[Jul 24 14:29:58] VERBOSE[4226] logger.c:  dialparties.agi: Methodology of ring is  'none'
[Jul 24 14:29:58] VERBOSE[4226] logger.c:     -- dialparties.agi: Added extension 808 to extension map
[Jul 24 14:29:58] VERBOSE[4226] logger.c:     -- dialparties.agi: Extension 808 cf is disabled
[Jul 24 14:29:58] VERBOSE[4226] logger.c:     -- dialparties.agi: Extension 808 do not disturb is disabled
[Jul 24 14:29:58] VERBOSE[4226] logger.c:        > dialparties.agi: extnum 808 has:  cw: 1; hascfb: 0 [] hascfu: 0 []
[Jul 24 14:29:58] VERBOSE[4226] logger.c:        > dialparties.agi: MM: 0
[Jul 24 14:29:58] VERBOSE[4226] logger.c:        > dialparties.agi: iea: 0 808
[Jul 24 14:30:11] VERBOSE[4225] logger.c:        > dialparties.agi: iea: 1 807
[Jul 24 14:30:11] VERBOSE[4225] logger.c:        > dialparties.agi: ExtensionState: 0
[Jul 24 14:30:11] VERBOSE[4225] logger.c:     -- dialparties.agi: dbset CALLTRACE/807 to 804
[Jul 24 14:30:11] VERBOSE[4225] logger.c:     -- dialparties.agi: Filtered ARG3: 807
[Jul 24 14:30:11] VERBOSE[4225] logger.c:        > dialparties.agi: NODEST: 200 blkvm enabled macro already in dialopts: trM(auto-blkvm)
[Jul 24 14:30:11] VERBOSE[4225] logger.c:        > dialparties.agi: NODEST: 200 blkvm enabled macro already in dialopts: trM(auto-blkvm)
[Jul 24 14:30:11] VERBOSE[4225] logger.c:        > dialparties.agi: DS: SIP/807
[Jul 24 14:30:11] VERBOSE[4231] logger.c:   == Manager 'admin' logged off from 127.0.0.1
[Jul 24 14:30:13] VERBOSE[4224] logger.c:     -- Nobody picked up in 15000 ms
[Jul 24 14:30:13] VERBOSE[4224] logger.c:     -- Nobody picked up in 15000 ms
[Jul 24 14:30:13] VERBOSE[4225] logger.c:     -- <Local/807@from-internal-1e89;2>AGI Script dialparties.agi completed, returning -1
[Jul 24 14:30:13] VERBOSE[4232] logger.c:   == Manager 'admin' logged off from 127.0.0.1
[Jul 24 14:30:13] VERBOSE[4226] logger.c:     -- <Local/808@from-internal-2f12;2>AGI Script dialparties.agi completed, returning -1

Ring Groups don't seem to have the problem because they don't seem to call the ExtensionState API.

This would also be why the SIP/num members don't have a problem, as I presume that ExtensionState isn't called for those either.



By: Paul Smith (pscs) 2009-07-24 12:23:26

After more diagnostics the problem seems to be that the call to the ExtensionState API returns nice and quickly, BUT it has a lot of data. In the capture I have here, the response was about 800 lines long, with over 100 events in it. Each of these events had to be processed in PHP, so I guess that is why it takes a long time.

This is where I get stuck because I don't know why those events are being generated, or whether they need to be (or how to stop them from being generated)

I have uploaded a copy of the response from the ExtensionState that happens on my PC.

[Edit]
The problem is the 'log' function in phpagi_asmanager.php.
This logs lots of stuff, and seems to take about 0.1 seconds per line being logged when the PHP 'error_log' setting is left blank or points to a file - so when 100 lines are being logged (1 per event) that takes about 10 seconds.
Changing the log function to use the pgpagi.php 'verbose' function instead of the php 'error_log' function solves the problem. I'm not sure if this is an acceptable solution or not though.

I can't think why the log function uses error_log, as it is not logging errors, just debug entries, and I don't know why error_log is so much slower than the verbose function, but it is, at least here. Also the 'log' function logs the text to error_log whatever the 'level' value is, so it looks broken from that point of view as well.

So, my proposed fix is, in phpagi-asmanager.php, change the log function to:

-------------cut-here---------
  /*
   * Log a message
   *
   * @param string $message
   * @param integer $level from 1 to 4
   */
   function log($message, $level=1)
   {
     global $AGI;
     $AGI->verbose($message, level);
     return;
   }
---------------cut here---------------

[Edit 2]
Actually this looks like a bug in FreePBX, so I've reported it there
http://www.freepbx.org/trac/ticket/3792



By: Leif Madsen (lmadsen) 2009-09-18 08:00:04

I'm closing this issue as it appears to be an issue with a FreePBX AGI file.