[Home]

Summary:ASTERISK-19279: Asterisk stops processing Local channels - CLI is full of messages "Exceptionally long voice queue length queuing to Local/XXX"
Reporter:Alex (alexrixhardson)Labels:
Date Opened:2012-01-31 06:13:37.000-0600Date Closed:2012-03-20 12:37:43
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_local
Versions:1.8.9.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:OS: Red Hat Enterprise Linux 5.5, Server: Dell PowerEdge T710Attachments:( 0) putty1.log
( 1) putty2.log
Description:Asterisk would occasionally stop processing Local channels and starts reporting "Exceptionally long voice queue length queuing to Local/XXX" in the CLI. The problem usually appears out of the blue. I have however once witnessed it to happen immediately after "reload" command is issued via CLI. Most of the times, however, it happens without anyone issuing the "reload" or any other command via CLI.

This problem affects several of my production servers. It happens cca. once per week per server. Each server handles few thousands calls per day and cca. 150 concurrent calls during the work hours.

To reproduce the problem (the dialplan is bit more complex, because as I said it is part of production server and it has its own meaning):

1. Get two Asterisk boxes,
2. Add the following in the dialplan on the first server (replace the [SECOND-SERVER-IP] with the IP of a second server:

[call-to-agent]
exten => 111,1,Set(CALLERID(num)=asterisk)
exten => 111,n,Set(CALLERID(name)=asterisk)
exten => 111,n,Dial(SIP/123@[SECOND-SERVER-IP])

[outbound-calls]
exten => 111,1,Macro(PauseAgent,${EXTEN:6:3})
exten => 111,n,Set(CALLERID(num)=219999999)
exten => 111,n,Set(CALLERID(rdnis)=219999999)
exten => 111,n,Set(CALLERID(name)='PINK')
exten => 111,n,Monitor(wav,${EXTEN:13}_${UNIQUEID}_${EXTEN:9:4}_wav128,mb)
exten => 111,n,Goto(outbound-routing,${EXTEN},1)
exten => 111,n,StopMonitor()
exten => 111,n,HangUp()

[outbound-routing]
exten => 111,1,Dial(SIP/124@[SECOND-SERVER-IP],,tT)

[macro-PauseAgent]
exten => s,1,Set(GROUP()=${ARG1}@agents)
exten => s,n,Set(SPYGROUP=_agentsspying:_agentspying${ARG1}:_queuespying${ARG2})

3. Add the following in the diaplan of the second server in the context, where SIP inbound calls are processed:

exten => 123,1,Wait(2)
exten => 123,n,Answer()
exten => 123,n,Playback(demo-congrats)
exten => 123,n,Playback(demo-congrats)
exten => 123,n,Playback(demo-congrats)
exten => 123,n,Playback(demo-congrats)
exten => 123,n,Playback(demo-congrats)
exten => 123,n,HangUp()

exten => 124,1,Ringing()
exten => 124,n,Wait(15)
exten => 124,n,Answer()
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,Playback(demo-congrats)
exten => 124,n,HangUp()

4. Create this outbound call file on the first server and name it /test/test.call:

Channel: Local/111@call-to-agent/n
MaxRetries: 1
RetryTime: 300
WaitTime: 300
Context: outbound-calls
Extension: 111
Priority: 1

5. Create this bash script on the first server and name it /test/stresstest:

#!/bin/bash

for ((x=1;x<=1200;x+=1));
do

rm -f /var/spool/asterisk/outgoing/*.*
touch /test/test.call

for ((i=1;i<=40;i+=1));
do
cp /test/test.call /var/spool/asterisk/outgoing/test_$i.call
done
sleep 20

done

6. Run the /test/stresstest script on the first server and wait for the deadlock.

I will attach two "core show locks" outputs from the CLI, that I managed to capture while Asterisk was reporintg "Exceptionally long voice queue length queuing to Local/XXX" error.

Please, let me know, shall anybody need any additional information. This issue is quite urgent. Thank you!
Comments:By: Matt Jordan (mjordan) 2012-01-31 08:19:17.805-0600

Are you running Asterisk in production with DEBUG_THREADS turned on?  If so, do you have this issue with DEBUG_THREADS disabled?

As it is, your 'core show locks' outputs do not appear to show a deadlock.  It would be useful if you compiled Asterisk with DONT_OPTIMIZE when you produce that output however, as the output will contain more useful information.

By: Alex (alexrixhardson) 2012-01-31 09:07:49.128-0600

Thank you Matt for your quick response.

I don't use DEBUG_THREADS in production.

I enabled DONT_OPTIMIZE and DEBUG_THREADS in my test environment only to get the "core show locks" which I attached. Those two *.log files are not from the production environment.

Production servers which are experiencing these problems, do not have DONT_OPTIMIZE or DEBUG_THREADS enabled.

I am now running a stress test in a test environment without DONT_OPTIMIZE and DEBUG_THREADS, but so far it hasn't crashed yet.

I am worried that this might be quite a complex problem that's harder to reproduce in the test environment than on a production server.

Do you think there is anything I could do on production servers to get more info what's going on in the code, without affecting call centers that work on the servers?

By: Matt Jordan (mjordan) 2012-02-09 15:54:03.760-0600

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

In particular, try to get the information leading up to the excessive queuing statements.  Once a local channel goes into that state, we shouldn't need anymore logging statements.



By: Alex (alexrixhardson) 2012-02-22 02:27:09.195-0600

Just to let you know that I have enabled debugging on the production machines which are experiencing this problem but I am still waiting for the issue to occur. Sometimes it takes two or three weeks to repeat, other times only few days. I will post here as soon as the issue repeats.

By: Matt Jordan (mjordan) 2012-03-20 12:37:37.552-0500

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested.  Further information can be found at http://www.asterisk.org/developers/bug-guidelines