ASTERISK-09337: When Using Multiple Queues, Agent Status Gets Stuck As "Busy" In Select Queues

[Home]

Summary: ASTERISK-09337: When Using Multiple Queues, Agent Status Gets Stuck As "Busy" In Select Queues

Reporter: jiddings (jiddings) Labels:

Date Opened: 2007-04-27 16:04:47 Date Closed: 2007-07-06 10:11:49

Priority: Major Regression? No

Status: Closed/Complete Components: Applications/app_queue

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) deadlock.txt

Description: Scenario..

Agent/1000 is logged in using AgentLogin() via a SIP phone.. Agent/1000 is a member of 15 queues, all using "leastrecent" as as strategy, and "ringinuse=no". Agent/1000 will randomly stop getting calls for select queues, and is shown as "busy" when in queue X, but available in queue Y. The busy state for queue X will not change until the agent logs out and back in again.

****** ADDITIONAL INFORMATION ******

Per the documentation, ringinuse for SIP devices requires call-limt, I am using that properly, as far as I'm aware.

sip.conf

[general]
limitonpeers = yes

[1101]
username=1101
type=friend
secret=xxxx
qualify=yes
nat=never
host=dynamic
context=trusted
canreinvite=no
callerid=Client Centre A <1101>
call-limit=10

Comments: By: jiddings (jiddings) 2007-05-02 09:26:23

First of all, I must have been sleeping when I made this bug report. I mentioned call-limit, but that has nothing to with AgentLogin(), that's only for the call back function.

This problem is much more consistent. Agents status are getting stuck as busy, sometimes it clears itself up, but sometimes the agent must logout.
By: jiddings (jiddings) 2007-05-02 10:34:23

http://pastebin.ca/467489

This link demonstrates the behavior I'm speaking about. You'll see agents available, idle, and ready to take calls for other queues, but not available for "QUEUE1".

In this example... A couple of notes. QUEUE1 is weighted lower than QUEUE2 and QUEUE3. Interestingly, it does seem that QUEUE1 is the queue that is locked up the most, could this have something to do with weights? Although QUEUE1 would be the most active of all the queues.

Or could it be that the wrapuptime is being ignored or misinterpreted?
By: jiddings (jiddings) 2007-05-02 16:14:03

I was able to reproduce this multiple times on every 1.4 server I tried it on..

On each server I setup a basic queue with one agent logged in via AgentLogin(). I then dialed into the queue multiple times (via Zap and SIP) and then hung up after a few seconds each time. After a random # of calls, the agent status would get stuck as busy but 'agents show online' would show the agent as idle.

We've been able to produce this error on Centos 4.4 and Centos 5, using the latest default kernel for both versions.

By: jiddings (jiddings) 2007-05-02 17:45:27

Below is the console debug information when the problem occurs... I hope, hope, hope this will give you the clue you need.

[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:244 __ast_pthread_mutex_lock: chan_agent.c line 2095 (__login_exec): Deadlock? waited 10 sec for mutex '&p->app_lock'?
[May 2 18:34:32] ERROR[5106]: /usr/src/asterisk-1.4-svn/include/asterisk/lock.h:247 __ast_pthread_mutex_lock: chan_agent.c line 971 (agent_new): '&p->app_lock' was locked here.
By: jiddings (jiddings) 2007-05-03 13:33:46

A bit more info..

This problem does not seem to exist with Asterisk 1.2. Though more extensive testing would need to be done to 100% verify that.

I setup Asterisk 1.4-svn on a Fedora Core 6 test box with the same debug flags as my Centos 4.4 box. Though the behavior on the FC6 box is exactly the same as the Centos box, we get a different error, not a dead lock. The error generated on the FC6 box is:

"WARNING[622]: channel.c:2081 ast_waitfordigit_full: Unexpected control subclass '17'"

Interesting enough, it does deadlock on a Centos 5 installation, as well, which is based mostly off of FC6.

The battle continues.
By: jiddings (jiddings) 2007-05-03 14:00:29

I've added a backtrace which I snagged during the...

""WARNING[622]: channel.c:2081 ast_waitfordigit_full: Unexpected control subclass '17'"

... errors on the Fedora Core 6 installation.
By: Roberto Lopes (rflopes3) 2007-06-06 20:20:05

1.2 still have this issue but under really heavy load.
By: Roberto Lopes (rflopes3) 2007-06-06 20:22:33

Also, i found something that may be useful. When wrapuptime on agents and queues.conf are equal, if the agent is still in wrapup and no other agent is free, it's stuck Busy until a reload app_queue.so is done. i solve this by putting wrapup time 1 second lower than wrapup on queues.conf.
By: Eric Kreckel (ekreckel) 2007-06-22 15:19:03

I too am having the same problem. I have figured out a work-around hack that will keep the agents status up-to-date without having to have them logout/in to clear the busy lock.

Make sure agents.conf has:

persistentagents=yes

issuing the command 'asterisk -rx "module reload chan_agent.so"' will reinitialize their status and channel bindings.

To make this happen automatically, I have added the following to my /etc/crontab file:

0-59 8-21 * * 1-5 root asterisk -rx "module reload chan_agent.so" > null

This makes my server keep the agent's status refreshed during the operating hours of our call center.
By: Dmitry Andrianov (dimas) 2007-06-29 15:16:41

jiddings,
First, can you please confirm that "the console debug information when the problem occurs" actually gets print only when the problem really occurs?
As far as I can tell from the sources, this "error" gets printed any time when one threads waits for mutex more than 5 seconds. And the message is repeated every 5 seconds.
One of chan_agent mutexes is kept locked for the whole duration of call while the thread handling AgentLogin application tries to lock it (generating the message).
That is why I do not see any problem with this particular error message and I do not think it ever relates to the problem.

(Well, to me, conditional variable should be used instead of mutex there and prevent that error-message-flood but that is completely different story)

Second, can you set debug level to at least 3 (with core set debug 3) and wait when problem reproduces? I'm interested in seeing if app_queue properly sees devstate changes for your channels or not.

By: Digium Subversion (svnbot) 2007-07-06 10:10:01

Repository: asterisk
Revision: 73674

------------------------------------------------------------------------
r73674 | mmichelson | 2007-07-06 10:10:00 -0500 (Fri, 06 Jul 2007) | 5 lines

Fixed a bug wherein agents get stuck busy. (issue 9618, reported by jiddings, patched by moi)

closes issue ASTERISK-9337

------------------------------------------------------------------------
By: Digium Subversion (svnbot) 2007-07-06 10:10:50

Repository: asterisk
Revision: 73675

------------------------------------------------------------------------
r73675 | mmichelson | 2007-07-06 10:10:49 -0500 (Fri, 06 Jul 2007) | 13 lines

Merged revisions 73674 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.2

........
r73674 | mmichelson | 2007-07-06 10:26:40 -0500 (Fri, 06 Jul 2007) | 5 lines

Fixed a bug wherein agents get stuck busy. (issue 9618, reported by jiddings, patched by moi)

closes issue ASTERISK-9337

........

------------------------------------------------------------------------
By: Digium Subversion (svnbot) 2007-07-06 10:11:49

Repository: asterisk
Revision: 73676

------------------------------------------------------------------------
r73676 | mmichelson | 2007-07-06 10:11:48 -0500 (Fri, 06 Jul 2007) | 21 lines

Merged revisions 73675 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

................
r73675 | mmichelson | 2007-07-06 10:27:28 -0500 (Fri, 06 Jul 2007) | 13 lines

Merged revisions 73674 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.2

........
r73674 | mmichelson | 2007-07-06 10:26:40 -0500 (Fri, 06 Jul 2007) | 5 lines

Fixed a bug wherein agents get stuck busy. (issue 9618, reported by jiddings, patched by moi)

closes issue ASTERISK-9337

........

................

------------------------------------------------------------------------