[Home]

Summary:ASTERISK-03771: Agent & queue locks if using local channels
Reporter:Matteo Brancaleoni (mbrancaleoni)Labels:
Date Opened:2005-03-25 17:17:15.000-0600Date Closed:2011-06-07 14:00:47
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) agent_lock.txt
( 1) agent_lock-20050326.txt
( 2) thread_18_noopt.txt
( 3) thread_18_opt.txt
( 4) threadjustafterlogin.txt
Description:Originating (either via manager or via spool files) call to a local channel, then connecting to an extension that execs a queue with agents, result into lockingup agents, queue and asterisk itself. No cpu load, only the whole process doesn't respond anymore. Kill and restart is required. This happens if you have calls (to a local channel) waiting in the queue for an agent to be free.
Happens both on CVS HEAD and CVS STABLE.

****** STEPS TO REPRODUCE ******

Just set up a queue with agents. Log in at least only one agent. then generate calls via manger or spool file like:

Channel: Local/number@dialout
Extension: myqueue
Priority: 1
Context: myqueues

Generate calls in order to have them waiting in the queue.
As soon as the agent will finish the first call (hangup by remote, for example) * locks (agents and queue stop working, no more show agents or show queues command on * cli)

depending on the system, it can happen after some calls passed to the agent. but it happens :) on my bigger box happens just after the first one.

****** ADDITIONAL INFORMATION ******

gdb output attached.
In this example I logged in as agent from console, and used to connect to a local extension, for test.
The same happened (when I experienced the problem for the first time) on agents with SIP tech and using IAX uplink as outgoing trunks.

What is interesting (perhaps):
chan_agent.c line 728 (agent_hangup): Error releasing mutex: Operation not permitted
when the agent finishes the call.

Then I switched to gdb, a thread apply all bt is included, along with a bt full of the thread managing chan_local.
Comments:By: Matteo Brancaleoni (mbrancaleoni) 2005-03-25 17:22:12.000-0600

just 2 notes:
* happens when generating sync calls and async (from manager) ones
* the log below was compiled with thread debug and docrash

By: Mark Spencer (markster) 2005-03-25 17:55:59.000-0600

Please perform:

rm -rf /usr/lib/asterisk/modules

make clean ; make install

And see if you can duplicate this.  Thanks.

By: Matteo Brancaleoni (mbrancaleoni) 2005-03-26 02:45:58.000-0600

already done :)
but to be sure I did it again and also tried on a fresh installed box (no asterisk here before ) and the results are the very same.
Attached the log.

By: Mark Spencer (markster) 2005-03-26 11:02:40.000-0600

Well perhaps someone can help you debug it.

By: Mark Spencer (markster) 2005-03-26 11:14:46.000-0600

lets talk on IRC...

By: Matteo Brancaleoni (mbrancaleoni) 2005-03-28 03:54:41.000-0600

Ok, along with Mark we found out that something is wrong when chan_local transfers the call (optimize itself out of the path). Using a dialstring with the '/n' syntax, makes it work.

btw, I did some others tests under gdb and found out that the thread holding
the agent login "looses" the control to the channel when the Local_chan does the transfer, resulting in not giving back the control to the agentlogin at hangup app thus blocking the application. (ok not the very exact words, but this is the idea).
When using Local with the '/n' flag, that doesn't happens.

See the thread18_opt.txt & thread18_noopt.txt where you can see the difference.
at hangup, when local is optimized, thread18 is dead :/
threadjustafterlogin.txt is the situation just after agent login.
Thread 18 is the one holding the agentlogin app. when using standard Local, thread_18 looses every reference to the agentlogin app and to the calling channel.
The other way we have everything as should be.

edited on: 03-28-05 04:01

By: Matteo Brancaleoni (mbrancaleoni) 2005-04-03 03:39:00

I'm not able to resolve it (I'm not very expert into thread debugging), but for sure the prob is that app_lock is not released when the channel gets transferred
by chan_local to another thread, so at hangup we cannot release:
chan_agent.c line 729 (agent_hangup): Error releasing mutex: Operation not permitted

anyone willing to help in this?

By: Matteo Brancaleoni (mbrancaleoni) 2005-04-20 00:55:02

--- reminder, this is still a bug to be fixed ---

By: Jon Gabrielson (gabriels) 2005-05-23 11:58:20

I also am having this same problem.
Here is what showed up in my log at about the same time it froze.

May 23 09:57:01 NOTICE[26090] app_queue.c: Caller was about to talk to agent on Agent/102 but the caller hungup.
May 23 10:01:23 WARNING[26090] channel.c: Avoided initial deadlock for 'Agent/106', 10 retries!
May 23 10:14:39 WARNING[26090] chan_local.c: Local/4107@localext-e1cf,1 wasn't locked while sending 1/35

I also have a gdb dump if anyone wants it, email me at
jonasterisk@directfreight.com and I will forward it.

By: Matthew Fredrickson (mattf) 2005-05-26 16:26:29

Can you try updating to CVS as of this time?  Mark just found a bug in chan_agent.c that was causing a deadlock and could be related to this.  Thanks.

By: Matteo Brancaleoni (mbrancaleoni) 2005-05-27 02:34:06

Hi.

I've just tested very latest cvs, and the issue
still remains.
(Done with a clean install on a fresh installed machine)

By: Mark Spencer (markster) 2005-05-30 10:29:59

Can you get me an updated dump please.  Thanks.

By: lters (lters) 2005-06-10 13:17:29

channel.c: Avoided deadlock for 'Local/122@csr-acd-b454,2', 10 retries! With cvs of 6/02 and 6/10 cvs head, I get this message.

Here is the dialplan we use:
http://pastebin.ca/13901

The queue is trying to send the call to a agent that has logged in with AgentLoginCallBack

By: Michael Jerris (mikej) 2005-07-06 11:07:31

We need updated traces on this in order to proceed.  Is anyone able to produce these ?

By: outtolunc (outtolunc) 2005-07-07 16:45:51

i was having the same issue, and finally decided to track it down today.

channel.c: Avoided deadlock for 'Local/122@csr-acd-b454,2', 10 retries! With cvs of 6/02 and 6/10 cvs head, I get this message.

in your example above you see how the channel name has 2 '-' in it, well so did mine... as a quick test i changed all my contexts. (like in your case context [csr-acd] needs the '-' removed so [csracd] and your dial string changed to remove the '-'.  this quick test worked great, no more '10 retries!' ones, only a handfull of the other one.

the problem and fix should occur in channel.c
int ast_parse_device_state(char *device)

i just don't have time to figure it out now.

i hope this helps (it's a fairly painless bandaid)

By: Olle Johansson (oej) 2005-07-31 15:24:06

Is this still an issue that needs to be solved? Any updated files for Mark?

/Housekeeping

By: Mark Spencer (markster) 2005-08-02 23:10:33

I'm suspending this until we get more input.  We've had a great deal of changes in CVS head to fix some deadlocks related to state changes, so it would be useful to have another acktrace if the problem is still there.