ASTERISK-05989: Possible locking problem with agents

[Home]

Summary: ASTERISK-05989: Possible locking problem with agents

Reporter: Johann Hoehn (johann) Labels:

Date Opened: 2006-01-05 13:26:53.000-0600 Date Closed: 2006-05-02 09:57:20

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_agent

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) asterisk_channel_lockup_gdb_2006-01-14.txt
( 1) asterisk.log
( 2) asterisk.wedge.20060118.2.txt
( 3) asterisk.wedge.backtrace.20060118
( 4) ML060424144650_000000033183798_debug.txt
( 5) sabbathbh_asterisk_agent_lockup_debug_20060203.txt
( 6) sabbathbh_asterisk_agent_lockup_gdb_20060203.txt

Description: The PRI line channels are getting out of sync with our provider. The problem occured when we were using 1.0.8 and has just occured with using 1.2.1. In the past we have contacted the PRI provider and had them run diagnostics from their end and they show everything fine. Asterisk just stops picking up incoming calls because it thinks they are still in use. There is no way short of killing Asterisk and restarting it to free up those channels either.

Based on various emails on the users list it seems that others have had the problem(at least with 1.0.x). We are using SIP channels, callback agents, queues, and blind transfers within Asterisk. Asterisk's main use is to operate the queues and this occurs 24/7.

When the error occured this time and in the past, using any queue/agent related command in the CLI would result in no output and further prevent the CLI from giving feedback to any command. Prior to using queue/agent related commands, others such as "show channels" would return output. However, the returned output would never change. Channels that had been hung up on are still shown. The use of "soft hangup" on any of the channels has no effect.

Existing calls over the PRI line continue to function normally, however new calls are blocked.

Currently the only solution is to pkill asterisk and restart it. Any suggestions on a possible fix? Since the problem seems strongly related with queues, agents, and the PRI line(zap channels), I'm not entirely sure where to place it in the bug tracker. Please move it if another category is more fitting.

****** ADDITIONAL INFORMATION ******

Jan 5 13:59:05 WARNING[1253]: chan_zap.c:8360 pri_dchannel: Ring requested on channel 0/2 already in use on span 1. Hanging up owner.
Jan 5 13:59:11 WARNING[1253]: chan_zap.c:8360 pri_dchannel: Ring requested on channel 0/2 already in use on span 1. Hanging up owner.
Jan 5 13:59:13 WARNING[1253]: chan_zap.c:8360 pri_dchannel: Ring requested on channel 0/2 already in use on span 1. Hanging up owner.
Jan 5 13:59:14 WARNING[1253]: chan_zap.c:8360 pri_dchannel: Ring requested on channel 0/5 already in use on span 1. Hanging up owner.
Jan 5 13:59:15 WARNING[1253]: chan_zap.c:8360 pri_dchannel: Ring requested on channel 0/2 already in use on span 1. Hanging up owner.

Comments: By: Tilghman Lesher (tilghman) 2006-01-05 13:45:28.000-0600

Please upload a file containing the output of the Asterisk console after doing a "pri intense debug span 1" (or whatever span it is) while the problem is occurring.
By: Matthew Fredrickson (mattf) 2006-01-05 13:54:26.000-0600

Are you sure that your system is not deadlocked on something else? That's usually what is happening when this occurs. Also, are you running your system with queues and agents?
By: Matthew Fredrickson (mattf) 2006-01-05 13:55:57.000-0600

Ok, just looked over the rest of the bugnote. It's probably as I suspected, a bug with queues and agents. When it locks up, get a backtrace and post it to the bugnote to see if there is any more information about what is occurring.
By: Felix Deierlein (deierlein) 2006-01-05 14:49:22.000-0600

Hello,

we have the same problem. We are also using Agents and Agentcallbacklogin. We had several asterisk crashes today :-(.
I have uploaded a log from pri intense debug. I hope that helps.

Regards

Felix Deierlein
By: Matthew Fredrickson (mattf) 2006-01-05 15:26:30.000-0600

No, you don't understand. I don't think that a PRI log has anything to do with this. This looks like a bug with agents. If it locks up, don't post a PRI log, post a gdb `thread apply all bt` to get the state of Asterisk.

By: Felix Deierlein (deierlein) 2006-01-05 15:41:55.000-0600

Okay, understood. The problem is, asterisk did not really crash. It just did not respond to any channel, but I could login via asterisk -r. I will monitor that tomorrow.
I have also noticed the following warning, what I did not had before:

Jan 5 23:34:08 WARNING[6371]: channel.c:784 channel_find_locked: Avoided initial deadlock for '0x826d700', 10 retries!

Our Agents login via AgentCallBackLogin using a normal inbound context. They are called via a macro which rings simultaneously some SIP and PRI channels (we have a Siemens HICOM PBX in use as cheap phone replacement, but besides our agents have a SIP softphone).
By: Mark Spencer (markster) 2006-01-07 17:07:59.000-0600

We need a backtrace of the system when it is stuck in this state. See http://www.digium.com/index.php?menu=bugguidelines for details...
By: Russell Bryant (russell) 2006-01-08 22:25:29.000-0600

There is also doc/README.backtrace:

http://svn.digium.com/view/asterisk/trunk/doc/README.backtrace?view=markup
By: Johann Hoehn (johann) 2006-01-09 08:05:27.000-0600

I dont believe this is related to bug 6159, as calls waiting in the queue are lost as when this occurs. Anyone waiting in the queue is effectively gone. Only calls that were sent to agents before it happened OR channels existing by other means like SIP to SIP calls etc. The usual way we know if the problem occurs is when an agent attempts to login or logout...they get no feedback at all. Basically all queue and agent related functions completely stop working. In our production box the max number of calls in the queue would be 23 due to line restrictions.

I'll attach gdb to the asterisk process when it occurs again. When we ran 1.0.8, it occured maybe once or twice a week, although sometimes it would occur a lot sooner. I've only seen it once while on 1.2.1, but we had other stability issues that might of hidden the problem.
By: Johann Hoehn (johann) 2006-01-18 08:23:17.000-0600

The problem occured and I attached gdb to it. However, I was only notified over 30 minutes after it occured :( There were 252 threads at the time and I didnt get a backtrace of each of those(needed to get the system back up in production asap). Hopefully this may help. Next time it occurs, I should have more time to get the needed information.
By: Alan Ferrency (ferrency) 2006-01-18 08:37:48.000-0600

Johann contacted me because we were having this same issue in the 1.0 branch. Although I thought the issue disappeared in the 1.2 branch, this is not the case: we are experiencing it again, though less frequently.

We also use agentcallbacklogin and queues. Our agents are always logging in from local SIP phones, and are almost always taking calls exclusively from PRI channels.

Johann and I noticed we have similar hardware specs. Our specs are:
Fedora Core 3
Motherboard: Supermicro P4SCE, BIOS Rev 1.3
PRI card: Digium T100P

Another thing I'll note: When we had this problem in the 1.0 branch, it was reproduceable with at least 50% success rate, whenever our agents performed SIP native call transfers (SIP REFER). This is not the case in the 1.2 branch; however: I wonder if there is any other behavior by the SIP agents which this problem correlates with.

I will also paste a backtrace here the next time this happens for us.

Thanks,

Alan Ferrency
pair Networks, Inc.
By: Johann Hoehn (johann) 2006-01-18 09:04:31.000-0600

Hardware information:

Supermicro 5013G-i (http://www.supermicro.com/products/system/1U/5013/SYS-5013G-i.cfm)
BIOS Version: Supermicro P4SGR/P4SGE BIOS Rev. 1.1C
Processor: Pentium 4
RAM: 1 GB
Digium Card: T100P
By: Alan Ferrency (ferrency) 2006-01-18 10:37:17.000-0600

I just added text file asterisk.wedge.backtrace.20060118. This has a bt and bt full for all threads, in the midst of this problem. The problem had only been going on for a few minutes at the time this backtrace was performed.

Unfortunately I am not currently running an unoptimized version of asterisk; hopefully the backtrace is at least a little bit useful.

This problem is not related to high call volume, in our case. It can happen when there are only a few calls going on.
By: Alan Ferrency (ferrency) 2006-01-18 10:56:50.000-0600

One more note: The backtrace I posted above was from Asterisk 1.2-beta1, not 1.2.1. We had unrelated issues upgrading to 1.2.1 and have not revisited this since the post-1.2.1 patch for our problem became available. Sorry for the inconvenience.
By: Alan Ferrency (ferrency) 2006-01-18 13:46:17.000-0600

I just uploaded a second backtrace of the issue, which again happened today. This is the last bt I'll submit with optimizations turned on, but there was no zero-call-volume time between the two crashes to restart asterisk with the unoptimized version.

Anyway, after seeing that Asterisk was in a mutex_lock somewhere deep within action_agents in the first backtrace, I asked Johann if they also use the Manager interface, and they do. My second backtrace also shows a mutex lock within action_agents.

My current working theory is that this is related to the use of the manager interface, specifically action_agents. I am going to see if it's okay to test without the manager clients running, but this may be difficult in this production service.
By: Alan Ferrency (ferrency) 2006-01-18 15:21:09.000-0600

After looking at the chan_agent.c source code, I believe this is a race condition/deadlock issue, related to the order in which mutex locking is performed in various cases.

In my specific case, I see "action_agents" waiting for a lock in the backtraces I submitted. That function locks in the following order (comments are mine):

ast_mutex_lock(&agentlock); // chan_agent lock
ast_mutex_lock(&s->lock); // manager session lock
ast_mutex_lock(&p->lock); // per-agent lock, one for each agent
ast_mutex_unlock(&p->lock);
ast_mutex_unlock(&agentlock);
ast_mutex_unlock(&s->lock);

However, there are other methods which call manager_event while they are holding an individual agent mutex lock. One example, which I believe may be the source of our problems, is, agent_hangup. The relevant mutex locks are:

// no agentlock
ast_mutex_lock(&p->lock);
manager_event(EVENT_FLAG_AGENT, "Agentcallbacklogoff",
> ast_mutex_lock(&sessionlock);
> ast_mutex_lock(&s->lock);
> ast_mutex_unlock(&s->lock);
> ast_mutex_unlock(&sessionlock);
ast_mutex_unlock(&p->lock);

The race condition is:
A thread running action_agents is interrupted while it holds mutexes agentlock and s->lock (the manager session lock). Then another thread running agent_hangup runs: it grabs the p->lock for an agent, but blocks when it attempts to get the s->lock inside manager_event. Control passes back to action_agents, which blocks when it attempts to get the lock for the agent that agent_hangup currently has locked: deadlock.

This is consistent with the backtraces I submitted, but it's impossible to tell from those backtraces whether this is the specific problem, because my optimized bt's don't show agent_hangup specifically, though they do show other chan_agent code blocking while waiting for a mutex. I don't see any action_agents in Johann's code, so it might not be the same problem.

I will go look for other cases where mutex locking may be causing problems.

For the specific case of action_agents vs. agent_hangup, it may be adequate to have agent_hangup get a lock on &agentlock prior to locking p->agent. This should prevent any execution of agent_hangup until action_agents is complete; which may cause other problems I'm not aware of.

Thanks.
By: Alan Ferrency (ferrency) 2006-01-19 09:35:34.000-0600

I apologize: the problem I spotted yesterday was fixed in chan_agent.c in September. I'll spend more time looking at newer code from now on.
By: Leonardo Gomes Figueira (sabbathbh) 2006-02-03 10:43:07.000-0600

Hello,

I just uploaded 2 files:

sabbathbh_asterisk_agent_lockup_gdb_20060203.txt GDB
sabbathbh_asterisk_agent_lockup_debug_20060203.txt Asterisk Log with debug enabled.

We have this problem with Agents and Queues locking up very often. We use the manager interface to control the Agents (Originate, Redirect, QueueStatus). We are using Agents logged in with MOH.

I can't duplicate the bug doing something specific, it just lockup after several calls with Agents (Queue to Agent, Agent to Zap, Agent to SIP, Agent to Agent). I couldn't find any pattern on the lockups, sometimes it takes hours, others in a few minutes.

We are using Asterisk 1.0.10 and planning the upgrade to 1.2 but it seems like the bug exists on 1.2 too since it was reported by an 1.2 user.

Can I help with more information ?

Thanks,

Leonardo
By: Serge Vecher (serge-v) 2006-02-03 11:24:27.000-0600

sabbathbh: if you read ferrency's notes carefully, you will notice that the problem was with Asterisk 1.2-beta1. Please update to 1.2.4 and report if there is a problem.
By: Mark Edwards (edwar64896) 2006-02-09 05:20:56.000-0600

Am experiencing similar behaviour on my customers PBX. We have a 10 ch E1 PRI in an outbound callcentre. Agents, Queues and SIP devices. Over time, the agents start to experience CHANUNAVAIL on outbound dialing. Also, when the PRI resets periodically, not all channels are getting reset. An attempt to dial out on any of the channels that are not being reset results in congestion. It's almost as if the channel hasn't been hung up. If I restart asterisk, then all is fine. This is a pretty serious issue at the moment as the problem is occurring half-way through a shift and we have to kick any remaining agents off the system to reset it. We're currently on Asterisk 124 and Zapata 123. In our configuration, the queue itself dials out (using a Local channel via the AMI).

Will monitor issue and attempt to provide feedback to assist with resolution.
By: Bartosz Jozwiak (bartekj) 2006-02-09 17:30:29.000-0600

Hi guys,

We have very similar problem running asterisk 1.2.4 with zaptel 1.2.3
We're using T1 card with FXO signaling to our channel bank to which are
analog phones connected.

WARNING[4225]: channel.c:787 channel_find_locked: Avoided deadlock for '0x9655d18', 10 retries!

Almost everyday we are experiencing deadlock on ZAP channels. (one or two per day). It usually happens while we have a call waiting on an active ZAP channel. Pressing flush button to pick up call waiting, pressing again flash button to retrieve call on hold follows by a deadlock. After it ( not always ) ZAP channel is not useable anymore until we restart Asterisk.
By: Mark Edwards (edwar64896) 2006-02-09 18:40:52.000-0600

When debugging this issue, I am noticing that p->call (PRI Structure) remains set on channels that are 'locked' (i.e. exhibiting this behaviour). This flag remains set with no way to clear the flag. Resetting the PRI span does not help becuase any channel with p->call=true is ignored. chan_zap.c:available returns false for any zap channel with p->call=true which might also explain why the channel indicates CHANUNAVAIL when trying to dial out through it.

Question for me is, what is causing the PRI "Call" flag not to clear on hangup?

Still looking...
By: Glenn Stephens (condor) 2006-02-10 01:53:29.000-0600

Reading these comments and looking at our systems I suspect this is either more channel generic, or Agent/Queue related, but triggered by high system loads, especially where there are a lot of hardware interupts.

Unfortunately we have only experenced this on production servers, we have been unable to replicate on our test servers.

Long version:

We have a two server setup, both running Asterisk-1.2.4 on P43.0Ghz, Fedora Core 3, machines. One server with a PRI (Digium 405P) answering calls and fowarding via IAX to a second server, the second server handling the SIP clents, Agents and queues, monitor, hold music... The second server also has a Voicetronix Openswitch12 card, but there is only one line in sparodic use.

Over the last couple of weeks we have had several times where the call load has been higher than usual and we get very similar symptoms mentioned throughout this report: CLI channel list static, agent/queue commands non responsive, no new IAX/SIP channels created, sometimes calls are actually dropped. The only cure seems to be re-start the server. But it only happens on the second server, NOT the server with the PRI cards.

We have also seen instances when the CLI becomes unresponsive for a few minutes, some calls are dropped, no new channels created for a time but most calls that were already in progress are unaffected, then the system suddenly load drops for no apparent reason, new CLI sessions respond and new calls are taken. (I had already closed the old CLI sessions)

The problems _seems_ to be triggered from a high system load, loadavg of around 2-3. While there have been instances of problems without calls appearing on the Voicetronix interfaces many of the problem periods were preceded by calls on the Voicetronix channels.

We have noticed that the second server, even with 0 channels in use, was running with a loadavg of 0.3-0.5 where the first server has 0.00-0.02, both machines have identical hardware and software[1] except the Voicetrinox/Digium cards. Using top, the server with the Digium card also has a consistent 3-4% hi CPU value, where as the Voicetronix server has 20-30% even at 0 channels. I strongly suspect the hi CPU is time spent on Hardware inturupts, but I've been unable to confirm that this what top is telling me.

This morning we have disabled the Voicetronix interfaces and unloaded the kernel module, the machine is now running much lower load averages with a 0.1%hi CPU occasionally. (Usually 0.0%) and we have not had any crashes today, even when the call loads get up to the same as yesterday. (2 crashes yesterday, 1 the day before, 2 the day before that.)

So while this bug report suggests that the problem is not fixed in 1.2.4 (we saw symptoms with 1.2.4), we have been able to remove the main trigger for it, lots of hardware interupts, hopefully this will give people that understand the channel/agent/queue code more help in finding the problem.

[1] both machines were purchased and installed at the same time, basically the purchase order said 2x everything from the one vendor.
By: j (j) 2006-02-28 08:40:54.000-0600

I also have this exact problem.
We've had it for quite a while. If the above is correct, it might be something due to to many channels being thrown around while agents take calls from queues and transfer said calls around.

We have a set up where an agent will get a call from a queue and possibly transfer that call back into another queue. The transfer is initiated by a button on the phone, not using asterisk's '#' function within the dial command.

The problem occurs often, usually more than twice a day on normal days, or much much more on busy days.

If there is anything, anything at all I can do to help solve this bug, *please* let me know.
By: Trevor Hammonds (trevmeister) 2006-03-14 10:41:13.000-0600

I am experiencing this issue, too -- at least once per day. Running latest 1.2 branch (updated last night). Has there been any progress on this issue?

Mar 14 07:08:14 WARNING[4938]: chan_zap.c:8392 pri_dchannel: Ring requested on channel 0/23 already in use on span 2. Hanging up owner.
By: Alan Ferrency (ferrency) 2006-03-14 10:53:52.000-0600

Important Notes:

For everyone who is experiencing the symptom of "pri_dchannel: Ring requested on channel 0/23 already in use" errors in the Asterisk messages log, this is most likely a red herring. The problem is most often Not in the PRI/zaptel system. In all cases I know of, which have been found and fixed already, this is a consequence of a mutex deadlock situation happening in code unrelated to the PRI system.

What this means to you:

- Please upgrade to the latest Asterisk release (1.2.5 at the time of writing). Deadlock bugs are known to exist in Asterisk versions at least up to 1.2.0.
- If your problem persists, please follow the instructions, found elsewhere on this bug report, for producing a complete backtrace of your running Asterisk when it is experiencing this problem.
- If you can rebuild Asterisk with "valgrind" turned on (see links earlier in this ticket) that will greatly assist debugging the backtraces.

Both my problems, and Johann's (the original bug submitter) seem to have gone away after upgrading to Asterisk version 1.2.3 or later. I am glad to have a much greater understanding of the general problem now, which will help me debug it if it crops up again. I will happily look into the code if/when someone can produce a backtrace of a failure in adequately new code.

Thanks,
Alan
By: j (j) 2006-03-17 08:25:59.000-0600

I just wanted to confirm that I am indeed having the same issue with asterisk 1.2.5 and all the latest software.
We did an upgrade last night and we got the issue this morning.

I'll try to follow the debugging steps outlined to get a nice dump/backtrace.
Is there a way to force asterisk to segfault via the CLI or command line in order to generate a core file?

j
By: Leonardo Gomes Figueira (sabbathbh) 2006-03-20 14:00:02.000-0600

Hi,

upgraded to Asterisk 1.2.5 and the problem still happens.

The server has no Zaptel cards (so there is no relation to PRI).

It happens at random, even when there is only 1 call (1 SIP caller to queue, when the queue tries to ring the Agent it locks). So I think the problem is not related to high volume calls.

The only pattern i could find is that it happens pretty fast when i generate calls to Agent channel.

Sample output after the deadlock:

-- Executing Dial("SIP/3052-1028", "Agent/8031") in new stack
gateway*CLI> show channels
Channel Location State Application(Data)
Agent/8031 s@3050_in:1 Down (None)
Local/3050@3050_in-c 3050@3050_in:1 Ring (None)
Local/3050@3050_in-c s@3050_in:1 Down (None)
SIP/3052-1028 8031@3052_out:1 Ring Dial(Agent/8031)
4 active channels
1 active call
gateway*CLI> show agents
gateway*CLI>

I will provide a backtrace soon.

Leonardo
By: jgoerzen2 (jgoerzen2) 2006-03-23 10:20:51.000-0600

FWIW, we are also seeing messages like:

Mar 23 10:14:42 WARNING[17104]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0x8640228', 10 retries!
Mar 23 10:14:42 WARNING[17104]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0x8640228', 10 retries!

Then:

Don't know what to do if second ROSE component is of type 0x6
Don't know what to do if second ROSE component is of type 0x6
Don't know what to do if second ROSE component is of type 0x6
Mar 23 10:16:47 WARNING[17119]: chan_zap.c:8514 pri_dchannel: Ring requested on channel 0/4 already in use on span 1. Hanging up owner.
Don't know what to do if second ROSE component is of type 0x6
Mar 23 10:16:57 WARNING[17119]: chan_zap.c:8514 pri_dchannel: Ring requested on channel 0/4 already in use on span 1. Hanging up owner.

We are not using agents. We are using queues, but they are low volume.
By: Alan Ferrency (ferrency) 2006-03-24 10:39:19.000-0600

A short update for those who need assistance debugging deadlock issues:

The previous links in this ticket describing how to debug Asterisk seem to be broken, after recent site renovations. Here is a link in voip-info.org which contains the most important information:

http://www.voip-info.org/tiki-index.php?page=Asterisk%20debugging
By: Jesus Mogollon (globalip) 2006-04-07 15:27:01

Hi

We are experiencing exactly the same problem. We have a combination of static queue members and remote agents using AgentCallbackLogin. Every couple of days we start getting messages like:

pri_dchannel: Ring requested on channel 0/2 already in use on span 1. Hanging up owner.

And

channel_find_locked: Avoided initial deadlock for '0x826d700', 10 retries!

Incoming calls start to drop as Asterisk thinks these channels are being used. We are using the lastest stable release (1.2.6). Next time it happens, I'll generate the debugging information as described by others in this thread
By: David J Craigon (superdjc) 2006-04-10 11:22:01

We too have had the same problem for some time.

In our system the problem only occurs when people log in or log out. I have caught asterisk in a state before the queue module has crashed where I can't do show queue status in the asterisk console.

We've reduced the amount people log in and out here and this has helped our problem.

As for speculation of load- we've had it do this with only one call on the system.
By: caspy (caspy) 2006-04-20 03:22:23

We have absolutely the same problem.
Asterisk 1.2.7, incoming calls from pri goes to queue, agents (logging in with AgentCallBackLogin) takes calls.
By: Alex Richardson (alexrch) 2006-04-24 12:43:07

We're having the very same problem. First we're getting the "channel_find_locked: Avoided initial deadlock for '0x826d700', 10 retries!" messages, then the "Don't know what to do if second ROSE component is of type 0x6" messages, and finally with the message "We're not yet handling hanging up when our state is XXXXXXXX, contact support@digium.com, ourstate Unknown, peerstate Unknown", the Asterisk stops handling incoming or outgoing calls or even responding to the CLI commands, where restart is needed.

In our case Asterisk is mostly handling toll-free calls, which are then handled by a number of queues in which agents log in using AgentCallBackLogin. Because of the size of the call centre, AgentCallBackLogin command is being executed frequently, also the volume of the incoming calls is quite high.
By: caspy (caspy) 2006-05-02 01:46:07

i have rewrited agents to dialplan using Local/ channel (without using chan_agent), - the same trouble. After 4 days of working - * died.
I reloaded * without loading chan_agent.so at all, - will look for it.
By: Serge Vecher (serge-v) 2006-05-02 09:38:30

caspy: if the problem persists, please do not forget to get a backrace from non-optimized build.

alexrch: your problem maybe different. I suggest that you contact Digium Technical Support at support@digium.com as indicated in console messages.
By: Matthew Fredrickson (mattf) 2006-05-02 09:56:40

Gentlemen, I don't believe this bugnote is going anywhere, and I don't believe that all of the issues that are here mentioned are related. I am going to close this bugnote, and if you have any more issues open an individual bug note for your particular problem.