ASTERISK-13246: [patch] Thread deadlock causes Asterisk to stop routing calls to agents, agents unable to change status

[Home]

Summary: ASTERISK-13246: [patch] Thread deadlock causes Asterisk to stop routing calls to agents, agents unable to change status

Reporter: Sebastian (sroberts) Labels:

Date Opened: 2008-12-19 07:16:44.000-0600 Date Closed: 2011-06-07 14:03:01

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_agent

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 1.4.22.2-agentLock.patch
( 1) 14112.patch
( 2) Asterisk_load_testing.txt
( 3) core_locks1.log
( 4) gdb_threads
( 5) gdb_threads_2
( 6) gdb_threads_3.log
( 7) gdb_threads_4.zip
( 8) locks2.log
( 9) locks3.log
(10) locks4
(11) locks5
(12) locks6

Description: We run several call centres on Asterisk. Our queue servers are using Asterisk 1.4.19 on CentOS 4.6. The busier sites can take anywhere between 5000 and 10000 calls per day.

We are experiencing a problem whereby occasionally Asterisk will stop routing calls to agents. If one opens the console and issues a "show channels" command the channel list does not finish displaying, it only shows a portion of the channels. The console then becomes unresponsive. Calls then continue to build up in the queue, and only restarting Asterisk fixes it.

I have attached a dump of "core show locks". I could not get any further info this time, as Asterisk must generally be restarted immediately when this happens to keep the call centre operating. If you look for Thread id 3006892960 you will see that this thread has locked the list of agents while a whole host of other threads are waiting for this mutex to be released. I ran "core show locks" 4 times about 2 seconds apart before restarting and in each case that lock was still being held with many other threads waiting for it.

The lock in question is in agent_logoff in chan_agent.c. It seems that the list of agent channels is locked, it then matches the agent channel it is looking for and then falls into one of the following two loops:

while (p->owner && ast_channel_trylock(p->owner)) {
ast_mutex_unlock(&p->lock);
usleep(1);
ast_mutex_lock(&p->lock);
}

or

while (p->chan && ast_channel_trylock(p->chan)) {
ast_mutex_unlock(&p->lock);
usleep(1);
ast_mutex_lock(&p->lock);
}

It seems as though it is unable to obtain the lock for p->owner or p->chan, so it keeps looping trying to get the lock. The problem is that the number of threads waiting for the agent list to be unlocked starts growing and growing as more and more calls pour into the queue. Obviously because the list of agents is locked, calls can't be routed.

This problem seems to happen every 2 to 3 weeks and generally occurs only when the box is receiving a lot of calls.

Any suggestions/advice would be greatly appreciated.

****** ADDITIONAL INFORMATION ******

Asterisk v 1.4.19
Asterisk-addons 1.4.7

Agents and queues are defined in a MySQL database using Asterisk realtime

Comments: By: Sebastian (sroberts) 2008-12-19 07:28:46.000-0600

I have Asterisk compiled with DEBUG_CHANNEL_LCOKS, DEBUG_THREADS and LOADABLE_MODULES selected. I'm going to try DON'T_OPTIMIZE tonight and see if it has any effect
By: Mark Michelson (mmichelson) 2008-12-23 16:54:16.000-0600

1.4.19 was released over 5 months ago and numerous changes have been introduced since then. Given that you are having to support sites with such high load, I imagine it's not convenient to have to try upgrading to a newer version to try to test. However, my gut reaction to seeing this report is to try using a newer version if at all possible to verify if the problem still exists.

I can see that thread 3007384480 is holding the lock on the channel that all the other threads are trying to get, but that thread isn't actually waiting for a lock itself, so it's unclear where the actual deadlock is just from the core show locks output. If you can attach to Asterisk with gdb when the problem occurs and upload the output of 'thread apply all bt' *and* get core show locks output from that particular failure, then it will be possible to use those two pieces of information to see what thread is holding the channel lock and to see what it is "stuck" doing.

In order to get usable output from gdb, it will be important that DONT_OPTIMIZE is enabled. I also recommend disabling the DEBUG_CHANNEL_LOCKS flag because it actually doesn't really help any for this case.

By: Sebastian (sroberts) 2008-12-24 05:38:44.000-0600

Thanks for the feedback. I certainly will. The only reason I couldn't attach GDB this time is that I only have a very short time window in which I can gather info before I have to restart.

When it happens again, I will try to get the full thread backtrace using GDB.
By: Sebastian (sroberts) 2009-02-25 01:43:39.000-0600

Hi,

So I managed to get the go-ahead and upgraded to 1.4.22. Unfortunately, it seems that there is now a lock occurring elsewhere.

If you look at threads 3085781920 and 3008134048, it seems to me that an agent channel is being hung up, yet the hangup doesn't complete as it can't get hold of a lock. It then also locks up in agent_devicestate_cb, which locks the agent list.

Now we use call files to log agents in and out and to pause/unpause them. So what happens when one of these deadlocks occurs is that agents start trying to log off and log in again, which causes even more threads to wait on the agent list. I've attach two more core locks, taken about 30 seconds apart as well as a gdb backtrace of the threads.

Unfortunately, I can't get authorization to compile Asterisk with optimizations off as my superiors are worried about performance implications. What, if any are the performance implications of compiling with optimizations off? I'd like to do this, however without some motivation as to why it won't affect performance I can't do this.

I have applied the patch from ASTERISK-11692 to one of our servers, hopefully this will solve the issue, however I thought I'd post this as well.

By: snuffy (snuffy) 2009-02-25 04:42:24.000-0600

Basically.. DONT_OPTIMIZE shouldn't hurt too much performance wise especially with modern PC p4 2.4GHz+

Also just check the 'load' via top if its less than 1.0 should be fine.
By: Sebastian (sroberts) 2009-03-02 15:49:00.000-0600

Hi, the lockup occurred again today, and on Friday. I managed to get a gdb backtrace and have that Asterisk compiled with optimizations off. Unfortunately, though I didn't get all the threads, as there were more lines than I had console window. I didn't get the chance to do it again, as there were already 150 calls stuck in the queue and more pouring in. If it happens again I'll get the full trace.

That being said, it's exactly the same lockup now. Always agent_devicestate_cb waiting for a lock which is held by ast_hangup. While this is waiting, it holds the lock for the list of channels, which prevents calls from being routed to agents.

If you look at gdb_threads_2 it looks like thread 182 is the culprit here. In ast_hangup there is a ast_channel_lock(chan); statement. Now in __ast_pthread_mutex_lock(), there is a do/while loop that seemingly will wait until the lock is acquired. Now I may be on the complete wrong track here, but is it not possible that two threads are trying to update the agent channel and hang it up at the same time?

The agent_devicestate_cb watcher may be attempting to update the status of an Agent channel that is being hung up at that same moment?

By: Sebastian (sroberts) 2009-03-02 17:07:46.000-0600

At the end of the day, what is really killing me here is that fact that agent_devicestate_cb is holding the lock on the list of agents while it waits. This prevents calls being routed to any other agents, which leaves around 300 call centre agents sitting around twiddling their thumbs. I was therefore wondering about whether I could modify agent_devicestate_cb so that at least it won't hold the agent list lock indefinitely. In other words, if it cannot get the lock on an agent channel, it should just carry on iterating through the list. Once it is done traversing the list of agents, if the agent in question has been updated, we can finish, otherwise, unlock the list of agents, sleep for a while and then try again. This will at least unlock the agent list for other threads to use. Meaning that should 1 agent become locked or waiting, at least the other agents can have calls routed to them. My idea for changing agent_devicestate_cb would be something as follows:

static int agent_devicestate_cb(const char *dev, int state, void *data)
{
int res, i;
struct agent_pvt *p;
char basename[AST_CHANNEL_NAME], *tmp;

/* Skip Agent status */
if (!strncasecmp(dev, "Agent/", 6)) {
return 0;
}
int stateUpdated = 0;
while (!stateUpdated) {
/* Try to be safe, but don't deadlock */
for (i = 0; i < 10; i++) {
if ((res = AST_LIST_TRYLOCK(&agents)) == 0) {
break;
}
}
if (res) {
return -1;
}

AST_LIST_TRAVERSE(&agents, p, list) {
res = ast_mutex_trylock(&p->lock);
if (!res) {
if (p->chan) {
ast_copy_string(basename, p->chan->name, sizeof(basename));
if ((tmp = strrchr(basename, '-'))) {
*tmp = '\0';
}
if (strcasecmp(p->chan->name, dev) == 0 || strcasecmp(basename, dev) == 0) {
p->inherited_devicestate = state;
ast_device_state_changed("Agent/%s", p->agent);
stateUpdated = 1;
}
}
ast_mutex_unlock(&p->lock);
}
}
AST_LIST_UNLOCK(&agents);
usleep(200);
}
return 0;
}

Any feedback on this would be greatly appreciated, as I'm really quite desperate...
By: Sebastian (sroberts) 2009-03-04 04:41:36.000-0600

Hi please see attached gdb_threads_3.log full gdb backtrace (with compiler optimizations off) and locks5 core show locks.
By: Sebastian (sroberts) 2009-03-04 08:43:45.000-0600

Hi again,

I think I have this figured out now.

When a hangup occurs, agent_devicestate_cb is called to update the state of the agent. This then obtains a lock on the agent_list. It then begins it's traversal of the list, blocking for each agent in the list until it gets the lock - ast_mutex_lock(&p->lock).

What happens at the same time is that ast_hangup is being called, now one of the functions called in that stack is local_hangup. Inside local_hangup is a lockup waiting to happen:

while (ast_mutex_trylock(&p->lock)) {
ast_channel_unlock(ast);
usleep(1);
ast_channel_lock(ast);
}

In the case of what happened to me today, this now sits looping and waiting for a lock being held in ast_indicate_data, while ast_indicate_data is waiting for a lock being held by local_hangup.

A quick glance at thread id's 3004504992 and 2981165984 in locks5 shows this.

Thread 194 in the gdb trace shows this as well.

So while these two threads are waiting for each other to release their respective locks, agent_devicestate_cb waits for this to finish, while holding it's lock on the channel list. If you look at the locks5 file, you can see hundreds of other threads blocking waiting for the agent list to be released.

Since I haven't had any response, and I have to act I am going to implement the abovementioned change as well as change the while loop I mentioned in this post
to use the DEADLOCK_AVOIDANCE macro. Hopefully this will at least stop the entire call centre grinding to a halt in the event of such lockups. If it is successful, I will post a patch here in a couple of days time

By: Mark Michelson (mmichelson) 2009-03-04 09:35:42.000-0600

I think using DEADLOCK_AVOIDANCE inside local_indicate is probably the way to go here. I'll put a patch up here for that.
By: Mark Michelson (mmichelson) 2009-03-04 09:54:00.000-0600

Patch 14112.patch posted.
By: Sebastian (sroberts) 2009-03-04 23:53:12.000-0600

Thanks a mil!

Just applied it now. I'll let you know how it goes.
By: Sebastian (sroberts) 2009-03-09 04:13:31

Hi,

Just had the same lockup happen on one of our servers (see attached locks6 and gdb_threads_4). I've been testing a patch I've created since last week Thursday and it seems to be working fine still, so I'm going to try it out this afternoon.

I've uploaded the patch - 1.4.22.2-agentLock.patch.
By: Sebastian (sroberts) 2009-03-09 11:15:45

OK, so it didn't seem to like my licence. I've filled out the licence again, but it won't let me upload the patch again, I don't want to just rename it, as I don't want multiple attachments for the same thing. How do I delete the last one? Or will it come right when my licence gets accepted?
By: Mark Michelson (mmichelson) 2009-03-09 11:17:00

It should show up correctly once your license is approved. I'm looking forward to seeing it :)
By: David Woolley (davidw) 2009-03-09 13:31:52

I think your latest lockup is actually a duplicate of ASTERISK-12877.
By: Mark Michelson (mmichelson) 2009-03-09 15:00:11

I think davidw is correct about your latest lockup being a duplicate. I'm going to mark the issues as related for now.
By: Sebastian (sroberts) 2009-03-10 00:49:24

Yup, that looks like it. I'll have to apply it this afternoon, but I'll keep you posted
By: Sebastian (sroberts) 2009-03-10 01:49:21

OK, so ignore my patch. It seemed like a good idea at the time, but failed horribly this morning.
By: David Woolley (davidw) 2009-03-12 10:47:15

Is there any chance of having a 1.6 port of 14112.patch?

We've been having deadlocks involving agents and local channels, on 1.6.0.1, which we have definitely matched to issue ASTERISK-12877 and to revision 155861 (no corresponding issue) and we suspect it can't be long before we hit the current issue.

The problem is they all have a race condition, so aren't easy to reproduce.
By: Leif Madsen (lmadsen) 2009-05-20 08:04:09

Any movement on this issue at all?
By: Amilcar S Silvestre (amilcar) 2009-05-23 05:55:11

I'm using 14112.patch for more than a month, without any issues (on both 1.4.22 and 1.4.23.1).
By: Ronald Chan (loloski) 2009-06-04 14:33:00

tested this on on production with 15 agents works nicely
By: Sebastian (sroberts) 2009-06-18 08:17:12

I've recently encountered a new deadlock where a call file being hung up causes a deadlock in local_hangup (line 515 of chan_local).

Now we change agent's state by using call files (creating local channels which execute dialplan logic). What seems to be happening is that a call file finishes execution local_hangup is trying to grab the local on the pvt structure of the channel, however it appears that this channel has already been destroyed so it will sit waiting for a lock which does not exist. This causes a deadlock and sometimes even a seg-fault. The biggest issue with this problem is that when the deadlock occurs, if I try and run "core show locks" it causes Asterisk's CLI to freeze up completely, stopping all routing of calls.

I have managed to create a setup in our lab environment in which I can recreate this bug. What I'm doing is creating 1000 queue members and assigning them to a queue. I then start pumping calls into the queue at the rate of about 1/sec. I've then created a script which logs in all 1000 agents, waits 60s then unpauses them all, waits 60s then pauses them all, waits another 60s and then unpauses them all and finally after another 60s logs them all out and starts over again. The machine with the queue on it then routes the calls sitting in the queue to another box, which answers and plays tt-monkeys a couple of times before hanging up.

Under a sustained load of about 200-300 concurrent calls this causes Asterisk to deadlock in under 5 minutes. Now you might think that this is a bit extreme, but our call centres are pushing 10k calls per day, and we are getting 1 or two of these lockups occurring per week, this just seems to accelerate the occurrence of the deadlock.

Upon further investigation, I realised where the problem was; looking at the GDB stack trace I noticed the following:

ASTERISK-5 0x008e400e in __ast_pthread_mutex_lock (filename=0xb1f2bf80 "chan_local.c", lineno=515, func=0xb1f2bf30 "local_hangup", mutex_name=0xb1f2bee0 "&p->lock", t=0x92db838) at /usr/src/asterisk-1.4.22.2-VITEL2-090611/include/asterisk/lock.h:406
res = -1309491552
canlog = 0
__PRETTY_FUNCTION__ = "__ast_pthread_mutex_lock"
ASTERISK-6 0x008e5e67 in local_hangup (ast=0x9535868) at chan_local.c:549
__func = "local_hangup", '\0' <repeats 48 times>, "5?S\000\001\000\000\000\024`B\000?????@\021\b"
__filename = "chan_local.c\000?S\000\001\000\000\000?-\t????@\021\b0?t\t?U\021\b???Q[\021\b??M\t?\216S\000???\001\000\000\000?-\t?\216S\000????hXS\t"
__mutex_name = "&p->lock", '\0' <repeats 71 times>
__lineno = 515
__res = 0
p = (struct local_pvt *) 0x92db838
isoutbound = 0
f = {frametype = AST_FRAME_CONTROL, subclass = 1, datalen = 0, samples = 0, mallocd = 0, mallocd_hdr_len = 0, offset = 0, src = 0x0, data = 0x0, delivery = {tv_sec = 0, tv_usec = 0}, frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0}
ochan = (struct ast_channel *) 0x0
glaredetect = 0
res = 0
__PRETTY_FUNCTION__ = "local_hangup"

What I noticed here was that it was attempting to get p->lock at line 515 AND 549 of local_hangup. Upon examining the method I noted something.

At the beginning of local_hangup, it acquires a lock on p->lock at line 515:

while (ast_mutex_trylock(&p->lock)) {
ast_channel_unlock(ast);
usleep(1);
ast_channel_lock(ast);
}
It then enters the if(is_outbound) {} else {} section. The thing to note here is that in the if(is_outbound) section it will unlock p->lock at line 527, and then re-lock it again at line 535. This is fine, however in the else block is where I see the problem. The following spin lock attempts to re-lock p->lock:

while (p->chan && ast_channel_trylock(p->chan)) {
DEADLOCK_AVOIDANCE(&p->lock);
}

However p->lock is already held at this point, so this is the second time it is being locked!

Looking further down, you can see in the if (!p->owner && !p->chan) block, p->lock is released and then the method returns. If it does not enter this block, we have the following:

if (!res)
ast_mutex_unlock(&p->lock);

What I'm getting at here, is that under certain circumstances p->lock is being locked twice and only unlocked once! My guess is that it gets locked twice, and then when another thread needs to obtain that lock it seems to block waiting for the lock, while the channel gets destroyed leaving it to sit there forever waiting for a non-existent lock.

I have fixed this by commenting out the lines
while (p->chan && ast_channel_trylock(p->chan)) {
DEADLOCK_AVOIDANCE(&p->lock);
}

I don't see the need for this, as the lock is already guaranteed to be acquired since the code executed almost immediately prior to it was also a spin lock which grabs the mutex.

With this code removed, I have now been running the same simulation for over 3 hours with no lockups whatsoever at a sustained load of 300 concurrent calls. Bear in mind that with that code in, Asterisk deadlocks in less than 5 minutes.

Now I've created a patch for this which removes that code (I'm running 1.4.22.2, but I see it is exactly the same for 1.4.25 even). I don't see any harm in removing the code and my testing shows that it should work fine, however I would like to know whether I will be breaking anything by doing this? As I said, I don't see how, but then again that code must be there for a reason surely? If it needs to be there for whatever reason, is there no better way to do this?

I'm going to apply this patch to one of our quieter servers tonight, while leaving the heavy load test on overnight to see how things work out. I'd appreciate some feedback on this, as I would like to roll this out to our busier servers as soon as possible if I can as these deadlocks are a major headache.

Thanks!
By: Miguel Molina (coolmig) 2009-06-18 23:30:27

That's an interesting queue load test. Could you please share your test scripts? They would be very useful. Looking forward on your results and some developer comments on your findings.
By: Ronald Chan (loloski) 2009-06-19 02:16:57

coolmig,

this is fairly easy to do even in bash could do this, this add 10 members to agent_queue, look for other command to pause and unpause the member.

#!/bin/sh

for i in `seq 1 10`
do
asterisk -rx "queue add member Sip/$i to agent_queue"
sleep .5
done
By: Ronald Chan (loloski) 2009-06-19 02:22:47

well, the attach script is using a dynamic members for queues which i think doesn't involve chan agent.
By: Sebastian (sroberts) 2009-06-25 02:20:47

coolmig,

I've been a bit busy this week, deadlines and all :( I'll post a doc outlining all the scripts and dialplans I use for testing by Monday or Tuesday next week.

So far the change I've made seems to be working fine on one rather quiet box as well as on our 4th busiest box. If all goes well today I'm gonna put it on one of the really busy boxes tonight as it locked up yesterday. I'll post next with some feedback as to how it's working out along with the test scripts etc
By: Leif Madsen (lmadsen) 2009-06-26 10:29:32

dvossel: assigned to you because this is a RR issue that may be Ready for Review. Let me know!
By: David Vossel (dvossel) 2009-06-26 16:41:45

sroberts, does the 14112.patch seem to resolve problem, or are you using a different fix. I'm trying to bring this issue to completion, but I'm having a little trouble following the history of this issue.
By: Sebastian (sroberts) 2009-06-27 03:11:00

dvossel,

I am using the 14112 patch and it does not solve the problem. The deadlocks are definitely still occurring in local_hangup though. The fix I proposed above has been working fine for two days in production now with no issues whatsoever. I will be rolling it out to some of our bigger sites next week. I will also try and post my test scripts tomorrow or on Monday.

As a side note (I haven't really had a chance to test this completely), the 14112 patch seems to reduce the frequency of deadlocks in Asterisk. Without it, Asterisk would lock up far more frequently so I think it does solve one issue. I suspect that the problem I'm having is different and related to the number of call files being generated.
By: Sebastian (sroberts) 2009-07-03 04:37:02

Hi,

As per coolmig's request I've uploaded a doc briefly outlining how I set up the test environment I have including scripts for generating call files and changing agent state etc.

My previous comments on locking in local_hangup were on the totally wrong track. Removing that section of code introduces a definite lockup situation. Although, interestingly, Asterisk will lock up less frequently with that code omitted than with it in?

Leaving the following code as it is:

while (p->chan && ast_channel_trylock(p->chan)) {
DEADLOCK_AVOIDANCE(&p->lock);
}

Allows for the situation whereby p gets destroyed or changed, meaning that the p->lock that Asterisk is trying to grab no longer exists. This leaves this code section stuck. Of course the problem with this is that the channel itself is still locked, so tons of threads build up trying to access the channel or parts of it. The solution I'm currently testing is to lock p->chan first, then lock p->lock and it seems to be working pretty well. I realise that there is a risk of p->lock being grabbed while trying to lock p->chan however I don't know a way around this. You need to get and keep locks on ast, p->lock and p->chan while ensuring that p is not null and p->chan is not null. Not an easy task!
By: David Vossel (dvossel) 2009-07-09 17:31:37

If a lock is attempted on p->lock when it doesn't exist I'd expect Asterisk to crash, not hang.

I've looked through all your threadlock output files and I'm seeing several different issues here. It appears you have updated since this was reported in 1.4.19, so I'm not entirely sure which of these issues are still prevalent. I need to know what version you are currently using and what log output is associated with that. It would help if you could update to the latest 1.4 release and reproduce this. That way I wouldn't have to determine if the issue has been resolved in a later version or not.
By: Leif Madsen (lmadsen) 2009-07-24 10:19:15

Ping!

Any chance for information from an updated version of Asterisk?
By: Leif Madsen (lmadsen) 2009-07-27 15:21:44

I'm closing this due to lack of response from reporter. However, if the reporter has the information requested, please do feel free to reopen the issue and comment. Thanks!