|Summary:||ASTERISK-08146: [patch] ast_channel_walk_locked or channel_find_locked can "terminate early"|
|Reporter:||Steve Davies (one47)||Labels:|
|Date Opened:||2006-11-16 16:48:21.000-0600||Date Closed:||2007-07-11 19:59:23|
|Environment:||Attachments:||( 0) chan.1.2.15.patch|
( 1) chan.1.2HEAD.patch2
( 2) chan.1.4HEAD.patch
( 3) chan.patch.txt
( 4) chan.patch2.txt
( 5) chan-1.2HEAD.patch.txt
|Description:||'ast_channel_walk_locked' calls 'channel_find_locked' with *prev set|
to the current list position. 'channel_find_locked' then steps forward
through the list by one position, and tries to lock the next channel
list entry. If the lock fails 10 times, then a warning is
printed, and NULL is returned.
Assuming that some code is looping through repeated calls to
'ast_channel_walk_locked' until NULL is returned (I found at least 6
cases of this in the 1.2.13 codebase), returning NULL at this point
implies that the end of the list has been reached, which may not be
true! It is also conceivable that the deadlock will not clear until
some processing on a channel later in the list has been completed.
This processing can now never happen.
I am attaching a patch which causes the deadlocked channel to be
skipped rather than returning NULL in this case.
****** ADDITIONAL INFORMATION ******
Disclaimer to follow.
|Comments:||By: Steve Davies (one47) 2006-11-17 06:06:11.000-0600|
Following a nudge from K Flemming on the -dev list, I have looked more deeply at what chan_find_locked is trying to do, and noticed a couple of things, such as a reference into the channel list after the list lock is freed, and the fact that the channel list scan tests for name/id matches even before the start point specified in 'prev' is reached.
Based on this I have reworked a slightly more significant patch and will attach it now.
(Disclaimer has also been faxed)
By: Kevin P. Fleming (kpfleming) 2006-11-17 14:55:33.000-0600
The comment you made on the list about the name/exten checking only occurring when no 'prev' channel was specified is correct; this behavior had already been corrected in the 1.4 branch, and I've backported it to 1.2.
This has the consequence of also fixing one of your other points here (doing name/exten checking before finding prev->next). I don't see any possible way of referencing the channel list after the lock is freed... if you could point that out it'd be much appreciated.
Sorry for making your patch no longer apply, but it seemed to make sense to bring down the fixes that already existed in the 1.4 branch :-)
By: Steve Davies (one47) 2006-11-17 15:36:13.000-0600
No problem on bringing down the backports. That makes a lot of sense. I will look at the 1.2 BRANCH version and see what is left to patch (I assume that is where I will find it?)
As far as the reference to a list item outside the lock is concerned, I think I meant the printing of the log message at the very end of the function - There is probably only an utterly small, totaly ignorable chance of this causing a problem in reality, but in this case we hold neither the list NOR the channel locks...
By: Steve Davies (one47) 2006-11-17 16:19:47.000-0600
Updated patch to apply to 1.2 HEAD, svn revision 47810.
This version is a lot slimmer than my last effort too.
By: Steve Davies (one47) 2006-11-29 11:21:59.000-0600
So far from my low-call volume testing, I can report no problems. I did manage to cause a deadlock case (using a WiFi phone on a bad connection) and this also caused no problems - In fact the deadlock did not persist, which is unusual.
Of course I have no way of knowing whether it would have persisted without the patch :)
(Please also update to show disclaimer on-file - Thanks.)
By: Edward Eastman (whisk) 2006-12-04 17:04:08.000-0600
this looks like the problem we have, documented here: http://bugs.digium.com/view.php?id=7004
If so, i'm happy to do some testing
By: Steve Davies (one47) 2006-12-13 04:57:16.000-0600
An "Initial Deadlock" (as mentioned in bugid=7004) occurs when the channel_find_locked method is called with a NULL parameter in *prev. This means "lock and give me the head of the channel list". Because the request is very specifically for the list-head, the current patch does not chage the resultant behaviour.
It is POSSIBLE that the patch will allow that lock to clear more readily, but I would assume that getting the list-head will always be necessary before we can ask for a subsequent channel-list entry - If this first channel is deadlocked, we are stuck until a new channel is created (inserted at the top) to un-glue the deadlock.
Perhaps the patch should be extended to find the first lockable channel if prev == NULL ?
By: Steve Davies (one47) 2006-12-13 05:09:09.000-0600
chan.1.2HEAD.patch2 uploaded. This makes the proposed change in my note above so trying to get the channel list-head now gets the first lockable channel instead of failing if the head channel cannot be locked (after 10 retries)
By: Serge Vecher (serge-v) 2006-12-13 08:47:04.000-0600
Whisk: I think you wanted to do some testing ;) ?
By: Joshua C. Colp (jcolp) 2007-01-18 00:00:09.000-0600
Any comments on http://svn.digium.com/svn/asterisk/team/file/chanlist? Revision 51219 specifically.
By: Steve Davies (one47) 2007-01-18 04:15:16.000-0600
Yes, this is 99% of the way there (and is much nicer code :) )but there is one backward step that I would like to point out in case it is important.
Where you loop through the retries 10 times in channel_find_locked, you are not releasing the channel-list lock. It looks like you are only holding a read-lock, but it is possible that the deadlock can only be cleared by allowing another thread to obtain a write-lock (perhaps to hangup and destroy a channel???).
The old code's retry would loop basically to the start of the channel_find_locked function in order to be able to unlock the channel-list (it was even mentioned in a comment IIRC) and I think this was the correct behaviour, even if it is somewhat "heavy". After all, it only happens once a deadlock of some sort has occured.
(As a datapoint, I am running a bristuffed version of 1.2.14 with modified "chan.1.2HEAD.patch2" on a number of live servers, and have seen no issues yet. I cannot positively say it fixes anything though because none of those servers had problems in the first place.)
By: Joshua C. Colp (jcolp) 2007-01-18 10:20:14.000-0600
Done in revision 51223 :D
By: Steve Davies (one47) 2007-01-18 11:47:00.000-0600
In this version, 'c' can become invalid while the channel-list is unlocked.
I assume that the do...while loop condition:
c = AST_RWLIST_NEXT(c, chan_list)
ensures that c is always revalidated (I could not see the definition of the macro.) In which case, this looks pretty good to me.
By: Joshua C. Colp (jcolp) 2007-01-18 11:50:18.000-0600
Would it be possible for you to jump on IRC so we can discuss this in realtime? irc.freenode.net #asterisk-bugs if possible.
By: Marcus Spurkel (mspurkel) 2007-02-06 01:54:53.000-0600
What I'd like to know is what is causing a channel to be locked for so "long" in the first place. We run a call center with a quad card (4 Ts = 96 lines) for incoming calls and about 20 to 40 operators. We have two queues and during busy times the queues can have 30 or more people on hold. I'm on 1.2.14 and I'm seeing a TON of "Avoided (initial) deadlock ..." messages which really means that Asterisk failed to obtain a lock. Granted, that many lines, queues and operators probably makes for some serious channel locking/unlocking but I'm amazed that the failure rate is so high. Often times a doezn or more per minute, EVERY minute (during peak hours)
The result is that the queue erroneously sends calls to operators when they are already on a call. This is because of the "channel_find_locked aborts and returns NULL when it can't obtain a lock" issue which sometimes tricks GROUP_COUNT into thinking that the extension is available.
I'll try applying the HEAD.patch2 and see if that makes a difference. I like the idea of skipping the channel rather than returning NULL, but what if that channel happens to be the one that ast_channel_walk_locked is looking for?
By: Steve Davies (one47) 2007-02-12 05:45:03.000-0600
Looking at 1.2.15, it seems the rewrite of this code has reverted it functionally to the pre- revision 51223 state. ie. the loop will still stop at the first locked channel, even if there is a good match later in the list.
I will upload a patch against 1.2.15 in case it is of use to people. (It compiles, but is as yet untested)
By: Serge Vecher (serge-v) 2007-03-07 11:51:54.000-0600
what's the status with 1.2.16?
By: Steve Davies (one47) 2007-03-08 04:03:22.000-0600
This is merged in 1.2.16, but AFAIK is not present in 1.4.1.
I do not yet use 1.4.1, and am not familiar with the new list walking macros yet. I could try to duplicate the patch for 1.4.1 if that would be helpful.
By: Jared Smith (jsmith) 2007-03-17 22:56:21
Yes, I think it would be helpful for 1.4.1, if it's not too much trouble.
By: Steve Davies (one47) 2007-03-21 16:55:40
Just a note to say that this is not forgotten. Work commitments are slowing me down a little though.
By: Steve Davies (one47) 2007-04-02 05:32:07
As promised, I have finally uploaded a patch for 1.4 (trunk in fact)
This is 100% untested (it does compile) as I have not made the move to 1.4, but I did offer to provide this patch, so...
Feel free to let me know if I've messed something up and I'll do what I can.
By: Zsolt Varga (redax) 2007-04-22 07:53:04
I have the same problem with asterisk-1.2.17,
seems like these patches not in the actual releases...
added manually, will see what happens,
By: Joshua C. Colp (jcolp) 2007-06-11 09:35:47
And this bug goes bye bye... I've put this into 1.2 as of revision 68682, 1.4 as of revision 68683, and trunk as of revision 68684. If it goes kaboom I'll take the heat :D