Summary: | ASTERISK-04165: [patch] better ast_channel_walk_locked() and ast_get_channel_by_name_locked() | ||
Reporter: | Luigi Rizzo (rizzo) | Labels: | |
Date Opened: | 2005-05-14 08:55:55 | Date Closed: | 2008-01-15 15:37:19.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) locked.diff ( 1) walk-function | |
Description: | (disclaimer on file) 1) The current implementation of ast_channel_walk_locked() is totally uncommented and has a long section of code repeated twice. 2) ast_get_channel_by_name_locked() is even worse because it loops around ast_channel_walk_locked() resulting in O(N^2) complexity while it could be done in O(N). I consider this part a rather severe performance bug and this is why i list the severity as 'major' 3) Finally, many source files reimplement their own ast_get_channel_by_name_locked() inline, again with O(N^2) complexity. This is also a quite severe performance bug. The attached patch: - rewrites both functions as wrappers around a common function, ast_channel_find_locked(), which covers both cases in O(N) and without code duplication. - provides extensive comments on the behaviour of the function and warns about the potential performance issues. - makes use of the nice interface provided by the *_walk_* function by writing loops as chan = NULL; while ( (chan = ast_channel_walk_locked(chan)) != NULL) { ... body ... } rather than replicating the *walk*() call twice. On passing: - staticise 'backends' and 'channels' in channel.c and add a comment that both lists are protected by 'chlock' - replace magic numbers with a macro in cli.c:handle_debugchan() and friends (I had to touch those function for the *walk* calls); ****** ADDITIONAL INFORMATION ****** The patch touches several files, and unfortunately some pieces of it might have a large offset because i have many other reported but uncommitted changes in my source tree. PLEASE BEAR WITH THIS - i only have limited time to work on this project, and anyone who decides to commit it has to scrutinize it anyways. Note however that With the exception of the main functioni (which i am also providing completely in a separate file for review), changes are really trivial for the most part and easy to apply by hand. NOTE: as noted in the comments in the code, the mechanism used to avoid deadlocks between the global and the individual channel locks is extremely critical with respect to timing and prone to report a locking failure. I don't have a better solution at the moment, but we surely should work on designing one, perhaps involving refcounts if nothing better comes up. | ||
Comments: | By: Tilghman Lesher (tilghman) 2005-05-14 14:25:14 Unfortunately, you've removed the safe locking of the walk function. Consider that for a time, the previous channel is unlocked before the next channel is locked. That means you've introduced a race condition where if the previous is unlocked, then removed (i.e. the channel is hung up) before your function has time to retrieve its position, you could have the channel walk fail (or worse, point to some unrelated structure). It's not critical; your patches could be fixed to keep the safe locking of the current _walk_ functions, but as it stands, you have introduced a race condition. By: Luigi Rizzo (rizzo) 2005-05-14 14:35:40 are you sure about your comment ? because i cannot find where my code fails. my code (see the 'walk-function' file) scans the list fromm the head while holding chlock, and it does not release it until after ast_mutex_trylock(&c->lock). 'prev' is never dereferenced unless we find it on the chain while we hold chlock, which means noone else can delete it. By: Tilghman Lesher (tilghman) 2005-05-14 15:53:27 Yes, it's a potential race condition: Current code: prev->lock held chlock taken next->lock taken prev->lock released chlock released Your code: prev->lock released (not even inside the function!) chlock taken next->lock taken chlock released By: Luigi Rizzo (rizzo) 2005-05-14 16:09:39 the description you give of the old code is incorrect; you enter ast_channel_walk_locked(prev) without holding prev->lock and you never release it inside the function. you have to tell me where you see the behaviour you mention in note 27245, because i see something different, e.g. channel.c::ast_get_channel_by_name_locked() chan = ast_channel_walk_locked(NULL); while(chan) { if (!strcasecmp(chan->name, channame)) return chan; ast_mutex_unlock(&chan->lock); chan = ast_channel_walk_locked(chan); } same lock/unlock sequence in channel.c::ast_parse_device_state(), all the sequences in cli.c and so on. And you don't need to hold prev->lock when you enter ast_channel_walk_locked() because you never dereference prev -- it's just a search key. You do dereference it only when you find a matching entry in the 'channels' list, which as i said happens when you are holding chlock so nobody can delete the entry. please double check and let me know if there is a different segment of code that does as you say. By: Tilghman Lesher (tilghman) 2005-05-14 16:39:30 You're right; I was misinformed. However, it's still a case that we should take into account. Otherwise, ast_channel_walk_locked() may, under certain race conditions, not iterate over all channels in the list, returning NULL early. By: Luigi Rizzo (rizzo) 2005-05-14 17:16:08 As i mentioned in the NOTE in the original post, the locking scheme in use has severe issues that my patch leaves unchanged. Once again, there is not a single comment in the code or documentation telling how the locking is meant to work, so i cannot fix the code until I know what it is meant to do. One possibility would be the introduction of refcounts in the the ast_channel objects, protected by either chan->lock or (better) by 'chlock' so that chan->lock can be left to its intended (i presume) purpose of implementing 'ownership' of the object. Then we would have the following interface, a lot simpler and without risks of deadlock. /* increments the refcount to create a new reference */ ast_channel_addref(chan) { ast_mutex_lock(&chlock); chan->refcount++; ast_mutex_unlock(&chlock); } /* decrement the refcount to return a reference */ ast_channel_remref(chan) { ast_mutex_lock(&chlock); i = --chan->refcount; ast_mutex_unlock(&chlock); if (i == 0) ast_channel_free(chan); /* delete on last ref. */ } /* optimized walk function, returns a new reference */ ast_channel_walk(prev) { ast_mutex_lock(&chlock); if (prev == NULL) /* we want the first item */ c = channels; else { /* make sure 'prev' still on the list */ for (c = channels; c && c != prev; c = c->next) ; if (c) /* yes, so get the next element */ c = c->next; } if (c) /* the object exists, increment the refcount */ c->refcount++; ast_mutex_unlock(&chlock); return c; } ast_channel_find_by_name(name) is similarly simple. It is up to the caller to keep or release the reference on 'prev'. It can release it even before calling ast_channel_walk() because once again, prev is never dereferenced. By: Brian Degenhardt (bmdhacks) 2005-05-14 17:18:48 I may be wrong here but I believe the method mentioned above by Corydon76 is prone to deadlocks: > Current code: > > prev->lock held > chlock taken > next->lock taken > prev->lock released > chlock released Imagine two threads walking the chanel list: One holds next->lock and is blocking on the chlock Other holds prev->lock and chlock and is blocking on next->lock By: Mark Spencer (markster) 2005-05-14 17:48:55 This is largely a duplication of the concepts of astobj. The much more logical approach would be to apply the astobj model to this. Moreover, this is "feature" not "major" as per the bug guidelines, especially since this isn't in and of itself a bug, but a requested performance inhancement. By: Michael Jerris (mikej) 2005-05-14 17:50:35 rizzo. you need to speak to kpfleming about the new (yet to be posted) astobj stuff as it addresses much of the locking issues you are seeing here. By: Luigi Rizzo (rizzo) 2005-05-14 18:09:24 ok reading astobj.h now makes a lot more sense and i agree that applying it ultimately makes sense. I wonder though how extensive are the changes needed, but i guess the only way is try and find out. As to calling this report "feature" or "major", apart from the obvious consideration that 'the boss is always right' :) , in my view there are two issues that are rather serious, namely the deadlock avoidance mechanism that might cause transient failures, coupled with the poor implementation of ast_get_channel_by_name_locked() which in addition to the O(N^2) cost also has to run through the deadlock avoidance mechanism O(N) times, increasing the chance of failures. By: Kevin P. Fleming (kpfleming) 2005-05-14 20:52:21 rizzo and I have spoken off-line about the refcount aspects of his comments, and he has agreed to wait on that subject until I can get to where I'm working on Asterisk full-time and the ASTOBJ stuff can get merged and tested. However, I do see some value in making these changes at this time, since changing the channels/channel list over to ASTOBJ will probably have to wait until we have _really_ shaken out the code in other modules. You comment in ast_channel_find_locked that we need to hold the channel's lock to access c->name, in case it disappears. However, at that moment the code is holding chlock, so the channel cannot disappear, can it? Minor note: functions that are static scope and not intended to be API shouldn't be named 'ast_<foo>', they should just be '<foo>'. I know there are lots of instances of that already, but let's not make it worse... Unless anyone else has objection, I think this patch is pretty much safe to apply. (Corydon, will Mantis let you revert your 'fail' review, since you later learned the code was correct?) By: Tilghman Lesher (tilghman) 2005-05-14 21:21:17 bmdhacks: it's not a deadlock situation, as ast_channel_walk_locked() only tries to get next->lock with ast_mutex_trylock(). If it can't get the lock, there's a failure condition, but not a deadlock. By: Olle Johansson (oej) 2005-06-05 17:01:04 Where are we with this issue? /Housekeeping By: Kevin P. Fleming (kpfleming) 2005-06-05 22:27:06 Committed to CVS HEAD with some slight mods... thanks for the excellent work! By: Digium Subversion (svnbot) 2008-01-15 15:37:19.000-0600 Repository: asterisk Revision: 5853 U trunk/app.c U trunk/apps/app_groupcount.c U trunk/apps/app_setcdruserfield.c U trunk/apps/app_softhangup.c U trunk/apps/app_zapscan.c U trunk/channel.c U trunk/cli.c U trunk/include/asterisk/channel.h U trunk/manager.c U trunk/pbx.c U trunk/res/res_agi.c U trunk/res/res_features.c U trunk/res/res_monitor.c U trunk/res/res_musiconhold.c ------------------------------------------------------------------------ r5853 | kpfleming | 2008-01-15 15:37:18 -0600 (Tue, 15 Jan 2008) | 2 lines more efficient (and understandable) ast_channel_walk_locked, and vastly more efficient ast_channel_by_name_locked (bug ASTERISK-4165) ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=5853 |