[Home]

Summary:ASTERISK-04165: [patch] better ast_channel_walk_locked() and ast_get_channel_by_name_locked()
Reporter:Luigi Rizzo (rizzo)Labels:
Date Opened:2005-05-14 08:55:55Date Closed:2008-01-15 15:37:19.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) locked.diff
( 1) walk-function
Description:(disclaimer on file)

1) The current implementation of ast_channel_walk_locked() is totally
  uncommented and has a long section of code repeated twice.
2) ast_get_channel_by_name_locked() is even worse because it
  loops around ast_channel_walk_locked() resulting in O(N^2)
  complexity while it could be done in O(N).
  I consider this part a rather severe performance bug and this is why
  i list the severity as 'major'
3) Finally, many source files reimplement their own
  ast_get_channel_by_name_locked() inline, again with O(N^2) complexity.
  This is also a quite severe performance bug.

The attached patch:
- rewrites both functions as wrappers around a common function,
  ast_channel_find_locked(), which covers both cases in O(N)
  and without code duplication.
- provides extensive comments on the behaviour of the function
  and warns about the potential performance issues.
- makes use of the nice interface provided by the *_walk_* function by
  writing loops as
         chan = NULL;
         while ( (chan = ast_channel_walk_locked(chan)) != NULL) {
               ... body ...
         }
  rather than replicating the *walk*() call twice.

On passing:
 - staticise 'backends' and 'channels' in channel.c and add a comment that
   both lists are protected by 'chlock'
 - replace magic numbers with a macro in cli.c:handle_debugchan()
   and friends (I had to touch those function for the *walk* calls);
         

****** ADDITIONAL INFORMATION ******

The patch touches several files, and unfortunately some pieces of
it might have a large offset because i have many other reported but
uncommitted changes in my source tree. PLEASE BEAR WITH THIS - i only
have limited time to work on this project, and anyone who decides to commit
it has to scrutinize it anyways. Note however that
With the exception of the main functioni (which i am also providing
completely in a separate file for review), changes
are really trivial for the most part and easy to apply by hand.

NOTE: as noted in the comments in the code, the mechanism used to avoid
deadlocks between the global and the individual channel locks
is extremely critical with respect to timing and prone to report
a locking failure.
I don't have a better solution at the moment, but we surely should work
on designing one, perhaps involving refcounts if nothing better comes up.
Comments:By: Tilghman Lesher (tilghman) 2005-05-14 14:25:14

Unfortunately, you've removed the safe locking of the walk function.  Consider that for a time, the previous channel is unlocked before the next channel is locked.  That means you've introduced a race condition where if the previous is unlocked, then removed (i.e. the channel is hung up) before your function has time to retrieve its position, you could have the channel walk fail (or worse, point to some unrelated structure).

It's not critical; your patches could be fixed to keep the safe locking of the current _walk_ functions, but as it stands, you have introduced a race condition.

By: Luigi Rizzo (rizzo) 2005-05-14 14:35:40

are you sure about your comment ? because i cannot find where my code fails.

my code (see the 'walk-function' file) scans the list fromm
the head while holding chlock, and it does not release it
until after ast_mutex_trylock(&c->lock).
'prev' is never dereferenced unless we find it on the chain
while we hold chlock, which means noone else can delete it.

By: Tilghman Lesher (tilghman) 2005-05-14 15:53:27

Yes, it's a potential race condition:

Current code:

prev->lock held
chlock taken
next->lock taken
prev->lock released
chlock released

Your code:
prev->lock released (not even inside the function!)
chlock taken
next->lock taken
chlock released

By: Luigi Rizzo (rizzo) 2005-05-14 16:09:39

the description you give of the old code is incorrect;
you enter ast_channel_walk_locked(prev) without holding prev->lock
and you never release it inside the function.

you have to tell me where you see the behaviour you mention in note 27245, because i see something different, e.g.
channel.c::ast_get_channel_by_name_locked()


       chan = ast_channel_walk_locked(NULL);
       while(chan) {
               if (!strcasecmp(chan->name, channame))
                       return chan;
               ast_mutex_unlock(&chan->lock);
               chan = ast_channel_walk_locked(chan);
       }

same lock/unlock sequence in channel.c::ast_parse_device_state(),
all the sequences in cli.c and so on.
And you don't need to hold prev->lock when you enter ast_channel_walk_locked()
because you never dereference prev -- it's just a search key.
You do dereference it only when you find a matching entry in the
'channels' list, which as i said happens when you are holding
chlock so nobody can delete the entry.

please double check and let me know if there is a different segment
of code that does as you say.

By: Tilghman Lesher (tilghman) 2005-05-14 16:39:30

You're right; I was misinformed.  However, it's still a case that we should take into account.  Otherwise, ast_channel_walk_locked() may, under certain race conditions, not iterate over all channels in the list, returning NULL early.

By: Luigi Rizzo (rizzo) 2005-05-14 17:16:08

As i mentioned in the NOTE in the original post, the locking scheme
in use has severe issues that my patch leaves unchanged.

Once again, there is not a single comment in the code or documentation
telling how the locking is meant to work, so i cannot fix the code
until I know what it is meant to do.

One possibility would be the introduction of refcounts in the
the ast_channel objects, protected by either chan->lock        
or (better) by 'chlock' so that chan->lock can be left to
its intended (i presume) purpose of implementing 'ownership'
of the object.

Then we would have the following interface, a lot simpler and
without risks of deadlock.

   /* increments the refcount to create a new reference */
   ast_channel_addref(chan) {
       ast_mutex_lock(&chlock);
       chan->refcount++;
       ast_mutex_unlock(&chlock);
   }

   /* decrement the refcount to return a reference */
   ast_channel_remref(chan) {
       ast_mutex_lock(&chlock);
       i = --chan->refcount;
       ast_mutex_unlock(&chlock);
       if (i == 0)  
           ast_channel_free(chan); /* delete on last ref. */
   }

   /* optimized walk function, returns a new reference */
   ast_channel_walk(prev) {
       ast_mutex_lock(&chlock);
       if (prev == NULL) /* we want the first item */
           c = channels;
       else {
           /* make sure 'prev' still on the list */
           for (c = channels; c && c != prev; c = c->next)  
               ;
           if (c) /* yes, so get the next element */
               c = c->next;
       }
       if (c)  /* the object exists, increment the refcount */
           c->refcount++;
       ast_mutex_unlock(&chlock);
       return c;
   }      

ast_channel_find_by_name(name) is similarly simple.

It is up to the caller to keep or release the reference on 'prev'.
It can release it even before calling ast_channel_walk() because
once again, prev is never dereferenced.

By: Brian Degenhardt (bmdhacks) 2005-05-14 17:18:48

I may be wrong here but I believe the method mentioned above by Corydon76 is prone to deadlocks:

> Current code:
>
> prev->lock held
> chlock taken
> next->lock taken
> prev->lock released
> chlock released

Imagine two threads walking the chanel list:
One holds next->lock and is blocking on the chlock
Other holds prev->lock and chlock and is blocking on next->lock



By: Mark Spencer (markster) 2005-05-14 17:48:55

This is largely a duplication of the concepts of astobj.  The much more logical approach would be to apply the astobj model to this.  Moreover, this is "feature" not "major" as per the bug guidelines, especially since this isn't in and of itself a bug, but a requested performance inhancement.

By: Michael Jerris (mikej) 2005-05-14 17:50:35

rizzo.  you need to speak to kpfleming about the new (yet to be posted) astobj stuff as it addresses much of the locking issues you are seeing here.

By: Luigi Rizzo (rizzo) 2005-05-14 18:09:24

ok reading astobj.h now makes a lot more sense and i agree that
applying it ultimately makes sense.
I wonder though how extensive are the changes needed, but
i guess the only way is try and find out.

As to calling this report "feature" or "major", apart from the obvious
consideration that 'the boss is always right' :) , in my view there
are two issues that are rather serious, namely the deadlock avoidance
mechanism that might cause transient failures, coupled with the poor
implementation of ast_get_channel_by_name_locked() which in addition
to the O(N^2) cost also has to run through the deadlock avoidance mechanism
O(N) times, increasing the chance of failures.

By: Kevin P. Fleming (kpfleming) 2005-05-14 20:52:21

rizzo and I have spoken off-line about the refcount aspects of his comments, and he has agreed to wait on that subject until I can get to where I'm working on Asterisk full-time and the ASTOBJ stuff can get merged and tested.

However, I do see some value in making these changes at this time, since changing the channels/channel list over to ASTOBJ will probably have to wait until we have _really_ shaken out the code in other modules.

You comment in ast_channel_find_locked that we need to hold the channel's lock to access c->name, in case it disappears. However, at that moment the code is holding chlock, so the channel cannot disappear, can it?

Minor note: functions that are static scope and not intended to be API shouldn't be named 'ast_<foo>', they should just be '<foo>'. I know there are lots of instances of that already, but let's not make it worse...

Unless anyone else has objection, I think this patch is pretty much safe to apply. (Corydon, will Mantis let you revert your 'fail' review, since you later learned the code was correct?)

By: Tilghman Lesher (tilghman) 2005-05-14 21:21:17

bmdhacks:  it's not a deadlock situation, as ast_channel_walk_locked() only tries to get next->lock with ast_mutex_trylock().  If it can't get the lock, there's a failure condition, but not a deadlock.

By: Olle Johansson (oej) 2005-06-05 17:01:04

Where are we with this issue?

/Housekeeping

By: Kevin P. Fleming (kpfleming) 2005-06-05 22:27:06

Committed to CVS HEAD with some slight mods... thanks for the excellent work!

By: Digium Subversion (svnbot) 2008-01-15 15:37:19.000-0600

Repository: asterisk
Revision: 5853

U   trunk/app.c
U   trunk/apps/app_groupcount.c
U   trunk/apps/app_setcdruserfield.c
U   trunk/apps/app_softhangup.c
U   trunk/apps/app_zapscan.c
U   trunk/channel.c
U   trunk/cli.c
U   trunk/include/asterisk/channel.h
U   trunk/manager.c
U   trunk/pbx.c
U   trunk/res/res_agi.c
U   trunk/res/res_features.c
U   trunk/res/res_monitor.c
U   trunk/res/res_musiconhold.c

------------------------------------------------------------------------
r5853 | kpfleming | 2008-01-15 15:37:18 -0600 (Tue, 15 Jan 2008) | 2 lines

more efficient (and understandable) ast_channel_walk_locked, and vastly more efficient ast_channel_by_name_locked (bug ASTERISK-4165)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=5853