[Home]

Summary:ASTERISK-04922: [patch] ast_mutex_lock() in ast_hangup() causes race/deadlock grief for channel drivers
Reporter:knielsen (knielsen)Labels:
Date Opened:2005-08-29 04:15:53Date Closed:2005-09-07 22:32:15
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) ast_hangup_patch_1.txt
Description:In channel.c, the ast_hangup() function locks the mutex chan->lock in struct ast_channel *chan, then proceeds to cleanup and de-allocate chan.

I would propose to remove this locking of chan->lock, since it does not do any good: if another thread is trying to lock the chan->lock mutex, it will fail anyway when it obtains the lock since it will be accessing freed memory.

Even worse, because the channel driver hangup callback is called with chan->lock locked, it is prone to either races or deadlocks (due to mixed locking order) if it needs to access the chan pointer from a monitor thread.

Just removing the useless locking of chan->lock in ast_hangup() will solve these issues, so I think it should be done. If nobody opposes, I will be happy to supply a patch, including any necessary modifications in the hangup callbacks in the channel drivers in the channels/ subdirectory.


****** ADDITIONAL INFORMATION ******

Here is a mail I sent to asterisk-dev, further explaining the problem:

The root of the problem is in ast_hangup(), which essentially does this:

   int ast_hangup(struct ast_channel *chan)
   {
       ast_mutex_lock(&chan->lock);
       chan->tech->hangup(chan);
       ast_mutex_unlock(&chan->lock);
       ast_channel_free(chan);
   }

The code apparently tries to avoid races by locking chan->lock, but that
is totally bogus, IMO. If another thread should be waiting for the lock,
it will immediately die when ast_hangup() unlocks chan->lock and calls
ast_channel_free().

But worse, the locking actually works to prevent the channel drivers
from properly implementing tech->hangup(). The correct way for a monitor
or other thread in a driver to access struct ast_channel is to keep a
mapping from circuit number to struct ast_channel * for the active
curcuits. This mapping must be protected by some lock (gmutex, say), so
that the monitor thread may atomically look up a curcuit, and
ast_hangup() may atomically remove a curcuit from the mapping.

But this will lead to a deadlock, since now ast_hangup(chan) will first
lock chan->lock, and then gmutex. The monitor thread will first lock
gmutex (to find the struct ast_channel *), and then lock chan->lock. The
result is a deadlock.

One solution to this is for the tech->hangup() function to unlock
chan->lock (which doesn't cause any further problems since holding the
lock is worthless anyway, as described above). But that is just ugly I
think, and the correct way would be to simply remove the
locking/unlocking of chan->lock from the ast_hangup() function.

Comments? If not, I will open a bug in Mantis with a suitable patch to
this effect.

If anyone needs further convincing, just look at this mess from
chan_zap.c; I think it was done to prevent exactly this issue:

   static void zap_queue_frame(struct zt_pvt *p, struct ast_frame *f, struct zt_pri *pri)
   {
       /* We must unlock the PRI to avoid the possibility of a deadlock */
       ast_mutex_unlock(&pri->lock);
       for (;;) {
           if (p->owner) {
               if (ast_mutex_trylock(&p->owner->lock)) {
                   ast_mutex_unlock(&p->lock);
                   usleep(1);
                   ast_mutex_lock(&p->lock);
               } else {
                   ast_queue_frame(p->owner, f);
                   ast_mutex_unlock(&p->owner->lock);
                   break;
               }
           } else
               break;
       }
       ast_mutex_lock(&pri->lock);
   }

This is essentially a busy wait with an attempt to yield the CPU with
usleep(1). Yuck.
Comments:By: crich (crich) 2005-08-29 16:21:31

I'm sure this is true, look at http://bugs.digium.com/view.php?id=4025 where I'm trying to find this issue.

I had a nasty feeling that the ast_hangup function had some funny locking, and you really explained my thoughts very well.

chan_misdn causes directly a segfault because of this issue (in asterisk stable). I ever wondered why only chan_misdn causes this segfault..

By: Tilghman Lesher (tilghman) 2005-08-29 18:00:15

Aren't you ignoring the fact that only a channel itself may ever call ast_hangup(), not any other thread or any other channel?  All other channels, if they want a particular channel to hangup, must call ast_softhangup, which signals the channel to initiate its own hangup.

By: crich (crich) 2005-08-29 18:05:58

funny never knew that.

By: Tilghman Lesher (tilghman) 2005-08-29 18:06:15

And you're also ignoring the code in ast_channel_free(), which first removes the channel from the list of channels, then runs a lock/unlock to ensure nothing else still has the lock before it deallocates it.  If some other resource is holding the channel without holding the channel lock, it has no other code to blame but itself for accessing freed memory.

By: Tilghman Lesher (tilghman) 2005-08-29 18:13:51

Unless you can give me a backtrace of a crash due to this supposed race (that frankly, I don't see at all), I'm going to close this as not a bug.

By: knielsen (knielsen) 2005-08-30 02:44:56

First, the issue isn't a particular Asterisk crash/deadlock, the issue is that the current ast_hangup() code makes it unnecessarily difficult to write a channel driver _without_ introducing a race or deadlock (hence the "feature" severity). So no backtrace. Sorry if that wasn't clear.

Second, I should have shown the patch I am proposing, I will attach it now. The point is that the chan->lock mutex must not be held when calling chan->tech->hangup():

Index: channel.c
===================================================================
RCS file: /usr/cvsroot/asterisk/channel.c,v
retrieving revision 1.235
diff -u -u -r1.235 channel.c
--- channel.c 30 Aug 2005 02:12:09 -0000 1.235
+++ channel.c 30 Aug 2005 07:19:01 -0000
@@ -1022,14 +1022,15 @@
if (!ast_test_flag(chan, AST_FLAG_ZOMBIE)) {
if (option_debug)
ast_log(LOG_DEBUG, "Hanging up channel '%s'\n", chan->name);
+                ast_mutex_unlock(&chan->lock);
if (chan->tech->hangup)
res = chan->tech->hangup(chan);
} else {
if (option_debug)
ast_log(LOG_DEBUG, "Hanging up zombie '%s'\n", chan->name);
+                ast_mutex_unlock(&chan->lock);
}

- ast_mutex_unlock(&chan->lock);
manager_event(EVENT_FLAG_CALL, "Hangup",
"Channel: %s\r\n"
"Uniqueid: %s\r\n"

The code in ast_channel_free() is actually a very good example of exactly the issue I am pointing out. Note that ast_channel_free() is called _without_ the chan->lock held. This is necessary for exactly the same reason that the lock must not be held across chan->tech->hangup(): ast_channel_free() needs to lock the global chlock mutex that protects the global channel list _before_ locking chan->lock, or it will deadlock with other threads that lock them in the opposite order. Since many/most channel drivers keep their own private channel lists, they have the same problem.

Thus, the problem is not another thread calling ast_hangup() (which I agree it should not). The problem is when the monitor thread in the channel driver wants to call for example ast_queue_frame() simultaneous with the channel thread calling ast_hangup(). It is the job of the channel driver to protect itself against this, typically with a global mutex on its private channel list exactly like ast_channel_free() does. This is the job we should make easier.

Of course I could just unlock the chan->lock in chan->tech->hangup(), problem solved, or busy-loop with ast_mutex_trylock() like done in chan_zap. However, that seems kind of stupid since, as I explained in the original description, the locking of chan->lock in ast_hangup() has no useful effect anyway. Hence my proposal that instead the locking be simply removed from ast_hangup().

Anyway, just wanted to point this out. If the consensus is that this is not a problem and/or should not be fixed, I'll shut up and put the ast_mutex_unlock(chan->lock) into my channel drivers. Close the bug and apologies for the disturbance :-)

By: knielsen (knielsen) 2005-08-31 03:14:19

I found a simple example in the Alsa driver (channels/chan_alsa.c) of a potential deadlock caused by this problem.

If console_answer() and ast_hangup() are called at the same time, we may get this:

1. ast_hangup() locks chan->lock
2. console_answer() locks the global alsalock
3. ast_hangup() calls alsa_hangup(), which sleeps on alsalock
4. console_answer calls ast_queue_frame(), which sleeps on chan->lock -> deadlock.

If we remove the locking of chan->lock in ast_hangup as I propose, there is no race, since console_answer() and alsa_hangup() correctly synchronize on the global alsalock.

Granted, this is unlikely to occur in practice, but it would be better to solve this problem correctly, in my opinion.



By: knielsen (knielsen) 2005-08-31 04:02:37

I went carefully through the hangup callback code in all the channel drivers in the channels/ subdirectory. I found that none of them assume that the struct ast_channel * is locked on entry to the callback, so attached patch should be sufficient and safe to apply.

By: Kevin P. Fleming (kpfleming) 2005-09-01 19:08:33

I think this patch is correct, but I'll need to run it past Mark and Matt to see what their thoughts are. More updates tomorrow :-)

By: Kevin P. Fleming (kpfleming) 2005-09-07 22:31:39

After much discussion with Mark and Matt, we've come to the conclusion that the current behavior is the way it was designed to be... any background (monitor) threads that are running are supposed to provisionally take the private lock before attempting to take the channel lock, then release the private lock if taking the channel lock was unsuccessful.

For now, keep with the techniques used in chan_zap and other drivers to avoid deadlocks, until we can engineer a new locking system to deal with this.