[Home]

Summary:ASTERISK-14998: Crash in __ast_pthread_mutex_unlock chan_sip after park
Reporter:Michael Cramer (micc)Labels:
Date Opened:2009-10-16 13:09:41Date Closed:2009-12-21 16:52:47.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) gdb.txt
( 1) gdb2.txt
( 2) gdb3.txt
Description:Looking at the back trace, it looks like someone was parking and hung up too soon or something.

****** ADDITIONAL INFORMATION ******

We are using the multi-tenant parking patch
Comments:By: Leif Madsen (lmadsen) 2009-10-26 08:33:23

What is the issue number of the patch you're running?

By: Michael Cramer (micc) 2009-10-26 13:43:39

We are running the multi-tenant parking patch from issue 0015538.

By: Michael Cramer (micc) 2009-10-26 13:45:09

I had another crash this weekend that I think is related. I'll upload the back trace later today.

By: Leif Madsen (lmadsen) 2009-10-26 15:59:14

Just to confirm, you're not able to reproduce this issue without the patch? I'm just trying to narrow down what could be causing it, and if there is an issue with the multi-tenant code from that issue you mentioned.

By: Michael Cramer (micc) 2009-10-26 23:49:40

gdb2 is a bt of another crash I had that I believe is related. I don't know if this is related to the parking patch or not.



By: Michael Cramer (micc) 2009-10-29 01:41:26

Here is another bt. Looks like the same thing. I don't have the option of trying it without the patch, but I will take a look at the patch and see why this might be happening. A crash in an unlock would seem like its being passed an invalid mutex handle or something. I'll try to spend a little time in the code tonight and see what I find.

By: Michael Cramer (micc) 2009-10-29 02:11:36

I really don't see those 2 ast_string_field_set calls from the patch causing this problem, unless it changes the code path somewhere else or changes the timing somehow and causes another bug to rear its head.

By: Michael Cramer (micc) 2009-10-29 02:47:57

Apparently someone has already done a lot of thought about this and has some very interesting code in sip_hangup() in chan_sip.c. The crash is happening in a part of the code that has a comment above that says, "This may get hairy..."
Whoever wrote that must have built their thought pyramid around what is happening here and I would love to understand what this is trying to solve. The crash is in a call to a macro called CHANNEL_DEADLOCK_AVOIDANCE(chan), which is in a loop that is locking the owner of the channel. I think its best if I just include the code snippet here so we can all see it easily.

/* We need to get the lock on bridge because ast_rtp_set_vars will attempt
* to lock the bridge. This may get hairy...
*/
while (bridge && ast_channel_trylock(bridge)) {
 struct ast_channel *chan = p->owner;
 sip_pvt_unlock(p);
 do {
   /* Use chan since p->owner could go NULL on us
    * while p is unlocked
    */
    CHANNEL_DEADLOCK_AVOIDANCE(chan);
 } while (sip_pvt_trylock(p));

 bridge = p->owner ? ast_bridged_channel(p->owner) : NULL;
}

So, in my opinion, this code could be more clear about what is happening inside the deadlock avoidance macro. Because this is such a "hairy" situation it might be best to bring the code out of the macro and just include it here. I don't see in the macro where its calling ast_pthread_mutex_unlock. I see the call to lock and ast_channel_lock, which may call it, but I would excpect to see that in the back trace.

So, lets just take a look at line 671 of lock.h like the backtrace says.
int canlog = strcmp(filename, "logger.c") & t->tracking;
And here is the line from the backtrace:
#0  0x027a37d6 in __ast_pthread_mutex_unlock (filename=0x280d054 "chan_sip.c", lineno=5342,
   func=0x28101f4 "sip_hangup", mutex_name=0x281047b "&chan->lock_dont_use", t=0x98)
   at /usr/src/asterisk/asterisk-1.6.1.6/include/asterisk/lock.h:671

This is the first time I've used gdb to do any real debugging, so I might be all wrong here, but it looks to me like some things got a bit mixed up. The name of the mutex isn't suppose to be "&chan->lock_dont_use" is it? But I don't think thats why its crashing here. t is supposed to be a pointer to ast_mutex_t, but if I read gdb correctly here it says t is 0x98, which is not large enough to be a pointer. It looks like memory was overwritten here.

So the question is could the memory be overwritten by the new calls to ast_string_field_set? I would hope not, but someone else can maybe check into that for me. I think more likely the owner channel was killed and the thing this is trying to prevent is not actually working correctly.

So I'm going to try to make a few changes and upgrade to 1.6.1.8. I'm going to put the code directly into the loop instead of the macro so I can get a better back trace hopefully. I hope I've shed enough light on this issue that maybe someone else can see the problem now. Or maybe I don't know how to read a backtrace and use gdb so I might be completely off here.

By: David Brillert (aragon) 2009-10-29 08:02:28

Any relation to ASTERISK-14304 ?

By: Michael Cramer (micc) 2009-10-30 03:20:07

Maybe, I'm betting that local_hangup uses the same deadlock_avoidance macro. I'd like to understand what that deadlock avoidance is trying to prevent. Is there a diagram somewhere that explains the problem? If not, can someone do a simple drawing or something to illustrate the different possible timing/locking issues that could happen here?

By: David Brillert (aragon) 2009-10-30 08:25:48

I have no idea but you could test the patch in 15314...
The reporter tested the patch with good results.
But the bug notes leave some unanswered questions
https://issues.asterisk.org/view.php?id=15314#109350

By: Michael Cramer (micc) 2009-11-02 13:00:10.000-0600

If it really is causing crashing in deadlock avoidance than this fix should probably be applied to many areas of asterisk.

By: Michael Cramer (micc) 2009-11-19 17:30:01.000-0600

No good. The patch doesn't help this problem. Its consistently crashing in sip_hangup in a call to pthread_mutex_lock about every 3 to 5 days.

By: Michael Cramer (micc) 2009-11-19 18:25:15.000-0600

I think this might be related to 0016106.



By: Leif Madsen (lmadsen) 2009-11-20 07:03:37.000-0600

You mean you think this issue might be related to itself?

By: Michael Cramer (micc) 2009-11-23 17:46:46.000-0600

No, my mistake, I think it might be related to the hangup being called twice bug, but I'll have to find the issue number... 0016106

Also, I don't think this has anything to do with parking or the bug that it is currently shown as related to. This is a mutex/lock timing issue.



By: Michael Cramer (micc) 2009-12-21 13:47:56.000-0600

Seems to be fixed in 1.6.1.10 and 1.6.1.12.

By: Leif Madsen (lmadsen) 2009-12-21 16:52:47.000-0600

Closed per the reporter. Thanks for the feedback!