[Home]

Summary:ASTERISK-11990: Asterisk locks up and cannot process incoming SIP messages on 1.4.20-rc1
Reporter:Julian Yap (jyap)Labels:
Date Opened:2008-05-07 15:53:12Date Closed:2008-06-17 12:58:29
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) gdb_thread_20080528.txt
( 1) locks20080507.txt
( 2) partial_locks20080505.txt
Description:This may be related to: ASTERISK-11973 or ASTERISK-11983

Filing this bug separately as I'm not sure.

Attached will be outputs from 'core show locks'.
Comments:By: Julian Yap (jyap) 2008-05-07 15:57:44

Both occasions required and Asterisk restart.

File partial_locks20080505.txt shows the output of 'core show locks' where are only the locks are shown.

File locks20080507.txt shows the full output of 'core show locks'.

By: Julian Yap (jyap) 2008-05-07 16:00:48

It looks like locks occur throughout the day but most of the time free themselves up:
# grep lock.h /var/log/asterisk/full
[May  7 05:37:52] ERROR[5024] /usr/src/asterisk-1.4.20-rc1/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): mutex '&chan->lock' freed more times than we've locked!
[May  7 05:37:52] ERROR[5024] /usr/src/asterisk-1.4.20-rc1/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): Error releasing mutex: Operation not permitted
[May  7 05:38:06] ERROR[5024] /usr/src/asterisk-1.4.20-rc1/include/asterisk/lock.h: channel.c line 1258 (ast_channel_free): Error destroying mutex &chan->lock: Device or resource busy


On occasions where it totally locks up, there is no 'ast_channel_free' line:
[May  7 09:14:37] ERROR[993] /usr/src/asterisk-1.4.20-rc1/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): mutex '&chan->lock' freed more times than we've locked!
[May  7 09:14:37] ERROR[993] /usr/src/asterisk-1.4.20-rc1/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): Error releasing mutex: Operation not permitted

By: Russell Bryant (russell) 2008-05-08 10:36:50

give rc2 a try

By: Julian Yap (jyap) 2008-05-09 15:48:17

I have upgraded to rc2 and lock errors still occur in /var/log/asterisk/messages.

The server has not yet fully locked up but it probably will.

# grep lock.h /var/log/asterisk/messages
[May  9 10:48:41] ERROR[28657] /usr/src/asterisk-1.4.20-rc2/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): mutex '&chan->lock' freed more times than we've locked!
[May  9 10:48:41] ERROR[28657] /usr/src/asterisk-1.4.20-rc2/include/asterisk/lock.h: channel.c line 4549 (ast_channel_unlock): Error releasing mutex: Operation not permitted
[May  9 10:49:05] ERROR[28657] /usr/src/asterisk-1.4.20-rc2/include/asterisk/lock.h: channel.c line 1258 (ast_channel_free): Error destroying mutex &chan->lock: Device or resource busy

By: Doug (doug) 2008-05-12 04:33:15

We still get the same result on 1.4.20 rc1 and rc2 and on 1.4.19 all versions. Ours seems to always be related to chan_local but i woul dthink it is the same problem as jyap as we have the same result when the lock makes the system unresponsive.

[May 12 11:06:23] ERROR[14085]: /dar/build/asterisk-1.4.19/include/asterisk/lock.h:479 __ast_pthread_mutex_unlock: chan_local.c line 188 (local_queue_frame): mutex '&us->lock' freed more times than we've locked!
[May 12 11:06:23] ERROR[14085]: /dar/build/asterisk-1.4.19/include/asterisk/lock.h:496 __ast_pthread_mutex_unlock: chan_local.c line 188 (local_queue_frame): Error releasing mutex: Operation not permitted

By: Mark Michelson (mmichelson) 2008-05-12 12:22:35

jyap: If possible, could you upload a core show locks output without DEBUG_CHANNEL_LOCKS enabled? my note(~86727) on issue ASTERISK-11973 will help explain why this is helpful. Thanks.

By: Julian Yap (jyap) 2008-05-12 13:16:07

putnopvut: OK, I'll recompile Asterisk _without DEBUG_CHANNEL_LOCKS compiler option tonight.

By: Russell Bryant (russell) 2008-05-12 17:10:15

jyap, what is the last version you ran where you did not have this problem?

By: Russell Bryant (russell) 2008-05-12 17:18:22

Also, can you please describe what is happening when this occurs?  It looks like it could be a transfer.  If so, what type of transfer is it?  Can you also provide a SIP trace of the call that causes this to occur?

By: Julian Yap (jyap) 2008-05-12 21:59:38

Russell, I can't be certain what was the version we ran which did not have this issue.

We ran 1.4.17 for a while but it did not have debugging enabled.

Can we send you a trace privately as it would contain confidential information?

By: Julian Yap (jyap) 2008-05-13 01:42:54

Asterisk 1.4.20-rc2 has been rebuilt on our Asterisk servers without DEBUG_CHANNEL_LOCKS compiler option.

I guess we'll wait for it to lock up.

By: Russell Bryant (russell) 2008-05-13 10:03:15

Yes, feel free to email traces privately if you would like.

russell(AT)digium.com
mmichelson(AT)digium.com  (putnopvut)

By: Digium Subversion (svnbot) 2008-05-13 16:13:41

Repository: asterisk
Revision: 116038

U   branches/1.4/channels/chan_local.c

------------------------------------------------------------------------
r116038 | russell | 2008-05-13 16:12:25 -0500 (Tue, 13 May 2008) | 24 lines

Fix a deadlock involving channel autoservice and chan_local that was debugged
and fixed by mmichelson and me.

We observed a system that had a bunch of threads stuck in ast_autoservice_stop().
The reason these threads were waiting around is because this function waits to
ensure that the channel list in the autoservice thread gets rebuilt before the
stop() function returns.  However, the autoservice thread was also locked, so
the autoservice channel list was never getting rebuilt.

The autoservice thread was stuck waiting for the channel lock on a local channel.
However, the local channel was locked by a thread that was stuck in the autoservice
stop function.

It turned out that the issue came down to the local_queue_frame() function in
chan_local.  This function assumed that one of the channels passed in as an
argument was locked when called.  However, that was not always the case.  There
were multiple cases in which this channel was not locked when the function was
called.  We fixed up chan_local to indicate to this function whether this channel
was locked or not.  The previous assumption had caused local_queue_frame() to
improperly return with the channel locked, where it would then never get unlocked.

(closes issue ASTERISK-11973)
(related to issue ASTERISK-11990)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=116038

By: Digium Subversion (svnbot) 2008-05-13 16:14:29

Repository: asterisk
Revision: 116039

_U  trunk/
U   trunk/channels/chan_local.c

------------------------------------------------------------------------
r116039 | russell | 2008-05-13 16:14:28 -0500 (Tue, 13 May 2008) | 32 lines

Merged revisions 116038 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r116038 | russell | 2008-05-13 16:17:23 -0500 (Tue, 13 May 2008) | 24 lines

Fix a deadlock involving channel autoservice and chan_local that was debugged
and fixed by mmichelson and me.

We observed a system that had a bunch of threads stuck in ast_autoservice_stop().
The reason these threads were waiting around is because this function waits to
ensure that the channel list in the autoservice thread gets rebuilt before the
stop() function returns.  However, the autoservice thread was also locked, so
the autoservice channel list was never getting rebuilt.

The autoservice thread was stuck waiting for the channel lock on a local channel.
However, the local channel was locked by a thread that was stuck in the autoservice
stop function.

It turned out that the issue came down to the local_queue_frame() function in
chan_local.  This function assumed that one of the channels passed in as an
argument was locked when called.  However, that was not always the case.  There
were multiple cases in which this channel was not locked when the function was
called.  We fixed up chan_local to indicate to this function whether this channel
was locked or not.  The previous assumption had caused local_queue_frame() to
improperly return with the channel locked, where it would then never get unlocked.

(closes issue ASTERISK-11973)
(related to issue ASTERISK-11990)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=116039

By: Digium Subversion (svnbot) 2008-05-13 16:14:42

Repository: asterisk
Revision: 116040

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_local.c

------------------------------------------------------------------------
r116040 | russell | 2008-05-13 16:14:41 -0500 (Tue, 13 May 2008) | 40 lines

Merged revisions 116039 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r116039 | russell | 2008-05-13 16:18:55 -0500 (Tue, 13 May 2008) | 32 lines

Merged revisions 116038 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r116038 | russell | 2008-05-13 16:17:23 -0500 (Tue, 13 May 2008) | 24 lines

Fix a deadlock involving channel autoservice and chan_local that was debugged
and fixed by mmichelson and me.

We observed a system that had a bunch of threads stuck in ast_autoservice_stop().
The reason these threads were waiting around is because this function waits to
ensure that the channel list in the autoservice thread gets rebuilt before the
stop() function returns.  However, the autoservice thread was also locked, so
the autoservice channel list was never getting rebuilt.

The autoservice thread was stuck waiting for the channel lock on a local channel.
However, the local channel was locked by a thread that was stuck in the autoservice
stop function.

It turned out that the issue came down to the local_queue_frame() function in
chan_local.  This function assumed that one of the channels passed in as an
argument was locked when called.  However, that was not always the case.  There
were multiple cases in which this channel was not locked when the function was
called.  We fixed up chan_local to indicate to this function whether this channel
was locked or not.  The previous assumption had caused local_queue_frame() to
improperly return with the channel locked, where it would then never get unlocked.

(closes issue ASTERISK-11973)
(related to issue ASTERISK-11990)

........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=116040

By: Nicolas Ross (rossnick) 2008-05-28 14:50:42

gdb_thread_20080528.txt uploaded. I have the same problem. I was running asterisk inside gdb to capture this... I didn't try to do a "Core show locks", but last time it did this the commande simply said nothing, not even a single line...

By: Mark Michelson (mmichelson) 2008-05-28 14:58:31

rossnick: Thanks for the backtrace. I'll have a look soon to see if I can figure out what's up. "core show locks" only works if you have compiled Asterisk with DEBUG_THREADS selected in Menuselect, so I'm not sure if that would have affected your ability to use the command.

It may be necessary to get core show locks output for this, but I won't know until I've looked at the backtrace further. Thanks!

By: Nicolas Ross (rossnick) 2008-05-28 15:02:34

putnopvut: yes, I did compile with DEBUG_THREADS, and in "normal" operations it shows something like :

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=======================================================================

When aterisk is jammed, it simply does nothing at all.

By: Mark Michelson (mmichelson) 2008-05-28 15:58:15

Gotcha. Okay, from the backtrace you provided, Asterisk is not attempting to acquire any locks, so it's no surprise that the core show locks output is empty. As for what's causing the holdup, I'm not sure just yet as all threads involved are in states that seem normal.

The one exception is thread 101 (the topmost thread in the backtrace provided). It's either "stuck" in the fork() system call (or your backtrace just happened to be taken as the thread was forking). I'd be interested to know if after waiting a few seconds that thread is still attempting to fork. If so, it would appear that something is happening within the fork system call that is causing it not to complete. I did some quick Googling and cannot seem to find a condition by which fork() would fail to return.

I could, of course, be chasing a red herring with this fork() idea, but if the problem occurs again, it would be interesting to see if after waiting several seconds between grabbing backtraces, the thread is still stuck forking.

---
By the way, this problem you're experiencing is not related to the original bug reported here. I suggest opening a new bug report and attaching the backtrace from here to there. I'll copy the main text of this note to the new bug as well. Thanks!

By: Julian Yap (jyap) 2008-06-17 01:38:04

This issue has not occurred since the patch was released in Asterisk 1.4.20-rc3.  Not issues for over a month.

You can go ahead and close this bug.

By: Mark Michelson (mmichelson) 2008-06-17 12:58:24

Closing since the problem has not occurred again.