ASTERISK-10043: [patch] Asterisk stops processing calls

[Home]

Summary: ASTERISK-10043: [patch] Asterisk stops processing calls

Reporter: callguy (callguy) Labels:

Date Opened: 2007-08-08 11:44:06 Date Closed: 2007-10-23 11:13:11

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 20071005__bug10406.diff.txt
( 1) 20071008__bug10406__3.diff.txt
( 2) 20071008__bug10406.diff.txt
( 3) asterisk-running-bt.txt
( 4) bt.txt
( 5) bt-1.4.12.1-crash.txt
( 6) bt-1.4.12.1-crash2.txt
( 7) core_show_channels_1200250907.txt
( 8) core_show_locks_07-10-01.txt
( 9) core_show_locks_071005
(10) core_show_locks_1200250907.txt
(11) core_show_threads.txt
(12) coreshowlocks.txt
(13) deadlock-10082007.txt
(14) dealock.10172007.txt
(15) messages_grep_ERROR
(16) showchannels.txt
(17) sip_show_peers.txt
(18) thread-all-bt.txt

Description: Asterisk stops processing calls in the sip channel at random intervals. This appears to be related to reloading chan_sip.so. When this happens the console partially locks up (show channels becomes unresponsive) and shortly after sip processing ceases to function.

Running bt attached.

Comments: By: Eliel Sardanons (eliel) 2007-08-08 12:31:40

Please upload a 'thread apply all bt full' or read the backtrace upload for deadlocks on voip-info.org.
By: callguy (callguy) 2007-08-09 14:21:34

thread apply all bt full uploaded, along with a brief snapshot of show channels. The behavior we see is that show channels drops into an infinite loop, and the server stops processing calls. It appears that call processing is hanging (and that's what's causes show channels to go crazy).

this bt is before call processing was complete dead - but well on it's way.
By: Eliel Sardanons (eliel) 2007-08-09 16:45:43

Do you have call limit on peers?
By: callguy (callguy) 2007-08-09 16:55:45

Yes - we have:

call-limit=99

on all of these peers, and limitonpeers=yes in sip.conf. This was required in order for the buddy watch features to work.
By: callguy (callguy) 2007-08-18 14:04:58

We have narrowed this somewhat. The peers that are causing this behavior are polycom 601 phones which have multiple sidecars. We are seeing hung notify messages, related to the call-limit and hints.

We had originally thought the behavior was transferring a local channel to a queue, then perhaps just using the local channel.

We have modified our dialplan to use neither queues or local channels when calling to these handsets, but have experienced the same behavior. The only common thread is that these handsets have a significant amount of notify traffic going to them for the hints, and from time to time the hints seem to cause an issue getting a lock elsewhere, causing a race condition.

Aside from the backtrace, is there additional data we can obtain the next time it happens to help narrow down the exact problem?
By: Russell Bryant (russell) 2007-08-19 23:36:37

Build Asterisk with DEBUG_THREADS enabled. Then, when the system locks up, grab the output of the "core show locks" Asterisk CLI command.
By: callguy (callguy) 2007-08-30 10:58:47

russell - we are currently testing your patch from 9724, as it seems the issue is related. We have generally narrowed down that the issue is triggered when calls are sent via local channels (which appeared to be similar in some of the other scenarios).

We have DEBUG_THREADS enabled, so as soon as we experience again we will submit the output.
By: Derrick Bradbury (snapple42) 2007-09-07 14:49:11

I can't compile/run any debug as it's a critical production box.. but I'm running into this problem as well. I've had the console (and ssh) lock up as well, but system was still processing calls. AND it seems to only happen when a call is answered from the Queue.

I've found that IAX channels are still working, but any sip calls will ring, but can't be answered.

We are running 1.2.14, upgrading to 1.2.24 tomorrow.

(Can't upgrade to 1.4 trunk as we haven't tested it)

I'm going to be watching this one!

By: Russell Bryant (russell) 2007-09-07 15:42:23

I really need the "core show locks" output to fix this. Please consider planning to move to 1.4 if you'd like to get this issue fixed.
By: Ted Brown (ted brown) 2007-09-23 17:04:15

Same problem here with a 1.4.11 server, using softphones (Eyebeam) and queues. We've got two SIP lockups per day during last week.
By: Russell Bryant (russell) 2007-09-23 17:57:14

Could you please try to get that "core show locks" output? That would make debugging and fixing this a lot easier. You just have to build with DEBUG_THREADS enabled for this command to be present.
By: Ted Brown (ted brown) 2007-09-25 05:42:42

Hi,

we've just had another SIP lockup, all phones couldn't registered to the Asterisk machine. Asterisk was compiled with the following flags: DEBUG_THREADS,
DONT_OPTIMIZE and LOADABLE_MODULES.

Please see attached the following results of CLI commands:

- core show channels
- core show threads
- core show locks
- sip show peers

RAM level compsumtion was also increasing at a nearly constant rate of 3% more each hour, at the moment of the lockup there was around 180MB of free RAM memory. The system was running for almost 24 hours before the lockup.

Please ask for more information in case it helps.
By: Ted Brown (ted brown) 2007-09-25 09:24:33

Hi,

after a deeper analyis of log files, I've encoutered several ERROR MESSAGES (in chan_local and chan_agent and) which seem to be related with our problem:

ERROR[8079] /usr/src/PBX/asterisk-1.4.11/include/asterisk/lock.h: chan_local.c line 180 (local_queue_frame): mutex '&us->lock' freed more times than we've locked!

ERROR[8079] /usr/src/PBX/asterisk-1.4.11/include/asterisk/lock.h: chan_local.c line 180 (local_queue_frame): Error releasing mutex: Operation not permitted

ERROR[31787] /usr/src/PBX/asterisk-1.4.11/include/asterisk/lock.h: chan_agent.c line 829 (agent_hangup): mutex '&p->app_lock' freed more times than we've locked!

ERROR[31787] /usr/src/PBX/asterisk-1.4.11/include/asterisk/lock.h: chan_agent.c line 829 (agent_hangup): Error releasing mutex: Operation not permitted

I've submitted the output of "grep 'ERROR' /var/log/asterisk/messages" which gives a lot of ERROR messages like the ones stated above.

In case it helps, we noticed the SIP lockup at [Sep 25 11:53:00].
By: Tilghman Lesher (tilghman) 2007-09-26 11:21:58

Let us know after a week of using this patch if the deadlock does not reoccur (or if it does, we'll need another 'core show locks').
By: Digium Subversion (svnbot) 2007-09-28 00:12:05

Repository: asterisk
Revision: 84049

------------------------------------------------------------------------
r84049 | tilghman | 2007-09-28 00:12:05 -0500 (Fri, 28 Sep 2007) | 3 lines

Avoid a deadlock with ALL of the locks in the masquerade function, not just the
pairs of channels. (Closes issue ASTERISK-10043)

------------------------------------------------------------------------
By: Digium Subversion (svnbot) 2007-09-28 00:18:11

Repository: asterisk
Revision: 84050

------------------------------------------------------------------------
r84050 | tilghman | 2007-09-28 00:18:11 -0500 (Fri, 28 Sep 2007) | 11 lines

Merged revisions 84049 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r84049 | tilghman | 2007-09-28 00:30:22 -0500 (Fri, 28 Sep 2007) | 3 lines

Avoid a deadlock with ALL of the locks in the masquerade function, not just the
pairs of channels. (Closes issue ASTERISK-10043)

........

------------------------------------------------------------------------
By: Ted Brown (ted brown) 2007-10-01 11:40:27

i have just had a deadlock with the patch applied. Uploaded 'core show locks'
By: Tilghman Lesher (tilghman) 2007-10-01 14:59:20

Okay, this SECOND deadlock was exposed when we fixed the FIRST deadlock, so both patches must be applied. Note that the first patch has already been applied to 1.4 SVN, so if you run SVN, you'll only need to apply the second patch.

Basically what we've done is to remove the recursive use of ast_write, doing deadlock avoidance on the two channels involved in a proxy, which should permit locks to pass in the future.

I believe this only occurred when you performed an attended transfer to an Agent channel.
By: Ted Brown (ted brown) 2007-10-02 09:23:31

thank you. i'll be testing this tomorrow

"I believe this only occurred when you performed an attended transfer to an Agent channel.": could this issue then be related to bug 10809?
By: Tilghman Lesher (tilghman) 2007-10-02 10:12:56

Ted Brown: probably not related, as this issue is one of a deadlock and that issue is one of a crash. I know the description sounds similar, but we group bugs based upon the interaction of code and the result, not on description (indeed, we have bugs which manifest themselves in completely different ways, but have a common fix).
By: Ted Brown (ted brown) 2007-10-05 03:42:46

Hi again

i bring you a fresh 'core show locks' after a new deadlock. the symptoms are the same that we always have. The version is branch 1.4 84291 with the second patch applied.
By: Tilghman Lesher (tilghman) 2007-10-05 11:35:05

Third deadlock... Okay, new patch. This one should apply cleanly to the latest SVN 1.4 branch and incorporates the last change that has not yet been applied.
By: callguy (callguy) 2007-10-06 21:57:33

FYI - I don't know if it's just me losing my mind, but when I apply the latest patch asterisk crashes on any attempt to make any call. I just tried compiling against a clean 1.4.12.1 tree and had the behavior, then compiled without the patch it runs fine. the bt's all show memory corruption in channel.c but are all slightly different, i'm going to try to clean them up and post something useful - Ted Brown, have you tested this one yet?
By: Tilghman Lesher (tilghman) 2007-10-07 00:16:11

callguy: please generate a resulting stack backtrace from a 'DONT_OPTIMIZE' built binary and upload it to the file area of this issue.
By: callguy (callguy) 2007-10-07 10:43:15

Corydon76- I uploaded the backtraces from two crashes. I can reproduce at will, so if you need any additional debugs just let me know.
By: Ted Brown (ted brown) 2007-10-08 03:46:54

I have installed latest SVN 1.4, clean installation then this patch. Same behavior as callguy's (callguy: you are not losing your mind :)
By: Tilghman Lesher (tilghman) 2007-10-08 11:55:44

Aha. Okay, new patch uploaded.
By: callguy (callguy) 2007-10-08 13:38:03

We just experienced another deadlock (though we aren't testing with the current patch - but the deadlock doesn't appear to be the same). Output of core show locks is uploaded.
By: callguy (callguy) 2007-10-08 14:23:17

Corydon76: also wanted to let you know that I captured a thread apply all bt to the running process. Not sure if you need it or it helps - but if it is useful let me know and I will clean it up and post.
By: Tilghman Lesher (tilghman) 2007-10-08 16:02:49

callguy: I'm working on a modification to "core show locks" that will help me figure out this one, as it's not exactly obvious where the deadlock is.
By: Tilghman Lesher (tilghman) 2007-10-08 16:38:39

Okay, new patch uploaded. This should expand the debugging available in the output of 'core show locks', and should hopefully help me diagnose your latest deadlock, callguy.
By: Russell Bryant (russell) 2007-10-08 16:52:24

Tilghman: Just one comment on your latest changes to "core show locks" ...

Perhaps in the mark_lock_failed() function we should also decrement the times_locked counter. Then, if the lock is eventually successfully acquired, and then later unlocked, the times_locked counter will accurately reflect 0, and be removed from the lock information list.
By: Tilghman Lesher (tilghman) 2007-10-08 16:58:16

Russell: Good point. Patch updated.
By: Ted Brown (ted brown) 2007-10-09 05:34:07

Tested latest patch with latest SVN version. No more crashes with normal calls, but with it applied we've got again crashes after attended transfers to queues with the eyebeam (bug 10809), which had become somehow random.

I won't be able to test it in my production machine for a few days, so I can't say if it really fixes all deadlocks. Hope callguy can test it
By: callguy (callguy) 2007-10-09 10:43:55

We are testing this today under load, we'll report back later.
By: callguy (callguy) 2007-10-09 16:21:54

We ran the most recent diff with about 100 concurrent calls all day today, without any deadlocks. That said, I don't think our most recent deadlock was related to what you fixed here - we'll continue to run with this patch until we have something to report.
By: Digium Subversion (svnbot) 2007-10-09 16:34:35

Repository: asterisk
Revision: 85158

U branches/1.4/include/asterisk/lock.h
U branches/1.4/main/channel.c
U branches/1.4/main/utils.c

------------------------------------------------------------------------
r85158 | tilghman | 2007-10-09 16:34:34 -0500 (Tue, 09 Oct 2007) | 5 lines

This commit fixes the following issues:
- Deadlock in ast_write (issue ASTERISK-10043)
- Deadlock in ast_read (issue ASTERISK-10043)
- Possible mutex initialization error in lock.h (issue ASTERISK-10177)

------------------------------------------------------------------------
By: Digium Subversion (svnbot) 2007-10-09 17:02:01

Repository: asterisk
Revision: 85176

_U trunk/
U trunk/include/asterisk/lock.h
U trunk/main/channel.c
U trunk/main/utils.c

------------------------------------------------------------------------
r85176 | tilghman | 2007-10-09 17:02:00 -0500 (Tue, 09 Oct 2007) | 13 lines

Merged revisions 85158 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r85158 | tilghman | 2007-10-09 16:55:06 -0500 (Tue, 09 Oct 2007) | 5 lines

This commit fixes the following issues:
- Deadlock in ast_write (issue ASTERISK-10043)
- Deadlock in ast_read (issue ASTERISK-10043)
- Possible mutex initialization error in lock.h (issue ASTERISK-10177)

........

------------------------------------------------------------------------
By: callguy (callguy) 2007-10-09 21:58:32

We just experienced a new deadlock. The person who triggered it said that they issued a "dialplan show" for a specific context, but that doesn't appear to be fully related. Interestingly, after a few minutes of being hung, asterisk seemed to sort out and start responding again. The output of core show locks and a thread apply all from during the hang bt are both uploaded.

This was running the latest diff.
By: Ted Brown (ted brown) 2007-10-10 06:09:58

I'm testing now latest svn branch (85199). When I make a SIP call to a queue, THERE IS NO AUDIO when the agent answers it. Besides, i've seen the following behaviors in transfers A->B->C to queue (A=linksys spa941, B=X, C=linksys spa941):

X=eyebeam 1.5.16.1: Attended transfer: crash
Blind transfer: no audio A<-->C
X=zoiper 2.09(latest) Attended transfer: crash
Blind transfer: no audio A<-->C
X=linksys spa941 Attended transfer: no audio A<-->C
Blind transfer: no audio A<-->C

All transfers, doing the same but calling directly to the C extension, work fine.

This behavior is caused by latest patch, as I've tested svn 85057 with the patch and i saw the same. Without it, only attended transfers with eyebeam would crash

You can use any of the configurations I have already uploaded in bug 10809 to reproduce this

The truth is i'm very surprised nobody had noticed the no audio issue...
By: callguy (callguy) 2007-10-17 11:40:17

We just had a new deadlock occur on 1.4.13. Output of core show locks attached (deadlock.10172007).
By: Tilghman Lesher (tilghman) 2007-10-17 11:52:43

callguy: this does not appear to be a deadlock. Could you describe what you're seeing?
By: callguy (callguy) 2007-10-17 11:57:37

The SIP channel stopped processing calls, it was similar behavior as we've seen during other deadlocks (even though I understand it doesn't necessarily point to that).
By: callguy (callguy) 2007-10-23 07:23:20

FYI - we are seeing the same behavior as Ted Brown - no audio when SIP calls into a queue.
By: Tilghman Lesher (tilghman) 2007-10-23 11:13:11

I'm going to close out this bug, because we have identified and fixed several deadlocks reported here. This issue ID is not the end-all be-all tracker for all deadlocks, so if you have additional problems, please report them as separate issues, not attached to this one.