ASTERISK-08705: [patch] Channel/Thread deadlock with heavy bi-directional traffic on PRI

[Home]

Summary: ASTERISK-08705: [patch] Channel/Thread deadlock with heavy bi-directional traffic on PRI - GLARE!

Reporter: jerjer (jerjer) Labels:

Date Opened: 2007-01-31 12:32:33.000-0600 Date Closed: 2007-02-14 11:16:58.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_zap

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) chan_zap_donthangupowner.diff
( 2) console.txt
( 3) lotsodebug.txt
( 4) mighthaveitlicked.txt

Description: We are repeatedly generating calls in and out of two PRIs that we have on two separate machines, specifically creating serious GLARE to force this lingering bug to present itself.

We have two machines, Intel P4 and a Mini-ITX which are connected via crossover cables to two PRIs using TE2XXP cards

After about 100 calls in each direction asterisk spits out the following warning:

Jan 31 13:18:35 WARNING[25500]: chan_zap.c:8535 pri_dchannel: Ring requested on channel 0/11 already in useon span 1. Hanging up owner.

...to which we have specifically setup asterisk to crash AFTER printing that message, which is attached as backtrace.txt.

****** ADDITIONAL INFORMATION ******

This bug is going to be related to a few other reports of this same sitaution.

We have limited access a non-production installation to solve this bug once and for all. Lets use it while we have. Anyone needing access can get it, just find me.

Comments: By: jerjer (jerjer) 2007-01-31 12:37:35.000-0600

If we set the test up to only send inbound or outbound call this bug never presents itself. We pushed over 10,000 inbound calls without so much as a hiccup.

However the instant we mix inbound and outbound calls on the same PRI we hit this problem.

By forcing a serious glare situation we can reproduce this situation in a matter of seconds.
By: jmls (jmls) 2007-02-05 09:29:03.000-0600

we are also receiving an uncomfortable number of these "Ring requested on channel 0/xx already in useon span 1" on inbound calls.

We are making outbound from ASTERISK-26 downwards, and receiving inbounds on #1 upwards on a BT EuroISDN-30 with a sangoma card
By: BJ Weschke (bweschke) 2007-02-06 01:29:08.000-0600

Hey JerJer - try applying this patch. It's not going to fix your problem, but I'd be interested to see what the values of these flags are in the pvt struct when this occurs. As we discussed in #asterisk-dev, I don't think it's appropriate to hangup on the owner when we've got a true case of "glare".
By: BJ Weschke (bweschke) 2007-02-06 01:31:28.000-0600

oh - that patch is for a 1.4 chan_zap file. If you need one for 1.2 lemme know, but I think you said this was also reproducible for 1.4/trunk quicker than 1.2. If that's the case, and the fix follows my theory on the situation, we can produce a patch for 1.2/1.4/trunk.
By: jmls (jmls) 2007-02-06 01:35:02.000-0600

bweschke, do you have an idea on what the problem is (is there a fix in sight) or are you just looking at data right now ? We have this problem as well - I can apply the patch to our system tonight and send in the data if it happens again
By: BJ Weschke (bweschke) 2007-02-06 01:41:57.000-0600

jmls: My theory is that if this is a true case of glare (inbound call requested on the same channel of which an outbound call is in a progress state but not yet setup), then we probably don't want to try to do what the error msg is telling us it's doing, "Hanging up owner". I think instead we probably want to reject the inbound call attempt at the PRI level and allow the outbound call to proceed.
The problem we have here is that there's plenty of other things that could cause this error msg that haven't anything to do with glare (eg - extended func_curl, app_curl calls, etc) and in those cases, the action in the error is appropriate as we do want to try and clear the channel/application on the channel that's got the timeslot's state out of sync with what PRI is signaling.
So, the more data we can get, probably the better.
By: jerjer (jerjer) 2007-02-06 14:03:20.000-0600

After multiple crashes all we ever get is:

Feb 6 15:02:34 WARNING[25563]: chan_zap.c:8537 pri_dchannel: ZAP Channel Progress Flag: 0
Feb 6 15:02:34 WARNING[25563]: chan_zap.c:8538 pri_dchannel: ZAP Channel Proceeding Flag: 0
Feb 6 15:02:34 WARNING[25563]: chan_zap.c:8539 pri_dchannel: ZAP Channel Setup/ACK Flag: 0
Feb 6 15:02:34 WARNING[25563]: chan_zap.c:8540 pri_dchannel: ZAP Channel Dialing Flag: 0
!! Forcing immediate crash a-la abort !!
By: jerjer (jerjer) 2007-02-06 14:12:06.000-0600

Oddly enough when I setup a test to hammer on the same channel on each box this situation has (so far) not occured.

This may not specifically be related glare now - perhaps just a timing thing
By: Matthew Fredrickson (mattf) 2007-02-06 14:38:00.000-0600

What we need to do is figure out what the exact sequence of events is that causes this to occur, if it is indeed a bug in chan_zap and not because some other application or part of asterisk has got the channel hung up.
By: BJ Weschke (bweschke) 2007-02-06 21:51:49.000-0600

JerJer - I'm somewhat surprised all these flags are 0. Next step I guess is to provide a VERBOSE/DEBUG log with maybe two mins of activity leading up to the crash. Hopefully, this should allow us to get a better idea of rebuilding the events in play.
By: jerjer (jerjer) 2007-02-07 10:30:06.000-0600

mattf: We even tried

exten => 3001,1,Answer
exten => 3001,2,Wait(5)
exten => 3001,3,Hangup

...and was still able to reproduce the Ring Requested situation in less than 3 minutes with very high usage (in and out) on the 2 PRIs.

Again, if we only have inbound calls we could not duplicate this in under 10,000 calls, but if we add in outbound calls to the mix this happens in a handful of calls.
By: Serge Vecher (serge-v) 2007-02-07 14:44:34.000-0600

JerJer: any luck with that verbose/debug log?
By: jerjer (jerjer) 2007-02-08 15:38:14.000-0600

Attached console.txt, however this time it did not last 2 minutes
3 seconds to crash - that might be a new record.

Test was setting up and receiving 46 simultaneous calls via 2 PRI
By: Matthew Fredrickson (mattf) 2007-02-10 12:16:06.000-0600

Can you get the "pri debug span x" output as well as the console output when this happens? The console doesn't tell quite enough about what is going on.
By: jerjer (jerjer) 2007-02-10 12:43:36.000-0600

As requested, sir
By: Matthew Fredrickson (mattf) 2007-02-10 14:44:01.000-0600

Please try the attached patch. I think hanging up the original owner is probably wrong now that channel negotiation works properly.
By: jerjer (jerjer) 2007-02-10 19:37:09.000-0600

Patch applied - received this seg:

#0 0xb759e93e in pri_dchannel (vpri=0xb75b4540) at chan_zap.c:8402
8402 if (pri->pvts[chanpos]->realcall)
(gdb) bt
#0 0xb759e93e in pri_dchannel (vpri=0xb75b4540) at chan_zap.c:8402
#1 0xb7f42f46 in start_thread () from /lib/tls/i686/libpthread.so.0
#2 0xb7e3741a in clone () from /lib/tls/i686/libc.so.6

-- Playing 'vm-intro' (language 'en')
Feb 10 20:33:28 DEBUG[3068]: chan_zap.c:8399 pri_dchannel: Ring requested on channel 0/19 already in use or previously requested on span 1. Attempting to renegotiating channel.
Segmentation fault (core dumped)
By: jerjer (jerjer) 2007-02-10 19:39:03.000-0600

oh - the patch didn't cleanly apply - Hunk #2 failed
By: jerjer (jerjer) 2007-02-10 20:04:30.000-0600

Been slamming these systems with calls for calls for about 45 minutes now - You may have solved it man.
By: jerjer (jerjer) 2007-02-10 20:15:57.000-0600

new file uploaded - after sifting thru all kinds of verboseness i found how this situation is now being dealt with. Seems to handle the situation now - YAY
By: Steve Davies (one47) 2007-02-12 04:42:45.000-0600

One of the problems often associated with this error message (at least in my experience) is channel(s) which lock up and never go away.

If I understand this patch, it works around the issue by allowing the renegotiation of the channel to another not-locked/in-use channel on the Zap/ side, so that the call may proceed instead of being hung-up.

Will this not just leave an increasing number of locked channels behind until the system runs out of resources? Is this something that is seen by JerJer? I still like the improved behaviour BTW :)
By: BJ Weschke (bweschke) 2007-02-12 06:13:53.000-0600

one47: In this case, the channel is locked for a valid reason. There's an outbound call in the process of setting up on it. This is a condition known as Glare and we're now appearing to handle it properly with Cresl1n's patch.
However, if your system is producing a growing number of channels that are locked and never return from this state, you may have another scenario where you've got an application that is hanging on the channel and this is why it is staying locked. In that scenario, this behavior is inappropriate, and you need to find out what application is hanging the channel and deal with that to alleviate the problems.
By: Matthew Fredrickson (mattf) 2007-02-12 10:51:29.000-0600

Yeah, we need to fix this one bug at a time. Not to repeat what has already been said, but if there is another bug that is causing channels to hang, that bug needs to be fixed, not band-aided over.
By: Serge Vecher (serge-v) 2007-02-12 10:57:14.000-0600

Small text comment:
"Attempting to renegotiating channel" -> "Attempting to renegotiate channel"
By: Steve Davies (one47) 2007-02-12 11:15:49.000-0600

I had not realised that this was confirmed as a specifically Glare related issue. Sorry.

Nonetheless, the same code path is used if a call arrives on a channel which is stuck due to a misbehaving app/locaked-up channel, so this patch should leave a system in a workable state for longer. Thanks :)

I will be combining this with loadtesting of bugASTERISK-8146 as I think they will be most complimentary.
By: jerjer (jerjer) 2007-02-12 12:58:14.000-0600

Mattf's patch does fix the crash situation. We just experienced the Ring Requested situation on a production call. With the new patch installed, Asterisk did not crash and the call was successfully transfered to another known idle channel.
By: seb7 (seb7) 2007-02-12 13:19:10.000-0600

I managed to reproduce (100%) this Ring requested problem with a single call up (on a PRI), although looking at this bug, I reckon I need to submit a new issue but don't have time today...

Briefly though, this is what happened (Asterisk SVN-branch-1.4-r53881):
1. Incoming PRI call is answered.
2. Wait 2 seconds.
3. Call Hangup from the dialplan
4. Move to the h extension
5. Do a couple of NoOp and GotoIfs.
6. Do another GotoIf, which evaluates false and has no labeliffalse and get stuck (Goto does not execute). The actual line in extensions.conf is: exten => h,n,GotoIf($["${CDR(answer)}" = ""]?50)
7. Wait a couple of minutes (my call is still connected).
8. Disconnect the call from my phone's end.
9. Try to place another call into Asterisk on the same number and get:
[Feb 12 18:04:58] WARNING[4014]: chan_zap.c:8593 pri_dchannel: Ring requested on channel 0/1 already in use on span 4. Hanging up owner.
10. Try a few more times and get the same thing.
11. Do a 'show channels' and see that the original call is still there even though I disconnected long ago.
12. Use 'stop now' to stop Asterisk - only cure I know of. [EDIT: without using this patch.]

By: Matthew Fredrickson (mattf) 2007-02-12 13:22:48.000-0600

That's totally another issue. Open up another issue for that. Although your steps for producing it look like they would be deliberately cooking up some trouble.
By: Matthew Fredrickson (mattf) 2007-02-14 11:16:23.000-0600

Fixed in 1.2, 1.4, and trunk.