[Home]

Summary:ASTERISK-13417: chan_dahdi segfaulting (may be related to Bridge() application).
Reporter:Matt King, M.A. Oxon. (kebl0155)Labels:
Date Opened:2009-01-21 08:26:29.000-0600Date Closed:2009-03-02 10:49:03.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_dahdi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) fixupBroke.txt.gz
( 1) FixupGDB.txt
Description:Hi,

We changed one of our apps to use the new Bridge() command today.  Since then we've had two segfault/core dumps in chan_dahdi.


****** ADDITIONAL INFORMATION ******

I have gdb'd the core dumps:

Program terminated with signal 11, Segmentation fault.
#0  do_monitor (data=0x0) at chan_dahdi.c:7739
7739                            if ((i->subs[SUB_REAL].dfd > -1) && i->sig && (i->radio)) {

So, both core dumps so far have failed at this same line in chan_dahdi.c

I'm not sure whether this is related to our use of Bridge(), however this has coincided with the adoption/use of this command, and there have been no other changes.  I also can't find any error messages in the asterisk logs or manager event stream that would indicate an alternative cause.

In any case, this line probably shouldn't be causing core dumps, so I thought I would let you know...
Comments:By: Matt King, M.A. Oxon. (kebl0155) 2009-01-21 11:07:14.000-0600

I have been able to associate this fault with the following set of error messages:

[Jan 21 10:25:16] VERBOSE[23296] logger.c:     -- Executing [9400@orderlyq:5] Bridge("Local/6400@orderlyq-603f;1", "DAHDI/35-1") in new stack
[Jan 21 10:25:16] WARNING[23292] channel.c: Fixup failed on channel DAHDI/11-1<MASQ>, strange things may happen.
[Jan 21 10:25:16] WARNING[23292] channel.c: Hangup failed!  Strange things may happen!
[Jan 21 10:25:16] WARNING[23292] channel.c: Failed to perform masquerade
[Jan 21 10:25:16] WARNING[23292] channel.c: Channel 'DAHDI/11-1' may not have been hung up properly


We've had one of these within 30 seconds of every core dump today (and only at these times).  We did occasionally see some of these when we were using ParkedCall() and ChannelRedirect() to connect the calls, so this may not be specific to the Bridge() application after all.

By: Leif Madsen (lmadsen) 2009-01-26 13:49:11.000-0600

Just wanted to see if you have any additional information you could provide on this issue?

In addition, I would like to see the backtrace from this crash. Please attach to this issue by compiling Asterisk with DONT_OPTIMIZE under the Compiler Flags section of menuselect.

Once you have gotten the coredump, then please output the text from gdb to a text file and upload it to this bug. Follow the instructions in doc/backtrace.txt of your Asterisk source directory.

Thanks!

By: Matt King, M.A. Oxon. (kebl0155) 2009-01-26 16:42:03.000-0600

Hello,

Since posting the bug I've been able to produce a work-around by adding a one second delay between the Bridge() command, and the ensuing Hangup() command that is run when one or other of the callers hangs up the phone, like this:

exten => _9XXX,n,Bridge(${CALLERCHANNEL})
exten => _9XXX,n,Wait(1)
exten => _9XXX,n,Hangup

This has solved the problem in that we're no longer getting the core dumps or Fixup Failed messages.

I think the problem was happening because Hangup() was being executed before the Bridge mechanism had completely finished unlinking the channels.

Unfortunately I can't compromise the customer's call centre by removing the work-around as this is a production system (i.e. they'll shout at me...), so I am unable to produce the further core dumps requested, as I'm sure you can appreciate!

None the less I have uploaded a full gdb output from the last core dump - I hope it is of some use.

By: Leif Madsen (lmadsen) 2009-01-27 14:33:07.000-0600

Darn... the values are optimized out so the backtrace isn't very useful.

Can you provide a simple call flow that might exhibit this issue so I can attempt to reproduce it?

(i.e. what would ${CALLERCHANNEL} normally contain, and what technologies are involved here?)

Thanks!

By: Matt King, M.A. Oxon. (kebl0155) 2009-01-27 14:44:40.000-0600

Hi Blitzrage,

I'm sorry the gdb output isn't as helpful as you would wish.  All the core dumps are consistently segfaulting at the same line in chan_dahdi.c

In answer to your question, we're essentially doing a custom ACD (without using the Queue() application).  Typically, a caller goes into a FastAGI using the AGI() application, where they get messages and music on hold.

A Probe call is launched on behalf of the caller to find an agent, by using the manager Originate action with a Local channel.  ${CALLERCHANNEL} is set on this Probe call to equal the caller's original channel.

When the agent answers the call, the dialplan automatically executes the following on the 'free' end of the local channel:

exten => _9XXX,1,Answer
exten => _9XXX,n,Bridge(${CALLERCHANNEL})
exten => _9XXX,n,Hangup

The Bridge() command will then automatically bridge the agent and the caller.

Execution continues after the Bridge() if the caller hangs up before the agent, so we have a Hangup statement as the next line.  Adding a one second delay after the Bridge() command seems to have alleviated the problem, like this:

exten => _9XXX,1,Answer
exten => _9XXX,n,Bridge(${CALLERCHANNEL})
exten => _9XXX,n,Wait(1)
exten => _9XXX,n,Hangup

We did formerly use a more complicated(!) scheme using ChannelRedirect() and ParkedCall(), which intermittently had the same problem, so I don't think this is specific to the Bridge() command after all - rather, this seems to be a problem when the Hangup is executed while the channels are still unlinking from each other at the end of the call.

Hope this helps, and best of luck!

By: Leif Madsen (lmadsen) 2009-01-27 16:04:02.000-0600

Darn... I can't seem to reproduce this. I've run calls through the Local channel to try and get it to die, but no luck thus far. I wonder if it has something to do with masquerading from a DAHDI channel...

Here is a simple dialplan I've been working with. Not sure if this is really the "right" way of trying to reproduce it, but it was kinda what I figured out based on what was happening in this bug report.

;
exten => 555,1,MusicOnHold()

exten => 666,1,Dial(Local/777@default)

exten => 777,1,Answer()
exten => 777,n,Bridge(SIP/110-1871cf00)
exten => 777,n,Hangup()
;

Extension 110 dials 555 which is then an active call, then I take extension 111 and dial 666 (after updating the Bridge() application to have the correct channel name), reload the dialplan, then see what happens. I tried hanging up 111 first and 110 first, but did not result in any crashes or warnings.

By: Matt King, M.A. Oxon. (kebl0155) 2009-01-30 04:18:44.000-0600

Hi Blitzrage,

I can think of three reasons why you may not be able to reproduce this...

1)  It only happens in 1 in 100 calls (approx).

2)  I *think* it only happens if the channel that calls Bridge() continues execution (i.e. the other party has to hang up first).

3)  The segfaults all happened in chan_dahdi.c, so you might not be able to reproduce this with SIP at all...

Please let me know if I can be of any further assistance.

By: Matt King, M.A. Oxon. (kebl0155) 2009-01-30 04:21:11.000-0600

...though the error message that precedes the segfault is from line 3931 of main/channel.c, so who knows...

By: Leif Madsen (lmadsen) 2009-01-30 08:33:09.000-0600

Ya, if it happens infrequently, I might need to load something up with SIPp and reproduce it that way.

If that doesn't work... then I might have no choice but to ping someone at Digium to try and reproduce this in a lab since I have no hardware here. But that might be easier said than done :)

By: Leif Madsen (lmadsen) 2009-02-13 13:39:00.000-0600

Do you happen to be using any custom patches, or perhaps using a packaged version from a Linux distribution such as Debian? I've had a developer look over the code of where this is happening, and he is confused as to how you would even be able to get that message in this scenario.

Thanks!

By: Joshua C. Colp (jcolp) 2009-02-13 14:09:02.000-0600

After looking at this further I think it may actually be the chan_local optimization that is causing another masquerade to happen. Can you post the *complete* console output and potentially the actual dialplan?

By: Matt King, M.A. Oxon. (kebl0155) 2009-02-23 07:26:03.000-0600

Hi there,

We're not using any custom patches, and we're compiling from scratch on Debian linux.

As I said, adding 1 second delay between the Bridge() and the Hangup() seems to have fixed it.

I have attached a 10 minute section of log file as requested - please let me know if you need more.

The dialplan is quite complicated.  We use Originate to launch a local channel into the queue, which is routed through the Queue app to agents (also over a local channel).  When the the call is answered, we use Bridge() to connect the agent to the original caller (who is listening on a different channel).

By: David Vossel (dvossel) 2009-02-23 09:18:17.000-0600

Hey, I committed a possible fix for this earlier, but I had a typo in the commit message that prevented it from being posted here.  Check out main/features.c rev177227 or above from 1.6.0 branch to get the fix.  Let me know if this helps.  The issue is nearly impossible for me to reproduce in my office.

By: Leif Madsen (lmadsen) 2009-02-24 11:40:11.000-0600

Status changed to feedback as we're looking for someone to test per dvossel's note and report back if the issue is resolved or not. Thanks!

By: Matt King, M.A. Oxon. (kebl0155) 2009-02-24 11:49:22.000-0600

I'll ask my customer if they'll allow us to test this patch.

By: Leif Madsen (lmadsen) 2009-02-24 11:54:34.000-0600

OK, keep us posted. If not, then we may just have to close this for now as unreproducible and you can re-open in the future should you have the same issue upon system upgrade.

Thanks!
Leif.

By: Matt King, M.A. Oxon. (kebl0155) 2009-03-02 08:10:15.000-0600

Hi, I'm sorry but my customer has refused permission to try the patch now that a workaround is in place.  Please understand that each call is worth $100's to this customer.

I'm really sorry I can't do more on this - we don't use Zap/Dahdi on our test rig here at the office so I'm unable to run test calls for you.

Hopefully this bug thread will be of some use if someone else runs into the same issue...

Once again, I'm sorry I can't do more.

Matt.

By: Leif Madsen (lmadsen) 2009-03-02 10:48:35.000-0600

kebl0155: thanks for following up and trying to see if you could test the latest SVN

However I'm going to close this issue out for now. There is a change in the code which may have solved this, so in the future if you end up upgrading the customer and the issue comes back, just re-open this issue.

If someone else is reading this, just go to #asterisk-bugs on the IRC network irc.freenode.net and ask a bug marshal to reopen this issue for you. Thanks!

By: Leif Madsen (lmadsen) 2009-03-02 10:49:03.000-0600

Potentially already fixed. Issue now closed.