DAHLIN-00071: sending a large number of calls to app

[Home]

Summary: DAHLIN-00071: sending a large number of calls to app_meetme causes kernel panic

Reporter: Terry Whelan (terrywhelan) Labels:

Date Opened: 2009-01-06 13:19:30.000-0600 Date Closed: 2009-02-03 11:20:11.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: wct4xxp

Versions: 2.1.0.3 Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) astcrash.txt
( 1) ringringscreen.png

Description: On machine 1 create a dial file that calls machine 2 plays about 30 seconds of test messages and hangup. Create one of these every 5 seconds. On machine 2 put all incoming calls into a meetme conference. After a couple of hundred calls machine 2 kernel panics. I have dialed in using SIP and on the PRI.

****** ADDITIONAL INFORMATION ******

Hardware: Dell PowerEdge SC440, TE405P quad T1
OS: CentOS 5.0
Dahdi Linux 2.1.0.3
One PRI connected.

extensions.conf:
[general]
static=yes
writeprotect=no
autofallthrough=yes

[test]
exten => s,1,Answer()
exten => s,2,NoOp(exten:${EXTEN} callerid:${CALLERID(NUMBER)})
exten => s,3,Playback(tt-allbusy)
exten => s,4,Wait(3.0)
exten => s,5,Playback(tt-allbusy)
exten => s,6,Wait(3.0)
exten => s,7,Playback(tt-allbusy)
exten => s,8,Wait(3.0)
exten => s,9,Hangup()

[newcall]
exten => test,1,MeetMe(test,dp(#*)q)

Comments: By: Terry Whelan (terrywhelan) 2009-01-07 17:10:14.000-0600

Changed the test to bring all the calls in on a TDM T1, so far no crashes.
By: Terry Whelan (terrywhelan) 2009-01-08 15:36:34.000-0600

Identical behaviour in asterisk 1.6.0.3-RC1, with dahdi 2.1.0.3
Problem absent in asterisk 1.4.11 using zaptel 1.4.5.1
By: Leif Madsen (lmadsen) 2009-01-09 08:53:32.000-0600

Can you attach the kernel panic as a file to the bug tracker so a developer can see what is happening?
By: Terry Whelan (terrywhelan) 2009-01-09 15:45:00.000-0600

I am not a kernel/device developer. Not sure what file you want or how to get it. Since on panic the kernel halts not sure how to get the file. I am happy to do significant work to debug this problem, but need some guidance on how to provide the information.
By: Terry Whelan (terrywhelan) 2009-01-09 19:18:00.000-0600

Added a serial console and captured the output of the panic. The code for this version was asterisk trunk compiled --with-pri=no along with dahdi trunk.
By: Leif Madsen (lmadsen) 2009-01-10 09:00:55.000-0600

Hey Terry,

Thanks for the file. I'm not a kernel dev. either, so I'm not sure where to point you, but I suppose I assumed a kernel panic would create some sort of log, but that's maybe not true.

That may or may not be enough information, but we'll see what one of the other developers has to say.

Thanks!
By: Terry Whelan (terrywhelan) 2009-01-13 15:03:19.000-0600

Tried various versions of the asterisk&(dahdi|zaptel) running with zaptel there were no problems, including using asterisk 1.4.22.1 (I believe this is the highest version of asterisk that supports zaptel). Using the same asterisk version and dahdi-linux trunk from svn caused the crash. I will try some older versions of dahdi from svn see where the problem was introduced.

Also tried CentOS 5.2 with same results.

By: Leif Madsen (lmadsen) 2009-01-14 13:43:31.000-0600

I've moved this to the DAHDI project as it appears the reporter has narrowed the issue down to some version of DAHDI, and not Asterisk itself.

Thanks for the feedback!
By: Dave Miller (justdave) 2009-01-15 14:18:21.000-0600

We've hit this crash ourselves twice in the last 3 days.

Asterisk 1.4.22
dahdi-linux-2.1.0.3

I have dahdi-linux-2.0.0 already compiled and on the box (upgraded to 2.1.0.3 this last week, which seems to imply it's new since 2.0.0). I'm planning to revert to 2.0.0 tonight after hours.

By: Shaun Ruffell (sruffell) 2009-01-15 14:43:19.000-0600

justdave: Does t4_interrupt_gen2 show up in the Call Trace for you as well?
By: Dave Miller (justdave) 2009-01-15 14:57:59.000-0600

We don't have a serial console set up at the moment. Attached is a screenshot of what was visible on the console in the last crash.
By: Shaun Ruffell (sruffell) 2009-01-15 15:06:27.000-0600

No need to setup a serial console....your screen shot answered my question. I'll need to try and reproduce this here, but from the two data points, it doesn't look like anything in the board specific drivers, but related to one of the changes in dahdi-base.c.

By: Terry Whelan (terrywhelan) 2009-01-15 18:18:04.000-0600

The problem manifests itself when app_meetme uses the pseudo dahdi device. I commented out every line in /etc/dahdi/modules and have an empty system.conf, so there is probably no board specific code running. The panic still occurs. I am in the process of working through each version of dahdi 2.1.0. I will update this note as I find out more.
By: Shaun Ruffell (sruffell) 2009-01-15 18:30:21.000-0600

My hunch is that it has something to do with this: http://svn.digium.com/view/dahdi?view=revision&revision=5275 commit. But I've yet to be able to reproduce it in my lab, but I'm looping calls from a single span t1 to the first span on a quad span on a single machine in the lab.

Maybe could you try just revisions 5275 and if it crashes there, try the previous revision?
By: Terry Whelan (terrywhelan) 2009-01-16 11:49:04.000-0600

to reproduce the problem you need to bring the calls onto the box using non dahdi channels, I use sip from another box. A feature of app_meetme is a need to use a dahdi channel, if the call appears from dahdi then that is used otherwise the pseudo device is opened.

So far versions 2.1.0-rc5,rc4,rc3 have all demonstrated the problem. These all have revisions higher than 5275. rev 5275 crashed. rc2 (rev 5239) is running now.
rev 5239 does not appear to display the problem.

By: Shaun Ruffell (sruffell) 2009-01-20 09:44:50.000-0600

An update for people who are following this. I was able to reproduce this as TerryWhelan said by using sipp to pump a bunch of calls into Meetme. However I don't yet have a fix. I can't just revert the change where the problem was introduced because that was a fix to another issue caused by the new pluggable echo canceler architecture. Essentially, hw_echocan_off can't be called in atomic context, so that means that functions that were previously called with interrupts disabled can no longer be. I'm still tracking down exactly what the source of the oops is to make sure that everything that needs to be protected by the various locks in play here are.

You know, in thinking about it, I guess I could always punt and make sure that the echocanwith_params functions assume they are called in atomic context... although not ideal...
By: Digium Subversion (svnbot) 2009-01-26 01:19:12.000-0600

Repository: dahdi
Revision: 5811

U linux/trunk/drivers/dahdi/dahdi-base.c

------------------------------------------------------------------------
r5811 | sruffell | 2009-01-26 01:19:12 -0600 (Mon, 26 Jan 2009) | 7 lines

Ensure the channel is in a good state before placing it on the chans arrays.
Also ensure that dahdi_receive holds the chan_lock while iterating over the
chans array to prevent channels from entering or leaving the array while the
interrupt handler is running.

Related to issue DAHLIN-71 .

------------------------------------------------------------------------

http://svn.digium.com/view/dahdi?view=rev&revision=5811
By: Shaun Ruffell (sruffell) 2009-01-26 01:51:46.000-0600

Revision 5811 of dahdi/linux/trunk resolved the issue on my system in the lab. If possible, could someone else give it a try before I call this issue closed?
By: Terry Whelan (terrywhelan) 2009-01-26 16:40:31.000-0600

I have started my usual test with this revision. I should know in about an hour.

12 hours and some regression testing later all looks good. I tested with a couple of versions of asterisk 1.6, most importantly the 1.6.0.3 release. I always get the crash with the release version of dahdi, but never with this revision.

By: Shaun Ruffell (sruffell) 2009-01-27 14:09:35.000-0600

TerryWhelan: thanks for testing. I'm going to go ahead and close this then now, and if someone else tests and have a problem potentially reopen. The next release of dahdi is due to enter release candidate stage here in the next few days, so this fix will be in dahdi-linux 2.2.0.
By: Digium Subversion (svnbot) 2009-01-27 20:54:53.000-0600

Repository: dahdi
Revision: 5865

_U linux/tags/2.1.0.4/
U linux/tags/2.1.0.4/drivers/dahdi/dahdi-base.c
U linux/tags/2.1.0.4/drivers/dahdi/dahdi_dynamic.c
U linux/tags/2.1.0.4/drivers/dahdi/tor2.c
U linux/tags/2.1.0.4/drivers/dahdi/wct1xxp.c
U linux/tags/2.1.0.4/drivers/dahdi/wct4xxp/base.c
U linux/tags/2.1.0.4/drivers/dahdi/wcte11xp.c
U linux/tags/2.1.0.4/drivers/dahdi/wcte12xp/base.c
U linux/tags/2.1.0.4/include/dahdi/kernel.h

------------------------------------------------------------------------
r5865 | sruffell | 2009-01-27 20:54:51 -0600 (Tue, 27 Jan 2009) | 28 lines

Merged revisions 5590,5811,5819 via svnmerge from
https://origsvn.digium.com/svn/dahdi/linux/trunk

........
r5590 | tzafrir | 2008-12-19 04:39:31 -0800 (Fri, 19 Dec 2008) | 4 lines

Fix the safety check in tor2 to be for SPANS_PER_CARD

Thanks to Eugene Teo, in a from issue DAHLIN-62 .

........
r5811 | sruffell | 2009-01-25 23:19:47 -0800 (Sun, 25 Jan 2009) | 7 lines

Ensure the channel is in a good state before placing it on the chans arrays.
Also ensure that dahdi_receive holds the chan_lock while iterating over the
chans array to prevent channels from entering or leaving the array while the
interrupt handler is running.

Related to issue DAHLIN-71 .

........
r5819 | sruffell | 2009-01-26 11:44:36 -0800 (Mon, 26 Jan 2009) | 3 lines

Manipulate the REGISTERED flag with atomic bitops now since the bit is set
outside the protection of any locks.

........

------------------------------------------------------------------------

http://svn.digium.com/view/dahdi?view=rev&revision=5865