Summary: | DAHLIN-00071: sending a large number of calls to app_meetme causes kernel panic | ||
Reporter: | Terry Whelan (terrywhelan) | Labels: | |
Date Opened: | 2009-01-06 13:19:30.000-0600 | Date Closed: | 2009-02-03 11:20:11.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | wct4xxp |
Versions: | 2.1.0.3 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) astcrash.txt ( 1) ringringscreen.png | |
Description: | On machine 1 create a dial file that calls machine 2 plays about 30 seconds of test messages and hangup. Create one of these every 5 seconds. On machine 2 put all incoming calls into a meetme conference. After a couple of hundred calls machine 2 kernel panics. I have dialed in using SIP and on the PRI. ****** ADDITIONAL INFORMATION ****** Hardware: Dell PowerEdge SC440, TE405P quad T1 OS: CentOS 5.0 Dahdi Linux 2.1.0.3 One PRI connected. extensions.conf: [general] static=yes writeprotect=no autofallthrough=yes [test] exten => s,1,Answer() exten => s,2,NoOp(exten:${EXTEN} callerid:${CALLERID(NUMBER)}) exten => s,3,Playback(tt-allbusy) exten => s,4,Wait(3.0) exten => s,5,Playback(tt-allbusy) exten => s,6,Wait(3.0) exten => s,7,Playback(tt-allbusy) exten => s,8,Wait(3.0) exten => s,9,Hangup() [newcall] exten => test,1,MeetMe(test,dp(#*)q) | ||
Comments: | By: Terry Whelan (terrywhelan) 2009-01-07 17:10:14.000-0600 Changed the test to bring all the calls in on a TDM T1, so far no crashes. By: Terry Whelan (terrywhelan) 2009-01-08 15:36:34.000-0600 Identical behaviour in asterisk 1.6.0.3-RC1, with dahdi 2.1.0.3 Problem absent in asterisk 1.4.11 using zaptel 1.4.5.1 By: Leif Madsen (lmadsen) 2009-01-09 08:53:32.000-0600 Can you attach the kernel panic as a file to the bug tracker so a developer can see what is happening? By: Terry Whelan (terrywhelan) 2009-01-09 15:45:00.000-0600 I am not a kernel/device developer. Not sure what file you want or how to get it. Since on panic the kernel halts not sure how to get the file. I am happy to do significant work to debug this problem, but need some guidance on how to provide the information. By: Terry Whelan (terrywhelan) 2009-01-09 19:18:00.000-0600 Added a serial console and captured the output of the panic. The code for this version was asterisk trunk compiled --with-pri=no along with dahdi trunk. By: Leif Madsen (lmadsen) 2009-01-10 09:00:55.000-0600 Hey Terry, Thanks for the file. I'm not a kernel dev. either, so I'm not sure where to point you, but I suppose I assumed a kernel panic would create some sort of log, but that's maybe not true. That may or may not be enough information, but we'll see what one of the other developers has to say. Thanks! By: Terry Whelan (terrywhelan) 2009-01-13 15:03:19.000-0600 Tried various versions of the asterisk&(dahdi|zaptel) running with zaptel there were no problems, including using asterisk 1.4.22.1 (I believe this is the highest version of asterisk that supports zaptel). Using the same asterisk version and dahdi-linux trunk from svn caused the crash. I will try some older versions of dahdi from svn see where the problem was introduced. Also tried CentOS 5.2 with same results. By: Leif Madsen (lmadsen) 2009-01-14 13:43:31.000-0600 I've moved this to the DAHDI project as it appears the reporter has narrowed the issue down to some version of DAHDI, and not Asterisk itself. Thanks for the feedback! By: Dave Miller (justdave) 2009-01-15 14:18:21.000-0600 We've hit this crash ourselves twice in the last 3 days. Asterisk 1.4.22 dahdi-linux-2.1.0.3 I have dahdi-linux-2.0.0 already compiled and on the box (upgraded to 2.1.0.3 this last week, which seems to imply it's new since 2.0.0). I'm planning to revert to 2.0.0 tonight after hours. By: Shaun Ruffell (sruffell) 2009-01-15 14:43:19.000-0600 justdave: Does t4_interrupt_gen2 show up in the Call Trace for you as well? By: Dave Miller (justdave) 2009-01-15 14:57:59.000-0600 We don't have a serial console set up at the moment. Attached is a screenshot of what was visible on the console in the last crash. By: Shaun Ruffell (sruffell) 2009-01-15 15:06:27.000-0600 No need to setup a serial console....your screen shot answered my question. I'll need to try and reproduce this here, but from the two data points, it doesn't look like anything in the board specific drivers, but related to one of the changes in dahdi-base.c. By: Terry Whelan (terrywhelan) 2009-01-15 18:18:04.000-0600 The problem manifests itself when app_meetme uses the pseudo dahdi device. I commented out every line in /etc/dahdi/modules and have an empty system.conf, so there is probably no board specific code running. The panic still occurs. I am in the process of working through each version of dahdi 2.1.0. I will update this note as I find out more. By: Shaun Ruffell (sruffell) 2009-01-15 18:30:21.000-0600 My hunch is that it has something to do with this: http://svn.digium.com/view/dahdi?view=revision&revision=5275 commit. But I've yet to be able to reproduce it in my lab, but I'm looping calls from a single span t1 to the first span on a quad span on a single machine in the lab. Maybe could you try just revisions 5275 and if it crashes there, try the previous revision? By: Terry Whelan (terrywhelan) 2009-01-16 11:49:04.000-0600 to reproduce the problem you need to bring the calls onto the box using non dahdi channels, I use sip from another box. A feature of app_meetme is a need to use a dahdi channel, if the call appears from dahdi then that is used otherwise the pseudo device is opened. So far versions 2.1.0-rc5,rc4,rc3 have all demonstrated the problem. These all have revisions higher than 5275. rev 5275 crashed. rc2 (rev 5239) is running now. rev 5239 does not appear to display the problem. By: Shaun Ruffell (sruffell) 2009-01-20 09:44:50.000-0600 An update for people who are following this. I was able to reproduce this as TerryWhelan said by using sipp to pump a bunch of calls into Meetme. However I don't yet have a fix. I can't just revert the change where the problem was introduced because that was a fix to another issue caused by the new pluggable echo canceler architecture. Essentially, hw_echocan_off can't be called in atomic context, so that means that functions that were previously called with interrupts disabled can no longer be. I'm still tracking down exactly what the source of the oops is to make sure that everything that needs to be protected by the various locks in play here are. You know, in thinking about it, I guess I could always punt and make sure that the echocanwith_params functions assume they are called in atomic context... although not ideal... By: Digium Subversion (svnbot) 2009-01-26 01:19:12.000-0600 Repository: dahdi Revision: 5811 U linux/trunk/drivers/dahdi/dahdi-base.c ------------------------------------------------------------------------ r5811 | sruffell | 2009-01-26 01:19:12 -0600 (Mon, 26 Jan 2009) | 7 lines Ensure the channel is in a good state before placing it on the chans arrays. Also ensure that dahdi_receive holds the chan_lock while iterating over the chans array to prevent channels from entering or leaving the array while the interrupt handler is running. Related to issue DAHLIN-71 . ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=5811 By: Shaun Ruffell (sruffell) 2009-01-26 01:51:46.000-0600 Revision 5811 of dahdi/linux/trunk resolved the issue on my system in the lab. If possible, could someone else give it a try before I call this issue closed? By: Terry Whelan (terrywhelan) 2009-01-26 16:40:31.000-0600 I have started my usual test with this revision. I should know in about an hour. 12 hours and some regression testing later all looks good. I tested with a couple of versions of asterisk 1.6, most importantly the 1.6.0.3 release. I always get the crash with the release version of dahdi, but never with this revision. By: Shaun Ruffell (sruffell) 2009-01-27 14:09:35.000-0600 TerryWhelan: thanks for testing. I'm going to go ahead and close this then now, and if someone else tests and have a problem potentially reopen. The next release of dahdi is due to enter release candidate stage here in the next few days, so this fix will be in dahdi-linux 2.2.0. By: Digium Subversion (svnbot) 2009-01-27 20:54:53.000-0600 Repository: dahdi Revision: 5865 _U linux/tags/2.1.0.4/ U linux/tags/2.1.0.4/drivers/dahdi/dahdi-base.c U linux/tags/2.1.0.4/drivers/dahdi/dahdi_dynamic.c U linux/tags/2.1.0.4/drivers/dahdi/tor2.c U linux/tags/2.1.0.4/drivers/dahdi/wct1xxp.c U linux/tags/2.1.0.4/drivers/dahdi/wct4xxp/base.c U linux/tags/2.1.0.4/drivers/dahdi/wcte11xp.c U linux/tags/2.1.0.4/drivers/dahdi/wcte12xp/base.c U linux/tags/2.1.0.4/include/dahdi/kernel.h ------------------------------------------------------------------------ r5865 | sruffell | 2009-01-27 20:54:51 -0600 (Tue, 27 Jan 2009) | 28 lines Merged revisions 5590,5811,5819 via svnmerge from https://origsvn.digium.com/svn/dahdi/linux/trunk ........ r5590 | tzafrir | 2008-12-19 04:39:31 -0800 (Fri, 19 Dec 2008) | 4 lines Fix the safety check in tor2 to be for SPANS_PER_CARD Thanks to Eugene Teo, in a from issue DAHLIN-62 . ........ r5811 | sruffell | 2009-01-25 23:19:47 -0800 (Sun, 25 Jan 2009) | 7 lines Ensure the channel is in a good state before placing it on the chans arrays. Also ensure that dahdi_receive holds the chan_lock while iterating over the chans array to prevent channels from entering or leaving the array while the interrupt handler is running. Related to issue DAHLIN-71 . ........ r5819 | sruffell | 2009-01-26 11:44:36 -0800 (Mon, 26 Jan 2009) | 3 lines Manipulate the REGISTERED flag with atomic bitops now since the bit is set outside the protection of any locks. ........ ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=5865 |