[Home]

Summary:ASTERISK-00957: Kernel panic when running meetme with 30 legs
Reporter:zalex (zalex)Labels:
Date Opened:2004-01-30 16:30:57.000-0600Date Closed:2004-09-25 02:55:35
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) webley_oops.1.out
( 1) webley_oops.2.out
Description:30 legs in meetme (coming and going) crash the box within 15-20 min. Today's zaptel update (01/30/04).
ksymoops output attached.

****** ADDITIONAL INFORMATION ******

Dual Xeon 2.8 GHz, 2 TE410P, RH9, 2.4.22 #4 SMP,
MMX enabled, hyperthreading disabled.
Comments:By: Brian West (bkw918) 2004-01-30 17:09:52.000-0600

Happen to have an intel e1000 in that box?

By: zalex (zalex) 2004-01-30 17:55:48.000-0600

No, it's Broadcom tigon3.

eth0: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:0b:db:92:50:3c
eth1: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:0b:db:92:50:3d

Let me point out that this happens only under meetme specific load.
With zap->sip, zap->ivr, sip->ivr calls - no crashes (even with many
more simultaneous calls).

By: Brian West (bkw918) 2004-01-31 14:20:36.000-0600

have you tried to force it to 100mbit?  I get this strange feeling about the gigabit network cards... I know its proven that the e1000 will freak a box out when used with a 410p card.

By: Brian West (bkw918) 2004-02-01 18:45:32.000-0600

Can you try these steps:

Turn off SMP

Try to crash it.

Next turn off MMX

Try to crash.

bkw

By: Mark Spencer (markster) 2004-02-01 19:00:44.000-0600

Also, does it happen on pure zap conferencing or only when zap + something else?  Can you tell if it happens on entry or exit?

By: Matt Florell (mflorell) 2004-02-02 09:13:12.000-0600

I just submitted a similar bug:
http://bugs.digium.com/bug_view_page.php?bug_id=0000980

I am experiencing the kernel panic on the T400P as well as the TE410P on an Athlon dual MP platform as well as the Intel P4 platform, I am also using RedHat 9.0. I will try compiling/running under non-SMP and post my results.

The similarities between the machines are Zaptel/Redhat/SMP it's gotta be one of those.

By: bdolljr (bdolljr) 2004-02-02 12:13:55.000-0600

I have experienced the same behavior on my system.  I am testing Meetme using only a TE410P.  I have a system (IVR) connected to the TE410P (d4/ami/fxoks) and it is currently scripted to call into a single Meetme conference on the asterisk server via a single span on the TE410P.  I am currently dialing in on all 24 channels in the span. The IVR script dials into the Meetme conference, plays a 10 second recording, hangs up, and waits 5 seconds before dialing into the conference again.  After approx 3-5 minutes my asterisk server reboots.

My configuration does not use anything other than the TE410P for the conference.  No SIP, IAX, etc.

My machines do not have an e1000, but do have dual embedded e100's.

I am not running SMP.

Not sure if it occurs on entry or exit.

I was waiting for the fix for bug 680 before I retested this, since I thought it could be related.  I now have a reprogrammed card and will retest tonight.

If it still occurs you are welcome to observe this behavior on my system.  I can provide SSH access.  Like I said it only take approx 3-5 min to reproduce.

By: Matt Florell (mflorell) 2004-02-02 12:21:18.000-0600

to bdolljr:
when you say your "asterisk server reboots" do you mean that it literally restarts itself(soft reboot) or does it kernel panic to where you have to manually power cycle the machine?
Also, what operating system are you running?

By: bdolljr (bdolljr) 2004-02-02 13:08:11.000-0600

It actually reboots (soft reboot).  I am not an expert with linux so I am not sure if my hw/sw is configured to reboot on a kernel panic (if that is even possible).  I know with windows :{ it can halt on a bsod or reboot automatically.  I am running RH8 2.4.18-14.

By: zalex (zalex) 2004-02-02 18:26:16.000-0600

We've tried several more things:

1. The box crashes with 30 sip calls in meetme
2. The last thing we can see before the crash is some leg exits from the conference
3. There are NO crashes with 30+ zap calls in meetme
4. We haven't tried the mixture of calls (zap, sip) yet.

Our next step is forcing NICs to 100 Mb, as recommended. Will add the bugnote as soon as results are available. Then - SMP and MMX.

BTW, about SMP - are you suggesting that we just rebuild zaptel without SMP (SMP commented out in Makefile) and leave TE410P # 1, TE410P # 2, eth0, eth1 interrupts distributed across CPU0 and CPU1? Or, you want all the interrupts to be processed by the same CPU?

By: Brian West (bkw918) 2004-02-02 18:33:56.000-0600

1) Try with complete UP boot
2) Try with IRQ affinity
try to break it on Zap with lots and lots of calls, coming and going

Then report your findings.

By: zalex (zalex) 2004-02-02 19:45:52.000-0600

Forcing NICs to 100 Mb didn't help - crash in 10 min.
Zaptel rebuilt without SMP - didn't help - crash in 15 min.

By: Matt Florell (mflorell) 2004-02-02 21:30:08.000-0600

After 2 kernel panics today(each an hour apart) I recompiled everything under non-SMP and turned off HyperThreading and started Linux with the non-SMP kernel. After doing so it has been 5 hours of regular use and I have not had a kernel panic yet. I will report back tomorrow and let everyone know if it is still not panic-ing.

By: zalex (zalex) 2004-02-03 13:07:41.000-0600

Disabling MMX also didn't help (though the test lasted a little longer) - crash in 40 min.

Our plan is to run 2 TE410P in a dual CPU box with IRQ affinity,
so a non-SMP kernel doesn't look as a reasonable option.

Would appreciate any suggestions on how to collect more valuable
info for ksymoops or locate the problem another way.

I've seen notes about TE410P firmware upgrade. Can it be applicable in
our case?

By: bdolljr (bdolljr) 2004-02-03 14:06:17.000-0600

I just retested a TE410P with upgraded firmware.  I ran about 15 minutes before my machine rebooted.  So it doesn't appear to be firmware related.  My test was with 24 ZAP legs coming and going.

BTW...  I am running with everything compiled for non-SMP and am running the non-SMP kernel.  How do you turn off hyperthreading?  I could try that as well and report my findings.

Thanks.

By: bdolljr (bdolljr) 2004-02-03 15:47:01.000-0600

Oh...  I see.  Hyperthreading is a Xeon thing.  Well my machine is just a PIII 1 Ghz, so I don't think the problem has anything to do with hyperthreading.

By: Matt Florell (mflorell) 2004-02-03 15:52:11.000-0600

No Kernel Panics today at all. I was running upto 36 concurrent meetme sessions with no problems at all.

The problem seems to be the SMP kernel in my case. That kind of stinks because I now have a nice new dual Athlon MP 2800+ system that can't run as SMP for my Asterisk box.

On my P4 system turning HT off and changing to a non-SMP kernel also worked to stop the meetme-induced kernel panics.

I am assuming that the problem is not hardware related since I had kernel panics with both the T400P and TE410P cards.

I've never had a problem with a SMP system kernel panicing like this before, what is it in the meetme code that has problems with SMP?

I think that someone who is very familiar with the meetme code needs to take a serious look at it and find out why it panics SMP machines. I do NOT consider changing all of my machines to non-SMP as a solution to this bug, it is just a quick desperate fix, and this bug should not be closed until the code is fixed.

Note to bdolljr: To turn off Hyperthreading you should just need to change it from ENABLED to DISABLED in your motherboard's BIOS at startup.

By: Brian West (bkw918) 2004-02-03 19:05:55.000-0600

So this has to be a threading issue with SMP.  hrm.. intresting.

By: sbisker (sbisker) 2004-02-04 09:22:25.000-0600

I'm having a similar problem with random kernel panics.  Dual Xeon 2.6Ghz.  2gig memory.  LSI Megaraid controller.  Two T400P cards.  It has dual on board e1000, but I removed those modules and put in a tulip card to eliminate the e1000 from the picture, to no avail.  

Specs:
Redhat 8.0
Kernel:  2.4.20-28.8smp
Zaptel:  0.8.0
LibPRI:  0.5.1
Asterisk:  0.7.1
Hyperthreading - Off
irqbalance - On
Meetme:  Enabled and module loaded, but no conferences in progress
15 SIP phones:  All Cisco 7960 and Polycom IP500

Spans 1-6 are Adtran 750 channel banks with FXS cards. Signalled fxo_ls
Span 7 e&m_w voice T1 from telco
Span 8 PRI from telco

I have a serial console on it now to capture the next panic.  Here's the abbreviated call trace from the last panic.  Thought it may have been astman triggered, so I haven't used astman in 2 days, but it just happenened after about 3 hours of use.

EIP is at zt_receive

Process astman (PID: 2711)

Call Trace:   kfree_skbmem
tcp_recvmsg
tor2_intr
inet_recvmsg
datxlt_t1
sys_select
handleirq_event

edited on: 02-04-04 09:23

By: zalex (zalex) 2004-02-05 20:11:17.000-0600

The workaround provided by Mark (app_meetme.c) didn't work, the
ksymoops is uploaded. The code change puts the same logic
for zap and non-zap channels - as opposed to what was there before (!?).

By: Matt Florell (mflorell) 2004-02-06 09:49:15.000-0600

I put a the latest CVS on my dual Athlon test server and now I'm getting kernel panics after about 30 calls go in and out of a meetme room, not concurrent 30, just 30 separate meetme calls over 15 minutes. It always seems to panic when a call leaves the meetme room and it doesn't seem to matter if it is a Zap, SIP or Local channel. This is a very ugly bug, could it be in any way related to RedHat? are people with other distros running SMP having the same problem?

Side Note, my other machine has not kernel paniced once since I switched ot to non-SMP kernel 4 days ago

edited on: 02-06-04 09:49

By: bdolljr (bdolljr) 2004-02-06 13:34:59.000-0600

Would anybody be willing to point me to a procedure on how to run ksymoops?  I would like to compare my system crashes to the files posted here.  The reason for this is that my system REALLY does panic whether it's in SMP or non-SMP mode.

Again, I am running RH8 2.4.18-14 SMP and non-SMP kernels.

By: Paul Cadach (pcadach) 2004-02-07 23:18:08.000-0600

Not about your subject, but for information.

I'm hardly playing with H.323 under asterisk (I have 3 boxes - one is dual-P-III/800 MHz, one is dual-Xeon-3.06 GHz, and one is Celeron-800 MHz under VMWare), and found next relation - when I boots system with UP kernel (i.e. kernel doesn't know anything about additional processor(s), hyperthreading, etc.), callgen323 (part of OpenH323 project used to generate/receive bunches of H.323 calls) works fine for about 30000 calls within 2-3 hours, but for SMP-aware kernel (2.4.20) it fails randomly after 500-2000 calls. On single-CPU system (Celeron-800 MHz) with non-SMP kernel all works fine too. So, I just decide that those problems like as incorrect threads synchronization (because on SMP systems threads works on diffirent CPUs with diffirent caches and could be ran simultaneousl, not sequentially as on single-CPU systems). But, for example, MySQL works pretty for years on dual-CPU systems without any problems....

Where is a bottleneck for SMP-aware threads? Bad application code, bugs in pthread library or in kernels?

By: Mark Spencer (markster) 2004-02-08 23:45:01.000-0600

I've made another zaptel update which prevents things from getting opened, closed, conferenced, or unconferenced during the critical section in which conferencing is handled.  Please cvs update and let me know if this makes any difference.  Unfortuantely since I can't duplicate your problem locally I have to take these stabs in the dark.

By: bdolljr (bdolljr) 2004-02-09 13:24:31.000-0600

Updated to CVS 02/09/04.
Running RH8.
Upgraded kernel and kernel sources to 2.4.20-28.8
Running with UP kernel.  Rebuilt zaptel, libpri, asterisk under UP kernel.
Began 24 leg ZAP / Meetme test.  Machine panic'd after about 10 minutes.

My machine is available with ssh access for someone (mark?) to debug this problem.  If this is of interest, please let me know.

I will reboot under SMP kernel, rebuild and retest.  I will post my results, however, I believe they will be the same.

By: Mark Spencer (markster) 2004-02-09 13:49:44.000-0600

bdollyjr: Can you absolutely confirm that you are not using MMX?  The problem as witnessed by zalex does not manifest itself on UP as far as I understand it, so I think your problem is not related.  If you can take the panic output and feed it through ksymoops, that may provide some useful information.  Can you also confirm there is only one zaptel interface in your machine?  Can you also confirm whether you have only zap devices participating or a mixture of zap and non-zap devices?

By: bdolljr (bdolljr) 2004-02-09 14:23:01.000-0600

In /usr/src/zaptel/ztconfig.h the line CONFIG_ZAPTEL_MMX is commented out.

I can confirm that there is only one zaptel interface in my machine.  If I lsmod I see only wct4xxp and zaptel which appears to be using wct4xxp.  Note: Upon reboot of my machine I see wct4xxp and wcusb and zaptel.  I have setup my machine to autoload zaptel by running /usr/src/zaptel/make config.  Not sure why it seems to find wcusb everytime I boot up.  However, I have been using rmmod wcusb before the tests I have been running so I only have wct4xxp and zaptel loaded.

During my tests I have my IVR system (connected to ZAP d4 / ami / fxoks channels) dialing into and out of conference channel 0.  I have been monitoring the test by dialing into conference channel 1 with a SIP Cisco 7960.  Since there is MOH on my conference channels, I hear MOH on conf channel 1 until the machine panics.

I agree that my UP panics may be a different problem.  It doesn't appear as though anybody else in this thread is having a UP problem.  Maybe it's CPU utilization related?  My machine is a Dual - PIII 800 mhz.  Is running 24 conf channels to much for my machine in UP mode?  I have now rebooted with SMP kernel, rebuilt zaptel, libpri, and asterisk.  I verified only wct4xxp and zaptel loaded and I've been running approx 15 minutes. (no SIP monitoring on conf channel 1 this time.)

As I stated on 02/06/04 I would be happy to feed the panic output through ksymoops, however, I have never done that before and don't know how.  I've tried searching the web for instructions, but haven't found any.  Can you tell me where I would find the panic output file and what the command line would be to feed it through ksymoops?  You can respond here or find me on AIM @ bjrbigsky.

Thanks for your help.

By: philipp2 (philipp2) 2004-02-09 15:07:40.000-0600

Hi there, I am not sure this is at all related, but since the latest CVS upgrade I also experience a problem in that Asterisk simply "disappears"; the rest of the system keeps running though. The box is at a remote location so I cannot easily follow what is going on.

RedHat 7.3
Athlon XP 1.8 GHz
no MMX optimization, compiled for i386
CVS-02/06/04-11:46:21 started with safe_asterisk
ztdummy only, no zaptel hardware
AVM Fritz! PCI and chan_capi 0.3.1
MeetMe and MOH enabled (but probably not used)
Daily CRON "restart now" and "restart when convenient"

Feb  9 10:19:16 WARNING[7176]: Got 200 OK on REGISTER that isn't a register
Feb  9 10:27:16 NOTICE[9226]: Peer 'server1p' is now TOO LAGGED (2504 ms)!
Feb  9 10:27:26 NOTICE[9226]: Peer 'server1p' is now REACHABLE!
Feb  9 10:35:31 NOTICE[9226]: Peer 'server1p' is now TOO LAGGED (3173 ms)!
Feb  9 10:35:41 NOTICE[9226]: Peer 'server1p' is now REACHABLE!
Feb  9 10:36:55 WARNING[21519]: Don't know how to indicate condition 14
Feb  9 10:38:28 WARNING[22543]: Don't know how to indicate condition 14
Feb  9 10:56:41 WARNING[24591]: Don't know how to indicate condition 14
Feb  9 11:17:11 WARNING[26639]: Don't know how to indicate condition 14
--> Asterisk gone at some point between above and below <--
Feb  9 12:50:06 NOTICE[1024]: ast_capi_pvt(022452300,022456260,022154068,0221547
35,22452300,22154068,22154735,remote,0x2,2) (1,2,64) (0)(0.800000/0.800000) 0
Feb  9 12:50:06 NOTICE[1024]: ast_capi_pvt(022452300,022456260,022154068,0221547
35,22452300,22154068,22154735,remote,0x2,2) (1,2,64) (0)(0.800000/0.800000) 0
Feb  9 12:50:06 NOTICE[1024]: this box has 1 capi controller(s)
Feb  9 12:50:07 NOTICE[8201]: Removing message from aaln/1@192.168.23.18-1 tansaction 3
Feb  9 12:50:07 NOTICE[8201]: Removing message from aaln/1@192.168.23.17-1 tansaction 2
Feb  9 12:50:07 NOTICE[8201]: Got response back on aaln/1@192.168.23.18-1 for transaction 3 we aren't sending? (current = 4)
Feb  9 12:50:07 NOTICE[8201]: Got response back on aaln/1@192.168.23.18-1 for transaction 2 we aren't sending? (current = 4)
Feb  9 12:50:07 WARNING[1024]: Ignoring port for now



Feb  8 20:13:18 NOTICE[7176]: Failed to authenticate on REGISTER to '<sip:siprob
@somehost.com>;tag=as7efed8e7'
Feb  8 20:13:20 WARNING[28687]: Don't know how to indicate condition 14
Feb  8 20:13:33 NOTICE[7176]: Registration for 'siprob@11.22.33.44' timed out
, trying again
Feb  8 20:14:33 WARNING[30735]: Don't know how to indicate condition 14
Feb  8 20:16:44 WARNING[31759]: Don't know how to indicate condition 14
Feb  8 20:30:47 ERROR[4101]: received a call waiting CONNECT_IND
Feb  8 20:35:12 WARNING[36879]: Don't know how to indicate condition 14
--> Asterisk gone at some point between above and below <--
Feb  8 23:41:37 NOTICE[1024]: ast_capi_pvt(022452300,022456260,022154068,0221547
35,22452300,22154068,22154735,remote,0x2,2) (1,2,64) (0)(0.800000/0.800000) 0
Feb  8 23:41:37 NOTICE[1024]: ast_capi_pvt(022452300,022456260,022154068,0221547
35,22452300,22154068,22154735,remote,0x2,2) (1,2,64) (0)(0.800000/0.800000) 0
Feb  8 23:41:37 NOTICE[1024]: this box has 1 capi controller(s)
Feb  8 23:41:38 NOTICE[8201]: Removing message from aaln/1@192.168.23.17-1 tansaction 2
Feb  8 23:41:38 NOTICE[8201]: Removing message from aaln/1@192.168.23.18-1 tansaction 3
Feb  8 23:41:38 NOTICE[8201]: Got response back on aaln/1@192.168.23.18-1 for transaction 2 we aren't sending? (current = 5)
Feb  8 23:41:38 NOTICE[8201]: Got response back on aaln/1@192.168.23.18-1 for transaction 3 we aren't sending? (current = 5)
Feb  8 23:41:38 WARNING[1024]: Ignoring port for now

bearbeitet am: 02-10-04 06:52

By: zalex (zalex) 2004-02-09 16:52:27.000-0600

1. Updated zaptel from the cvs (02/09/04)
2. Disabled MMX
3. Eliminated 'workaround' in app_meetme.c
4. The same load test that was causing kernel panic in 10-15 min
       is up for 3+ hours now
5. No matter what happens next it's a big step ahead
6. Will keep testing
7. Thank you, Mark

By: woofie (woofie) 2004-02-09 20:46:48.000-0600

I am adding this information to this bug at request entered in mailing list today under topic of System Freeze. While I initially did not thing my issues were related, they may be so I will post all info I can, hope it helps.

Problem:  Period complete Server Freeze, No network, console, anything, reset from reset button and reboot for system to become operational. Review of system afterwords shows absolutley nothing in any logs about a proble,

When Happens?

First two times happened during or soon after a call transfer from one Zap channel to another.  Thought this was something to do with call transfers and/or a phone plugged into extension immediatly before call.
System up for 5 days before first hang,  second hang 23.5 hours later.
System up for 4 more days. then:
System hung this morning around 4:00AM,  only idle asterisk activity (No calls for hours, Main activity may be IAX2 Qualify and thread that looks for messages for VMWI).  Only other thing that may have been running is the default cron.daily job that runs at 4:00AM everyday. So, I may not be related to call transfer at all. No backtraces or anything else available during these events.
Have has some thread hangs but posted that differ bug 986.

I know this only general info, hope it helps



Software Info:
RedHat 9  Kernel 2.4.20-28.9, SMP
Asterisk CVS-01/19/04-09:31:47
Asterisk Mods:
enabled OLD_DSP_ROUTINES
Mod logger.c to flush log entries to disk
Mod chan_iax2.c to use iax2.conf file
enabled SMP in zaptel makefile
enable 686 and MMX optimizations in Asterisk makefile
Custom Triple ring cadence in chan_zap.c

Running Zap cards mentioned in Hardware and have Voicepulse service using IAX2 via the network.  IAX2 config is set to qualify, no trunking.

No Meetme used, No astman, only calls via X100P, IAX2, call tranfer, call parking, threeway via zap and iax2, voicemail, etc.

Hardware Info:
Dual Penitum III 600MHz (Coppermine)
4 X 256MB SDRAM
Chipset Intel Corp. 440BX Based (Yes I know PCI2.1 but it works?)
2 X  13 GB IDE Drives
Voodoo 3000 AGP Video Card for Console, Hardware Accel (Console rarely used)
2 X TDM400P  Full (TDM40B) Rev F (I think) 7 of 8 ports in use
1 X X100P
1 Compaq Quad Ethernet Card in 32Bit slot
 1 Port Active  4 X  Intel Corp. 82557/8/9 [Ethernet Pro 100] (#4) (rev 8)    
 Using CONFIG_EEPRO100=y  in Kernel

By: zoa (zoa) 2004-02-10 04:04:37.000-0600

zalex, how is it going today ?

By: zalex (zalex) 2004-02-10 11:12:46.000-0600

The test was running overnight, no problem. From our standpoint
the bug is fixed.