[Home]

Summary:ASTERISK-08823: Asterisk crashes randomly under heavy load with GSM->G.729 conversion
Reporter:Marco Cintolesi (dartvader)Labels:
Date Opened:2007-02-16 17:06:22.000-0600Date Closed:2007-06-18 18:27:57
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt
( 1) bt2
( 2) bt3
Description:Hi, recently we moved our asterisk to 1.4.
Our central machine (we are an ITSP) receive a VERY big call flow (like 30000 calls/day).
Asterisk crashes randomly (in 1-2 days).
SO is Centos 4.4 x86_64 on a 4 Xeon (Dell 6850) with 4 GB ram.
No interface cards, only SIP and IAX, Zaptel (ZT-dummy) for IAX2 trunks timing.
Asterisk compiled for x86_64.
Very high cpu usage for translating GSM (from SIP&IAX) to g729 (to Media Gateway),  in peak times more than 300 channels up.
Id like to attach core file but is very large (over 300 megabytes).

****** ADDITIONAL INFORMATION ******

Asterisk SVN-branch-1.4-r53821 built by root @ xxxx on a x86_64 running Linux on 2007-02-16 16:12:16 UTC
Comments:By: Russell Bryant (russell) 2007-02-16 17:08:03.000-0600

Instead of attaching the core file, just attach a backtrace from it.  See doc/backtrace.txt for more information.

By: Marco Cintolesi (dartvader) 2007-02-17 06:15:26.000-0600

Ok Thanks,
Updated asterisk today (SVN branch 1.4), GDB complains about the different binary, i will post the BT when i have a new core file from this version (i hope never but....)
Bye

By: Brandon Kruse (bkruse) 2007-02-18 22:32:09.000-0600

First off, I would like to see this if
you have a true timing source, not ztdummy.

Second off, up your ulimit and ulimit -n, including
your iax thread counts.

(maxiaxthreads=highnumber)

It will default to the highest number(200 i believe)

You will not run into a problem having this many threads,
because you have alot of ran and some beefy processors.


Give it a shot :]

Also, just a thought is to maybe tryout
a tc400b, a new transcoder card digium just released.

http://www.digium.com/en/products/hardware/tc400b.php

It will make your cpu load go way down, and help probe
for possible causes.

The translation "cost" for ulaw/alaw-><-g729/g723.1 is 1.

96 g729 channels, check it out :]

-bkruse

By: Marco Cintolesi (dartvader) 2007-02-19 05:12:02.000-0600

Of course i increased the ulimit values (-n open files at 8192)

In iax.conf i have set

iaxthreadcount=100
iaxmaxthreadcount = 1000

I dont know if the problems comes form a lack of free threads..i think Asterisk must log something if run out of free threads......these things LACKS documentation...

We are testing the tc400b, but we are waiting from Digium to know if its possible to use more than one card in the same machine (96 channels is 1/3 of our actual traffic).

Still waiting for the new core file....

By: Marco Cintolesi (dartvader) 2007-02-19 12:02:52.000-0600

Ok here's the latest Backtrace. Asterisk crashed after 22 hours of HARD Work.

Note: Asterisk compiled without optimizations uses more and more cpu time on this machine....under full load from 55% (optimized) to 85% (dont_optimize).

Waiting for some infos....

By: Marco Cintolesi (dartvader) 2007-02-19 12:16:39.000-0600

Reading again...maybe same problem to ASTERISK-8118175...same os, same kernel, same glibc, same cpu.......Now i try to disable Hyperthreading from the kernel

By: Marco Cintolesi (dartvader) 2007-02-19 12:58:12.000-0600

new crash after only one hour...BT attached (bt2)
glibc version is: glibc-2.3.4-2.19

By: Brandon Kruse (bkruse) 2007-02-19 14:11:44.000-0600

Yes, I personally tested the Card.

The cards are stackable.

You CAN use 3 for 276 g729 channels.

-bkruse

By: Marco Cintolesi (dartvader) 2007-02-19 19:00:20.000-0600

Tonight disabled Hyperthreading (need reboot...). now /proc/cpuinfo shows only 4 cpu's.
Also uUpdated svn at rel. 55435, recompiled with no_optimization and debug_threads

Anyone can take a look to the BT? any clue?
Let's see....i think more BT's are on the way...

By: Tony Plack (plack) 2007-02-19 21:47:30.000-0600

Not sure that your issue is mine or not, but I have a simple single XEON CPU which is crashing as well.  I just set the box up with 55219 and I believe I have had 4 crashes.  The first 3, I was not sure if I just shut it down or not, but thought it strange.

Tonight, it looks like it had problems just after a DNS failure on an IAX peer.

Your backtrace shows Segmentation fault.

SIG 11, Segmentation fault could be bad memory or CPU, but I don't have any other problems on the box.

Interesting that your bt trace is occuring in IAX Peer function on your box as well.

However, on file bt2, it is faulting on freeing memory from a channel.

Not a expert in the code (yet) but I would say there might be two issues here...



By: Marco Cintolesi (dartvader) 2007-02-20 10:54:24.000-0600

Another BT as expected.....hyperthreading disabled.
Now i think about three different problems,
the first BT seems coming from an IAX problem
the second BT seems related to http://bugs.digium.com/view.php?id=9103 in free(),
the third ????? Sorry but im not a C guru :-)

The problem is becoming URGENT anyone can raise the priority of this ??

Please help.....



By: Serge Vecher (serge-v) 2007-03-07 11:56:54.000-0600

what's the status with 1.4.1 installed from tarball?

By: Marco Cintolesi (dartvader) 2007-03-07 12:10:30.000-0600

Hi, same problems with 1.4.1 tag, installed 2 days ago, 1 or 2 crashes a day, various BackTraces (same i sent) with same problems.
I can't send others BT because now asterisk is compiled optimized (its not possible for use to run unoptimized...too much CPU usage)
Thanks

By: Marco Cintolesi (dartvader) 2007-03-07 12:19:46.000-0600

Last BackTrace

#0  0x000000335b807acc in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
No symbol table info available.
#1  0x0000002a9f5a13a2 in retrans_pkt (data=Variable "data" is not available.
) at /usr/src/digium/asterisk/tags/1.4.1/include/asterisk/lock.h:532
       pkt = (struct sip_pkt *) 0x1b8abd0
       prev = Variable "prev" is not available.
(gdb) bt full
#0  0x000000335b807acc in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
No symbol table info available.
#1  0x0000002a9f5a13a2 in retrans_pkt (data=Variable "data" is not available.
) at /usr/src/digium/asterisk/tags/1.4.1/include/asterisk/lock.h:532
       pkt = (struct sip_pkt *) 0x1b8abd0
       prev = Variable "prev" is not available.
(gdb) bt
#0  0x000000335b807acc in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
#1  0x0000002a9f5a13a2 in retrans_pkt (data=Variable "data" is not available.
) at /usr/src/digium/asterisk/tags/1.4.1/include/asterisk/lock.h:532
#2  0x0000000000498979 in ast_sched_runq (con=0x911f30) at sched.c:359
#3  0x0000002a9f5e2526 in do_monitor (data=Variable "data" is not available.
) at chan_sip.c:15006
#4  0x00000000004a428e in dummy_start (data=Variable "data" is not available.
) at utils.c:545
ASTERISK-1  0x000000335b80610a in start_thread () from /lib64/tls/libpthread.so.0
ASTERISK-2  0x000000335afc68b3 in clone () from /lib64/tls/libc.so.6
ASTERISK-3  0x0000000000000000 in ?? ()

By: timrobbins (timrobbins) 2007-05-01 20:19:10

I'm also seeing crashes in IAXPEER(CURRENTCHANNEL) under moderate load - practically the same as your 1st backtrace.

It looks like function_iaxpeer() should be locking iaxsl[callno] before accessing iaxs[callno], but I'm still seeing the same crashes even after fixing that.

By: timrobbins (timrobbins) 2007-05-02 19:10:54

Ok, your second backtrace (bt2) is the same as issue 9103.

By: Tilghman Lesher (tilghman) 2007-06-11 17:50:46

There have been massive improvements since 1.4.1.  Could one of you who has been having these issues update to the current 1.4 SVN to check if this is still an issue (and perhaps give us some more recent backtraces)?

By: Russell Bryant (russell) 2007-06-18 18:27:55

It looks like this one has already been fixed.  Please reopen this bug if you still have a problem with the latest version.  Thanks!