ASTERISK-07454: Segfault under moderate load on a 64-bit platform

[Home]

Summary: ASTERISK-07454: Segfault under moderate load on a 64-bit platform

Reporter: colin westlake (colinwes) Labels:

Date Opened: 2006-08-03 09:31:43 Date Closed: 2011-06-07 14:01:07

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) Asterisk_SVN-trunk-r38476M_gdb_trace_2nd_Aug.txt
( 1) Trunk-r39295_gdb_traces.txt

Description: In a dialer application:
Originating calls via AMI made out overZap (quad span Digium card) by way of local.
Calls sucessfully passing AMD are then sent to a queue on a remote server (1.2.7.1) by IAX
At a sustained dial rate ~ 1cps (using < 40 chans) a crash will normally occur within 1/2 hr. At higher dial rates frequency increases.

****** ADDITIONAL INFORMATION ******

Not definitive, but I strongly suspect that originating more calls when all zap channels are already in use precipates a crash within a couple of mins

running on Fedora 2.6.11-1.1369_FC4smp twin proc 3 Ghz Xeon

Other dial servers connected to the same queue server and handling exactly the same call pattern (calls are round robin between servers) but running asterisk 1.2.0-beta1 on identical hardware crash very infrequently (once every 1-2 days)

Comments: By: Serge Vecher (serge-v) 2006-08-03 09:36:51

what mods have been done to the source?
By: colin westlake (colinwes) 2006-08-03 09:52:23

Nothing other than one column added to the cdrs written my app_addon.mysql, but I don't think this can be relevant as it's also been done on all the other servers and anyway not loading this module fails to cure the problem
By: Serge Vecher (serge-v) 2006-08-03 09:58:59

anything "interesting" reported on the console prior to the crash?
By: colin westlake (colinwes) 2006-08-03 10:02:06

woops - typo - I meant cdr_addon_mysql.c !!
By: colin westlake (colinwes) 2006-08-03 10:11:31

well, under light load there is nothing to see... it's typically calling app_AMD with default parameters. (But no error messages)

When under heavy load (I suspect being asked to originate with no zap channels available)I have noticed this sort of thing in the message log:
[Jul 31 14:58:02] ERROR[30906] chan_zap.c: !! Got reject for frame 103, retransmitting frame 103 now, updating n_r!
[Jul 31 14:58:02] ERROR[30906] chan_zap.c: !! Got reject for frame 103, retransmitting frame 104 now, updating n_r!
[Jul 31 14:58:02] WARNING[13448] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)
[Jul 31 14:58:02] WARNING[13464] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)
By: Russell Bryant (russell) 2006-08-05 16:08:05

Is this code built with or without optimizations?

I would be curious to know if running with DONT_OPTIMIZE enabled in the "Compiler Flags" section of "make menuselect" would make any difference.
By: colin westlake (colinwes) 2006-08-06 02:17:11

The particular crash in the attached backtrace was on a build with DONT_OPTIMIZE enabled, but it happens just as often with optimised code. I have also tried changing the logging verbosity, but that makes no difference either.
By: Tilghman Lesher (tilghman) 2006-08-06 21:20:26

Can you install a debug-enabled version of glibc? I'm curious to see just where _int_malloc is crashing.

Also, is this crash caused by a segfault (signal 11) or an abort (signal 6)? The gdb output from the beginning has been removed in your file upload.

It's worth noting that we've seen several mysterious crashes inside the glibc memory routines on 64-bit only. I'm beginning to suspect something is wrong with the glibc implementation on this platform.
By: colin westlake (colinwes) 2006-08-07 04:37:31

OK - will do. In the meantime I tried 1.2.0-beta1 (so as to be the same as the other 3 production dial servers), but compiled on 64 bit (all the others are 32) and guess what... all machines passed 65,000 calls but the 64 bit build crashed on signal 11 after 8 hours. Unfortunately the trace will not be much good as optimizition was on... I'll experiment more with debug enabled glibc as you suggest
By: wins (winson) 2006-08-09 21:33:52

I'm having a similar problem on my 64bit FC4
I'm using asterisk to dial out using zap channels. When it reaches capacity, it seems to crash with a segfault 11
gdb /usr/sbin/asterisk -c core.xxxx
also segfaults

GNU gdb Red Hat Linux (6.3.0.0-1.21rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `asterisk'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /usr/lib64/libncurses.so.5...done.
Loaded symbols for /usr/lib64/libncurses.so.5
..............................
Reading symbols from /usr/lib/asterisk/modules/app_cdr.so...done.
Loaded symbols for /usr/lib/asterisk/modules/app_cdr.so
Reading symbols from /usr/lib/asterisk/modules/app_macro.so...done.
Loaded symbols for /usr/lib/asterisk/modules/app_macro.so
Reading symbols from /usr/lib/asterisk/modules/format_wav_gsm.so...done.
Loaded symbols for /usr/lib/asterisk/modules/format_wav_gsm.so
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Segmentation fault

did it die in glibc ?

I tried updating glibc to 2.3.6-3 still crashes
By: colin westlake (colinwes) 2006-08-15 10:24:54

see new trace - this time on 32 bit (2.6.12-1.1378_FC3smp) SVN-trunk-r39295

This time exiting on signal 9, otherwise seems the same as before - ie falls over at very moderate loads (2 setups per sec, 60 zap chans ) other physically identical servers running 1.2.0-beta1 (2 on 32 bit 1 on 64 bit) never fall over with the exact same call loading pattern (round robin between servers)

Only other difference is that the other three are using app_machinedetect instead of app_AMD

I jaust had another crash running in console mode - this time there was some stuff of interest in the CLI
loads of: [Aug 15 17:57:50] WARNING[32635]: chan_iax2.c:7511 socket_process: Received mini frame before first full voice frame

.... followed by quite a few:
[Aug 15 17:57:50] NOTICE[32640]: chan_iax2.c:890 __schedule_action: Out of idle IAX2 threads for scheduling!
[Aug 15 17:57:51] NOTICE[32641]: chan_iax2.c:6216 socket_read: Out of idle IAX2 threads for I/O, pausing!

followed by:
Ouch ... error while writing audio data: : Broken pipe
Warning, flexibel rate not heavily tested!
Segmentation fault

By: Joshua C. Colp (jcolp) 2006-08-16 12:35:23

Would it be possible for me to gain access so I could look at the core dump and try to track down what happened?
By: colin westlake (colinwes) 2006-08-16 13:29:31

Sorry for the delay - just saw your note. I need to get home now but will arrange some access later tonight or tomorrow morning - could you email me directly with your ip (if you have fixed address) and I'll ping you back with a login c o l i n at s y n t e c co uk
By: Joshua C. Colp (jcolp) 2006-09-06 11:15:13

I'm going to suspend this for now, a lot of things have been isolated down to app_amd including this issue. I am going to look further into the code for that module to see if it is doing anything bad.