[Home]

Summary:ASTERISK-14566: random crashes
Reporter:Richard Odekerken (rgj)Labels:
Date Opened:2009-07-30 14:26:14Date Closed:2013-01-14 14:27:02.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) crash4.txt
( 1) gdb.txt
( 2) log.txt
( 3) secondcrash-bt.txt
( 4) thirdcrash-bt.txt
Description:Our production Asterisk server crashes randomly, about once every two days.
GDB shows random locations for the crash. The only thing they have in common is that it's something near malloc, but it's being called from really different locations each time.




****** ADDITIONAL INFORMATION ******

We've been experiencing the crashes since v1.4.24 and were hoping that v1.4.26 would solve the problem. It didn't.

Since we suspected memory issues, we've moved to another server as well. This didn't help.

Server is CentOS 4.7 Linux g1 2.6.9-78.0.8.ELsmp #1 SMP Wed Nov 19 20:05:04 EST 2008 i686 i686 i386 GNU/Linux
Asterisk has been compiled on the same box

We're a contact center with 3000 agents.
At the moment of the crash, it's not busy at all though, sometimes crashes even with only 10 or 15 simultaneous calls.
Uptime is not a factor, sometimes crashes are 5 days apart, sometimes 10 minutes.
Comments:By: Raimund Sacherer (hatrix) 2009-08-10 02:03:29

Hi rgj,

I as well have random segfaults, and after searching around i came accross this:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498399

I am not sure if it still apply's to the current version of asterisk, but after disabling my other registrations my system appears more stable.

hope it helps,
best
ray

By: Richard Odekerken (rgj) 2009-08-10 15:56:27

Thanks. Unfortunately this doesn't apply to our situation.
There is only one registration on the entire box...

By: Richard Odekerken (rgj) 2009-08-10 15:58:58

BTW another crash, this time not in malloc() or free() but clearly corrupted memory

(gdb) frame 0
#0  0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737
737                     if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name))
(gdb) print s->orig_chan_name
$1 = 0x80008 <Address 0x80008 out of bounds>
(gdb) print s->owner->name
$2 = 0x0

rest of the stuff in attachment 'thirdcrash'

By: Richard Odekerken (rgj) 2009-08-12 03:08:13

Same crash today


[Aug 12 09:24:23] VERBOSE[25118] logger.c:     -- Executing Queue("SIP/10.32.128.131-b7757540", "13032")
[Aug 12 09:24:23] DEBUG[25118] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Aug 12 09:24:23] DEBUG[25118] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM queues WHERE name = '13032'
[Aug 12 09:24:23] DEBUG[25118] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Aug 12 09:24:23] DEBUG[25118] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM queue_members WHERE interface LIKE '%' AND queue_name = '13032' ORDER BY interface
[Aug 12 09:24:23] VERBOSE[25118] logger.c:     -- Started music on hold, class 'silence', on SIP/10.32.128.131-b7757540
[Aug 12 09:24:23] VERBOSE[25118] logger.c:     -- Stopped music on hold on SIP/g2-08d90738
[Aug 12 09:24:23] VERBOSE[25118] logger.c:     -- agent_call, call to agent '3032' call on 'SIP/g2-08d90738'
[Aug 12 09:24:23] VERBOSE[25118] logger.c:     -- <SIP/g2-08d90738> Playing 'beep' (language 'en')

(gdb) frame 0
#0  0x0809ddce in ast_readaudio_callback (s=0x8aaec70) at file.c:737
737                     if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name))
(gdb) print s->orig_chan_name
$1 = 0x80008 <Address 0x80008 out of bounds>
(gdb) Quit
(gdb) quit



By: Amilcar S Silvestre (amilcar) 2009-08-12 18:23:57

I think that the segfaults here are related to (if not the same as) the segfaults described in #0015109.

By: Richard Odekerken (rgj) 2009-08-22 05:34:21

The circumstances seem to be the same as you describe in https://issues.asterisk.org/view.php?id=15109#109257

The crash indeed seems to occur at a transfer. We using queues and agents in the same manner.

But... we're not using non-files MOH at all.

By: Richard Odekerken (rgj) 2009-08-25 09:12:01

Another one

#0  0x00bea7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00c2b825 in raise () from /lib/tls/libc.so.6
#2  0x00c2d289 in abort () from /lib/tls/libc.so.6
#3  0x00c5fcda in __libc_message () from /lib/tls/libc.so.6
#4  0x00c66fe5 in _int_malloc () from /lib/tls/libc.so.6
ASTERISK-1  0x00c687c9 in calloc () from /lib/tls/libc.so.6
ASTERISK-2  0x080fc19e in newpvt (t=0xd30820) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/utils.h:359
ASTERISK-3  0x080fc6ab in ast_translator_build_path (dest=64, source=3) at translate.c:293
ASTERISK-4  0x0807fe53 in set_format (chan=0xb55240c0, fmt=6, rawformat=0xb5524448, format=0xb5524438, trans=0xb5524444, direction=0) at channel.c:3082
ASTERISK-5  0x0808052e in ast_channel_make_compatible (chan=0xb55240c0, peer=0x938fbd8) at channel.c:3092
ASTERISK-6 0x007f0310 in try_calling (qe=0xb59e0840, options=Variable "options" is not available.
) at app_queue.c:3081
ASTERISK-7 0x007f4265 in queue_exec (chan=0xb55240c0, data=0xb59e0c70) at app_queue.c:4077
ASTERISK-8 0x080c30ef in pbx_exec (c=0xb55240c0, app=0x8def7e8, data=0xb59e0c70) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/strings.h:36
ASTERISK-9 0x00122462 in realtime_exec (chan=0xb55240c0, context=0xb5524240 "default", exten=0xb5524290 "14111", priority=2,
   callerid=0xb77f8488 "0402740644", data=0x8de4cdd "@") at pbx_realtime.c:216
ASTERISK-10 0x080caae8 in pbx_extension_helper (c=0xb55240c0, con=Variable "con" is not available.
) at pbx.c:1874
ASTERISK-11 0x080cf246 in __ast_pbx_run (c=0xb55240c0) at pbx.c:2283
ASTERISK-12 0x080d118e in pbx_thread (data=0xb55240c0) at pbx.c:2599
ASTERISK-13 0x08102755 in dummy_start (data=0x6) at utils.c:856
ASTERISK-14 0x00d653cc in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-15 0x00ccf96e in clone () from /lib/tls/libc.so.6

By: Jason Parker (jparker) 2009-08-25 15:18:53

Are you using MP3s (via format_mp3) for your Music on Hold?  Your thirdcrash-bt.txt looks like it has a cause very similar to ASTERISK-14129

By: Richard Odekerken (rgj) 2009-08-25 15:37:37

No - we're not. We got rid of mp3's as soon as we suspected it to be a possible cause.

As suggested in 0015109, we'll make sure that format_mp3.so is not being loaded at all.
EDIT: no, we patched it, so we can see if the patch makes a difference.



By: Richard Odekerken (rgj) 2009-08-26 10:30:08

Just got a crash, so the patch doesn't work for this issue.
Maybe it does work for the 'third crash' you refer to, I've filed this as a separate issue, since I think it is something different. You can find it at ASTERISK-14656

#0  0x00bea7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00c2b825 in raise () from /lib/tls/libc.so.6
#2  0x00c2d289 in abort () from /lib/tls/libc.so.6
#3  0x00c5fcda in __libc_message () from /lib/tls/libc.so.6
#4  0x00c66fe5 in _int_malloc () from /lib/tls/libc.so.6
ASTERISK-1  0x00c687c9 in calloc () from /lib/tls/libc.so.6
ASTERISK-2  0x080fc19e in newpvt (t=0xd30820) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/utils.h:359
ASTERISK-3  0x080fc6ab in ast_translator_build_path (dest=64, source=3) at translate.c:293
ASTERISK-4  0x0807fe53 in set_format (chan=0xb7639818, fmt=6, rawformat=0xb7639ba0, format=0xb7639b90, trans=0xb7639b9c, direction=0) at channel.c:3082
ASTERISK-5  0x0808052e in ast_channel_make_compatible (chan=0xb7639818, peer=0x8d096b8) at channel.c:3092
ASTERISK-6 0x009e6310 in try_calling (qe=0xb6e61840, options=Variable "options" is not available.
) at app_queue.c:3081
ASTERISK-7 0x009ea265 in queue_exec (chan=0xb7639818, data=0xb6e61c70) at app_queue.c:4077
ASTERISK-8 0x080c30ef in pbx_exec (c=0xb7639818, app=0x8c4c2f8, data=0xb6e61c70) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/strings.h:36
ASTERISK-9 0x002a8462 in realtime_exec (chan=0xb7639818, context=0xb7639998 "default", exten=0xb76399e8 "14887", priority=2, callerid=0xb7627300 "0650565556",
   data=0x8b9a47d "@") at pbx_realtime.c:216
ASTERISK-10 0x080caae8 in pbx_extension_helper (c=0xb7639818, con=Variable "con" is not available.
) at pbx.c:1874
ASTERISK-11 0x080cf246 in __ast_pbx_run (c=0xb7639818) at pbx.c:2283
ASTERISK-12 0x080d118e in pbx_thread (data=0xb7639818) at pbx.c:2599
ASTERISK-13 0x08102755 in dummy_start (data=0x6) at utils.c:856
ASTERISK-14 0x00d653cc in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-15 0x00ccf96e in clone () from /lib/tls/libc.so.6



By: Richard Odekerken (rgj) 2009-09-15 11:45:57

Got one of these again

#0  0x00bea7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00c2b825 in raise () from /lib/tls/libc.so.6
#2  0x00c2d289 in abort () from /lib/tls/libc.so.6
#3  0x00c5fcda in __libc_message () from /lib/tls/libc.so.6
#4  0x00c66fe5 in _int_malloc () from /lib/tls/libc.so.6
ASTERISK-1  0x00c687c9 in calloc () from /lib/tls/libc.so.6
ASTERISK-2  0x080fc20e in newpvt (t=0xd30820) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/utils.h:359
ASTERISK-3  0x080fc71b in ast_translator_build_path (dest=64, source=3) at translate.c:293
ASTERISK-4  0x0807fe53 in set_format (chan=0xb5c73160, fmt=6, rawformat=0xb5c734e8, format=0xb5c734d8, trans=0xb5c734e4, direction=0) at channel.c:3082
ASTERISK-5  0x0808052e in ast_channel_make_compatible (chan=0xb5c73160, peer=0x9ac1228) at channel.c:3092
ASTERISK-6 0x00789310 in try_calling (qe=0xb5777840, options=Variable "options" is not available.
) at app_queue.c:3081
ASTERISK-7 0x0078d265 in queue_exec (chan=0xb5c73160, data=0xb5777c70) at app_queue.c:4077
ASTERISK-8 0x080c315f in pbx_exec (c=0xb5c73160, app=0x985d1c8, data=0xb5777c70) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/strings.h:36
ASTERISK-9 0x001b1462 in realtime_exec (chan=0xb5c73160, context=0xb5c732e0 "default", exten=0xb5c73330 "14923", priority=2, callerid=0xb5374150 "Unknown", data=0x984d1f5 "@") at pbx_realtime.c:216
ASTERISK-10 0x080cab58 in pbx_extension_helper (c=0xb5c73160, con=Variable "con" is not available.
) at pbx.c:1874
ASTERISK-11 0x080cf2b6 in __ast_pbx_run (c=0xb5c73160) at pbx.c:2283
ASTERISK-12 0x080d11fe in pbx_thread (data=0xb5c73160) at pbx.c:2599
ASTERISK-13 0x081027c5 in dummy_start (data=0x6) at utils.c:856
ASTERISK-14 0x00d653cc in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-15 0x00ccf96e in clone () from /lib/tls/libc.so.6


(gdb) frame 6
ASTERISK-2  0x080fc20e in newpvt (t=0xd30820) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/utils.h:359
359     AST_INLINE_API(
(gdb) print t
$1 = (struct ast_translator *) 0xd30820
(gdb) print *t
$2 = {name = "\002\000\000\000I", '\0' <repeats 43 times>, "PÔï\t\210f\226\t\000\000\000\000\000\000\000\000ð¤\214\t\000 í¶¸¡¤\tH\000\232\t", srcfmt = 161417200, dstfmt = 161417200, newpvt =

0x99eac40,
 framein = 0x9a90730, frameout = 0xd30878 <main_arena+88>, destroy = 0xd30878 <main_arena+88>, sample = 0x9a5f4c8, buffer_samples = 161888184, buf_size = 13830280, desc_size = 13830280,
 plc_samples = 160341736, useplc = 161275944, native_plc = 163849624, module = 0x99508c0, cost = 161085440, active = 165780744, list = {next = 0xd308a8}}

By: Tilghman Lesher (tilghman) 2009-09-15 11:47:17

Please see doc/valgrind.txt.  This is the only information which is likely to be helpful.  Please do not upload any more backtraces.

By: Richard Odekerken (rgj) 2009-09-15 11:49:57

Valgrind imposes too much load on our system.
Any ideas or other options?

By: Tilghman Lesher (tilghman) 2009-09-15 14:28:43

What you have is memory corruption.  Valgrind is the only general approach that will help track this down.

By: Richard Odekerken (rgj) 2009-09-15 14:49:36

We know.
And since it's not reproducable and valgrind is too heavy, we're stuck.

By: Tilghman Lesher (tilghman) 2009-09-15 16:27:30

Well, there's something that one or more of your agents are doing that is probably causing this crash (not their fault, ours), and you'll need to narrow down what that special operation is.  Is it possible to segment out a section of your call center to run on a debug server, so that this could be tracked down?

Unfortunately, without knowing your application and what exactly your agents are doing, it's incredibly difficult to figure out the source of the memory corruption.

By: Leif Madsen (lmadsen) 2009-09-30 10:01:08

Just pinging this issue. I don't want to close this one out too soon since I want to give you some time to try and narrow down the scenario that is causing this.

By: David Brillert (aragon) 2009-09-30 10:12:36

I just uploaded valgrind dump to ASTERISK-14558 since these bug reports are/should be related.
https://issues.asterisk.org/file_download.php?file_id=24012&type=bug

By: Tilghman Lesher (tilghman) 2010-02-01 13:45:25.000-0600

One thing that MAY help track this down is my malloc_hold branch:
http://svn.digium.com/svn/asterisk/team/tilghman/malloc_hold (branch of 1.4)

Once downloaded, 'make menuselect', enable Compiler Flags/MALLOC_DEBUG, then define in individual files "#define MALLOC_HOLD 1" on the very top line of each file you want to debug (don't do main/frame.c if you can help it), then compile and run.

By: YvesGael (hurdman) 2010-05-21 02:39:55

Hi,
seems to be the same bug, isn't it ?

I have only that log on a my coredump :
#0  0x0000000000481668 in ast_readaudio_callback (s=0x2aabd4075098) at file.c:762
762 if (s->owner->timingfd > -1) {

and asterisk crash status 138.

( i keep my coredump if you want more info )

is there a patch ?

I use asterisk-1.6.1.18.

Thanks !

By: Tilghman Lesher (tilghman) 2010-05-21 10:17:55

hurdman:  There is no way to tell if your crash is related or not, and it most likely is not.  Please file a SEPARATE issue with your crash, but please also upgrade to 1.6.2, first.

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.6.0 and 1.6.1 branches has ended. For continued maintenance support please move to the 1.6.2 branch.

More information on this change can be found in the release announcement: http://www.asterisk.org/node/49924

By: Matt Jordan (mjordan) 2013-01-14 14:26:53.204-0600

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.  After testing with Asterisk 1.8, if you find this problem has not been resolved, please open a new issue against Asterisk 1.8.