[Home]

Summary:ASTERISK-08451: Frequent seg fault in ast_cdr_alloc() at cdr.c:438
Reporter:Ravi Sakaria (rsakaria)Labels:
Date Opened:2006-12-28 13:48:58.000-0600Date Closed:2007-03-14 08:44:35
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bug8689.gdb.txt
Description:Seg fault during ast_cdr_alloc().  Happens once every 2-3 hours on box with light traffic (approximately 100 simultaneous calls).

****** ADDITIONAL INFORMATION ******

* Fedora Core 5 w/ stock kernel

We have three core dumps, all crashed as follows:

(gdb) bt
#0  0x0050418e in malloc_consolidate () from /lib/libc.so.6
#1  0x0050635a in _int_malloc () from /lib/libc.so.6
#2  0x005079a2 in calloc () from /lib/libc.so.6
#3  0x080ec689 in _ast_calloc (num=1, len=1004, file=0x812300b "cdr.c",
   lineno=438, func=0x81232dd "ast_cdr_alloc")
   at /usr/local/src/asterisk-1.4.0.vp/include/asterisk/utils.h:352
#4  0x08075392 in ast_cdr_alloc () at cdr.c:438
ASTERISK-1  0x080b480d in __ast_pbx_run (c=0xb791ed38) at pbx.c:2302
ASTERISK-2  0x080b582e in pbx_thread (data=0xb791ed38) at pbx.c:2556
ASTERISK-3  0x080ed68a in dummy_start (data=0xb793d480) at utils.c:545
ASTERISK-4  0x006363b6 in start_thread () from /lib/libpthread.so.0
ASTERISK-5  0x0056a33e in clone () from /lib/libc.so.6
Comments:By: Serge Vecher (serge-v) 2006-12-28 14:12:52.000-0600

100 simultaneous calls would certainly qualify as 'high' load ;) Is this backtrace   with 'DONT OPTIMIZE' flag on in menuselect, if not, please recompile and redo the by. Also, need to see the output of "thread apply all bt full"

By: Ravi Sakaria (rsakaria) 2006-12-28 14:18:15.000-0600

This is a backtrace with 'DONT OPTIMIZE' set.

I can't post the output of 'thread apply all bt full' because it will show confidential information, but I can make that and remote access to the box available to a Digium employee if one would like to call me.

In the meantime, here's the next best thing (bt full without 'thread apply'):

(gdb) bt full
#0  0x0050418e in malloc_consolidate () from /lib/libc.so.6
No symbol table info available.
#1  0x0050635a in _int_malloc () from /lib/libc.so.6
No symbol table info available.
#2  0x005079a2 in calloc () from /lib/libc.so.6
No symbol table info available.
#3  0x080ec689 in _ast_calloc (num=1, len=1004, file=0x812300b "cdr.c",
   lineno=438, func=0x81232dd "ast_cdr_alloc")
   at /usr/local/src/asterisk-1.4.0.vp/include/asterisk/utils.h:352
       p = (void *) 0x0
       __PRETTY_FUNCTION__ = "_ast_calloc"
#4  0x08075392 in ast_cdr_alloc () at cdr.c:438
       __PRETTY_FUNCTION__ = "ast_cdr_alloc"
ASTERISK-1  0x080b480d in __ast_pbx_run (c=0xb791ed38) at pbx.c:2302
       found = 0
       res = 0
       autoloopflag = -1217514552
       error = 0
       __PRETTY_FUNCTION__ = "__ast_pbx_run"
ASTERISK-2  0x080b582e in pbx_thread (data=0xb791ed38) at pbx.c:2556
       c = (struct ast_channel *) 0xb791ed38
ASTERISK-3  0x080ed68a in dummy_start (data=0xb793d480) at utils.c:545
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {6561780, 0,
       0, -1217514392, -1105725510, 24581600}, __mask_was_saved = 0}},
---Type <return> to continue, or q <return> to quit---
 __pad = {0xb76e34a0, 0x0, 0x0, 0x0}}
       __cancel_routine = (void (*)(void *)) 0x8067f85 <ast_unregister_thread>
       __cancel_arg = (void *) 0xb76e3ba0
       not_first_call = 0
       ret = (void *) 0x5a595e
       a = {start_routine = 0x80b5817 <pbx_thread>, data = 0xb791ed38,
 name = 0xb7945e10 "pbx_thread", ' ' <repeats 11 times>, "started at [ 2580] pbx.c ast_pbx_start()"}
ASTERISK-4  0x006363b6 in start_thread () from /lib/libpthread.so.0
No symbol table info available.
ASTERISK-5  0x0056a33e in clone () from /lib/libc.so.6
No symbol table info available.



By: Ravi Sakaria (rsakaria) 2006-12-29 11:00:17.000-0600

In ast_pbx_start(), is it true that we're instantiating a local structure 'pthread_t t', and then using it to create a detached thread... and then returning from ast_pbx_start() thereby causing 't' to go out of scope while the detached thread is potentially still running?

Or more likely, there must be something I'm not understanding...

Here is the section of code from pbx.c that I'm referring to:

enum ast_pbx_result ast_pbx_start(struct ast_channel *c)
{
       pthread_t t;
       pthread_attr_t attr;

       if (!c) {
               ast_log(LOG_WARNING, "Asked to start thread on NULL channel?\n");
               return AST_PBX_FAILED;
       }

       if (increase_call_count(c))
               return AST_PBX_CALL_LIMIT;

       /* Start a new thread, and get something handling this channel. */
       pthread_attr_init(&attr);
       pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
       if (ast_pthread_create(&t, &attr, pbx_thread, c)) {
               ast_log(LOG_WARNING, "Failed to create new channel thread\n");
               return AST_PBX_FAILED;
       }

       return AST_PBX_SUCCESS;
}

By: Tilghman Lesher (tilghman) 2006-12-29 15:55:31.000-0600

rsakaria:  yes, that's true, but that value is only used prior to the ast_pthread_create returning.  It does not need to be kept in memory past the time of the creation of the thread, especially since the thread is created as DETACHED.

By: Tilghman Lesher (tilghman) 2006-12-29 16:06:23.000-0600

rsakaria:  Is this core dump caused by a SIGSEGV or a SIGABORT?  (It's shown as one of the first lines in the core dump).

Actually, it would be better if you uploaded the entire gdb output in a file to the file upload area, since there is additional information we can glean from the output.

By: Ravi Sakaria (rsakaria) 2006-12-29 16:13:56.000-0600

Corydon76 - Thanks for the clarification on the local variable.  

Here is the full output: (moved to file upload area)



By: Tilghman Lesher (tilghman) 2006-12-29 17:09:42.000-0600

rsakaria:  this is a bug in Fedora.  Please contact them, as we are unable to do anything further on this.

By: Eric Romang (zataz) 2006-12-30 03:40:54.000-0600

Hello,

If this is a bug into Fedora, it should be also avaible for RedHat, Centos, and all other existing RedHat fork. Personnaly, I am astonished by this answer. Is this not more a bug in asterisk how don't support RedHat and they fork ?

By: Tilghman Lesher (tilghman) 2006-12-30 07:57:53.000-0600

Let me be a little more clear as to what is happening here.

During the allocation, you have essentially run out of memory (I'm oversimplifying for clarity), so malloc() internally is reorganizing memory to get more sequential chunks together, so it can give you the size of memory that you have requested.  During this reorganization, malloc() steps on an unallocated page, which causes a segmentation fault.

Generally, memory corruption is detected by malloc() which would result in abort(3) called.  This is why I asked to see if it was a SIGSEGV or a SIGABRT.  Since it is a SIGSEGV, this is a bug in the particular version of glibc that was packaged by Fedora Core 5.

By: Ravi Sakaria (rsakaria) 2007-01-02 09:06:49.000-0600

Corydon76 -

Your explanation makes sense -- thanks for that.  We've just updated glibc on that box from RedHat's 2.4-4 package to 2.4-11.  We're testing again, and I'll report back later today.

By: Ravi Sakaria (rsakaria) 2007-01-02 14:26:49.000-0600

Corydon76 -

Same result with the latest version of libc available from RedHat.  I understand that this doesn't necessarily rule out the possibility that the problem is with RedHat's libc distribution, but the thing that leaves me wondering is this:  We have quite a few other boxes (more than 100) running Asterisk versions prior to 1.4, and all of them are using Fedora Core 5 w/ the same libc.  None of them exhibit this same seg-fault in malloc_consolidate(), and they all have higher call volumes than this one that we're testing with.

Is it possible that some other thread is stepping on memory somewhere, and that the seg faults in malloc_consolidate() are a result of that?

By: Serge Vecher (serge-v) 2007-02-26 12:38:15.000-0600

is this still an issue with the latest glibc package / 1.4 svn?

By: Serge Vecher (serge-v) 2007-03-14 08:44:34

ok, no response for over 2 weeks. If you are able to reproduce this problem with 1.4.1, please reopen the bug with a new backtrace.