|Summary:||ASTERISK-04949: Asterisk segfaults with signal 11 from within logger.c:850|
|Reporter:||Daniel W. Halverson (danh_tbc_net)||Labels:|
|Date Opened:||2005-08-31 15:13:49||Date Closed:||2011-06-07 14:01:06|
|Environment:||Attachments:||( 0) output.txt|
( 1) output-9-14-05.txt
( 2) output-9-15-05.txt
( 3) output-9-6-05.txt
|Description:||We have been having an problem where Asterisk crashes about 2-3 times a day. Backtrace is attached. CVS-Head 6-1-05 does not exhibit this bug.|
#3 0x08057137 in ast_verbose (fmt=0x1 <Address 0x1 out of bounds>)
stuff = " -- Executing \033[1;36;40mResponseTimeout\033[0;37;40m(\"\033[1;35;40mZap/7-1\033[0;37;40m\", \"\033[1;35;40m7\033[0;37;40m\") in new stack\n\000w stack\n\000n new stack\n\000m\") in new stack\n\000tack\n", '\0' <repeats 3927 times>
Seems to indicate a pointer overflow.
|Comments:||By: Kevin P. Fleming (kpfleming) 2005-08-31 15:19:39|
We may be able to find the problem using this backtrace, but it will be much more helpful if you can provide a backtrace from a non-optimized build of Asterisk.
By: Daniel W. Halverson (danh_tbc_net) 2005-08-31 15:26:32
I compiled with "make valgrind" this morning, so it should be non-optimized already. (If I understood the instructions correctly.) If not, I can recompile and resubmit.
By: Kevin P. Fleming (kpfleming) 2005-08-31 15:33:46
OK, it appeared that your backtrace was from an optimized build, but it's not always obvious :-) I'll look into it deeper.
By: Daniel W. Halverson (danh_tbc_net) 2005-09-01 09:42:21
Two more crashes this morning, same basic core dump. Seems to happen more often when the system is busy.
By: Michael Jerris (mikej) 2005-09-02 22:24:05
I am having trouble lining up line numbers and making sense of this. The call from the line numbers has a value in a call that has a hardcoded NULL. Can you give us another bactrace (just like that one) but from most current head and comment the exact time\date you checked it out please.
By: Tilghman Lesher (tilghman) 2005-09-03 15:43:58
Given that this is crashing on a strdup(), during a memory allocation, and the item it is allocating memory for is quite obviously not NULL (stuff), this is most likely your memory beginning to fail. My recommendation is to replace the memory in this machine and try again.
By: Daniel W. Halverson (danh_tbc_net) 2005-09-04 13:52:43
The only thing is that I can revert back to the 6-1-05 CVS-Head, and it will run for a week at a time.
With the current CVS-Head, the segfault is always at the same point. I would expect it to happen at different points otherwise.
The system is an Intel Xeon server with ECC Memory.
Tuesday I'll run it with the current CVS-Head. I will also try to run some memory testing utilities between now and then.
By: Daniel W. Halverson (danh_tbc_net) 2005-09-06 09:28:20
Updated bt from crash this morning. This is based on CVS downloaded 9-6-05, Show version follows:
Asterisk CVS HEAD built by root@***** on a i686 running Linux on 2005-09-06 12:40:30 UTC
By: Mark Spencer (markster) 2005-09-12 00:32:46
Is there a specific sequence which dupes this?
By: Daniel W. Halverson (danh_tbc_net) 2005-09-14 09:25:30
I have not been able to come up with a specific set of steps. I've updated to CVS as of this morning, and see what I can find.
By: Daniel W. Halverson (danh_tbc_net) 2005-09-14 11:23:36
Just uploaded the segfault from todays CVS build. It almost looks like it's an overload of too many logging messages. "stuff" always repeats right around 3930 times. It does seem to happen when the phone system is busy.
By: Kevin P. Fleming (kpfleming) 2005-09-14 19:59:32
I have done a minor cleanup on the ast_verbose() function, but I didn't really see anything in there that would lead to this problem (there was a small buglet that would have kept the message list from growing to its maximum length under severe memory pressure, but that is all).
Do you have the 'timestamp' option turned on (either in asterisk.conf or via the -T parameter to asterisk)?
By: Kevin P. Fleming (kpfleming) 2005-09-14 20:02:24
That 'repeats 3930 times' is referring the ending NULL character, not the entire string. Since the buffer is 4096 bytes long, that seems reasonable.
By: Daniel W. Halverson (danh_tbc_net) 2005-09-15 09:32:15
I'm not running the timestamp option. I installed from CVS-Head this morning, and will monitor the operation today. First thing that I notice is logging is significantly more verbose to syslog. Specifically, I see newlines being posted, so a single log line may take up 5 or 6 physical screen lines. Not a complaint, just an observation.
I appreciate the time. Thanks again - Dan
By: Daniel W. Halverson (danh_tbc_net) 2005-09-15 10:05:53
Had another crash with coredump. It doesn't appear to have crashed at the same logger.c line. I'm still reviewing the dump.
By: Mark Spencer (markster) 2005-09-21 08:39:00
This sounds like it is somewhere else in Asterisk. I can go ahead and tell you this is likely to be a difficult bug to track down. Here's the order of events I think we need to try to track it down:
1. Trying to do this under valgrind and see if you can trip it up with normal operation.
2. Split up the operation of the system into multiple systems. Move the operation of items slowly from one system to the other until the problem moves from the first system to the second, and we can more narrowly identify what is causing the problem.
I know these are somewhat time consuming, but without a more narrow focus, it's very hard to debug this problem.
By: Tilghman Lesher (tilghman) 2005-09-21 12:09:35
Have you tried replacing the physical memory in this machine yet?
By: Matt O'Gorman (mogorman) 2005-10-04 11:54:41
Did new ram fix the problem, if not please reply. Otherwise can we close said bug.
By: Daniel W. Halverson (danh_tbc_net) 2005-10-05 10:32:31
Have not had a chance to test new RAM yet. Reverted back to 6-1-05 version and have run 2 weeks with no crashes.
By: Clod Patry (junky) 2005-10-05 11:57:06
dan: any place to try changing the ram or could we consider that bug fixed, maybe just suspended.
By: twisted (twisted) 2005-10-13 16:36:00
Can we get an update on this? Is this for sure a memory issue, or does this crash still happen with the new ram?
By: Daniel W. Halverson (danh_tbc_net) 2005-10-17 11:31:08
Let's throw the bug on hold for the moment. I'm redoing the configuration files with IP Manager, and will test with CVS and those files.