|Summary:||ASTERISK-20227: Segfault (possible memory corruption?)|
|Reporter:||Jared Smith (jsmith)||Labels:|
|Date Opened:||2012-08-13 17:22:35||Date Closed:||2012-12-19 18:36:09.000-0600|
|Environment:||Linux||Attachments:||( 0) another_backtrace.20120820|
( 1) asterisk_backtrace_09032012.txt
( 2) asterisk_configs.tgz
( 3) backtrace_20227.txt
( 4) backtrace.3975
( 5) malloc_backtrace.txt
( 6) malloc-enhancements-188.8.131.52.diff
( 7) test_configs.tgz
|Description:||Another segfault I'm seeing (not the same one as ASTERISK-20226). Opening this bug at the request of mjordan.|
[Edit by Rusty Newton - removed older backtrace *from description* and attached as backtrace_20227.txt]
|Comments:||By: Jared Smith (jsmith) 2012-08-13 17:47:24.059-0500|
This is an updated backtrace, with more debugging symbols for glibc installed.
By: Rusty Newton (rnewton) 2012-08-16 18:24:36.587-0500
Jared I see a lot of values optimized out. Can you get another backtrace with asterisk compiled with DONT_OPTIMIZE and BETTER_BACKTRACES ?
By: Jared Smith (jsmith) 2012-08-20 16:47:02.803-0500
This is another backtrace with a very similar crash
By: Matt Jordan (mjordan) 2012-08-21 11:15:23.361-0500
Please make sure you hit the Send Back button once you've provided feedback - otherwise it may drift off the Triage radar.
Are they using chan_agent by any chance?
By: Rusty Newton (rnewton) 2012-09-19 20:32:08.877-0500
Jared were you able to obtain a backtrace after recompiling with the options mentioned above?
By: Jared Smith (jsmith) 2012-09-20 02:34:09.786-0500
Sure, I have a bunch of them. I've attached another example here, this one should have both DONT_OPTIMIZE and BETTER_BACKTRACES turned on.
For the record, we're in the process of swapping out the hardware as well, just to verify that it's not a hardware issue.
By: Matt Jordan (mjordan) 2012-10-17 08:35:17.371-0500
Hey Jared -
I'm pretty sure this is a memory corruption of some sort. Can you provide the .conf files for the system(s) affected?
Ideally if this can be reproduced in a lab environment, a valgrind trace would also be hugely useful.
By: Matt Jordan (mjordan) 2012-10-17 08:45:31.031-0500
(Also: please remember to hit "Send Back" when you've provided feedback, otherwise it doesn't always show up in the Triage filters)
By: Jared Smith (jsmith) 2012-10-17 15:27:17.629-0500
I've attached my configs. It should be a fairly ordinary Asterisk install (using FreePBX as the front-end to generate the configs).
The only thing unusual about this server is the high number of queues in use on this system. Last I checked, there were somewhere around 180 queues in use at any given time on this system.
By: Rusty Newton (rnewton) 2012-10-19 10:20:57.042-0500
Thanks for additional info. Will you be able to get valgrind output?
By: Jared Smith (jsmith) 2012-10-19 10:35:59.330-0500
No, I won't be able to get valgrind output. This is a production system handling tens of thousands of calls per day. Sorry :-(
By: Deniz (deniz) 2012-11-06 06:10:34.338-0600
having the same segfault running 1.8.17 within production....
By: Richard Mudgett (rmudgett) 2012-11-06 16:56:17.365-0600
The reviewboard patch for MALLOC_DEBUG enhancements should help locate the possible memory corruption.
MALLOC_DEBUG logs its output to stderr and to the /var/log/asterisk/mmlog file by default.
By: Matt Jordan (mjordan) 2012-11-08 11:12:43.533-0600
Any luck on running the production system with Richard's patches?
By: Matt Jordan (mjordan) 2012-11-08 15:17:33.678-0600
Attaching a patch (malloc-enhancements-184.108.40.206.diff) that provides Richard's MALLOC_DEBUG enhancements for this version.
There are two ways to use this patch:
1) Enable MALLOC_DEBUG. This will create a mmlog file that will log out information related to a memory corruption that will be useful in the case that one happens.
2) Along with MALLOC_DEBUG, enable DO_CRASH. This will cause Asterisk to immediately crash when a memory corruption is detected, as opposed to waiting for something to access the now corrupted memory. If you can tolerate what may potentially be a 'quicker' crash, this would help as well.
By: Jared Smith (jsmith) 2012-11-08 17:52:40.876-0600
We tested the patch in the lab today, and were easily able to crash the lab system. I think the patch does more harm than help.
I'll post the backtrace from the crash on the lab system here shortly.
By: Jared Smith (jsmith) 2012-11-08 18:00:25.389-0600
This is the backtrace from a crash on the very latest 1.8 from SVN (revision 376029) on my lab system.
By: Matt Jordan (mjordan) 2012-11-08 20:59:59.073-0600
Nothing logged to mmlog?
By: Jared Smith (jsmith) 2012-11-09 08:33:13.286-0600
Nothing interesting logged to mmlog -- it looks like this:
1352409852 - New session
1352409921 - New session
1352409989 - New session
1352411167 - New session
1352417026 - New session
1352419089 - New session
1352420047 - New session
1352428444 - New session
1352428951 - New session
By: Richard Mudgett (rmudgett) 2012-11-09 16:35:54.610-0600
The new backtrace does not show a crash that I would expect with MALLOC_DEBUG enabled. The MALLOC_DEBUG code assumes that all allocations go through it. I would expect to see a __ast_alloc_region() in that backtrace.
The MALLOC_DEBUG code wipes the contents of a released block with the 0xdeaddead value and delays actually freeing the memory.
The debug code will prevent memory corruption writes from causing a crash because the freeing of a block is delayed. When the block is rotated back to the heap, it is checked to see if the memory has been changed from 0xdeaddead.
The debug code should cause a crash if a released block attempts to dereference a pointer because a released block is wiped with the 0xdeaddead value. Therefor, a dereference of a freed pointer will attempt to dereference the address 0xdeaddead which is usually an invalid memory address.
If you also enable DO_CRASH option, a crash will be forced if an assertion fails or MALLOC_DEBUG reports a warning.
By: Jared Smith (jsmith) 2012-11-09 18:05:35.630-0600
Right -- I understand that the patch wouldn't catch this type of crash. What I'm saying is that the patch appears to be *causing* this type of crash, at least in our lab testing. With the patch, we can easily crash the system with backtraces similar to the latest one attached to this ticket -- without it, we can go much longer without a segfault.
By: Jared Smith (jsmith) 2012-11-13 14:51:18.486-0600
After chatting with mjordan on IRC, he asked that I attach the configs from the test run with 1.8 (from SVN) where we were seeing problems with the patch. I've attached the relevant configs -- everything else is a stock config from "make samples" in Asterisk.
By: Matt Jordan (mjordan) 2012-11-15 16:47:01.727-0600
Did some testing of this tonight by logging in two agents (jared/chris) and using a third SIP phone to dial into Queue 302. No crashes or memory reports kicked back yet.
Does this typically crash quickly, or do you usually script something to simulate a large number of calls?
By: Jared Smith (jsmith) 2012-11-15 21:36:02.271-0600
I could usually trigger the crash with a few dozen calls. I'll keep pounding on it in the lab and get some additional backtraces, if that's helpful. If I can get to reliably crash again in the lab, I'll give you access to the box and let you work your magic.
By: Matt Jordan (mjordan) 2012-11-16 08:47:37.369-0600
Little bit more info on what I was testing:
I changed jared/chris into two local D40 SIP phones that I have (digium01/digium02). Otherwise, the dialplan/config is the same:
agent => digium01,4321,Jared Smith
agent => digium02,4321
musicclass = default
;member => Agent/chris
;member => Agent/larry
;member => Agent/paul
;member => Agent/patrick
;member => Agent/anthony
;member => Agent/derek
;member => Agent/jared
;member => Agent/olle
member => Local/digium01@agents
member => Local/digium02@agents
I then used call files to spam calls into the sales queue (extension 300) - the bash script creates 10 calls at a time. Randomly, at each phone I either ignore the call (which puts it back into the Queue) or Answer it, wait a bit, and hang up. So far I've processed about 100 calls without a crash.
By: Rusty Newton (rnewton) 2012-12-19 18:17:30.955-0600
Jared, can you give us any further guidance on reproducing the crash?
By: Jared Smith (jsmith) 2012-12-19 18:34:11.647-0600
I think this bug can be safely closed. After applying the patch in ASTERISK-20226 (which I didn't think was related to this issue), we haven't had any more crashes in the past 3 weeks, 6 days, 19 hours, 10 minutes, 33 seconds. During that time, we've put 922,685 calls through the system.
By: Jared Smith (jsmith) 2012-12-19 18:36:09.073-0600
Assuming the patch from bug ASTERISK-20226 is added to the 220.127.116.11 release, I don't see any reason to keep this bug open. The problem doesn't seem to happen any more after applying the patch.
By: Matt Jordan (mjordan) 2012-12-20 08:28:58.258-0600
Yup, that patch is in 18.104.22.168-rc1. I'll close this out for now and if it rears its ugly head again, we'll reopen.