[Home]

Summary:ASTERISK-11169: Segmentation fault / Memory corruption with intensive AMI
Reporter:Eduard M. (callmewind)Labels:
Date Opened:2008-01-07 04:19:10.000-0600Date Closed:2008-01-31 15:35:27.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20080109__bug11698.diff.txt
( 1) 20080110__bug11698__patch2.diff.txt
( 2) bt.txt
( 3) btfull.txt
( 4) console.txt
( 5) malloc_debug.txt
( 6) malloc_debug.txt.gz
( 7) malloc_debug3.txt
( 8) malloc_debug4.txt
( 9) malloc_debug5.txt
(10) threadapplyallbt
(11) valgrind.txt
(12) valgrind2.txt
(13) valgrind3.txt
(14) valgrind4.txt
(15) valgrind5.txt
Description:We've developed an application which uses intensively AMI, specially Redirect actions and UserEvents. We are using only SIP channels.
Asterisk crashes randomly, one time at day, with 10-15 calls.
So we build some kind of test system with 2 Asterisk, one making calls to the other, and it crashes at 30 minutes or so.

****** ADDITIONAL INFORMATION ******

Asterisk is compiled with DEBUG_THREADS, DETECT_DEADLOCKS and DONT_OPTIMIZE. Below are attached the console output, and backtraces. I also have the core file (30 MB).
Comments:By: Tilghman Lesher (tilghman) 2008-01-07 15:37:31.000-0600

Okay, I need you to run by following the instructions in doc/valgrind.txt, and upload the resulting files EVEN IF IT DOESN'T CRASH when running under valgrind.

By: Eduard M. (callmewind) 2008-01-08 04:02:08.000-0600

Ok, files uploaded but I couldn't exectue valgrind with --log-file-exactly option, it returns 'valgrind: Bad option '--log-file-exactly=valgrind.txt'; aborting.'. I executed valgrind with --log-file option instead.
There is another core file, but it's 8.2Mb gzipped.

By: Tilghman Lesher (tilghman) 2008-01-08 14:40:53.000-0600

I think I just fixed that problem with the recent commit for issue ASTERISK-10897.  Please upgrade to the latest 1.4 SVN and try again.

By: Eduard M. (callmewind) 2008-01-09 05:11:42.000-0600

Bad news,
today I upgraded to last 1.4 svn and it crashed again.
I've tested three times:
-The first time asterisk didn't crash but stopped accepting new calls and registration from the other asterisk of the test.
-Second time it crashed with a segmentation fault
-Third time I re-ran asterisk with valgrind until crashed.
I will upload valgrind outputs from the last crash and, if you want, I can upload backtraces too.

By: Tilghman Lesher (tilghman) 2008-01-09 11:21:58.000-0600

Please try this patch.  It should solve the race condition that you are presently experiencing which is causing a crash.

By: Eduard M. (callmewind) 2008-01-10 03:03:31.000-0600

No luck :(
It crashed again with the patch. New valgrind3.txt and malloc_debug3.txt files.

As a comment, I've tried a similar test but using AMI only to generate calls and hangup some of them, and the rest of the logic implemented via fastAgi (basically play tt-monkeys in both ways of the call) and it crashed too.

Another comment: if we run the test with 2 asterisk in the local network it crashes around ten minutes; and if we run it across internet it can take several hours to crash.



By: Tilghman Lesher (tilghman) 2008-01-10 12:18:36.000-0600

Okay, second patch for a *different* problem uploaded.  Please note that you need *both* patches.

By: Digium Subversion (svnbot) 2008-01-10 17:06:30.000-0600

Repository: asterisk
Revision: 97973

U   branches/1.4/channels/chan_sip.c
U   branches/1.4/main/translate.c

------------------------------------------------------------------------
r97973 | tilghman | 2008-01-10 17:06:30 -0600 (Thu, 10 Jan 2008) | 6 lines

1) When we get a translated frame out, clone it, because if the
translator pvt is freed before we use the frame, bad things happen.
2) Getting a failure from ast_sched_delete means that the schedule
ID is currently running.  Don't just ignore it.
(Closes issue ASTERISK-11169)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=97973

By: Digium Subversion (svnbot) 2008-01-10 17:38:08.000-0600

Repository: asterisk
Revision: 97978

_U  trunk/
U   trunk/channels/chan_sip.c
U   trunk/main/translate.c

------------------------------------------------------------------------
r97978 | tilghman | 2008-01-10 17:38:07 -0600 (Thu, 10 Jan 2008) | 14 lines

Merged revisions 97973 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r97973 | tilghman | 2008-01-10 17:08:36 -0600 (Thu, 10 Jan 2008) | 6 lines

1) When we get a translated frame out, clone it, because if the
translator pvt is freed before we use the frame, bad things happen.
2) Getting a failure from ast_sched_delete means that the schedule
ID is currently running.  Don't just ignore it.
(Closes issue ASTERISK-11169)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=97978

By: Eduard M. (callmewind) 2008-01-11 03:09:09.000-0600

New crash with both patches.
With valgrind asterisk freezes a pair of minutes before crashing, is this normal?
Whatever, I upload new valgrind4.txt and malloc_debug4.txt files.
And thank you for your fast responses!


By: Eduard M. (callmewind) 2008-01-11 06:46:19.000-0600

Maybe is useful:
A month ago, we have a machine with asterisk running very stable. There was sip terminals attached and a digium card making and receiving external calls throught primarys.
Segmentation fault issues started when we unplugged the digium card and started making external calls throught a VOIP provider, so the problem seems related to SIP to SIP calls.

By: Digium Subversion (svnbot) 2008-01-14 15:19:27.000-0600

Repository: asterisk
Revision: 98853

U   team/russell/translator_frame_fix-1.4/channels/chan_iax2.c
U   team/russell/translator_frame_fix-1.4/codecs/codec_zap.c
U   team/russell/translator_frame_fix-1.4/include/asterisk/frame.h
U   team/russell/translator_frame_fix-1.4/include/asterisk/translate.h
U   team/russell/translator_frame_fix-1.4/main/abstract_jb.c
U   team/russell/translator_frame_fix-1.4/main/frame.c
U   team/russell/translator_frame_fix-1.4/main/rtp.c
U   team/russell/translator_frame_fix-1.4/main/translate.c

------------------------------------------------------------------------
r98853 | russell | 2008-01-14 15:19:23 -0600 (Mon, 14 Jan 2008) | 17 lines

The bug being addressed here is related to issue ASTERISK-11169.

The issue is that it is possible for a translation frame to still be in use
when the translation path gets destroyed.  Specifically, this can happen if
a masquerade happens in between a call to ast_read() and ast_write(), since
the channel is not locked between these two operations.

This hack of a fix is written in such a way that it is API and ABI compatible
with previous versions of Asterisk 1.4.

First, the "has_timing_info" field in ast_frame was change to be a flags field.
A flag was added that gets set when a translator outputs an internal frame.
Then, when a frame gets free'd, if it has this flag set, a hint is given to
the translation core.  So, if the translator was requested to be destroyed while
this frame was still in use, it can be destroyed once the frame is sent to
ast_frame_free(), meaning that it is no longer needed.

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=98853

By: Eduard M. (callmewind) 2008-01-16 02:38:31.000-0600

How can I test the fix? Actually I can't checkout branch 1.4 and I can't find your 'translator_frame_fix-1.4' directory. It seems like someone had rollback the svn repository from two years.

#svn checkout http://svn.digium.com/svn/asterisk/branches/1.4 asterisk-1.4
svn: El URL 'http://svn.digium.com/svn/asterisk/branches/1.4' no existe
(That URL doesn't exist).

In http://svn.digium.com/view/asterisk/branches/ I only see 1.0, 1.2 and 1.2-netsec branches.
Is there some change in svn repository? Or something gone wrong??

By: Michiel van Baak (mvanbaak) 2008-01-16 12:00:53.000-0600

Right now the svn server is undergoing maintenance. It's syncing with the masters so give it some time to get up-to-date. Hopefully today or tomorrow things will be back to normal again.

By: Eduard M. (callmewind) 2008-01-17 05:34:53.000-0600

Ok, finally I could checkout the new code; I ran it but it crashed again. With debugger it crashed but no core dump was generated. New valgrind5.txt and malloc_debug5.txt uploaded.

By: Russell Bryant (russell) 2008-01-19 03:26:03.000-0600

I'm curious, are you using realtime for your SIP configuration?

By: Eduard M. (callmewind) 2008-01-19 05:27:34.000-0600

Yes, I'm using realtime with mysql. Since the test only uses a pair of SIP accounts maybe today or tomorrow I can run the test without it.

By: Eduard M. (callmewind) 2008-01-21 04:20:29.000-0600

Good news!
I'm running the test against the last 1.4 trunk but this time without realtime and  it's all ok at the momment. It's running for 1 hour and 15 minutes, previously tests crashed at 20 minutes or so.
I will let running the test more time, stay tuned :)

By: Clod Patry (junky) 2008-01-24 15:57:29.000-0600

do you have any update here?

By: Eduard M. (callmewind) 2008-01-25 02:50:03.000-0600

Sorry for the lack of response, I'm still making tests because I have no conclusive results:
Originally, we discovered the problem in a production system with asterisk 1.2. Since it is a bit outdated, we built the test with latest 1.4 version (considering updating the production system if we resolve the bug).
Now, without realtime we have two different scenarios:
-test with 1.2 still crashing, but the uptime is increased from 20 minutes to 3-4 hours
-test with 1.4 SVN don't crash at all, but something weird happens with AMI because at some point (4-5 hours working) it stops processing calls. I want to simplify more the test to make sure that the fault is in my code (it's written using asterisk-java) and not asterisk's. Eventually I start getting a java SocketException: broken pipe in the client side.

My objective is to run the test without crashing for 12 hours or so. Since for now I can't reproduce the same bug in 1.4, feel free to close the bug if you want (but it's still here with realtime). If I can reproduce it again without realtime I will reopen it, and if I'm still having problems with AMI but not segmentation faults I will open another bug.

By: Russell Bryant (russell) 2008-01-31 15:35:24.000-0600

I'm closing out this issue.  Please file another report if you determine that you still have a problem, as it sounds like what you're looking at now would be something different.