Summary:ASTERISK-14656: [patch] Crash in ast_readaudio_callback
Reporter:Richard Odekerken (rgj)Labels:
Date Opened:2009-08-14 01:57:02Date Closed:2011-06-07 14:01:07
Versions:Frequency of
Environment:Attachments:( 0) bt.txt
( 1) bt-1.4.27.txt
( 2) bt2.txt
( 3) crash-090909.txt
( 4) issue_15719-debug1.diff.txt
Description:We're experiencing almost-daily crashes in ast_readaudio_callback.
There seems to be corruption in the ast_filestream structure.

At first we thought it was just another appearance of issue 15616 (random crashes) but currently we're experiencing the same crash at the same point.

Every time it's a different agent, and a different extension. We have been completely unable to reproduce this behaviour. It occurs at seemingly random moments.


#0  0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737
737                     if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name))
(gdb) bt
#0  0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737
#1  0x0809df1e in ast_fsread_audio (data=0x80008) at file.c:791
#2  0x080f510a in ast_sched_runq (con=0x944ac78) at sched.c:363
#3  0x0809f192 in waitstream_core (c=0x9385538, breakon=Variable "breakon" is not available.
) at file.c:1289
#4  0x0809f74c in ast_waitstream (c=0x80008, breakon=0x93fae00 "8\025\r") at file.c:1305
ASTERISK-1  0x004b3738 in agent_call (ast=0x9486348, dest=0xb628a496 "4094", timeout=0) at chan_agent.c:764
ASTERISK-2  0x0808030e in ast_call (chan=0x9486348, addr=0x80008 <Address 0x80008 out of bounds>, timeout=524296) at channel.c:3407
ASTERISK-3  0x0054726b in ring_entry (qe=0xb628c840, tmp=0x9605ea8, busies=0xb628a7a4) at app_queue.c:1985
ASTERISK-4  0x00547a9c in ring_one (qe=0xb628c840, outgoing=0x9605ea8, busies=0xb628a7a4) at app_queue.c:2063
ASTERISK-5  0x00548d45 in try_calling (qe=0xb628c840, options=Variable "options" is not available.
) at app_queue.c:2954
ASTERISK-6 0x0054e265 in queue_exec (chan=0xb77cb4c0, data=0xb628cc70) at app_queue.c:4077
ASTERISK-7 0x080c30ef in pbx_exec (c=0xb77cb4c0, app=0x90bd058, data=0xb628cc70) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/strings.h:36
ASTERISK-8 0x00142462 in realtime_exec (chan=0xb77cb4c0, context=0xb77cb640 "default", exten=0xb77cb690 "14094", priority=2, callerid=0xb7717fd8 "0302753322", data=0x90b271d "@") at pbx_realtime.c:216
ASTERISK-9 0x080caae8 in pbx_extension_helper (c=0xb77cb4c0, con=Variable "con" is not available.
) at pbx.c:1874
ASTERISK-10 0x080cf246 in __ast_pbx_run (c=0xb77cb4c0) at pbx.c:2283
ASTERISK-11 0x080d118e in pbx_thread (data=0xb77cb4c0) at pbx.c:2599
ASTERISK-12 0x08102755 in dummy_start (data=0x93fae00) at utils.c:856
ASTERISK-13 0x00d653cc in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-14 0x00ccf96e in clone () from /lib/tls/libc.so.6
(gdb) frame 0
#0  0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737
737                     if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name))
(gdb) print s->orig_chan_name
$1 = 0x80008 <Address 0x80008 out of bounds>
(gdb) print s->owner->name
$2 = 0x0
Comments:By: Russell Bryant (russell) 2009-08-25 14:50:54

I have posted a patch on issue 15109 which should address this issue.  Please give it a try!


By: Richard Odekerken (rgj) 2009-08-25 14:59:21

We're not using format_mp3.... we switched to gsm and wav as soon as we suspected that mp3 could be an issue.

By: Jason Parker (jparker) 2009-08-25 17:18:18

I would say that this backtrace is related to ASTERISK-14129, and that your other issue (which would not be fixed by this) is not.  Are you able to reproduce a crash that shows this same backtrace?

By: Richard Odekerken (rgj) 2009-08-26 00:37:19

We're not able to reproduce.

The patch has been installed and we did experience another ASTERISK-14566 crash since then, so that one is definitely not solved by the patch.

By: David Brillert (aragon) 2009-08-26 13:01:20

I think I have an almost identical bt and valgrind debug over on ticket 15544
No memory abort in the backtrace but the bt looks very similar.
I'm not using chan_agent or realtime.

The similarities are here (my backtrace includes elements of sched.c and utils.c):
_0  0x00227432 in __ast_pthread_mutex_lock (filename=0x281b74 "chan_sip.c",
   lineno=1954, func=0x2820fc "retrans_pkt",
   mutex_name=0x282108 "&pkt->owner->lock", t=0x6e6e6f43)
   at /root/asterisksvn/ast14/include/asterisk/lock.h:354
_1  0x002267ab in retrans_pkt (data=0x99b1330) at chan_sip.c:1954
_2  0x081113c5 in ast_sched_runq (con=0x929b980) at sched.c:363
_3  0x0027294c in do_monitor (data=0x0) at chan_sip.c:16784
_4  0x081230d6 in dummy_start (data=0x92a8420) at utils.c:856
_5  0x008b249b in start_thread () from /lib/libpthread.so.0
_6  0x0080942e in clone () from /lib/libc.so.6

By: David Brillert (aragon) 2009-08-26 15:39:48

The runq stuff in 15544's valgrind backtrace is downright spammy...

Here is the link to the valgrind stuff

By: shabbir hussain (shabbir52) 2009-08-27 02:44:53

just check your config files like if you use realtime exten then you should valid mysql drivers.

By: Richard Odekerken (rgj) 2009-08-27 04:35:53

We are using valid mysql drivers. This is not the issue at all.

By: Richard Odekerken (rgj) 2009-08-31 10:20:17

We have experienced another crash, so the #0015109 patch is not working for this issue. Backtrace is attached.

By: Russell Bryant (russell) 2009-09-01 16:38:58

I have uploaded a patch (issue_15719-debug1.diff.txt) to test a theory about what is going on here.  Basically, I'd like to know if the problem is a problem similar to what we just fixed in format_mp3.

Please apply this patch and let me know if you still get the crashes, or if you see some new ERROR messages showing up in your log from this patch.

By: Richard Odekerken (rgj) 2009-09-02 11:09:25

We'll apply the patch this evening and let you know the results.
BTW in a crash that happened today, I inspected orig_chan_name, but it's clearly not overwriting anything. See bt2.txt

By: Richard Odekerken (rgj) 2009-09-03 15:09:01

System has been running with the patch for 14 hours now. No crash so far (but sometimes they're 5 days apart). No corruption in the extra memory block either.

I'll keep you posted.

By: David Brillert (aragon) 2009-09-10 09:36:40


Another core dump (averaging one per day) running SVN r217156 including patch issue_15179.debug1.diff.txt with patched Asterisk-addons revision 1027
New backtrace core.5134.txt uploaded to ASTERISK-14558
My latest bt with your debug patch can be found here on my open ticket ASTERISK-14558

By: Richard Odekerken (rgj) 2009-09-10 14:18:17

We had another abort yesterday, patch issue_15719-debug1.diff.txt was applied
Nothing was logged since we started using the patched version

[root@g1 core]# grep "Found a non-zero" full-20090909
[root@g1 core]#

(gdb) frame 7
ASTERISK-3  0x0809de53 in ast_readaudio_callback (s=0x9ad9750) at file.c:752
752                     fr = s->fmt->read(s, &whennext);
(gdb) print *s
$3 = {fmt = 0x99158a8, flags = 0, mode = 0, filename = 0x0, realfilename = 0x0, vfs = 0x0, trans = 0x0, tr = 0x0,
 lastwriteformat = 0, lasttimeout = 160, owner = 0x9a85990, f = 0x9baf9d8, fr = {frametype = AST_FRAME_VOICE,
   subclass = 64, datalen = 320, samples = 160, mallocd = 0, mallocd_hdr_len = 0, offset = 64, src = 0x99158a8 "wav",
   data = 0x9ada80c, delivery = {tv_sec = 0, tv_usec = 0}, frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0,
   seqno = 0}, buf = 0x9ada7cc "", {_private = 0x9ada94c, private = 0x9ada94c}, orig_chan_name = 0x0,
 big_waste_of_memory = '\0' <repeats 4095 times>}

See attachment crash-090909.txt for more details

By: David Brillert (aragon) 2009-09-10 14:45:51

The runq stuff looks exactly like a backtrace with valgrind I have posted to bug ASTERISK-14506

By: Richard Odekerken (rgj) 2009-09-10 14:49:14

I think most calls go through runq so it's logically to show up...

By: David Brillert (aragon) 2009-09-10 14:50:21

in my valgrind trace it looks like runq is leaking

By: Joshua C. Colp (jcolp) 2009-11-04 15:25:54.000-0600

Can you please retest using the latest version from SVN? Thanks.

By: Richard Odekerken (rgj) 2009-11-04 19:20:11.000-0600

On holiday right now, will be able to check in 2 weeks...

By: Leif Madsen (lmadsen) 2009-11-17 07:30:33.000-0600

Have you had a chance to re-test this issue with the latest release candidates? Thanks! (hope you had a great vacation! :))

By: Richard Odekerken (rgj) 2009-11-18 10:23:59.000-0600

Had a great vacation! Do you want me to check with a version from SVN or is 1.4.27-rc5 also good to test this?

By: Leif Madsen (lmadsen) 2009-11-18 10:38:39.000-0600

I'd say 1.4.27-rc5 is good enough, but if you can test from SVN directly, that would be the ideal as you'll have the latest commits.


By: Richard Odekerken (rgj) 2009-11-18 10:42:46.000-0600

Ok, will build tonight. Please be aware that since we cannot reproduce this issue, it occurs randomly. Sometimes it stays away for a week. We can only falsify (i.e. a crash means no success, but no crash could be 'just' luck), so a definite answer in case the issue is solved will take at least a few weeks (to make the odds for 'luck' really small)

By: Leif Madsen (lmadsen) 2009-11-18 13:31:24.000-0600

Fair enough. How about for now I close the issue as resolved, and if you end up with a crash on the latest, then you can reopen it and report back with a backtrace.


By: Leif Madsen (lmadsen) 2009-11-18 13:32:12.000-0600

Closing this issue for now as it will take several weeks to know if the latest version has resolved this issue or not.

If it has not, then the reporter will report back with a new backtrace with DONT_OPTIMIZE enabled, and we will take it from there.

By: Richard Odekerken (rgj) 2009-11-25 16:10:11.000-0600

Bad news. Just got the same crash again, running 1.4.27.
Backtrace attached, although this one hasn't been compiled with DONT_OPTIMIZE enabled. Will recompile tomorrow. Let me know if you need more info.

Got a console message "*** glibc detected *** free(): invalid pointer: 0xb61aa000 ***", seems it has to do with this crash.

EDIT please delete bt-full-1.4.27.txt attachment, it belongs to another crash.

By: Richard Odekerken (rgj) 2009-11-27 02:16:03.000-0600

And another one, this time identical as before, with a crash in file.c as shown in the top.

(the one from yesterday looked a little different, so I thought this might be relevant to note)

By: Tilghman Lesher (tilghman) 2009-11-27 09:27:18.000-0600

Given that there have been several fixes made in this code, we really would prefer if you ran the most recent release or SVN 1.4, as the changes made may already fix the issues identified here.

By: Richard Odekerken (rgj) 2009-11-27 09:31:51.000-0600

We _are_ running the most recent release, 1.4.27.

By: Tilghman Lesher (tilghman) 2009-11-27 10:05:42.000-0600

Oops.  In that case, I recommend that you follow the instructions in doc/valgrind.txt.

By: Richard Odekerken (rgj) 2009-11-28 09:01:41.000-0600

As described in ASTERISK-14566, that is not an option for us since valgrind imposes too much load on the system.

By: Tilghman Lesher (tilghman) 2009-11-29 22:57:36.000-0600

Given the crash, valgrind is required for us to proceed any further with this issue.  Can you reproduce this with a non-production or semi-production system?

By: Leif Madsen (lmadsen) 2009-12-21 10:14:15.000-0600

I'm suspending this issue right now as valgrind data is required in order to move this issue forward. If the reporter is able to reproduce on a test system and produce the valgrind data required to move this issue forward, they are welcome to reopen this issue. Thanks!