Summary: | ASTERISK-14656: [patch] Crash in ast_readaudio_callback | ||
Reporter: | Richard Odekerken (rgj) | Labels: | |
Date Opened: | 2009-08-14 01:57:02 | Date Closed: | 2011-06-07 14:01:07 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/FileFormatInterface |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) bt.txt ( 1) bt-1.4.27.txt ( 2) bt2.txt ( 3) crash-090909.txt ( 4) issue_15719-debug1.diff.txt | |
Description: | We're experiencing almost-daily crashes in ast_readaudio_callback. There seems to be corruption in the ast_filestream structure. At first we thought it was just another appearance of issue 15616 (random crashes) but currently we're experiencing the same crash at the same point. Every time it's a different agent, and a different extension. We have been completely unable to reproduce this behaviour. It occurs at seemingly random moments. ****** ADDITIONAL INFORMATION ****** #0 0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737 737 if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name)) (gdb) bt #0 0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737 #1 0x0809df1e in ast_fsread_audio (data=0x80008) at file.c:791 #2 0x080f510a in ast_sched_runq (con=0x944ac78) at sched.c:363 #3 0x0809f192 in waitstream_core (c=0x9385538, breakon=Variable "breakon" is not available. ) at file.c:1289 #4 0x0809f74c in ast_waitstream (c=0x80008, breakon=0x93fae00 "8\025\r") at file.c:1305 ASTERISK-1 0x004b3738 in agent_call (ast=0x9486348, dest=0xb628a496 "4094", timeout=0) at chan_agent.c:764 ASTERISK-2 0x0808030e in ast_call (chan=0x9486348, addr=0x80008 <Address 0x80008 out of bounds>, timeout=524296) at channel.c:3407 ASTERISK-3 0x0054726b in ring_entry (qe=0xb628c840, tmp=0x9605ea8, busies=0xb628a7a4) at app_queue.c:1985 ASTERISK-4 0x00547a9c in ring_one (qe=0xb628c840, outgoing=0x9605ea8, busies=0xb628a7a4) at app_queue.c:2063 ASTERISK-5 0x00548d45 in try_calling (qe=0xb628c840, options=Variable "options" is not available. ) at app_queue.c:2954 ASTERISK-6 0x0054e265 in queue_exec (chan=0xb77cb4c0, data=0xb628cc70) at app_queue.c:4077 ASTERISK-7 0x080c30ef in pbx_exec (c=0xb77cb4c0, app=0x90bd058, data=0xb628cc70) at /usr/src/asterisk/asterisk-1.4.26/include/asterisk/strings.h:36 ASTERISK-8 0x00142462 in realtime_exec (chan=0xb77cb4c0, context=0xb77cb640 "default", exten=0xb77cb690 "14094", priority=2, callerid=0xb7717fd8 "0302753322", data=0x90b271d "@") at pbx_realtime.c:216 ASTERISK-9 0x080caae8 in pbx_extension_helper (c=0xb77cb4c0, con=Variable "con" is not available. ) at pbx.c:1874 ASTERISK-10 0x080cf246 in __ast_pbx_run (c=0xb77cb4c0) at pbx.c:2283 ASTERISK-11 0x080d118e in pbx_thread (data=0xb77cb4c0) at pbx.c:2599 ASTERISK-12 0x08102755 in dummy_start (data=0x93fae00) at utils.c:856 ASTERISK-13 0x00d653cc in start_thread () from /lib/tls/libpthread.so.0 ASTERISK-14 0x00ccf96e in clone () from /lib/tls/libc.so.6 (gdb) frame 0 #0 0x0809ddce in ast_readaudio_callback (s=0x93fae00) at file.c:737 737 if (s->orig_chan_name && strcasecmp(s->owner->name, s->orig_chan_name)) (gdb) print s->orig_chan_name $1 = 0x80008 <Address 0x80008 out of bounds> (gdb) print s->owner->name $2 = 0x0 | ||
Comments: | By: Russell Bryant (russell) 2009-08-25 14:50:54 I have posted a patch on issue 15109 which should address this issue. Please give it a try! https://issues.asterisk.org/view.php?id=15109 By: Richard Odekerken (rgj) 2009-08-25 14:59:21 We're not using format_mp3.... we switched to gsm and wav as soon as we suspected that mp3 could be an issue. By: Jason Parker (jparker) 2009-08-25 17:18:18 I would say that this backtrace is related to ASTERISK-14129, and that your other issue (which would not be fixed by this) is not. Are you able to reproduce a crash that shows this same backtrace? By: Richard Odekerken (rgj) 2009-08-26 00:37:19 We're not able to reproduce. The patch has been installed and we did experience another ASTERISK-14566 crash since then, so that one is definitely not solved by the patch. By: David Brillert (aragon) 2009-08-26 13:01:20 I think I have an almost identical bt and valgrind debug over on ticket 15544 No memory abort in the backtrace but the bt looks very similar. I'm not using chan_agent or realtime. The similarities are here (my backtrace includes elements of sched.c and utils.c): _0 0x00227432 in __ast_pthread_mutex_lock (filename=0x281b74 "chan_sip.c", lineno=1954, func=0x2820fc "retrans_pkt", mutex_name=0x282108 "&pkt->owner->lock", t=0x6e6e6f43) at /root/asterisksvn/ast14/include/asterisk/lock.h:354 _1 0x002267ab in retrans_pkt (data=0x99b1330) at chan_sip.c:1954 _2 0x081113c5 in ast_sched_runq (con=0x929b980) at sched.c:363 _3 0x0027294c in do_monitor (data=0x0) at chan_sip.c:16784 _4 0x081230d6 in dummy_start (data=0x92a8420) at utils.c:856 _5 0x008b249b in start_thread () from /lib/libpthread.so.0 _6 0x0080942e in clone () from /lib/libc.so.6 By: David Brillert (aragon) 2009-08-26 15:39:48 The runq stuff in 15544's valgrind backtrace is downright spammy... Here is the link to the valgrind stuff https://issues.asterisk.org/file_download.php?file_id=23299&type=bug By: shabbir hussain (shabbir52) 2009-08-27 02:44:53 just check your config files like if you use realtime exten then you should valid mysql drivers. By: Richard Odekerken (rgj) 2009-08-27 04:35:53 We are using valid mysql drivers. This is not the issue at all. By: Richard Odekerken (rgj) 2009-08-31 10:20:17 We have experienced another crash, so the #0015109 patch is not working for this issue. Backtrace is attached. By: Russell Bryant (russell) 2009-09-01 16:38:58 I have uploaded a patch (issue_15719-debug1.diff.txt) to test a theory about what is going on here. Basically, I'd like to know if the problem is a problem similar to what we just fixed in format_mp3. Please apply this patch and let me know if you still get the crashes, or if you see some new ERROR messages showing up in your log from this patch. By: Richard Odekerken (rgj) 2009-09-02 11:09:25 We'll apply the patch this evening and let you know the results. BTW in a crash that happened today, I inspected orig_chan_name, but it's clearly not overwriting anything. See bt2.txt By: Richard Odekerken (rgj) 2009-09-03 15:09:01 System has been running with the patch for 14 hours now. No crash so far (but sometimes they're 5 days apart). No corruption in the extra memory block either. I'll keep you posted. By: David Brillert (aragon) 2009-09-10 09:36:40 Russel, Another core dump (averaging one per day) running SVN r217156 including patch issue_15179.debug1.diff.txt with patched Asterisk-addons revision 1027 New backtrace core.5134.txt uploaded to ASTERISK-14558 My latest bt with your debug patch can be found here on my open ticket ASTERISK-14558 https://issues.asterisk.org/file_download.php?file_id=23755&type=bug By: Richard Odekerken (rgj) 2009-09-10 14:18:17 We had another abort yesterday, patch issue_15719-debug1.diff.txt was applied Nothing was logged since we started using the patched version [root@g1 core]# grep "Found a non-zero" full-20090909 [root@g1 core]# (gdb) frame 7 ASTERISK-3 0x0809de53 in ast_readaudio_callback (s=0x9ad9750) at file.c:752 752 fr = s->fmt->read(s, &whennext); (gdb) print *s $3 = {fmt = 0x99158a8, flags = 0, mode = 0, filename = 0x0, realfilename = 0x0, vfs = 0x0, trans = 0x0, tr = 0x0, lastwriteformat = 0, lasttimeout = 160, owner = 0x9a85990, f = 0x9baf9d8, fr = {frametype = AST_FRAME_VOICE, subclass = 64, datalen = 320, samples = 160, mallocd = 0, mallocd_hdr_len = 0, offset = 64, src = 0x99158a8 "wav", data = 0x9ada80c, delivery = {tv_sec = 0, tv_usec = 0}, frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0}, buf = 0x9ada7cc "", {_private = 0x9ada94c, private = 0x9ada94c}, orig_chan_name = 0x0, big_waste_of_memory = '\0' <repeats 4095 times>} See attachment crash-090909.txt for more details By: David Brillert (aragon) 2009-09-10 14:45:51 The runq stuff looks exactly like a backtrace with valgrind I have posted to bug ASTERISK-14506 By: Richard Odekerken (rgj) 2009-09-10 14:49:14 I think most calls go through runq so it's logically to show up... By: David Brillert (aragon) 2009-09-10 14:50:21 in my valgrind trace it looks like runq is leaking By: Joshua C. Colp (jcolp) 2009-11-04 15:25:54.000-0600 Can you please retest using the latest version from SVN? Thanks. By: Richard Odekerken (rgj) 2009-11-04 19:20:11.000-0600 On holiday right now, will be able to check in 2 weeks... By: Leif Madsen (lmadsen) 2009-11-17 07:30:33.000-0600 Have you had a chance to re-test this issue with the latest release candidates? Thanks! (hope you had a great vacation! :)) By: Richard Odekerken (rgj) 2009-11-18 10:23:59.000-0600 Had a great vacation! Do you want me to check with a version from SVN or is 1.4.27-rc5 also good to test this? By: Leif Madsen (lmadsen) 2009-11-18 10:38:39.000-0600 I'd say 1.4.27-rc5 is good enough, but if you can test from SVN directly, that would be the ideal as you'll have the latest commits. Thanks! By: Richard Odekerken (rgj) 2009-11-18 10:42:46.000-0600 Ok, will build tonight. Please be aware that since we cannot reproduce this issue, it occurs randomly. Sometimes it stays away for a week. We can only falsify (i.e. a crash means no success, but no crash could be 'just' luck), so a definite answer in case the issue is solved will take at least a few weeks (to make the odds for 'luck' really small) By: Leif Madsen (lmadsen) 2009-11-18 13:31:24.000-0600 Fair enough. How about for now I close the issue as resolved, and if you end up with a crash on the latest, then you can reopen it and report back with a backtrace. Thanks! Leif. By: Leif Madsen (lmadsen) 2009-11-18 13:32:12.000-0600 Closing this issue for now as it will take several weeks to know if the latest version has resolved this issue or not. If it has not, then the reporter will report back with a new backtrace with DONT_OPTIMIZE enabled, and we will take it from there. Thanks! By: Richard Odekerken (rgj) 2009-11-25 16:10:11.000-0600 Bad news. Just got the same crash again, running 1.4.27. Backtrace attached, although this one hasn't been compiled with DONT_OPTIMIZE enabled. Will recompile tomorrow. Let me know if you need more info. Got a console message "*** glibc detected *** free(): invalid pointer: 0xb61aa000 ***", seems it has to do with this crash. EDIT please delete bt-full-1.4.27.txt attachment, it belongs to another crash. By: Richard Odekerken (rgj) 2009-11-27 02:16:03.000-0600 And another one, this time identical as before, with a crash in file.c as shown in the top. (the one from yesterday looked a little different, so I thought this might be relevant to note) By: Tilghman Lesher (tilghman) 2009-11-27 09:27:18.000-0600 Given that there have been several fixes made in this code, we really would prefer if you ran the most recent release or SVN 1.4, as the changes made may already fix the issues identified here. By: Richard Odekerken (rgj) 2009-11-27 09:31:51.000-0600 We _are_ running the most recent release, 1.4.27. By: Tilghman Lesher (tilghman) 2009-11-27 10:05:42.000-0600 Oops. In that case, I recommend that you follow the instructions in doc/valgrind.txt. By: Richard Odekerken (rgj) 2009-11-28 09:01:41.000-0600 As described in ASTERISK-14566, that is not an option for us since valgrind imposes too much load on the system. By: Tilghman Lesher (tilghman) 2009-11-29 22:57:36.000-0600 Given the crash, valgrind is required for us to proceed any further with this issue. Can you reproduce this with a non-production or semi-production system? By: Leif Madsen (lmadsen) 2009-12-21 10:14:15.000-0600 I'm suspending this issue right now as valgrind data is required in order to move this issue forward. If the reporter is able to reproduce on a test system and produce the valgrind data required to move this issue forward, they are welcome to reopen this issue. Thanks! |