ASTERISK-16406: Asterisk crashing in ast_readaudio

[Home]

Summary: ASTERISK-16406: Asterisk crashing in ast_readaudio_callback at file.c:762

Reporter: Jeff Hoppe (jhoppebugs) Labels:

Date Opened: 2010-07-22 12:43:01 Date Closed: 2010-12-06 13:45:39.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt

Description:

Program terminated with signal 11, Segmentation fault.
#0 0x080d54db in ast_readaudio_callback (s=0xb358d5e8) at file.c:762
762 if (s->owner->timingfd > -1) {
#0 0x080d54db in ast_readaudio_callback (s=0xb358d5e8) at file.c:762
whennext = 160
__PRETTY_FUNCTION__ = "ast_readaudio_callback"
#1 0x080d56ef in ast_fsread_audio (data=0xb358d5e8) at file.c:789
fs = 0xb358d5e8
res = FSREAD_SUCCESS_SCHED
#2 0x08098a76 in __ast_read (chan=0xb59b72f8, dropaudio=0) at channel.c:2784
func = 0x80d56d8 <ast_fsread_audio>
data = 0xb358d5e8
res = AST_TIMING_EVENT_EXPIRED
f = 0x0
blah = -1255758064
prestate = 6
count = 0
cause = 0
__PRETTY_FUNCTION__ = "__ast_read"
#3 0x0809a539 in ast_read (chan=0xb59b72f8) at channel.c:3163
No locals.

Comments: By: David Woolley (davidw) 2010-07-23 04:46:52

We had a crash with a similar traceback, except that it went deeper, and actually crashed in ast_frdup. Unfortunately we'd accidentally stripped the main executable when creating an RPM, and also it wasn't reproducible. We were assuming a delayed result of memory corruption, as our AMI application was complaining about receiving gibberish at about the same time.

In our case, the agent was an AgentLogin one and the actual agent was a local channel connected from a SIP channel. We believe the audio file was the beep file. The agent was configured for manual answer. Ours is a modified 1.6.1.0, although I don't think there are significant modifications in this area.

By: Paul Belanger (pabelanger) 2010-07-24 16:37:38

Upload a debug log (see below) and reproduce your crash please.

---
We require a complete debug log to help triage the issue.

This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue:

http://svn.digium.com/svn/asterisk/trunk/doc/HOWTO_collect_debug_information.txt
By: David Woolley (davidw) 2010-07-26 04:48:48

If this is the problem we had, it will be almost impossible to reproduce it (without having worked out what the underlying problem is). It took several months for the first one to occur, and we have only had one. I may have a debug log available, but unlike the original reporter, I don't have a proper backtrace. I didn't report it at the time because I knew we didn't meet the reporting requirements.

There is nothing unusual in our debug log, it just shows a call being offered to the agent and then the restart.

Incidentally, the boiler plate is incorrect; you do fix non-reproducible problems if a detailed mechanism is supplied, and especially if a patch is provided.

We are taking steps to ensure we get a better backtrace next time, but it is quite likely we will not see this problem again. Although I suspect there is a memory overwrite involved, we are not going to be able to run valgrind on a customer system for months on end, and probably not at all, both for perforamnce and configuration management reasons.

Unfortunately Asterisk is highly dependent on locking and omissions in that area can produce faults that are extremely timing critical.
By: David Woolley (davidw) 2010-07-26 05:22:39

In our case, these are what I believe to be the relevant lines in the full log and the short form backtrace. I'm not convinced it is worth my time to remove sensitive data from the complete debug log given that I cannot give a backtrace with full debugging information.

Inline because very short:

[2010-07-19 09:01:58.570] VERBOSE[22896] chan_agent.c: -- agent_call, call to agent '015' call on 'Local/015@BTSAgentLogins-ae09;2'
[2010-07-19 09:01:58.571] VERBOSE[22896] file.c: -- <Local/015@BTSAgentLogins-ae09;2> Playing 'beep.slin' (language 'en')

Next line only to establish the last timestamp before the crash:
[2010-07-19 09:01:59.044] VERBOSE[22913] chan_sip.c: Reliably Transmitting (no NAT) to 10.17.50.23:5060:

[2010-07-19 09:02:03.474] VERBOSE[22937] logger.c: Asterisk Event Logger Started /var/log/asterisk/event_log

#0 0x080e2c1c in ast_frdup ()
#1 0x08095c55 in __ast_queue_frame ()
#2 0x0809602c in ast_queue_frame ()
#3 0x002b593b in local_queue_frame (p=0x97620f0, isoutbound=1, f=0x97a67dc, us=0x94ed058, us_locked=1)
at chan_local.c:245
#4 0x002b61a2 in local_write (ast=0x94ed058, f=0x97a67dc) at chan_local.c:364
ASTERISK-1 0x0809d3c7 in ast_write ()
ASTERISK-2 0x080df647 in ast_readaudio_callback ()
ASTERISK-3 0x080df88c in ast_fsread_audio ()
ASTERISK-4 0x0809a308 in __ast_read ()
ASTERISK-5 0x0809bc9b in ast_read ()
ASTERISK-6 0x003c2192 in agent_ack_sleep (data=0x94e7cf0) at chan_agent.c:953
ASTERISK-7 0x003c83d7 in login_exec (chan=0x94ed058, data=0xb7425e88) at chan_agent.c:2122
ASTERISK-8 0x0810e0ae in pbx_exec ()
ASTERISK-9 0x08116014 in pbx_extension_helper ()
ASTERISK-10 0x08117543 in ast_spawn_extension ()
ASTERISK-11 0x08117c0f in __ast_pbx_run ()
ASTERISK-12 0x08119117 in pbx_thread ()
ASTERISK-13 0x081720e9 in dummy_start ()
ASTERISK-14 0x00afb49b in start_thread () from /lib/libpthread.so.0
ASTERISK-15 0x00a5242e in clone () from /lib/libc.so.6
By: Leif Madsen (lmadsen) 2010-07-26 10:51:53

davidw: thanks for the additional information! Hopefully that can help move this issue slightly more forward :)
By: Leif Madsen (lmadsen) 2010-07-29 14:25:12

This should be fixed in Asterisk 1.6.2.10. Please open a new issue if you continue to have issues.
By: Leif Madsen (lmadsen) 2010-07-30 06:15:17

2010-07-22 Leif Madsen <lmadsen@digium.com>

* Release Asterisk 1.6.2.10

* Included a fix for res_timing_pthread per the description below:

r278465 | russell | 2010-07-21 11:15:00 -0500 (Wed, 21 Jul 2010) | 41 lines

Use poll() instead of select() in res_timing_pthread to avoid stack corruption.
This code did not properly check FD_SETSIZE to ensure that it did not try to
select() on fds that were too large. Switching to poll() removes the limitation
on the maximum fd value.