Summary:ASTERISK-20914: Segfault when iLBC voice frame is interpolated in a jitter buffer due to codec_ilbc's improper manipulation of datalen
Reporter:John McEleney (jmce)Labels:
Date Opened:2013-01-10 06:49:57.000-0600Date Closed:2013-01-21 12:29:21.000-0600
Status:Closed/CompleteComponents:Codecs/codec_ilbc Core/Jitterbuffer
Versions: 11.1.2 Frequency of
Environment:Two different systems. Asterisk on CentOS 6.3 (2.6.32-279.19.1.el6.x86_64) and Asterisk 11.1.2 on CentOS 6.3 (2.6.32-279.19.1.el6.i686). Each system has a Digium TE122 card installed. libpri-1.4.14 and dahdi 2.6.1 installed. We're using Cisco 7961 handsets set to use SIP over TCP.Attachments:( 0) ASTERISK-20914-1.8.diff
( 1) backtrace_1.8.10.1_2013-01-10T23_27_14+0000.txt
( 2) backtrace_1.8.19.1_2013-01-14T09_36_36+0000.txt
( 3) backtrace_2013-01-10T17_49_24+0000.txt
( 4) backtrace_2013-01-10T20_39_14+0000.txt
( 5) backtrace.txt
( 6) capture_ending_1.8.19.1_2013-01-14T09_36_45.pcap
Description:We've been getting a lot of Segfaults on both these versions of Asterisk, although my focus at the moment is on system. I should hopefully be able to provide backtraces from 11.1.2 later.

This is the error in syslog:
Jan 10 10:54:32 voip2 kernel: asterisk[4355]: segfault at 0 ip 00007fb8924c8122 sp 00007fb84440cf08 error 4 in libc-2.12.so[7fb89243f000+189000]

In addition to the PRI card, we also use a SIP trunk, which uses UDP. Many calls will be SIP(TCP) <=> Asterisk <=> SIP(UDP).

This problem occurs several times a day. Today it happened twice in the space of 5 minutes, but on other days it may happen between 2-10 times a day on a lightly loaded system. There tends to be no more than three concurrent calls, but the problem can occur when there is only one call. The system only ever Segfaults while someone is on a call.

Asterisk is built with compile flags DONT_OPTIMIZE, DEBUG_THREADS and MALLOC_DEBUG.

This is a backtrace:
(gdb) bt
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:191
#1  0x00000000004d333d in ast_frdup (f=0x7fb81400a428) at frame.c:537
#2  0x000000000046cad6 in __ast_queue_frame (chan=0x7fb8780b0ba0, fin=0x7fb81400a428, head=0, after=0x0) at channel.c:1483
#3  0x000000000046d125 in ast_queue_frame (chan=0x7fb8780b0ba0, fin=0x7fb81400a428) at channel.c:1555
#4  0x00007fb883b72335 in local_queue_frame (p=0x7fb8780a9ef0, isoutbound=0, f=0x7fb81400a428, us=0x7fb8780aad10, us_locked=1) at chan_local.c:447
#5  0x00007fb883b73192 in local_write (ast=0x7fb8780aad10, f=0x7fb81400a428) at chan_local.c:660
#6  0x0000000000477e0d in ast_write (chan=0x7fb8780aad10, fr=0x7fb81400a428) at channel.c:5109
#7  0x00000000004802d3 in ast_generic_bridge (c0=0x7fb81c008990, c1=0x7fb8780aad10, config=0x7fb84440f920, fo=0x7fb84440dcd8, rc=0x7fb84440dcd0)
   at channel.c:7260
#8  0x00000000004822a9 in ast_channel_bridge (c0=0x7fb81c008990, c1=0x7fb8780aad10, config=0x7fb84440f920, fo=0x7fb84440dcd8, rc=0x7fb84440dcd0)
   at channel.c:7618
#9  0x00000000004c0b23 in ast_bridge_call (chan=0x7fb81c008990, peer=0x7fb8780aad10, config=0x7fb84440f920) at features.c:4102
#10 0x00007fb852631dc6 in try_calling (qe=0x7fb84440fff0, options=0x7fb84440ff85 "", announceoverride=0x7fb84440ff87 "", url=0x7fb84440ff86 "",
   tries=0x7fb8444111e4, noption=0x7fb8444111e0, agi=0x7fb84440ff8c "", macro=0x7fb84440ff8d "", gosub=0x7fb84440ff8e "", ringing=0) at app_queue.c:5243
#11 0x00007fb852635a9c in queue_exec (chan=0x7fb81c008990, data=0x7fb844413460 "123,t,,,240,,,,,") at app_queue.c:6194
#12 0x0000000000508c8e in pbx_exec (c=0x7fb81c008990, app=0x2d50528, data=0x7fb844413460 "123,t,,,240,,,,,") at pbx.c:1447
#13 0x0000000000512ee2 in pbx_extension_helper (c=0x7fb81c008990, con=0x0, context=0x7fb81c008f00 "ext-queues", exten=0x7fb81c008f50 "123", priority=23,
   label=0x0, callerid=0x7fb878069078 "01638780352", action=E_SPAWN, found=0x7fb844415b58, combined_find_spawn=1) at pbx.c:4267
#14 0x0000000000514c01 in ast_spawn_extension (c=0x7fb81c008990, context=0x7fb81c008f00 "ext-queues", exten=0x7fb81c008f50 "123", priority=23,
   callerid=0x7fb878069078 "01638780352", found=0x7fb844415b58, combined_find_spawn=1) at pbx.c:4907
#15 0x00000000005156b7 in __ast_pbx_run (c=0x7fb81c008990, args=0x0) at pbx.c:5010
#16 0x0000000000517412 in pbx_thread (data=0x7fb81c008990) at pbx.c:5351
#17 0x000000000056ef88 in dummy_start (data=0x7fb81c004728) at utils.c:1012
#18 0x00007fb891839851 in start_thread (arg=0x7fb844416700) at pthread_create.c:301
#19 0x00007fb89252711d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Any help with this would be greatly appreciated.
Comments:By: John McEleney (jmce) 2013-01-10 07:04:04.431-0600

In addition to using outgoing SIP trunks, we also have a couple of incoming IAX trunks. One leg of the call is always SIP(TCP), but the second leg may be either IAX or SIP(UDP). I'm unsure if it makes any difference if it's a SIP or IAX trunk.

By: Matt Jordan (mjordan) 2013-01-10 08:09:33.888-0600

Two questions:
# Can you try to provide a backtrace with {{BETTER_BACKTRACES}} enabled as well?
# Since you have {{MALLOC_DEBUG}} enabled, are there any warnings or messages (other than the session started message) in the mmlog file?

By: John McEleney (jmce) 2013-01-10 08:21:50.404-0600

Unfortunately BETTER_BACKTRACES was not enabled at the time. This (attached) is the best I've got for now. I'll rebuild again shortly with this enabled.

By: John McEleney (jmce) 2013-01-10 08:28:01.590-0600

There's nothing else in /var/log/messages at the time of the segfault, and nothing out of the ordinary in /var/log/asterisk/full at that time either.

By: Matt Jordan (mjordan) 2013-01-10 08:47:27.651-0600

And nothing in {{mmlog}}? That particular file is only created when {{MALLOC_DEBUG}} is enabled and can indicate a possible memory corruption.

By: John McEleney (jmce) 2013-01-10 08:58:26.873-0600

Sorry , I'd not seen that file before. Right now, this is all that's in mmlog:
1357756969 - New session
1357759701 - New session
1357815284 - New session
1357815385 - New session
1357829420 - New session

Are the contents of that file persistent, or does it get regularly deleted? The particular file has Modify and Change times from just a few minutes ago when I restarted Asterisk.

By: John McEleney (jmce) 2013-01-10 12:32:20.770-0600

I'm pretty sure (but it might be a co-incidence), that I triggered the latest segfault (backtrace_2013-01-10T17_49_24+0000.txt). I can't trigger it every time though. I triggered it twice in a row, but then failed to do so twice more. This is what I did:

1. Establish an incoming call over an IAX trunk, picked up my a Cisco phone using SIP(TCP). (Call 1)
2. While that Call 1 was in progress establish a second incoming IAX call on a different IAX trunk, which is then forward out via a SIP(UDP) trunk to a cell phone. (Call 2)
3. Speak for a moment over Call 2 and then hang-up.
4. Speak for a moment more on Call 1 (which was still working at this point) and then hang-up.
5. Segfault!

I have no idea if any of the above is superfluous information, or how other combinations could trigger the bug. I'm just happy to have triggered it twice running.

By: John McEleney (jmce) 2013-01-10 12:36:20.652-0600

There was no further information written to mmlog.

By: Richard Mudgett (rmudgett) 2013-01-10 12:51:01.756-0600

The mmlog file is only appended to by Asterisk.  It is never deleted.

FYI: Asterisk v1.8.20.0 (at -rc2 status now) has an improved MALLOC_DEBUG that can detect more memory corruption problems.  It may help.

Some things that would help investigating the crash is to determine the channel names that are involved and what type of frame is causing the crash.  This information can be determined from the core file using gdb.  (Please do not try to attach a core file to the issue.  They are generally only usable on the machine that created them.)

Using the backtrace posted to the issue description as a guide:

1) The frame information can be obtained by printing the contents of stack frame 1 variable f:
#1 0x00000000004d333d in ast_frdup (f=0x7fb81400a428) at frame.c:537

2) The channel names can be obtained by printing the c0->name and c1->name variables in stack frame 7
#7 0x00000000004802d3 in ast_generic_bridge (c0=0x7fb81c008990, c1=0x7fb8780aad10, config=0x7fb84440f920, fo=0x7fb84440dcd8, rc=0x7fb84440dcd0)
at channel.c:7260

By: John McEleney (jmce) 2013-01-10 13:57:39.023-0600

(gdb) frame 1
#1  0x00000000004d3bad in ast_frdup (f=0x7f71340069a8) at frame.c:537
537 memcpy(out->data.ptr, f->data.ptr, out->datalen);
(gdb) p f
$1 = (const struct ast_frame *) 0x7f71340069a8
(gdb) frame 7
#7  0x0000000000480b43 in ast_generic_bridge (c0=0x7f7134017e90, c1=0x7f70fc059300, config=0x7f71476ae920, fo=0x7f71476accd8, rc=0x7f71476accd0) at channel.c:7260
7260 ast_write(other, f);
(gdb) p c0->name
$2 = (const ast_string_field) 0x7f713402939a "IAX2/gradwell-8095"
(gdb) p c1->name
$3 = (const ast_string_field) 0x7f70fc05e68c "Local/8105@from-queue-00000021;1"

IAX2/gradwell is the incoming IAX channel I referred to in "Call 1". Extension 8105 is the Cisco 7961 SIP handset that is set to use TCP.

By: John McEleney (jmce) 2013-01-10 14:51:26.213-0600

OK, I'm now running backtrace_2013-01-10T20_39_14+0000.txt (attached) was generated from a crash of with the same compiler options as before, using the same sequence of events.

Additional info:
(gdb) frame 1
#1  0x00000000004d5d61 in ast_frdup (f=0x7ff1140110f0) at frame.c:537
537 memcpy(out->data.ptr, f->data.ptr, out->datalen);
(gdb) p f
$1 = (const struct ast_frame *) 0x7ff1140110f0
(gdb) frame 7
#7  0x00000000004826c3 in ast_generic_bridge (c0=0x7ff114016798, c1=0x7ff12c6cf578, config=0x7ff13ca08920, fo=0x7ff13ca06cd8, rc=0x7ff13ca06cd0) at channel.c:7300
7300 ast_write(other, f);
(gdb) p c0->name
$2 = (const ast_string_field) 0x7ff11400cb16 "IAX2/gradwell-1818"
(gdb) p c1->name
$3 = (const ast_string_field) 0x7ff12c5b4036 "Local/8110@from-queue-0000003e;1"

One thing I should point out is that my had been patched for Cisco-BLF. This latest test with has no patches applied whatsoever.

By: John McEleney (jmce) 2013-01-10 14:55:12.392-0600

Forgot to mention. There is nothing new in mmlog.

By: Richard Mudgett (rmudgett) 2013-01-10 16:05:37.435-0600

Sorry.  I was wanting the contents of the frame pointed to by f.
p f
only gives the pointer value.
p *f
will give the contents of the frame.

I wanted to see what type of frame and if it was the same frame type each time.

By: John McEleney (jmce) 2013-01-10 16:17:05.299-0600

In case it's relevant, this is from the core file:
(gdb) p *f
$1 = {frametype = AST_FRAME_VOICE, subclass = {integer = 1024, codec = 1024}, datalen = 50, samples = 240, mallocd = 1, mallocd_hdr_len = 206, offset = 64,
 src = 0x7ff1140111a8 "IAX2 JB interpolation", data = {ptr = 0x0, uint32 = 0, pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 1357850352, tv_usec = 954158}, frame_list = {
   next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0}

By: Matt Jordan (mjordan) 2013-01-10 16:22:17.887-0600

Well, that's no good. A datalen of 50 and a NULL data pointer are not going to mesh well.

This appears to be an IAX2 issue, and I'm not aware of much that has changed in chan_iax2 over the past few versions. What version were you upgrading from?

By: John McEleney (jmce) 2013-01-10 16:39:49.691-0600

I'm afraid I can't really point to any "last known good" version, as I've been suffering low-level instability for over a year, and have always put that instability down to (possibly wrongly) chan_sccp_b. We're now making a fresh start with new hardware, a clean OS installation, without chan_sccp, and are sticking with stock Asterisk and chan_sip.

We previously ran CentOS 5 with Asterisk 1.6.x (not sure which revision). It could well have had the same bug, but we never properly investigated. That old system is offline at the moment, so I'm not in a position to carry out any tests on it.

By: John McEleney (jmce) 2013-01-10 17:34:12.121-0600

Just to get a handle on how old the bug might be, I installed With one inbound IAX->SIP call on the go, I didn't even get to make the second call before it segfaulted.

Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal 11, Segmentation fault.
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:191
191 movq  (%rsi), %rax
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64 speex-1.2-0.12.rc1.1.el6.x86_64 sqlite-3.6.20-1.el6.x86_64
(gdb) bt
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:191
#1  0x00000000004ba980 in ast_frdup (f=0x7fef18000a20) at frame.c:528
#2  0x0000000000464e0a in __ast_queue_frame (chan=0x7fef700254e8, fin=<value optimized out>, head=0, after=0x0) at channel.c:1449
#3  0x00007fef8d4a49a8 in local_queue_frame (p=0x7fef700239b8, isoutbound=<value optimized out>, f=0x7fef18000a20, us=0x7fef70023b18, us_locked=<value optimized out>)
   at chan_local.c:438
#4  0x00007fef8d4a5854 in local_write (ast=0x7fef70023b18, f=0x7fef18000a20) at chan_local.c:608
#5  0x000000000046f949 in ast_write (chan=0x7fef70023b18, fr=0x7fef18000a20) at channel.c:5032
#6  0x0000000000475ab3 in ast_generic_bridge (c0=0x7fef24014078, c1=0x7fef70023b18, config=0x7fef41c460b0, fo=0x7fef41c444f0, rc=0x7fef41c444e8) at channel.c:7140
#7  0x0000000000476422 in ast_channel_bridge (c0=0x7fef24014078, c1=0x7fef70023b18, config=0x7fef41c460b0, fo=0x7fef41c444f0, rc=0x7fef41c444e8) at channel.c:7498
#8  0x00000000004a7346 in ast_bridge_call (chan=0x7fef24014078, peer=0x7fef70023b18, config=0x7fef41c460b0) at features.c:4017
#9  0x00007fef4fcd84d9 in try_calling (qe=0x7fef41c46340, options=<value optimized out>, announceoverride=<value optimized out>, url=0x7fef41c462e6 "", tries=<value optimized out>,
   noption=<value optimized out>, agi=0x7fef41c462ec "", macro=<value optimized out>, gosub=0x7fef41c462ee "", ringing=0) at app_queue.c:5036
#10 0x00007fef4fcdcf6d in queue_exec (chan=<value optimized out>, data=<value optimized out>) at app_queue.c:5980
#11 0x00000000004edb0b in pbx_exec (c=0x7fef24014078, app=0x2dbbc80, data=0x7fef41c49690 "123,t,,,240,,,,,") at pbx.c:1430
#12 0x00000000004fae2a in pbx_extension_helper (c=0x7fef24014078, con=0x0, context=0x7fef240145d0 "ext-queues", exten=0x7fef24014620 "123", priority=23,
   label=0x7fef240145d0 "ext-queues", callerid=0x7fef70001b60 "07748103635", action=E_SPAWN, found=0x7fef41c4bd0c, combined_find_spawn=1) at pbx.c:4249
#13 0x00000000004fd805 in ast_spawn_extension (c=0x7fef24014078, args=0x0) at pbx.c:4887
#14 __ast_pbx_run (c=0x7fef24014078, args=0x0) at pbx.c:4987
#15 0x00000000004ff1ab in pbx_thread (data=<value optimized out>) at pbx.c:5327
#16 0x000000000053853a in dummy_start (data=<value optimized out>) at utils.c:1004
#17 0x00007fef8eefe851 in start_thread (arg=0x7fef41c4c700) at pthread_create.c:301
#18 0x00007fef8fbec11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) frame 1
#1  0x00000000004ba980 in ast_frdup (f=0x7fef18000a20) at frame.c:528
528 memcpy(out->data.ptr, f->data.ptr, out->datalen);
(gdb) p *f
$1 = {frametype = AST_FRAME_VOICE, subclass = {integer = 1024, codec = 1024}, datalen = 50, samples = 240, mallocd = 1, mallocd_hdr_len = 206, offset = 64,
 src = 0x7fef18000ad8 "IAX2 JB interpolation", data = {ptr = 0x0, uint32 = 0, pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 1357860433, tv_usec = 935182}, frame_list = {
   next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0}
(gdb) frame 6
#6  0x0000000000475ab3 in ast_generic_bridge (c0=0x7fef24014078, c1=0x7fef70023b18, config=0x7fef41c460b0, fo=0x7fef41c444f0, rc=0x7fef41c444e8) at channel.c:7140
7140 ast_write(other, f);
(gdb) p c0->name
$2 = (const ast_string_field) 0x7fef24005412 "IAX2/gradwell-1608"
(gdb) p c1->name
$3 = (const ast_string_field) 0x7fef70024cf2 "Local/8110@from-queue-f87a;1"

There's still nothing new in mmlog.

By: John McEleney (jmce) 2013-01-10 17:39:24.620-0600

Full backtrace for attached.

By: John McEleney (jmce) 2013-01-11 09:09:33.719-0600

We've had about 5 more segfaults today from this. Is there anything more you need from me to investigate this further?

By: John McEleney (jmce) 2013-01-14 04:38:18.857-0600

Over the weekend I simplfied our set-up a little by reducing us to just one IAX trunk, and using SIP for the second one. It turns out that all it takes is one IAX trunk to be in use to crash the system. The anatomy of the Segfault seems identical to the others.

I'm really very surprised at this. How can it be that on a freshly compiled Asterisk we're unable to maintain a single IAX trunk without getting segfaults time and time again?

I occurred to me that there might be something interesting in the payload, so I captured the latest IAX call with tcpdump using a snaplen of 110, and capturing all port 4569 traffic. This seems to have captured the whole packets. I'm attaching the tail end of that capture.

There's very obvious two-way audio traffic between the two server up until a certain point. At 1358156195.623043 the last packet to come from my server is sent. I assume that that is when Asterisk segfaulted, or more likely upon receipt of the very next packet (1358156195.636835) from the other end (Gradwell, an IAX trunk provider). Both of these timestamps coincide with the timestamp of the core file.

Just to be clear, is my Asterisk box, and is Gradwell. There is another host that sends a "POKE". This is another box that is configured to have an IAX trunk to my box, but that trunk is not configured at my end.

I'm attaching the backtrace as backtrace_1.8.19.1_2013-01-14T09_36_36+0000.txt.

By: John McEleney (jmce) 2013-01-14 04:47:07.433-0600

There's still nothing in mmlog. Further info:
(gdb) frame 1
#1  0x00000000004d3bad in ast_frdup (f=0x7f8bf400bc48) at frame.c:537
537 memcpy(out->data.ptr, f->data.ptr, out->datalen);
(gdb) p *f
$1 = {frametype = AST_FRAME_VOICE, subclass = {integer = 1024, codec = 1024}, datalen = 50, samples = 240, mallocd = 1,
 mallocd_hdr_len = 206, offset = 64, src = 0x7f8bf400bd00 "IAX2 JB interpolation", data = {ptr = 0x0, uint32 = 0,
   pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 1358156195, tv_usec = 657376}, frame_list = {next = 0x0}, flags = 0,
 ts = 0, len = 0, seqno = 0}
(gdb) frame 7
#7  0x0000000000480b43 in ast_generic_bridge (c0=0x7f8bec007c10, c1=0x7f8c440a3180, config=0x7f8c1882b920, fo=0x7f8c18829cd8,
   rc=0x7f8c18829cd0) at channel.c:7260
7260 ast_write(other, f);
(gdb) p c0->name
$2 = (const ast_string_field) 0x7f8bec00514e "IAX2/gradwell-1816"
(gdb) p c1->name
$3 = (const ast_string_field) 0x7f8c440a850e "Local/8105@from-queue-0000007a;1"

By: Matt Jordan (mjordan) 2013-01-14 09:13:49.743-0600

So, this is definitely strange. So far, in each of the frames that have caused a seg fault, the frame was allocated by chan_iax2 as a JitterBuffer interpolation frame. When chan_iax2 creates frames in this portion of code, there is not supposed to be any voice data. The code appears to be correct (and also hasn't changed for a very, very long time):

struct ast_frame af = { 0, };

/* create an interpolation frame */
af.frametype = AST_FRAME_VOICE;
af.subclass.codec = pvt->voiceformat;
af.samples  = frame.ms * (ast_format_rate(pvt->voiceformat) / 1000);
af.src  = "IAX2 JB interpolation";
af.delivery = ast_tvadd(pvt->rxcore, ast_samp2tv(next, 1000));

/* queue the frame:  For consistency, we would call __do_deliver here, but __do_deliver wants an iax_frame,
* which we'd need to malloc, and then it would free it.  That seems like a drag */
if (!ast_test_flag64(iaxs[callno], IAX_ALREADYGONE)) {
iax2_queue_frame(callno, &af);
/* iax2_queue_frame() could cause the call to disappear */
pvt = iaxs[callno];

After calling {{ast_queue_frame}}, the frame is duplicated using {{ast_frdup}}. This also appears to be correct, and doesn't appear to be incorrectly manipulating the datalen value (in fact, if it wasn't, we'd be getting seg faults before the bridging code attempts to process the frame).

So, something - somewhere - is probably using the samples value and setting the datalen without first checking whether or not there is any data present.

One thing that is suspicious is the codec type: iLBC. In {{codec_ilbc}}'s {{ilbctolin_framein}}, there is a place where it 'fakes out' the datalen if the datalen is currently 0:

if (f->datalen == 0) { /* native PLC, set fake f->datalen and clear plc_mode */
f->datalen = ILBC_FRAME_LEN;
f->samples = ILBC_SAMPLES;
plc_mode = 0; /* do native plc */
pvt->samples += ILBC_SAMPLES;

The fact that {{ILBC_FRAME_LEN}} is 50 - which is the datalen value in the frame when is explodes - is also suspicious.

Does this behavior occur if you aren't using the iLBC codec?

By: John McEleney (jmce) 2013-01-14 09:26:05.913-0600

Up until now, the codec is not something that I've varied. About 30 minutes ago I switched to ulaw and turned the jitter buffer off. No crashes since then, but that doesn't really answer any questions.

I can switch the JB back on after-hours and experiment with codecs then. However, it's always very hit-and-miss. Can you think of a more reliable way (than making a series of calls) of triggering bit of code to be run? Would dropping a few packets using iptables help?

By: Matt Jordan (mjordan) 2013-01-14 10:09:43.979-0600

It probably would - in general, you should get an interpolated frame whenever something attempts to obtain a frame from a jitter buffer and one isn't ready. Dropping packets should cause that.

To further aid (or hamper) your testing, I put together a patch that I *think* will fix the issue in codec_ilbc. It maintains the packet loss correction code that the original code block was attempting to force when data isn't present without actually modifying the underlying voice frame's datalen value.

By: John McEleney (jmce) 2013-01-14 10:18:07.335-0600

I turned ilbc and the JB back on.

A co-worker here made the following modification in the hope that it might help:
--- asterisk-      2013-01-02 18:50:08.000000000 +0000
+++ asterisk-      2013-01-14 15:28:44.260517356 +0000
@@ -4100,7 +4100,9 @@
               case JB_INTERP:
                       struct ast_frame af = { 0, };
+                       memset(&af, 0, sizeof(af));
                       /* create an interpolation frame */
                       af.frametype = AST_FRAME_VOICE;
                       af.subclass.codec = pvt->voiceformat;

After that, during testing I got this in my asterisk log:
[2013-01-14 16:01:07] DEBUG[3668] codec_ilbc.c: issue 16070, ILIB ERROR. data = NULL datalen = 50 src = IAX2 JB interpolation

That's the first occurrence of that message I've seen. The system did not segfault at that time, or since.

By: Matt Jordan (mjordan) 2013-01-14 11:35:13.789-0600

Your co-workers patch won't help, unless your version of gcc is borked 10 ways from Sunday. Declaring struct ast_frame af = {0, }; automatically initializes all struct members to 0, so memsetting it to 0 afterwards is redundant.

Reference C99 Standard

If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.

By: John McEleney (jmce) 2013-01-14 13:30:47.942-0600

My co-workers comment is as follows:
Asterisk doesn't run gcc with --std=gnu99 and AFAIK it defaults to the c90 standard which has substantial differences.
Looking at the C90 spec, although it also specified to treat remainder of aggregate as per those with static duration, it does not define behaviour for intialising static unions as per C99 spec. Thus according to spec there is no guarantee the data pointer would be set to NULL. However, the behaviour of gcc in its default mode might go beyond what the C90 spec says.
This is a quotation I found on the GCC site:
A new edition of the ISO C standard was published in 1999 as ISO/IEC 9899:1999, and
is commonly known as C99. GCC has incomplete support for this standard version; see
http://gcc.gnu.org/gcc-4.4/c99status.html for details. To select this standard, use
‘-std=c99’ or ‘-std=iso9899:1999’. (While in development, drafts of this standard version
were referred to as C9X.)

So, on the face of it what my co-worker has suggested is prudent, although it might turn out to be redundant if gcc exceeds the c90 specification.

By: John McEleney (jmce) 2013-01-14 13:36:13.617-0600

On the subject of the patch to ilbc, it's hard for me to say for sure, but it might have worked. All I can say is that after 30 minutes of testing I couldn't make it crash. The problem is that I've had periods of reliability in the past, ultimately followed by a segfault. Tomorrow will be the real test. If I can make it through a whole day without a crash that will be remarkable.

Thanks for your help today.

By: John McEleney (jmce) 2013-01-15 11:07:47.556-0600

Not a single crash since we applied your patch over 22 hours ago. We've been running with iLBC over IAX with jitter buffer enabled that whole time.

By: Matt Jordan (mjordan) 2013-01-15 12:40:42.989-0600

Good to hear. I think we can at least move this out of Triage :-)