[Home]

Summary:ASTERISK-14985: [patch] Seg Fault on Transfer with IAX/iLBC
Reporter:Darrin Henshaw (ayth)Labels:
Date Opened:2009-10-14 12:18:10Date Closed:2011-06-07 14:00:58
Priority:BlockerRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) core_gdb.txt
( 1) gdb_asterisk_valgrind_core.txt
( 2) issue16070_debug.diff
( 3) Sip_Iax_Core_Debug.txt
( 4) valgrind_asterisk_log.log
Description:Hello,

We recently in our process of upgrading asterisk have run into serious problems. Our environment is as follows:

Asterisk clusters spread across three countries. IAX trunking is done between each office. Running CentOS 4.7 and Asterisk 1.4.26.

We have for the longest time used iLBC as the codec between the offices, and wish to continue, however, we are experiencing seg faults with it. See https://issues.asterisk.org/view.php?id=13116, as we are having the exact same issues, but the patch in this post is already in our code.

When using jitter buffer, with iLBC and Mixmonitor we have seg faults when a user tries to transfer the call to another party within the office. A example call is as follows:

1. Call comes into our system in one office, and is trunked via IAX to another office.
2. Call gets recorded using MixMonitor(with audiohook_inherit), and enters a queue.
3. Call is answered by one of our staff. If the call is completed with that one agent we are fine, however, if that agent needs to transfer it to another person Asterisk seg faults and dumps it's core.

I have run gdb against the core, and am posting it here, also before testing this I turned up SIP/IAX/Core debugging within the CLI. I've also attached that output.

We normally use a RPM version of Asterisk that we package ourselves, however, I'm able to reproduce this from source at will. If I remove the option for jitterbuffer, or not record the call, or change the codec I don't get the seg faults at all. However, we need them in place or else the phone quality goes down the tubes.
Comments:By: Darrin Henshaw (ayth) 2009-10-14 13:07:00

In testing I've been able to reproduce this with all recent 1.4 versions, 1.4.26.1, 1.4.26.2, and 1.4.27-rc1. Going to attempt 1.6 next, but it might require a rework of my test dialplan, that I use to generate the crashes.



By: Darrin Henshaw (ayth) 2009-10-15 07:35:31

Tried out 1.6.0.1 from source and with a little tweaking of my test diaplan(since 1.6 uses commas instead of the pipe character) I did not have the crash this time. My manager is having me look at svn now to see if we can find any info.

By: Darrin Henshaw (ayth) 2009-10-19 09:46:30

Anyone have a chance to review this?

By: Darrin Henshaw (ayth) 2009-10-23 09:56:51

I would like to see some movement on this if possible. I can duplicate the issue very simply, so if there is any specific debug info you want I can get it. I've checked and the ILBC source files have not changed from 1.6 and 1.4 at all, so that is not the problem.

It's either audiohook_inherit or jitterbuffer, one of them is doing something the other doesn't like and is causing a crash. Thanks.

By: Darrin Henshaw (ayth) 2009-10-23 14:00:20

Ok I've found the issue, but need definite guidance. Using an idea I got from aragon, I started working backwards. It turns out that 1.2.24 didn't give me problems. So I started checking out revisions of 1.4 until I had the crash and worked my way down.

This bug was introduced in revision 204012 of 1.4. Using this revision, trunking through IAX with ILBC codec, I can crash Asterisk at will, by performing a transfer.

Here is from the changelog:
2009-06-29 15:04 +0000 [r204012]  Mark Michelson <mmichelson@digium.com>

       * apps/app_mixmonitor.c: Place unlock of mutex in an else block so
         that it does not get unlocked twice. (closes issue ASTERISK-14378)
         Reported by: aragon

Here is a diff on the two app_mixmonitor.c files:
--- ./asterisk-204000/apps/app_mixmonitor.c     2009-10-23 13:40:21.000000000 -0400
+++ ./asterisk-204012/apps/app_mixmonitor.c     2009-10-23 14:03:27.000000000 -0400
@@ -35,7 +35,7 @@

#include "asterisk.h"

-ASTERISK_FILE_VERSION(__FILE__, "$Revision: 201423 $")
+ASTERISK_FILE_VERSION(__FILE__, "$Revision: 204012 $")

#include <stdlib.h>
#include <stdio.h>
@@ -273,8 +273,9 @@
                                       ast_writestream(*fs, cur);
                               }
                       }
+               } else {
+                       ast_mutex_unlock(&mixmonitor->mixmonitor_ds->lock);
               }
-               ast_mutex_unlock(&mixmonitor->mixmonitor_ds->lock);

               /* All done! free it. */
               ast_frame_free(fr, 0);

Can mmichelson possibly take a look since he coded the revision in question?

By: Darrin Henshaw (ayth) 2009-10-23 14:59:38

I've also found out that this bug shows up between 1.6.0.13 and 1.6.0.14.

By: Leif Madsen (lmadsen) 2009-10-23 15:32:45

Thanks for all the work you've done here! This should make it easier for a developer to solve once time permits. Currently the developer who made the change is out till next Tuesday, but hopefully he can take a look at it then.

By: Leif Madsen (lmadsen) 2009-10-23 16:38:16

Hey Mark, you committed the change that apparently caused this regression. If you have a chance to look at this, that'd be great, otherwise I'm sure it'll get taken care of by another developer. Thanks!

By: Darrin Henshaw (ayth) 2009-10-26 09:33:28

Okay I have to you, I made a mistake when going through the revisions. It turns out the bug was introduced in revision 201450 of 1.4, not what I previously stated.

I was running through so many revisions that I mixed myself up. Here is the info from the ChangeLog:

2009-06-17 19:59 +0000 [r201450]  Mark Michelson <mmichelson@digium.com>

* main/channel.c: Change the datastore traversal in
 ast_do_masquerade to use a safe list traversal. It is possible
 for datastore fixup functions to remove the datastore from the
 list and free it. In particular, the queue_transfer_fixup in
 app_queue does this. While I don't yet know of this causing any
 crashes, it certainly could. Found while discussing a separate
 issue with Brian Degenhardt.



By: Darrin Henshaw (ayth) 2009-10-26 09:50:56

Also, on suggestion from my manager. I took the channel.c from 1.4.25's tarball and copied it into the sources for 1.4.26.2, recompiled and reinstalled. No crashes at all.

Started to review the code and I'm way out of my depth not a lot was changed in that revision, but I can do any testing that is needed. Thanks.

By: Darrin Henshaw (ayth) 2009-10-26 13:27:17

Uploaded a log file from valgrind, I ran asterisk within valgrind and caused it to crash. I also have a core file from valgrind, but it's too large to upload. Let me know if needed.

By: Darrin Henshaw (ayth) 2009-10-26 13:30:14

Added gdb output from the valgrind core I mentioned earlier.

By: David Vossel (dvossel) 2009-11-04 18:00:24.000-0600

I just uploaded a debug patch.   This patch should prevent the system from crashing and display a LOG_WARNING message instead.  If you can, please apply the patch and provide log output of this message when it occurs.  The message should contain the source of the frame which is causing this mess to begin with.  Thanks!



By: Darrin Henshaw (ayth) 2009-11-04 20:55:38.000-0600

Thanks dvossel. I'll give this a try in the morning and let you know.

By: Darrin Henshaw (ayth) 2009-11-05 09:29:23.000-0600

Ok, so I've installed 1.4.26.2(from source, to rule out any rpm stuff), with the patch applied, and it doesn't crash, which is exactly what dvossel said, but it also doesn't provide the output it should either. I turned the verbosity way up and still nothing.

I then went back to pristine 1.4.26.2 sources, and it crashes perfectly(or not perfectly depending on your thoughts). With no change of configuration or anything if I switch between the two versions, I can crash on the unpatched version. But it's kind of weird to not get the log message I was expecting.

By: David Vossel (dvossel) 2009-11-05 09:44:33.000-0600

In logger.conf, do you have "warning" listed in the console option, or where ever you are directing output?  If it isn't crashing, that statement should be getting hit.

my logger.conf's console option looks something like this.
console => notice, warning, error, debug

By: Darrin Henshaw (ayth) 2009-11-05 10:08:38.000-0600

my logger.conf is as such:

console => notice,warning,error
messages => notice,warning,error,verbose

I thought something was up, as looking at the patch the log_warning should be run as you mention.

By: David Vossel (dvossel) 2009-11-05 13:42:04.000-0600

I have no idea why you aren't seeing the message.  I updated the patch, its kind of a laughable approach, but I don't really know what else to try.

By: Darrin Henshaw (ayth) 2009-11-05 14:43:56.000-0600

Thanks, I'll review this again and try to see what's going on.

By: David Vossel (dvossel) 2009-11-16 10:38:00.000-0600

any update on this issue now?

By: Leif Madsen (lmadsen) 2009-11-18 10:37:35.000-0600

Just pinging the reporter for some feedback. Thanks!

By: David Vossel (dvossel) 2009-11-30 10:10:05.000-0600

This issue's symptom was fixed, but the cause is still unknown.  I'm closing this issue due to inactivity from the reporter.