Summary:ASTERISK-20375: Asterisk channel reference leak when attempting to transfer a call originated to a local channel running the Echo application
Reporter:Mark Michelson (mmichelson)Labels:
Date Opened:2012-09-06 17:01:09Date Closed:2012-09-13 13:43:27
Status:Closed/CompleteComponents:Channels/chan_local Channels/chan_sip/Interoperability
Versions: Frequency of
Environment:Attachments:( 0) ami.pcap
( 1) asteriskcliwithwarnings.txt
( 2) gdbthreadapplyall_non_optimized.txt
( 3) ngrepami.txt
( 4) sip.pcap
Description:While working on a separate issue, I managed to reliably create a situation that results in a channel reference leak.

Use the following extensions.conf:
exten => 11,1,Answer()
same => n,Echo()

exten => 2301,1,Dial(Local/11@default,,t)

1) From the CLI issue the following command:

originate SIP/999 extension 2301@default

where SIP/999 is a SIP phone.

2) Pick up SIP/999 when it starts ringing.
3) Press the configured attended transfer DTMF in features.conf. You will start to hear music on hold (odd).
4) Hang up SIP/999.
5) Issue a {{core show channels}} CLI command.

You will notice that SIP/999 is still around, as are the two Local/2301 channels. Every so often, a warning message will appear

*CLI> [Sep  6 16:37:27] WARNING[610]: chan_sip.c:3918 __sip_autodestruct: Autodestruct on dialog '54200e235d36cebb6182df3d0b9ddf5f@' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms

Yay, you have leaking channel refs!
Comments:By: David M. Lee (dlee) 2012-09-10 16:30:01.117-0500

Mark - I assume there should be a {{,1,}} between the {{2301}} and {{Dial}}. Right?

By: Mark Michelson (mmichelson) 2012-09-10 16:50:25.018-0500

Yes. That was a typo.

By: David Brillert (aragon) 2012-09-11 13:28:49.834-0500

I see the Autodestruct warnings also and can reproduce with a *2 DTMF transfer.
I'll be ready and waiting to test any patch.

By: David M. Lee (dlee) 2012-09-11 15:38:00.918-0500

Fixed typo

By: Mark Michelson (mmichelson) 2012-09-11 18:01:01.835-0500

I decided to look into the transferdigittimeout to see if that might be getting screwed up on my system. I applied the following diff on my system:

root@levin:/usr/src/asterisk/tags/ svn diff
Index: main/channel.c
--- main/channel.c    (revision 372336)
+++ main/channel.c    (working copy)
@@ -3544,6 +3544,7 @@

int ast_waitfordigit_full(struct ast_channel *c, int ms, int audiofd, int cmdfd)
+    int saved_ms = ms;
    /* Stop if we're a zombie or need a soft hangup */
    if (ast_test_flag(c, AST_FLAG_ZOMBIE) || ast_check_hangup(c))
        return -1;
@@ -3557,6 +3558,10 @@
        struct ast_channel *rchan;
        int outfd=-1;

+        if (saved_ms != ms) {
+            ast_log(LOG_NOTICE, "In waitfordigitfull, ms is %d\n", ms);
+            saved_ms = ms;
+        }
        errno = 0;
        rchan = ast_waitfor_nandfds(&c, 1, &cmdfd, (cmdfd > -1) ? 1 : 0, NULL, &outfd, &ms);
Index: main/features.c
--- main/features.c    (revision 372336)
+++ main/features.c    (working copy)
@@ -2509,7 +2509,9 @@

    /* this is specific of atxfer */
+    ast_log(LOG_NOTICE, "ast_app_dtget with transferdigittimeout of %d\n", transferdigittimeout);
    res = ast_app_dtget(transferer, transferer_real_context, xferto, sizeof(xferto), 100, transferdigittimeout);
+    ast_log(LOG_NOTICE, "ast_app_dtget returned with %d\n", res);
    if (res < 0) {  /* hangup or error, (would be 0 for invalid and 1 for valid) */
        return -1;
And here is the resulting output that I see on the CLI when I run my test:
*CLI> originate SIP/2000 extension 2301@default
*CLI> [Sep 11 17:25:44] NOTICE[12863]: features.c:2514 builtin_atxfer: ast_app_dtget with transferdigittimeout of 3000
[Sep 11 17:25:44] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2980
[Sep 11 17:26:08] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2978
[Sep 11 17:27:08] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2976
[Sep 11 17:35:09] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2974
[Sep 11 17:37:34] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2973
[Sep 11 17:37:38] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2972
[Sep 11 17:37:38] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2970
[Sep 11 17:37:41] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2967
[Sep 11 17:37:41] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2966
[Sep 11 17:41:09] NOTICE[12863]: channel.c:3562 ast_waitfordigit_full: In waitfordigitfull, ms is 2964
As you can see, the proper timeout is being passed into {{ast_app_dtget()}}, but ... stuff is clearly broken here. For whatever reason, after about 15 minutes, the number of milliseconds has only decreased by 36. Furthermore, it's difficult to see a definite pattern to when the number decreases. There are two other interesting data points to bring up.

1) The change to main/channel.c initially did not have the {{saved_ms}} variable. Instead, I printed the {{ms}} value on every iteration of the while loop. This resulted in a huge flood of messages, most of which did not show any sort of change to {{ms}}. I added the saved_ms variable so that I could see when the value of ms actually changed. So for whatever reason, the underlying {{poll()}} call is constantly waking up but the {{ms}} value is not decreasing as it should.

2) If I attach to the running Asterisk process with gdb, and then exit out of gdb, this results in {{ast_app_dtget()}} returning immediately. I suspect this is because the gdb session stopped the Asterisk process for longer than three seconds. When gdb exited, the SIGCONT (or whatever signal is used to get the process going again) woke up the poll(), which properly dropped {{ms}} down by the time that it was supposed to.

I'm starting to suspect that the problem could be due to whatever high-resolution timer is being used on my system. {{poll()}} returns after a sub-millisecond time frame, thus resulting in the {{ms}} value not decreasing as necessary. This problem could be due to how {{ast_waitfor_nandfds()}} handles its timeouts, or it could be a fault of the {{poll()}} system call itself.

I wonder if perhaps the timing module in use might be playing a role in this. I'm using the {{timerfd}} module myself. If you run {{timing test}} from the CLI, which module does it say is in use?

For reference, my operating environment is an Ubuntu 10.04 system running 2.6.32-38-generic kernel. gcc is version 4.4.3. No idea if that makes a difference here or not.

By: Mark Michelson (mmichelson) 2012-09-11 18:18:55.602-0500

Scratch the timing module theory. I switched over to the {{pthread}} module and the problem still occurs for me.

By: David Brillert (aragon) 2012-09-12 09:00:33.091-0500

Hi Mark,

My site is not using any *2 transfers but a lot of AMI.
I have been watching one problem site all morning.
I ran ngrep on the AMI connection and caught all data in the Asterisk CLI when I caught the Autodestruct warnings.

My hope is that this will give insight on how to reproduce the problem or root cause.

By: David Brillert (aragon) 2012-09-12 09:23:46.524-0500

ami.pcap and sip.pcap during Autodestruct

By: David Brillert (aragon) 2012-09-12 09:29:35.932-0500

This is the output of gdb attached to Asterisk on a NON_OPTIMIZED build but without DEBUG_THREADS option

By: David M. Lee (dlee) 2012-09-12 11:54:19.155-0500

I've narrowed it down to waiting for digits during a transfer. Instead of timing out as it should, under conditions we haven't been able to determine, it will essentially never time out. It will hold a reference to the channel forever while it waits for digits, which is why you eventually see the Autodestruct warnings.

aragon - If you have an IVR or some other application which waits for digits, and the caller hangs up instead of entering digits, this might be the culprit. It's just as likely, though, that something else is happening.

I'll post a patch as soon as it's ready.

By: David Brillert (aragon) 2012-09-12 12:48:03.824-0500

Thanks David,

I've reviewed the configuration and no IVR's are used on the affected site.
All incoming lines go directly to dynamic ACD queues.
Overflow rules for each queue go to a busy tone.

But almost all outgoing calls require a PIN to be entered before a call is placed.
Also I see the Autodestructs shortly after an agent login and an agent must also enter a login password.

I've tried to reproduce this on a local system but I cannot reproduce.

I look forward to the patch ;)

By: David M. Lee (dlee) 2012-09-12 12:49:15.738-0500

Added link to review.

By: David M. Lee (dlee) 2012-09-12 12:51:22.095-0500

Aragon - feel free to [try my patch|https://reviewboard.asterisk.org/r/2109/diff/raw/]. Good luck!

By: David Brillert (aragon) 2012-09-12 14:56:05.596-0500

David - I've put your patch into a very busy production system.
The system will be very busy tomorrow morning.
Fresh SVN checkout is patched with DONT_OPTIMIZE and includes:

 r372933 | mmichelson | 2012-09-12 09:53:35 -0500 (Wed, 12 Sep 2012) | 10 lines
 Add channel name to a warning to make debugging easier.
 The "autodestruct with owner in place" message is typically
 indicative of a channel reference leak. Printing out the name
 of the channel in the message may be helpful when trying to
 debug the issue.

By: David Brillert (aragon) 2012-09-14 08:05:38.583-0500

Hi David, your patch has been in production for 2 days and it didn't help me.
Various times of day the system is unsable; cannot call between extensions or transfer calls.
I think my problem is located in acd login that requires a password.
I opened a new ticket ASTERISK-20416 to report my problem.