Summary:ASTERISK-17148: [patch] SIP REFER transfers do not work
Reporter:Kirill Katsnelson (kkm)Labels:
Date Opened:2010-12-21 22:24:00.000-0600Date Closed:2011-01-19 08:17:16.000-0600
Versions:Frequency of
Environment:Attachments:( 0) 18516-kkm-maybefix-1.patch
Description:SIP blind transfer does not work: Transferree does not enter the transfer context.

****** STEPS TO REPRODUCE ******

1. Set global variable TRANSFER_CONTEXT=from-xfer
2. Set the context from-xfer to print anything:
 context from-xfer {
   _[!-z]. => Verbose(1,Transfer to ${EXTEN});
3. Establish a call from device A and device B and make B send a REFER to transfer A to C.

Verbose never reached.


This is likely not related to deadlock 18403

After some debugging, I understand perfectly now why the error happens, but have only a vague idea how to fix. Help appreciated.

1. REFER is handled in handle_request_refer() on B's channel. Line 22223 calls ast_async_goto() on A's channel to transfer it to the transfer context. Since chan->pbx exists, ast_async_goto() calls ast_explicit_goto(), and sets AST_SOFTHANGUP_ASYNCGOTO into A's chan->_softhangup.

2. On another thread, __ast_read() is called by generic_bridge(). In line 3622, a control hangup frame is enqueued:
  if (ast_check_hangup(chan)) { \\ ast_queue_control(chan, AST_CONTROL_HANGUP);

In the same function later, the frame is dequeued, and set to indicate hangup, line 3748:

if (f->frametype == AST_FRAME_CONTROL && f->subclass.integer == AST_CONTROL_HANGUP) {
 . . .
 f = NULL;

and later, line 4077,

 chan->_softhangup |= AST_SOFTHANGUP_DEV;

3. Next, on A's PBX service thread, in __ast_pbx_run() line 4724, the following condition is supposed to break out of the loop and begin processing of the extension previously

 } else if (c->_softhangup == AST_SOFTHANGUP_ASYNCGOTO) {
   c->_softhangup = 0;

but it does not happen, because _softhangup is now set to AST_SOFTHANGUP_DEV|AST_SOFTHANGUP_ASYNCGOTO. So the while loop is not broken, and PBX service ends on the next iteration by way _softhangup being non-zero in ast_check_hangup.

A little surprising how that is supposed to work. I cannot be only one out there hitting the race condition at this point.

As for fixing that, I need an advice what is the right way to do that:
1. Most radical: ignore AST_SOFTHANGUP_ASYNCGOTO bit when testing _softhangup in ast_check_hangup(). Perhaps too radical?
2. Ignore AST_SOFTHANGUP_ASYNCGOTO only in __ast_read when calling ast_check_hangup().
3. Other?
Comments:By: Kirill Katsnelson (kkm) 2010-12-21 23:39:56.000-0600

I see the following fix in the trunk's pbx.c:4828

295867   rmudgett  } else if (c->_softhangup & AST_SOFTHANGUP_ASYNCGOTO) {
295867   rmudgett    c->_softhangup &= ~AST_SOFTHANGUP_ASYNCGOTO;

which is praised down to the following changeset:

$ svn log -r295867
r295867 | rmudgett | 2010-11-22 11:42:02 -0800 (Mon, 22 Nov 2010) | 67 lines

Merged revisions 295866 via svnmerge from

 r295866 | rmudgett | 2010-11-22 13:36:10 -0600 (Mon, 22 Nov 2010) | 60 lines

 Merged revisions 295843 via svnmerge from

   r295843 | rmudgett | 2010-11-22 13:28:23 -0600 (Mon, 22 Nov 2010) | 53 lines

   Merged revisions 295790 via svnmerge from

     r295790 | rmudgett | 2010-11-22 12:46:26 -0600 (Mon, 22 Nov 2010) | 46 lines

     The channel redirect function (CLI or AMI) hangs up the call instead of redirecting the call

     To recreate the problem:
     1) Party A calls Party B
     2) Invoke CLI "channel redirect" command to redirect channel call leg
     associated with A.
     3) All associated channels are hung up.

     Note that if the CLI command were done on the channel call leg associated
     with B it works.

     This regression was a result of the fix for issue ASTERISK-15731

     The regression affects all features that use an async goto to execute the
     dialplan because of an external event: Channel redirect, AMI redirect, SIP
     REFER, and FAX detection.

     The struct ast_channel._softhangup code is a mess.  The variable is used
     for several purposes that do not necessarily result in the call being hung
     up.  I have added doxygen comments to describe how the various _softhangup
     bits are used.  I have corrected all the places where the variable was
     tested in a non-bit oriented manner.

     The primary fix is the new AST_CONTROL_END_OF_Q frame.  It acts as a weak
     hangup request so the soft hangup requests that do not normally result in
     a hangup do not hangup.

     JIRA SWP-2470
     JIRA SWP-2489

     (closes issue ASTERISK-16838)
     Reported by: SantaFox
     (closes issue ASTERISK-16847)
     Reported by: kwemheuer
     (closes issue ASTERISK-16873)
     Reported by: zahir_koradia
     (closes issue ASTERISK-16891)
     Reported by: vmarrone
     (closes issue ASTERISK-16950)
     Reported by: mbrevda
     (closes issue ASTERISK-16972)
     Reported by: nerbos

     Review:   https://reviewboard.asterisk.org/r/1013/


From looking at it, supposed to fix my issue as well. And that's in 1.8.2:

2010-11-22 19:36 +0000 [r295866]  Richard Mudgett <rmudgett@digium.com>

By: Kirill Katsnelson (kkm) 2010-12-21 23:56:22.000-0600

And that changeset indeed fixes the reported issue.

By: Kirill Katsnelson (kkm) 2010-12-22 01:15:44.000-0600

And it is definitely the same issue as ASTERISK-16847. D-oh!

By: John Hass (john8675309) 2010-12-23 15:31:33.000-0600

Even after this patch some call transfers work and others do not, I can do the redirect 10 times and it will work perfectly but sometimes, it will hangup after the 11th sometimes it will hangup on the first.

By: Kirill Katsnelson (kkm) 2010-12-23 22:45:49.000-0600

john8675309: could you please check if the attached patch 18516-kkm-maybefix-1.patch fixes your problem?

By: John Hass (john8675309) 2010-12-24 10:30:41.000-0600

kkm: yes the kkm patch stops it from hanging up, however now when doing a redirect with ExtraChannel: the ExtraChannel is hung up on.  I did a clean install of with just the kkm patch.

By: Kirill Katsnelson (kkm) 2010-12-24 14:59:28.000-0600

john8675309: Looks like there is more problems to it, and I am just an asterisk user like you, trying to come up with immediate fixes. The patch I attached is how I fixed the problem in this ticket, before deciding to go with the "official" changeset from the 1.8.2 branch.

I suggest you open a new ticket and give a reproduction for your problem, stating which fixes you tried and what they changed. Your scenario of use is clearly different than mine.

Try also 1.8.2: it is already in rc1, might be stable enough to use.

By: Leif Madsen (lmadsen) 2011-01-04 14:16:55.000-0600

So this issue can be closed as resolved then?

By: Kirill Katsnelson (kkm) 2011-01-04 14:20:26.000-0600

Yes please. The change from the 1.8.2 branch fixed it completely for me.

Also the attached "patch" may be confusing. It was a very temporary experimental fix which is incorrect. Maybe it is better to delete it, so people who search the tracker won't be confused? Up to you, anyhow.

By: gb_delti (gb_delti) 2011-01-06 07:32:24.000-0600

I have the same issue here on an Asterisk system. I could provide log files. Do I have to create a new issue for or will the patch fix it in the next 1.6.2.x version?

By: Malcolm Davenport (mdavenport) 2011-01-19 08:17:16.000-0600

The above-referenced patch didn't go into 1.6.2.x until