[Home]

Summary:ASTERISK-07849: Call drops on transfer when coming from queue/local channels (broken by r31520)
Reporter:Adam Goryachev (adamg)Labels:
Date Opened:2006-10-01 06:25:13Date Closed:2007-02-23 09:47:24.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 1.2.7.1-transfer.txt
( 1) attended-transfer-sipdebug-nocolorcodes.txt
( 2) debug_agent_transfer_before_answer_1.2-r55868_logdebug.txt
( 3) full.txt
Description:Call arrives on a ZAP channel and is placed into a queue
The queue calls an agent (agentcallbacklogin) on a local channel which then calls a SIP channel (polycom IP300 software version 1.6.7.0098) which answers the call
The call is then transferred to another SIP phone using the polycom attended transfer feature, the destination phone answers and the two users can speak.
The agent then completes the transfer, and ALL channels drop including the incoming zap channel.

****** ADDITIONAL INFORMATION ******

This works perfectly if I change the version of asterisk back to 1.2.7.1 which we have been using for a long time. I keep trying to upgrade to fix (or at least test) other issues, but bigger problems like this keep us from upgrading.

Have tried various versions of software on the polycom phones starting with 1.5.2 which is what we used before, 1.6.6 and 1.6.7 which all have the same issue with 1.2.12.1 or newer SVN versions. All versions work fine when we revert asterisk back to the 1.2.7.1 version

If any more information is needed, please let me know, I'll try and be on irc as user webman for a while if you need anything....
Comments:By: Adam Goryachev (adamg) 2006-10-01 07:32:06

It seems that this situation works ok in asterisk-1.2.8 but is broken in asterisk-1.2.9 and 1.2.10 and 1.2.11 and 1.2.12 and current svn....

Any suggestions on how to narrow this down further or anything else I need to provide??

By: Adam Goryachev (adamg) 2006-10-01 09:07:04

OK, I've now tracked this down by svn revision, and it seems the problem is in r31520 which has this comment:
handle Zap transfers behind chan_agent properly so the agent doesn't get stuck waiting for the call to hang up

This looks like it is  meant to fix something for zap channels, but in this case, it breaks things for sip channels!!

By: Adam Goryachev (adamg) 2006-10-01 09:23:51

a really quick test with current svn release but manually removing the patch from svn release 31520 seems to work correctly.
This is pretty much as far as I can go with this, I have no idea about the code etc to try and fix this properly...

By: Martin Vit (festr) 2006-10-01 17:54:17

isnt this releated to ASTERISK-7682?

By: Adam Goryachev (adamg) 2006-10-01 23:10:26

I'm not sure if this is related to bug 7890, but in my case, none of the channels remain up, and there is no seg fault when doing a show channels.

Currently I've just reverted the patch from channels.c in the latest SVN version, and it seems to work OK (at least the described problem doesn't happen anymore).

Hopefully someone who understands more about what is actually happening in the code can take a look at this. I can't imagine I'm the only person having this problem, yet it seems to have been there for a very long time!

By: Leonardo Gomes Figueira (sabbathbh) 2006-10-02 09:27:43

I think it's related to 7890 but you get different results cause it's a different enviroment (Zap to SIP). On 7890 my enviroment is SIP to SIP.

I created another test enviroment like yours and got the same results as yours (hangup on all channels when Polycom transfer). On this test if I change the Agent to another phone the results are different:

Agent on Grandstream BT100: After transfer there is no audio on the channel.
Agent on Soyo G668: Nothing wrong after tranfer.

Tested on 1.2.12.1.

By: Serge Vecher (serge-v) 2006-10-02 10:25:23

alright, we need the output from the console, not the debug file. Please enable the debug for console in logger.conf and repost the output please.

By: jmls (jmls) 2006-11-02 12:17:20.000-0600

adamg, thanks very much for narrowing down the problem to a specific version. Will you be able to upload the console log as requested by serge-v ? Thanks

By: Mark B (markb) 2006-11-03 14:32:48.000-0600

Hi folks I'am running asterisk 1.2.13 w/polycom 601's - Same problem here.. I can confirm that the problem is only present with agentcallback on calls comming from the queue.

I have another queue with sip extension directly added to the queue, (without agent login) problem does not occur on those sets.

I think that this is ASTERISK-7682 and while debugging on my server, show channels caused the queues to completely lock up. (similar to that reported in ASTERISK-7261)..

my system also reports a 500 "Internal Server Error" from the polycom. As reported elsewhere.

By: Serge Vecher (serge-v) 2006-11-09 10:18:01.000-0600

Somebody still needs to produce a console debug log here ...

1) Prepare test environment (reduce the amount of unrelated traffic on the server);
2) Make sure your logger.conf has the following line:
  console => notice,warning,error,debug
3) restart Asterik.
4) Enable SIP transaction logging with the following CLI commands:
set debug 4
set verbose 4
sip debug
5) Save complete console log to file and _attach_ said file to the bug.
 i.e. asterisk -Tvvvvvdddddgc | tee /tmp/sipdebug.txt

By: Mark B (markb) 2006-11-09 11:36:51.000-0600

serge-v, Thanks for looking at this and for clarifying your requirements. My asterisk is crazy busy during the week, I will see if I can get you the logs on sunday, but I can't promise. (My wife's pregnant, was due on the third.. still waiting)..

By: Kevin McAllister (mcallist) 2006-11-10 08:21:30.000-0600

I can reproduce and will upgrade the debug log shortly.

By: Kevin McAllister (mcallist) 2006-11-10 08:42:32.000-0600

uploaded file, it still has all the console binary chars in it.  Not sure of the best way to strip them out.

Please let me know if this is enough or if you need more.

My debug file is from 1.2.13

Thanks

By: Kevin McAllister (mcallist) 2006-11-10 08:55:17.000-0600

I will redo, just noticed the asterisk CLI option to turn off colors in the console, that will make the file a little more readable.

By: Serge Vecher (serge-v) 2006-11-15 10:52:06.000-0600

mcallist: are you also able to get rid of this problem by downgrading to 1.2.7.1 as noted by adamg? If so, can you please include a SIP debug of a working call? Also, you don't have to include all of that startup information in your log ...

By: Chris B. (chrisb) 2006-11-17 04:46:37.000-0600

Calls always drop or sometimes you have audio holes? This issue is very close to the bug ASTERISK-7978 I reported.
Perhaps there's no relations but if I were you, I'd substitue mpg123 with rawplayer.

Thanks.

By: Kevin McAllister (mcallist) 2006-11-17 07:41:18.000-0600

serge-v: I will try 1.2.7.1 this morning and report the results.

ChrisB: calls always drop, immediately after the REFER is received.  Looking at a tcpdump trace there are BYE's all around.

By: Kevin McAllister (mcallist) 2006-11-17 08:18:01.000-0600

At 1.2.7.1 with the exact same configuration The transfer was successful, I have uploaded a debug log file.

By: sohosys (sohosys) 2006-11-30 08:23:22.000-0600

It has been nearly 2 weeks with no activity on this, the problem still exists. Same issue here, SIP all around, polycom phones, all channels receive BYE on completion of SIP transfer. There are circular relatinoships on this and two other bugs. Worked last in 1.2.7, related queue deadlock/segfault fixed in 1.2.13, but SIP transfer from agent (agentcallbacklogin) still fails. Has anyone found a fix?

By: Adam Goryachev (adamg) 2006-11-30 19:45:01.000-0600

Well, as I said exactly 2 months ago, the fix is to remove the patch from svn release 31520, and it 100% resolves the issue. What totally amazes me is that this major bug has made the stable release of asterisk totally un-usable since 1.2.7.1 which was released... a very long time ago - actually 13th April 2006 ... I guess that is a very long time in the asterisk world.

I don't understand why the above patch affects SIP calls in this way, but it seems strange that one of the bug watcher people can't stick this in front of the appropriate developer who wrote that patch and ask them to fix it.... It might be a difficult, or strange, bug to fix, but at this rate, it looks like it won't ever be fixed in asterisk 1.2.x. I haven't tested any of the 1.4.x versions yet, but for all I know, the problem may well be there as well.

Of course, I may be upset, and annoyed by all this, but really most of that is gone now, I've fixed the problem in the environments that I needed to, and so don't really care. I've provided as much information as I possibly can here, and there is really nothing more I can do to assist.

PS, given the tone of the above message, I thought I better add the obligatory "Asterisk is great", I love it, appreciate all the time and effort that all of the asterisk contributors have put into it, and the world would be a worse place without it.

By: sohosys (sohosys) 2006-11-30 21:09:05.000-0600

adamg,

have you been able to remove the patch from both 1.2.12 and 1.2.13? How exactly did you accomplish this; "the fix is to remove the patch from svn release 31520, and it 100% resolves the issue" My apologies, I could use a little hand holding if you have the time. As far as 1.4 goes, agentcallbacklogin is gone, so 1.4 certainly would not fix this issue!

By: Adam Goryachev (adamg) 2006-11-30 21:59:48.000-0600

I am using asterisk release SVN-branch-1.2-r43977M and as far as I can recall, I simply browsed the SVN repository http://svn.digium.com/view/asterisk/branches/1.2/channel.c?r1=27468&r2=31520

I simply deleted the section of code highlighted in green from the latest release at that time.

ie, just open the file with vi, find that exact section of code in your version, delete the few lines, and then everything works nicely.

Hope that helps.

Regards,
Adam

By: sohosys (sohosys) 2006-12-01 02:20:07.000-0600

looks like there was additional changes to the same function in 37361. Kevin Flemming was the author of both changes (31520 and 37361). The purpose appears to be to release the agent channel after transfer to fix an issue where the agent would should busy until the recipinet of the transfer hangs up. How did you handle the code changes in 37361 after you removed the addition from 31520?

I wonder if Kevin would be willing to give this a look and shed some light on what the correct fix would be?

By: sohosys (sohosys) 2006-12-01 02:28:07.000-0600

sent an email to kpflemming begging for a peek at this, hope he does not get upset :)

"Kevin,

This issue has been open for a couple of months, adamg has traced it to 1.2 revision 31520, which you authored. It would be greatly appreciated if you could find a minute to review the issue and shed some light on what you were fixing that resulted in broken SIP transfers from a callback agent.

I have concerns about the suggested fix (remove 31520) since there were further changes to the same code in 37361.

Thanks, and sorry about the direct email, just do not know where else to go.

Damon Estep (sohosys)"

By: sohosys (sohosys) 2006-12-01 11:15:31.000-0600

sent to the dev mail list on 12/1/2006

"I would like to offer a $300(US) bounty for a fix and subsequent merge for issue 8064 in 1.2.12.1 and 1.2.13.

If anyone else can add to the bounty please feel free.

I think that the source of the bug has been clearly identified on the bug tracker, including the revision and exact code that caused it. The reporters, including myself (sohosys), do not have enough experience with the code to provide a patch. We simply do not understand the reason that the code was changed in the first place so we can not be sure we do not regress. There are also other related bugs that must be considered, at one point in the related changes a deadlock in the queue app was experienced when a call was transferred and even a segfault after doing a ?show channels? after the failed transfer. The latter two issues do not seem to be present in 1.2.12.1 and 1.2.13.

Terms;

Programmer must review and understand the code added in revisions 31520 37361, the patch must resolve the new issue as well as not regress on the original need for the changes. A summary of how this was accomplished as well as the patch for at least 1.2.12.1 and 1.2.13 must be posted to the bug tracker. If the code requires a different patch for trunk it will also need to be posted.

Disclaimers must be filed, and the patch and notes must be posted on the bug tracker.

$150 will be paid via PayPal within 2 days of the delivery of a workable patch (as tested by us on several high volume production servers), and an addition $150 will be paid when (and if) the patch is merged into to code tree for the 1.2 branch, which will obviously require favorable peer review.

If this issue appears larger than a couple of hours work I welcome the feedback.

Damon Estep
damon@suburbanbroadband.net"

By: Olle Johansson (oej) 2006-12-01 12:01:55.000-0600

Moving this out of the SIP category...

By: Håkan Källberg (hk) 2006-12-04 06:06:15.000-0600

I just wanted to confirm that this has nothing with SIP or ZAP to do. We have had the same problem since we upgraded from 1.2.7.1. We use mISDN/chan_misdn and the problem is there regardless of SIP or IAX clients. We now have to use transfers as of features.conf, wich is less flexible, but works on 1.2.13. We were forced to upgrade due to the lock up problem in the queues - Bug: 0006626.

By: Serge Vecher (serge-v) 2006-12-04 09:46:29.000-0600

hk: does the solution outlined by adamg in 0052369 and confirmed by sohosys also work for your scenario with mISDN?

By: Håkan Källberg (hk) 2006-12-04 11:28:29.000-0600

Well, this is a production server - call center. I don't dare to experiment there,  but I might find another server where I could try...

By: sohosys (sohosys) 2006-12-12 15:05:43.000-0600

not sure what is going on here, 35120 went in with no reference to a documented bug, we can not get any information on what issue it fixed, and there are plenty or confirmations that it broke all types of queue transfers other than features.conf transfers. Would it not make sense to review the code for dependencies on the change and if none roll it back?

By: sohosys (sohosys) 2006-12-12 15:08:27.000-0600

adamg, still stable after removing 35120? what kind of volume? any ZAP channels on the system?

By: crich (crich) 2006-12-18 03:38:36.000-0600

I can confirm that this issue happens also with chan_misdn. It works on 1.2.7.1 but doesn't on 1.2.13 and also NOT on 1.2.14.

So it seems not to be chan_zap specific.

By: Adam Goryachev (adamg) 2006-12-18 04:03:03.000-0600

Yes, the installed system is working perfectly, with absolutely no issues. They get approx 200 calls per day.

Regards,
Adam

By: Leonardo Gomes Figueira (sabbathbh) 2006-12-28 06:24:51.000-0600

Followed the tip about removing 35120 rev changes from channel.c on 1.2.14 release and seems it really fixed the problem, got 2 days uptime and the system is stable using a lot of Queues/Agent Callback with Agents making SIP transfers.

Production system:

Asterisk 1.2.14 built by root @ pfdesenv on a i686 running Linux on 2006-12-26 11:29:00 UTC

System uptime: 2 days, 33 minutes, 11 seconds
Last reload: 1 day, 17 hours, 7 minutes, 19 seconds

1496 calls in 48 hours.

I was a little concerned about reverting this changes if they could broke something else... but it seems they are harmless. It's nice to be able to finally update from 1.2.7.1 to last release :)

The change was just this:

--- channel.c.orig      2006-12-20 17:56:27.000000000 -0200
+++ channel.c   2006-12-20 17:57:15.000000000 -0200
@@ -2803,14 +2803,6 @@
               ast_mutex_lock(&original->lock);
       }

-       /* each of these channels may be sitting behind a channel proxy (i.e. chan_agent)
-          and if so, we don't really want to masquerade it, but its proxy */
-       if (original->_bridge && (original->_bridge != ast_bridged_channel(original)))
-               final_orig = original->_bridge;
-
-       if (clone->_bridge && (clone->_bridge != ast_bridged_channel(clone)))
-               final_clone = clone->_bridge;
-
       if ((final_orig != original) || (final_clone != clone)) {
               ast_mutex_lock(&final_orig->lock);
               while(ast_mutex_trylock(&final_clone->lock)) {

By: Serge Vecher (serge-v) 2006-12-28 08:55:26.000-0600

josh: any chance of looking at this? This has been sitting around way too long :(

By: Rob M (nyt) 2006-12-29 04:17:53.000-0600

I can also confirm that removing 35120 rev changes from channel.c fixes this problem in 1.4.

Removing this also fixes another problem I was seeing while using agentcallbacklogin and autofill queues on a high volume server.  After a while no calls would go through and show channels would list a large amount of ghosted queue() entries.

By: kib (kibeki) 2007-01-04 08:56:03.000-0600

we changed to from 1.2.7.1 to 1.2.14 with removed lines in channel.c
after 3 hours we got core dumped:
#0  0x08068484 in ast_bridged_channel (chan=0xa190b58) at channel.c:3273
3273            if (bridged && bridged->tech->bridged_channel)
(gdb) bt
#0  0x08068484 in ast_bridged_channel (chan=0xa190b58) at channel.c:3273
#1  0x001a68f3 in process_sdp (p=0xa25dae0, req=0xb7cc4c00) at chan_sip.c:3770
#2  0x001ce548 in handle_request_invite (p=0xa25dae0, req=0xb7cc4c00, debug=0, ignore=0, seqno=2,
   sin=0xa20cf48, recount=0xa20cf48, e=0xa20cf48 "x??") at chan_sip.c:10583
#3  0x001d43f3 in handle_request (p=0xa25dae0, req=0xb7cc4c00, sin=0xb7cc4bf0, recount=0xa20cf48,
   nounlock=0xa20cf48) at chan_sip.c:11321
#4  0x001d57af in sipsock_read (id=0xa0b53f0, fd=14, events=1, ignore=0x0) at chan_sip.c:11464
ASTERISK-1  0x08055c1f in ast_io_wait (ioc=0xa0a1f68, howlong=169922376) at io.c:284
ASTERISK-2  0x001dbc61 in do_monitor (data=0x0) at chan_sip.c:11627
ASTERISK-3  0x00b3c371 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-4  0x00cbaffe in clone () from /lib/tls/libc.so.6

By: sohosys (sohosys) 2007-01-04 09:17:13.000-0600

This bug is not limited to 1.2.14, it is also present in 1.4. Do we need to report it again or can someone update this to reflect a 1.4 bug?

By: sohosys (sohosys) 2007-01-04 09:29:44.000-0600

kibeki, how are calls to your queues originated? are they SIP or Zaptel? what about the callback agents, are they SIP or something else? We have been running with no issues for a while with the code removed (12 agents on calls all day, tens of thousands of calls), and there was a note in 35120 that it was aimed at fixing a Zaptel issue.

By: kib (kibeki) 2007-01-04 09:46:23.000-0600

The calls are misdn or sip originated. The sip client 104 will login to the queue and is configured like member => Agent/1047.

By: Serge Vecher (serge-v) 2007-01-04 09:48:55.000-0600

sohosys: if the same solution (reverting r31520-related code) works for 1.4.0, then we don't need to change any tags, as the code needs to be fixed in 1.2.x first, then forward-ported to 1.4.x and trunk.

kibeki: how did you verify that removing r31520 is what causes the segfault? Is unpatched 1.2.14 not crashing?

By: kib (kibeki) 2007-01-04 10:28:27.000-0600

sorry, I didn't verify that because the system is a production system and if do not change the file channel.c the queue receptionst is not able to do transfer.
it is not reproducible when the server crashes.

By: Leonardo Gomes Figueira (sabbathbh) 2007-01-15 07:13:56.000-0600

kibeki, which lines did you remove exactly from channel.c ?

Your backtrace shows that the "if" instruction that segfaulted is on line 3273, on the original channel.c it's on line 3275 so it seems that you removed 2 lines from channel.c but you should remove 4 lines to revert the patch. Did you insert other lines so the lines were shifted just by 2 ? (Maybe comments ?)

On a side note, my server is running fine with the modifications for the past 20 days (the server was restarted 1 week ago to upgrade the UPS).

 Leonardo

By: Corentin Le Gall (clegall_proformatique) 2007-02-08 04:00:19.000-0600

I have carried out some tests on an Asterisk version 1.2.14, only with IAX2 and SIP channels (not ZAP). At least 2 agents are registered in a queue, and the 1st phone to be picked up attempts an attended transfer to a final one :
       incom.  agents  1stPU   finPU   -patch  +patch
       ---------------------------------------------------
       SIP     2SIP    SIP     SIP     KO      OK
       SIP     SIP+IAX SIP     IAX     KO      OK
       SIP     SIP+IAX IAX     SIP     OK      OK
       SIP     2IAX    IAX     IAX     OK      OK
       IAX2    2SIP    SIP     SIP     KO      OK
       IAX2    SIP+IAX SIP     IAX     KO      OK
       IAX2    SIP+IAX IAX     SIP     OK      OK
       IAX2    2IAX    IAX     IAX     OK      OK
Without any change ("-patch"), the connection closes every time the first to be picked up is a SIP phone (whatever the incoming and final channels are).
With the submitted patch below ("+patch"), the transfer is successful.

The reason for this patch arises from the observation that the ast_channel_masquerade() function is not symmetrical when dealing with an agent : the connection breaks down when the second argument to ast_channel_masquerade() is the one whose _bridge field is an agent.


Index: channels/chan_sip.c
===================================================================
--- channels/chan_sip.c (revision 53359)
+++ channels/chan_sip.c (working copy)
@@ -10816,7 +10816,7 @@
               if (!ignore) {
                       if (p->refer_call) {
                               ast_log(LOG_DEBUG,"202 Accepted (supervised)\n");
-                               attempt_transfer(p, p->refer_call);
+                               attempt_transfer(p->refer_call, p);
                               if (p->refer_call->owner)
                                       ast_mutex_unlock(&p->refer_call->owner->lock);
                               ast_mutex_unlock(&p->refer_call->lock);



By: Serge Vecher (serge-v) 2007-02-08 09:46:47.000-0600

leonardo, does proformatique's change work for you?

By: Leonardo Gomes Figueira (sabbathbh) 2007-02-08 10:33:59.000-0600

serge-v,

the channel.c modification already works for me. I've been using it for +40 days  and there are successful reports from other users too in this issue.

I can't test this modifications in chan_sip.c today but I will try to arrange time and test it tomorrow.

The question is: this modification in chan_sip.c does not break other SIP transfers ? This attempt_transfer() call is the same since Asterisk 1.0.X. If the bug started with channels.c r35120 and that commit is not associated with any bug report why don't just revert it ?

  Leonardo

By: Leonardo Gomes Figueira (sabbathbh) 2007-02-12 08:20:14.000-0600

serge-v,

the chan_sip.c modification (attempt_transfer) fix the Agent transfer when the transfer process is activated 'after' the destination answered the call but when the transfer is activated 'before' answer the inbound call is hungup and the transfer does not work.

Some SIP phones does not support transfer when Ringing (before answer). Tested with Polycom that supports it.

Tested with calls to queue from IAX2, SIP and Zap channels.

 Leonardo

By: Serge Vecher (serge-v) 2007-02-12 08:24:27.000-0600

clegall_proformatique: would you please file your change as a separate bug, please, so that it can be applied separately from the fix to this issue?

Update: actually, please post it in 8784, if you will, as there is a successful report of using it.



By: Joshua C. Colp (jcolp) 2007-02-16 19:24:49.000-0600

Please try 1.2 as of revision 55073, 1.4 as of revision 55086, or trunk as of revision 55087 and report back. Thanks!

By: Corentin Le Gall (clegall_proformatique) 2007-02-20 04:29:35.000-0600

Let's consider 3 given Asterisk revisions : A, B, C.
* The A revision handles the SIP transfers correctly, except in the case of a Zap channel with BRI stuff, where a seg fault occurs (the A revision would be any one prior to the r31520 revision).
* With the B revision (that would include r31520), no segfault occurs on the BRI stuff, however on an attended transfer between SIP channels, the connection is broken.
* The aim of the "C revision" I have sent recently was to handle both the failing SIP connection and the segfault on BRIstuffed hardware (which I haven't been able to test yet, actually).

I had not experienced the configuration explained by Sabbathbh/Leonardo, and
am unable to reproduce it. However I take it for granted that the patch I submitted fails, and therefore there might be no real need to submit it anywhere else.
I will inquire further on the 8064 issue.

By: Leonardo Gomes Figueira (sabbathbh) 2007-02-21 09:34:52.000-0600

file,

tested SVN-branch-1.2-r55868. Agent SIP transfer after the destination answer work fine. But transfer before answer does not work, all channels got hangup. This NOTICE shows up on console:

Feb 21 13:25:39 NOTICE[7323]: chan_sip.c:10498 attempt_transfer: Transfer attempted with no appropriate bridged calls to transfer

This kind of SIP transfer without Agent involved work fine.

 Leonardo

By: Serge Vecher (serge-v) 2007-02-21 10:14:08.000-0600

leonardo, please include a new sip verbosedebug.txt

By: Serge Vecher (serge-v) 2007-02-21 11:02:41.000-0600

Leonardo, thanks for getting back quickly, you've forgotten to set debug 4 and/or include debug for console in logger.conf (see note 0054200). Obrigado!

By: Leonardo Gomes Figueira (sabbathbh) 2007-02-21 11:32:03.000-0600

I forgot to enable debug on logger.conf on the file uploaded. Uploaded the right one now:

debug_agent_transfer_before_answer_1.2-r55868_logdebug.txt

By: Rob M (nyt) 2007-02-21 14:53:37.000-0600

I had been removing 35120 rev changes from channel.c using SVN-branch-1.4-r48435.   Everything was working properly.  My transfers didn't get dropped and my queues functioned properly (agentcallbacklogin queues).  I tried the newest SVN  Asterisk SVN-branch-1.4-r55834M and my queues were getting messed up with channels left in the Ring state about 8 hours after the server was running, but transfers seemed to work.  I had to revert back to my patched version of SVN-branch-1.4-r48435.  Incoming calls are coming from ZAP channels.

Sorry for the lack of information but its hard to debug on a production server in a NOC.  I don't see anything strange with core set verbose 4 when it happens, and when I do core show channels I see a bunch of stuff in the queue stuck in the Ring state.

By: Serge Vecher (serge-v) 2007-02-21 15:19:31.000-0600

thanks for report, the version you've reported (r55834M) suggests some mods to the code. If you patched Asterisk and then did svn update, that would preserve your modification. It's always recommended to delete patched repos and doing a clean check out. Anyway, there was another important update to chan_sip (I assume that's what you use). If you get another chance, please do a clean checkout and test the latest 1.4 svn. If the problem persists, please provide verbosedebug.txt as per instructions here, as that is what will ultimately help to track the issue down.

By: Rob M (nyt) 2007-02-21 15:58:57.000-0600

I have recompiled with a clean checkout.  I am waiting for some downtime in calls to restart the server with the new code.  I will post an update / sip log after some more testing.

By: Rob M (nyt) 2007-02-21 17:10:53.000-0600

tested with Asterisk SVN-branch-1.4-r55957

I messed up with the sipdebug log, ill try to get it up later.

looks like transfers work fine, but my queues die.

Channel              Location             State   Application(Data)
Zap/6-1              s@billing:1          Ring    Queue(billing|tnr|||45)
Zap/11-1             s@billing:1          Ring    Queue(billing|tnr|||45)
Zap/19-1             s@support:1          Ringing Queue(support|tnr|||90)
Zap/3-1              s@billing:1          Ringing Queue(billing|tnr|||45)
SIP/3029-082d7878    4122@internal-real:2 Ring    Dial(Agent/4122|45|twr)

(notice no sip/agent channels in the queue, normally looks like below)
Agent/4148           s@queue:1            Up      Bridged Call(Zap/13-1)
SIP/3008-082c1f48    s@queue:1            Up      (None)
Zap/13-1             s@support:1          Up      Queue(support|tnr|||90)


EDIT: transfers work fine, queues break.



By: Corentin Le Gall (clegall_proformatique) 2007-02-22 02:54:43.000-0600

I have been able to reproduce the "transfer before answer" feature on a Polycom, as described by sabbathbh.
I agree that the transfer fails, either with or without the patch I submitted into chan_sip.c, and  I see no obvious workaround apart from removing the "r31520" lines.
Did "sohosys" or anyone get an answer from Kevin Flemming about the reason why he added his feature in r31520 ? i.e. what would be an example why it is needed ?

Corentin

By: sohosys (sohosys) 2007-02-22 08:25:48.000-0600

I did not receive any information directly from Kevin regarding r31520, there were no useful comments in the bug tracker, no references to a bug corrected by the revision, and nobody has otherwise identified any issue that it corrected or that is created by its removal. Given the overwhelming evidence that there is a significant and repeatable bug with it, and the lack of any evidence that it is a needed patch, I am surprised that it is still in the "stable" release.

By: Serge Vecher (serge-v) 2007-02-22 09:25:18.000-0600

guys, I do not have an answer on the "r31520"-question, aside from "if kpf did it, it is for a good reason". While you are looking for an answer to that question, please also remember to help with the "mainstream" process of fixing the issue by providing updated debug logs for the latest 1.4 if the issue is not resolved by a developer's commit. Thanks!

By: sohosys (sohosys) 2007-02-22 09:58:53.000-0600

serge-v - I fully agree that if kpf did it there was a reason at the time, the problem is that the reason was not documented, which is a big no-no in the world of source code management, right? The assumption that r31520 does something useful needs to be validated; everyone makes mistakes once and a while. This issue has been open for almost 5 months, there are several very high volume users validating that the removal of the revision fixes the issue and does not cause other problems. The reason for r31520 is obviously obscure. We run about 150 concurrent SIP/PRI channels per Asterisk server (75 concurrent calls) during peak periods that span most of the business day and well into the evening, the calls are originated via a combination of PRI and SIP, and the UAs are a combination of Polycom phones and Linksys ATAs. We removed r31520 from 1.2.12.1 in early December and have not had a single issue as a result. We use callback agents, local channels, and queues extensively in our configuration. We also heavily rely on zaptel and libpri and use PRIs (kpf's forte) extensively. We could spend the rest of our lives trying to figure out what kpf?s intentions were, or he could answer the question (possible?), or we could revert it and keep an eye/ear open. Forgive my tone please, no disrespect towards anyone intended, I just think this is a classic case of believing a bad assumption without validation for too long.

By: Rob M (nyt) 2007-02-22 12:12:58.000-0600

Ok, I've tracked down the problem.  Both queues are full of agentcallbacklogin agents.

Caller calls in on zap channel to queue "support", support engineer picks up, sees caller needs to go to billing dept, blind transfers to the "billing" queue extension (4180).


[Feb 22 09:30:18]     -- outgoing agentcall, to agent '4121', on Local/3008@queue-6cde,1'
[Feb 22 09:30:18]     -- Executing [3008@queue:1] UserEvent("Local/3008@queue-6cde,2", "RECV|Agent: Agents NameEDITED|Num: 4121|CID1: 8315321346|CID2: ") in new stack
[Feb 22 09:30:18]     -- Executing [3008@queue:2] Dial("Local/3008@queue-6cde,2", "SIP/3008|15| twm") in new stack
[Feb 22 09:30:18]     -- Called 3008
[Feb 22 09:30:18]     -- Started music on hold, class 'default', on channel 'Local/3008@queue-6cde,2'
[Feb 22 09:30:18] Really destroying SIP dialog '1879a2f2650549216e983c4a2dc77218@192.168.1.1' Method: INVITE
[Feb 22 09:30:18]     -- SIP/3008-082ce0c0 is ringing
[Feb 22 09:30:25]     -- Call on SIP/3008-082ce0c0 left from hold
[Feb 22 09:30:25]     -- Stopped music on hold on Local/3008@queue-6cde,2
[Feb 22 09:30:25]     -- SIP/3008-082ce0c0 answered Local/3008@queue-6cde,2
[Feb 22 09:30:25]     -- Agent/4121 answered Zap/21-1
[Feb 22 09:30:25]   == Spawn extension (queue, 3008, 2) exited non-zero on 'Local/3008@queue-6cde,2'
[Feb 22 09:30:35]     -- Started music on hold, class 'default', on channel 'Zap/21-1'
[Feb 22 09:30:37]     -- Executing [4180@internal:1] Set("SIP/3008-082b9978", "CALLEDEXTEN=4180") in new stack
[Feb 22 09:30:37]     -- Executing [4180@internal:2] GotoIf("SIP/3008-082b9978", "1?10") in new stack
[Feb 22 09:30:37]     -- Goto (internal,4180,10)
[Feb 22 09:30:37]     -- Executing [4180@internal:10] Set("SIP/3008-082b9978", "CIDNUM=4121") in new stack
[Feb 22 09:30:37]     -- Executing [4180@internal:11] Set("SIP/3008-082b9978", "CALLERID(number)=4121") in new stack
[Feb 22 09:30:37]     -- Executing [4180@internal:12] Set("SIP/3008-082b9978", "CALLERID(name)=agents nameEDITED") in new stack
[Feb 22 09:30:37]     -- Executing [4180@internal:13] UserEvent("SIP/3008-082b9978", "DIAL|Agent: agents nameEDITED|Num: 4121|Dest: 4180") in new stack
[Feb 22 09:30:37]     -- Executing [4180@internal:14] Goto("SIP/3008-082b9978", "internal-real|4180|1") in new stack
[Feb 22 09:30:37]     -- Goto (internal-real,4180,1)
[Feb 22 09:30:37]     -- Executing [4180@internal-real:1] Goto("SIP/3008-082b9978", "billing|s|1") in new stack
[Feb 22 09:30:37]     -- Goto (billing,s,1)
[Feb 22 09:30:37]     -- Executing [s@billing:1] Queue("SIP/3008-082b9978", "billing|tnr|||45") in new stack
[Feb 22 09:30:37]     -- outgoing agentcall, to agent '4181', on 'Local/3029@queue-0e0f,1'
[Feb 22 09:30:37]     -- Executing [3029@queue:1] UserEvent("Local/3029@queue-0e0f,2", "RECV|Agent: Agents nameEDITED|Num: 4181|CID1: 4121|CID2: AgentsNAMEEDITED") in new stack
[Feb 22 09:30:37]     -- Executing [3029@queue:2] Dial("Local/3029@queue-0e0f,2", "SIP/3029|15| twm") in new stack
[Feb 22 09:30:37]     -- Called 3029
[Feb 22 09:30:37]     -- Started music on hold, class 'default', on channel 'Local/3029@queue-0e0f,2'
[Feb 22 09:30:37]     -- SIP/3029-082bd8e0 is ringing
[Feb 22 09:30:38] NOTICE[14979]: chan_sip.c:12768 attempt_transfer: SIP Transfer attempted with no appropriate bridged calls to transfer
[Feb 22 09:30:38]   == Spawn extension (billing, s, 1) exited non-zero on 'SIP/3008-082b9978'
[Feb 22 09:30:38]     -- Stopped music on hold on Local/3029@queue-0e0f,2
[Feb 22 09:30:38]   == Spawn extension (queue, 3029, 2) exited non-zero on 'Local/3029@queue-0e0f,2'


then my queues stopped functioning

I reproduced this twice within minutes.  I'm going to retest shortly with r31520 commented in the latest svn.


It seems the key to this is the error message:
[Feb 22 09:30:38] NOTICE[14979]: chan_sip.c:12768 attempt_transfer: SIP Transfer attempted with no appropriate bridged calls to transfer

It seems to be looking for another endpoint on the blind transfer when there isn't one.



By: Rob M (nyt) 2007-02-22 12:18:50.000-0600

It should also be noted that this bug is listed for 1.2, when it affects all branches.

By: Serge Vecher (serge-v) 2007-02-22 12:28:29.000-0600

nyt: see my note 0057205 for an explanation on the tags.

By: Joshua C. Colp (jcolp) 2007-02-22 12:46:51.000-0600

Okay folks... I've spent the last few days tracking this down and while removing the added code was not the answer, a minor change to it was. Please try 1.2 as of revision 56230.

By: Rob M (nyt) 2007-02-22 12:52:50.000-0600

serge, I have just tested with SVN-branch-1.4-r55957 the 35120 rev changes removed.

It looks like with all the new changes, this no longer fixes the problem (is this a 1.4 bug now?)


EDIT: my post was beaten by a few minutes, I will try the new source and report back.



By: Rob M (nyt) 2007-02-22 14:55:48.000-0600

1.4 revision 56230 fixes this issue. Hurray.

By: Serge Vecher (serge-v) 2007-02-22 15:39:46.000-0600

thanks for testing and replying back, nyt. Would like to get a confirmation from recent reporters, namely sabbathbh, loloski, clegall_proformatique, sohosys (anyone else) before closing this bug.

By: Rob M (nyt) 2007-02-23 00:25:52.000-0600

Just as a follow up, I've been running this revision without any issues all day and hundreds of calls later.  I even tried to recreate the bug with it.  Everything worked perfectly.

Thanks for the fix

By: Leonardo Gomes Figueira (sabbathbh) 2007-02-23 06:32:39.000-0600

serge-v,

Tested SVN-branch-1.2-r56279 and it' working fine. Tested both Agent transfers before and after answer. No problems detected.

Thank you file for this fix.

 Leonardo

By: Serge Vecher (serge-v) 2007-02-23 09:47:15.000-0600

thanks for testing reports and thanks to file for fixing this in 1.2 rev 56230, 1.4 r56231 and trunk 56323!