Summary:ASTERISK-13626: Queue members associated with ZOMBIE channels (breaks wrapuptime and queue_log).
Reporter:Matt King, M.A. Oxon. (kebl0155)Labels:
Date Opened:2009-02-23 06:26:39.000-0600Date Closed:2011-07-26 14:11:17
Versions:Frequency of
Environment:Attachments:( 0) agentconnectfull.txt
( 1) agentconnectSIP.txt
Description:In some circumstances, queue members become associated with a ZOMBIE channel during the answer process.

When the ZOMBIE channel is hung up under normal clearing as part of the answer process, an AgentComplete event is generated, and the wrapuptime counter is set, even though the agent is still on a call.  The queue_log entry is also written at this time with just 1 second of talk time.

No further AgentComplete event is generated when the call comes to an end.  The wrapuptime is then counted from the start of the call, not the end of the call.


This seems to happen whenever multiple levels of Local channels and/or bridging are used to connect agent and caller.  Here's an example dialplan that will reproduce this error.

exten => 1,1,Answer
exten => 1,2,Dial(Local/plainae@test)

exten => 12345,1,Dial(SIP/12345)

exten => plainae,1,Queue(PlainAE)

We have one agent as a member of the queue:

queue add member Local/12345@main to PlainAE

When a call is placed from SIP/12344 to extension 1, a local channel dials the queue.  The queue dials a second local channel to reach the agent, listening on SIP/12345.  When the agent answers, both local channels are masqueraded out.

When the agent is called, an AgentCalled event is generated:

Event: AgentCalled
Privilege: agent,all
Queue: PlainAE
AgentCalled: Local/12345@main
AgentName: Local/12345@main
ChannelCalling: Local/plainae@test-c6d0;2
DestinationChannel: Local/12345@main-cc4e;1

This is as it should be.  When the call is answered, an AgentConnect event is generated:

Event: AgentConnect
Privilege: agent,all
Queue: PlainAE
Uniqueid: 1235323799.8
Channel: Local/12345@main-cc4e;1
Member: Local/12345@main
MemberName: Local/12345@main

This is also as it should be.

As the local channels are masqueraded out, a complex sequence of Bridge, Masquerade and Rename events are generated (see attached log).  

Having had a look at the Manager stream of events, it appears that the Rename events are being followed, but not the Masquerade or Bridge events.  The agent *should* be left as associated with the SIP channel for the agent's phone (SIP/12345-0825c248 in this case).  Instead, it's eventually associated with Local/plainae@test-c6d0;1<ZOMBIE>, which is one of the leftover channels from the bridge/masquerade sequence.

So, when this channel is hung up as part of the normal clearing process, the following event is generated:

Event: AgentComplete
Privilege: agent,all
Queue: PlainAE
Uniqueid: 1235323799.8
Channel: Local/plainae@test-c6d0;1<ZOMBIE>
Member: Local/12345@main
MemberName: Local/12345@main
HoldTime: 4
TalkTime: 1
Reason: caller

This is coming way too early - we shouldn't get one of these until either the agent or caller hangs up the phone.  The wrapuptime counter is also set at this time (instead of the end of the call as it should be).  The call is also written as complete in the queue_log, with just one second of talk time.

Finally, when the call ends, we get two Hangup events:

Event: Hangup
Privilege: call,all
Channel: SIP/12345-0825c248
Uniqueid: 1235323799.7
CallerIDNum: 12345
CallerIDName: <unknown>
Cause: 16
Cause-txt: Normal Clearing

Event: Hangup
Privilege: call,all
Channel: SIP/12344-08242b40
Uniqueid: 1235323799.6
CallerIDNum: 12344
CallerIDName: 08450045413
Cause: 16
Cause-txt: Normal Clearing

This is when the AgentComplete event should be fired off, however there is no such event because the queue system thinks the call has already ended.

I have attached a full capture of the Manager events associated with this broken process.

This happens with both the 1.6 and 1.4 release groups (I haven't checked 1.2).

A quick fix would be to prevent the system from following Rename events where the Newname: field ends with <MASQ>.  This would leave the queue member associated with the correct channel at the end of the bridging process.

I did take a look at the app_queue code, but couldn't work out where to start with implementing this (sorry!)...
Comments:By: Matt King, M.A. Oxon. (kebl0155) 2009-03-03 11:23:08.000-0600

There is now a $150 bounty for a fix of this bug.


By: Matt King, M.A. Oxon. (kebl0155) 2009-03-03 11:40:42.000-0600

...This bug can be worked around using /n at the end of the Local channel dial, however we can't use this in production as SIP reinvite won't work in this case.

Also, the bug persists even if the agent is a straight SIP dial (not a local channel) - please see attached new Manager log.

By: Mark Michelson (mmichelson) 2009-03-03 16:22:15.000-0600

Allow me to explain what is happening here. The cause of this is that the incoming call to the queue is on a Local channel. When the channel is answered by a queue member, the Local channel optimizes itself out of the equation.  The queue application sees this as a hangup of the incoming channel and thus treats the situation as the end of a call.

Fixing this in the code is not going to be an easy thing at all. In fact, I would say that anyone who takes on this bounty is going to run into lots of trouble when attempting to fix this in the code. As you have found, using the /n flag at the end of a Local channel will cause the aforementioned optimization not to occur. Another possible fix you could explore is to avoid using a Local channel as the incoming channel for the queue if it can be done. I realize that what you have posted is pretty much a toy example which illustrates the point, so asking for such dialplan changes in your actual dialplan will probably be more difficult. An alternative to dialing a local channel would be to use Goto's in the dialplan if possible.

By: Matt King, M.A. Oxon. (kebl0155) 2009-03-04 11:03:54.000-0600

Hi there,

Thanks for your post.  My feeling is that the AgentComplete should not be sent until the agent is no longer speaking to the caller, regardless of what's happening internally with the Local channels.

In answer to your question, we're sending the calls to the queue using the Originate command, so I believe we have to use Local channels for this.

I can see from the Manager trace that the channel the Queue system associates with the agent end of the call is changing as the Rename events are sent - do you know which bit of the codebase makes this change?

Thanks again...

By: Mark Michelson (mmichelson) 2009-03-04 11:38:42.000-0600

Yes, if you're using Originate, that implicitly uses a Local channel, so working around the problem in ways I have described is not exactly an option.

The problem is that app_queue sends the AgentComplete event because to the queue app, it appears that the bridge between the caller and the queue member has been broken because the incoming channel into the queue has been hung up. This hang up is due to the Local channel optimization that is going on once the Local channel becomes bridged with the queue member's channel. Part of the optimization that occurs is a channel masquerade (ast_do_masquerade in main/channel.c). It is during the channel masquerade that the Rename events occur.

By: jmls (jmls) 2009-10-27 06:57:43

so, what are we going to do with this ? Leave it open, close it as not fixable, or fix it ;)

By: Matt King, M.A. Oxon. (kebl0155) 2009-10-27 07:25:48

Go on fix it... You know you want to... ;-)


By: Atis Lezdins (atis) 2009-10-28 09:10:38

You could always follow UniqueID as workaround. That works perfectly for me with similar scenarios.

By: Matt King, M.A. Oxon. (kebl0155) 2009-10-29 12:40:29

Thanks for the helpful suggestion, but we're not having trouble following the Manager stream.  

It's the AgentComplete, queue_log and wrapuptime functions that are broken...

By: Matt King, M.A. Oxon. (kebl0155) 2009-12-21 08:42:18.000-0600

This is also broken for just regular dials to queue when Local Channels are used - we get two channels hanging up on answer:

*CLI>     -- SIP/1002-090f25d8 answered Local/2002@main2-8d35;2
   -- Local/2002@main2-8d35;1 answered SIP/1001-090efcf8
[Dec 21 14:23:20] NOTICE[3522]: app_queue.c:3491 try_calling: Delaying member connect for 2 seconds
[Dec 21 14:23:21] NOTICE[3523]: app_dial.c:1963 dial_exec_full: PEER context: outbound; PEER exten: ;  PEER priority: 1
   -- Executing [h@macro-agent-dial:1] NoOp("Local/2002@main2-8d35;1<ZOMBIE>", "DialHangup Local/2002@main2-8d35;1<ZOMBIE> CALLINGAGENT: AGENT:Local/2002@main2 INTERFACE: DIALSTATUS: ANSWEREDTIME:") in new stack
   -- Executing [h@macro-agent-dial:2] GotoIf("Local/2002@main2-8d35;1<ZOMBIE>", "6?h-ignore,1") in new stack
   -- Goto (macro-agent-dial,h-ignore,1)
 == Spawn extension (macro-agent-dial, dial, 5) exited non-zero on 'Local/2002@main2-8d35;2' in macro 'agent-dial'
 == Spawn extension (macro-agent-dial, dial, 5) exited non-zero on 'Local/2002@main2-8d35;2'
   -- Executing [h@macro-agent-dial:1] NoOp("Local/2002@main2-8d35;2", "DialHangup Local/2002@main2-8d35;2 CALLINGAGENT: AGENT:Local/2002@main2 INTERFACE: DIALSTATUS:ANSWER ANSWEREDTIME:1") in new stack
   -- Executing [h@macro-agent-dial:2] GotoIf("Local/2002@main2-8d35;2", "0?h-ignore,1") in new stack
   -- Executing [h@macro-agent-dial:3] GotoIf("Local/2002@main2-8d35;2", "1?h-ignore,1") in new stack
   -- Goto (macro-agent-dial,h-ignore,1)
   -- Stopped music on hold on SIP/1001-090efcf8

The second of these is problematic, as it is indistinguishable from the agent 'double-picking' the call.

This means we can't use the DEVICE_STATE function to reliably set the device state for the Local Channels, which in turn means they cannot be used as a replacement for Agent Channels (because agent's phones are rung even if they are in use) - so what are we supposed to do instead now that Agent Channels are gone?

By: Leif Madsen (lmadsen) 2011-07-26 14:11:10.269-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions