[Home]

Summary:ASTERISK-07299: [branch] GROUP_COUNT returns invalid result on calls from queue to agentcallback or Local channel
Reporter:Leonardo Gomes Figueira (sabbathbh)Labels:
Date Opened:2006-07-06 13:12:31Date Closed:2007-07-09 21:20:52
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) debug_group_segfault_SVN-file-groupcountv2-r59373_gdb.txt
( 1) debug_groupcount_queue_agent_console.txt
( 2) debug_groupcount_queue_agent_log.txt
( 3) debug_groupcount_queue_configs.txt
( 4) debug_groupcount_queue_local_console.txt
( 5) debug_groupcount_queue_local_log.txt
Description:Tested on 1.2.7.1, 1.2.9.1 and SVN-branch-1.2-r37143.

GROUP_COUNT on calls from queue to Agent/ or Local/ channel sometimes returns invalid result (zero when there is one call on the group).

I'm trying to use GROUP_COUNT on queue to do not call a member that is already on another call not from queue.

It's not 100% reproduceable but in my tests I could say it's happens on >80% queue calls.

In most tests it happens on the second member that the queue tries to call but sometimes it happens later.

The situation is:

1- SIP/3052 dial SIP/3053
On the dialplan before Dial I have: Set(GROUP=3052) and Set(OUTBOUND_GROUP=3053)
2- The call is established and both SIP channels have their GROUP set.
3- SIP/3050 call myqueue that has Agent/3052 and Agent/3053 as members.
4- queue calls Agent/3052, on dialplan GROUP_COUNT(3052) returns 1 and goto Busy(). CORRECT behavior.
5- queue calls Agent/3053, on dialplan GROUP_COUNT(3053) returns 0 (should return 1) so the call goes through Dial/SIP/3053. WRONG behavior.
6- SIP/3053 rings until 15 seconds timeout (SIP/3053 has call waiting).
7- queue try calling Agent/3052 again, on dialplan GROUP_COUNT(3052) returns 1 and goto Busy(). CORRECT behavior.
8- queue try calling Agent/3053 again, on dialplan GROUP_COUNT(3053) now returns 1 correctly and goto Busy(). CORRECT behavior.
Step 7 and 8 repeat multiple times, GROUP_COUNT works fine again.

Sometimes it does not happens on the second call, but on the third or fourth call. Sometimes it does not happens at all.

On SIP to SIP calls GROUP_COUNT always returns correct value. This problem happens only on calls from queue.

I tried to switch from Agent/ channel to Local/ channel but the same problem happens.

Before every wrong GROUP_COUNT I get an "channel_find_locked: Avoided deadlock for '0x849b6f0', 10 retries" like:

Jul  6 14:44:09 WARNING[29731]: channel.c:787 channel_find_locked: Avoided deadlock for '0x849b6f0', 10 retries!

I will post configuration files and a simple dialplan used to reproduce the problem. All this tests were created on an Asterisk with no other calls.


****** ADDITIONAL INFORMATION ******

The simplified dialplan used to demonstrate this:

[queue_to_agent]
exten => _3XXX,1,NoOp(BEFORE SETGROUP: Group: ${EXTEN} Count: ${GROUP_COUNT(${EXTEN})})
exten => _3XXX,n,Set(GROUP=${EXTEN})
exten => _3XXX,n,NoOp(AFTER SETGROUP:  Group: ${GROUP} Count: ${GROUP_COUNT(${EXTEN})})
exten => _3XXX,n,GotoIf($[ ${GROUP_COUNT(${EXTEN})} > 1 ] ? 1000)
exten => _3XXX,n,Dial(SIP/${EXTEN}|15|T)
exten => _3XXX,n,Busy()

exten => _3XXX,1000,NoOp(Agent ${EXTEN} is busy. Count: ${GROUP_COUNT(${EXTEN})})
exten => _3XXX,n,Busy()

[sip_to_anywhere]
exten => _3XXX,1,NoOp(BEFORE SETGROUP: Group: ${ACCOUNTCODE} Count: ${GROUP_COUNT(${ACCOUNTCODE})})
exten => _3XXX,n,NoOp(BEFORE SETOUTBOUNDGROUP: Group: ${EXTEN} Count: ${GROUP_COUNT(${EXTEN})})
exten => _3XXX,n,Set(GROUP=${ACCOUNTCODE})
exten => _3XXX,n,Set(OUTBOUND_GROUP=${EXTEN})
exten => _3XXX,n,NoOp(AFTER SETGROUP:  Group: ${GROUP} Count: ${GROUP_COUNT(${GROUP})})
exten => _3XXX,n,GotoIf($[ ${GROUP_COUNT(${EXTEN})} > 0 ] ? 1000)
exten => _3XXX,n,Dial(SIP/${EXTEN}|15|tT)
exten => _3XXX,n,Hangup()

exten => _3XXX,1000,NoOp(SIP ${EXTEN} is busy. Count: ${GROUP_COUNT(${EXTEN})})
exten => _3XXX,n,Busy()

exten => 3097,1,Queue(myqueue|tr|||600)
exten => 3097,2,Hangup()

Comments:By: wes (whoiswes) 2006-07-10 14:54:33

Might this be related to http://bugs.digium.com/view.php?id=7278

I've been watching closely any bugs that deal with call limits and/or device status in app_queue, as I have been trying to get an effective call limiter on the queue since Decmber.

Another bug that might be worth a look (code was never included in trunk, to my knowledge):  http://bugs.digium.com/view.php?id=6315

This may be totally irrelevant, but I figure it can't hurt...

By: Serge Vecher (serge-v) 2006-07-10 15:20:36

whoiswes: thanks for the pointers. Patch for 6315 was included in trunk, but not 1.2 branch, since it's a new feature. In order to test it, you would have to try current svn trunk, which should be fairly stable and steps away from being 1.4 beta.

on 7278, it would be great if sabbathbh could test the patch there and report results.

By: wes (whoiswes) 2006-07-10 15:28:05

Ah, gotcha.  Well, 6315 doesn't ALWAYS work for me anyways, and since it's closed, I can't add any feedback.  On my test box, it works perfectly, but on the floor I can't seem to get it to function properly.

I'd be interested in the results of 7278 being applied here as well.  I have a couple of machines that are still using rrmemory for a strategy - I will see about possibly applying the patch to one of those.

Thanks for letting me know about 6315, I'm still learning how all of this works.

By: Leonardo Gomes Figueira (sabbathbh) 2006-07-11 07:36:35

I can't see the relation to 7278 but will test here anyway since you've asked to.

Althrough I have leastrecent as the queue policy on test scenario from the uploaded files, the problem with GROUPCOUNT happens with ringall policy too.

If 6315 works it must solve one of my problems here. In one PBX I could stop using Agent and use SIP channels directly. But there is another situation (other PBX) that I must use Agents.

By: Leonardo Gomes Figueira (sabbathbh) 2006-07-11 08:19:02

Tested 7278 patch with SVN-branch-1.2-r37361: same problem with GROUPCOUNT.

By: Serge Vecher (serge-v) 2006-07-11 08:54:16

sabbathbh: thanks for testing the patch. If at all possible, please see to the possibility of testing latest trunk; it should be stable enough at this point and this way you get to test 6315.

this is also for whoiswes -- if there are problems with trunk related to commit from 6315, please don't hesitate to open a new report mentioning the relationship to the original bug -- when you can narrow the bug down, that is ;)

By: Leonardo Gomes Figueira (sabbathbh) 2006-07-11 13:11:21

I'm testing 6315 with 1.2.7.1 in a test enviroment and it's working fine so far.

Tested this issue on trunk and got the same problem:

SIP/3052 bridged to SIP/3053 both with GROUP=305X

Call to queue:

   -- Executing [3097@sip_to_anywhere:1] Queue("SIP/3050-09be6c78", "myqueue|tr|||600") in new stack

   -- outgoing agentcall, to agent '3052', on 'Local/3052@queue_to_agent-ceae,1'
 == Spawn extension (queue_to_agent, 3053, 1001) exited non-zero on 'Local/3053@queue_to_agent-17e2,2'
   -- Executing [3052@queue_to_agent:1] NoOp("Local/3052@queue_to_agent-ceae,2", "BEFORE SETGROUP: Group: 3052 Count: 1") in new stack
   -- Executing [3052@queue_to_agent:2] Set("Local/3052@queue_to_agent-ceae,2", "GROUP=3052") in new stack
   -- Executing [3052@queue_to_agent:3] NoOp("Local/3052@queue_to_agent-ceae,2", "AFTER SETGROUP:  Group: 3052 Count: 2") in new stack
   -- Executing [3052@queue_to_agent:4] GotoIf("Local/3052@queue_to_agent-ceae,2", "1 ? 1000") in new stack
   -- Goto (queue_to_agent,3052,1000)
   -- Executing [3052@queue_to_agent:1000] NoOp("Local/3052@queue_to_agent-ceae,2", "Agent 3052 is busy. Count: 2") in new stack
   -- Executing [3052@queue_to_agent:1001] Busy("Local/3052@queue_to_agent-ceae,2", "") in new stack
   -- Agent/3052 is busy
   -- Nobody picked up in 0 ms
 == Spawn extension (queue_to_agent, 3052, 1001) exited non-zero on 'Local/3052@queue_to_agent-ceae,2'
pfdesenv*CLI>

First run ok, both GROUPCOUNT returned 1.

...

Second run ok, both GROUPCOUNT returned 1.

   -- outgoing agentcall, to agent '3052', on 'Local/3052@queue_to_agent-647e,1'
 == Spawn extension (queue_to_agent, 3053, 1001) exited non-zero on 'Local/3053@queue_to_agent-5331,2'
   -- Executing [3052@queue_to_agent:1] NoOp("Local/3052@queue_to_agent-647e,2", "BEFORE SETGROUP: Group: 3052 Count: 0") in new stack
   -- Executing [3052@queue_to_agent:2] Set("Local/3052@queue_to_agent-647e,2", "GROUP=3052") in new stack
   -- Executing [3052@queue_to_agent:3] NoOp("Local/3052@queue_to_agent-647e,2", "AFTER SETGROUP:  Group: 3052 Count: 1") in new stack
   -- Executing [3052@queue_to_agent:4] GotoIf("Local/3052@queue_to_agent-647e,2", "0 ? 1000") in new stack
   -- Executing [3052@queue_to_agent:5] Dial("Local/3052@queue_to_agent-647e,2", "SIP/3052|15|T") in new stack
   -- Called 3052
   -- SIP/3052-09b51c28 is ringing
   -- Agent/3052 is ringing
pfdesenv*CLI>

On the third run the problem occured, GROUPCOUNT of 3052 returned 0.

By: Leonardo Gomes Figueira (sabbathbh) 2006-07-11 13:13:11

Forgot to post the version:

Asterisk SVN-trunk-r37362 built by root @ pfdesenv on a i686 running Linux on 2006-07-11 15:37:25 UTC

By: Serge Vecher (serge-v) 2006-07-11 14:10:25

alright, so neither ASTERISK-7090 nor ASTERISK-6155 have impact on any issues. If you are able to get on IRC, I'd join #asterisk-dev list and see if bweschke has time to work on this with you.

By: stadsj (stadsj) 2006-08-31 09:27:13

I think this behaviour is correct.
When queue calls out to an agent it uses the Local/ channel on which you set up a group in your dialplan. However, upon answering and call bridging, this local channel disappears, even the call is still ongoing. This affects the group count.
Hope this explains "strange" GROUP_COUNT behaviour.



By: Leonardo Gomes Figueira (sabbathbh) 2006-08-31 10:28:11

stadsj,

no, it does not explain it.

The calls are not answered through the Local channel when the problem happens. Read the problem description again and you will understand it.

How do you explain this (from debug_groupcount_queue_agent_console.txt) ?

   -- Executing NoOp("Local/3052@queue_to_agent-6386,2", "BEFORE SETGROUP: Group: 3052 Count: 0") in new stack

... No call established here but then:

   -- Executing NoOp("Local/3052@queue_to_agent-0051,2", "BEFORE SETGROUP: Group: 3052 Count: 1") in new stack

The first GROUP_COUNT should return 1, not 0. The second is ok.

By: stadsj (stadsj) 2006-09-01 16:35:07

In the piece of your dialplan I cannot see where you set group.
When you set the group, you need to use the function GROUP(), not a variable GROUP.
In your dialplan you use Set(GROUP=${EXTEN}), which doesn't actually sets the group. Strange that you see a change in GOUP_COUNT() even sometimes...

By: Leif Madsen (lmadsen) 2006-09-01 21:49:04

Tonight I was working with the Local channel in Queue(), and have been trying to debug another issue I'm having with the 'n' flag, I noticed a couple of lines in the Local channel debug:

[Sep  1 20:12:34] DEBUG[23055]: chan_sip.c:2879 update_call_counter: Updating call counter for outgoing call
[Sep  1 20:12:34] DEBUG[23055]: chan_sip.c:2906 update_call_counter: user_one is not a local device, no call limit

Not sure if this is related or not, but figured I'd mention it anyways.

By: BJ Weschke (bweschke) 2006-09-03 12:28:06

stadsj is right here. GROUP() is quite different than GROUP in the container assignment when using Set in the dialplan. See UPGRADE.TXT for more info.

By: stadsj (stadsj) 2006-09-04 04:38:52

Well, strictly speaking, using GROUP() against the undocumented direct set of the channel variable GROUP is a preferrable option. But the problem persists when there is a need to count used channels by the outbound calls from queue app. As I indicaded earlier, queue employs local channel, the place where actually GROUP set can be used before the Dial app establishes an outbound leg of the call. Upon bridging that local channel is destroyed which makes GROUP settings disappear too... (The use if the /n option, looks not nice, why would we need to have an extra local channel in each call?)

I can say, that dealing with the GROUP channel variable may look like a hack, but I cannot see any other solution of the problem. In fact, it is possible to make this variable to be inherited by the child channel (__GROUP) and that way GROUP_COUNT may work. With the function GROUP() such trick is not possible, I think. If someone can propose any remedy using more canonical way, that would be very interesting.

The problem with using Set(__GROUP=foo) may occur during the time when the outbound call is not answered yet. Since __GROUP is now inherited, both, the local channel and the outbound channel would be counted. After bridging, though, only one channel would remain. So, still this solution is not perfect. Any othe idea?



By: Leonardo Gomes Figueira (sabbathbh) 2006-09-05 13:25:01

Changed from Set(GROUP=${EXTEN}) to Set(GROUP()=${EXTEN}) and got the same problem.

stadsj,

I think the problem here is not the GROUP variable of the Local channel created by queue since the call that defined the GROUP before the queue was a SIP to SIP call.

I'll try to explain again:

1- SIP A call SIP B
2- On the dialplan both legs get a GROUP() (SIP A GROUP A, SIP B GROUP B)
3- Now I have 2 bridged SIP channels, SIP-A with GROUP A and SIP-B with GROUP B
4- SIP C calls queue(myqueue)
5- myqueue try calling Agent B and on the dialplan GROUP_COUNT(B) returns 1 (CORRECT, SIP-B has GROUP B)
6- myqueue try calling Agent A and on the dialplan GROUP_COUNT(A) returns 0 (WRONG, the channel SIP-A has GROUP A defined, why does it returns 0 ???)
7- Repeat pass 5: CORRECT
8- Repeat pass 6: NOW THE RESULT IS CORRECT (Now GROUP_COUNT(A) returns 1, why ???)

I got a lot of "Avoided deadlock" when the wrong GROUP_COUNT() happens:

Sep  5 15:16:38 WARNING[25567]: channel.c:787 channel_find_locked: Avoided deadlock for '0x885f720', 10 retries!

I removed the Set(GROUP()) from the dialplan that the Local channel of the Agent  goes to be certain that there is no relation to variables created by Local channels:

 '_3XXX' =>        1. NoOp(Group: ${EXTEN} Count: ${GROUP_COUNT(${EXTEN})}) [pbx_config]
                   2. GotoIf($[ ${GROUP_COUNT(${EXTEN})} > 0 ] ? 1000) [pbx_config]
                   3. Dial(SIP/${EXTEN}|15|T)                    [pbx_config]
                   4. Busy()                                     [pbx_config]
                   1000. NoOp(Agent ${EXTEN} is busy. Count: ${GROUP_COUNT(${EXTEN})}) [pbx_config]
                   1001. Busy()                                  [pbx_config]

By: stadsj (stadsj) 2006-09-07 09:55:58

sabbathbh,
changing GROUP to GROUP() was not supposed to solve your problem, it was just a note. Moreover, it seems that setting variables rather than using a function may be a more flexible approach, as it allows to distinguish between GROUP and OUTBOUND_GROUP variables. I noticed that in all your debug info, the problem with GROUP_COUNT appeared when queue attempted to call to 3052. Hence, the problem is with the inbound leg of the SIP to SIP call. This may narrow the search to setting of the GROUP variable rather then OUTBOUND_GROUP.

By: jmls (jmls) 2006-11-01 10:36:51.000-0600

sabbathbh, any more comments on this ?

By: Leonardo Gomes Figueira (sabbathbh) 2006-11-01 15:15:11.000-0600

jmls,

do you need more info ?

Just tested on SVN-branch-1.2-r46803 and got the same results.

Leonardo

By: jmls (jmls) 2006-11-01 16:05:46.000-0600

I was just making sure that everyone knew where they were in this - thanks for following through and testing the latest version

By: Alex Richardson (alexrch) 2006-12-13 12:26:56.000-0600

I can confirm this problem. I can reproduce this problem in cca. 10 minutes in my test environment. Please let me know, what debugging info should I post, to give you guys clearer image of what's going on wrong.

And also, where do you think should one start debugging this problem? Is the problem in the GROUP() function or in the queue system?

By: Alex Richardson (alexrch) 2006-12-14 04:42:34.000-0600

I've added a few debug info to the group_count_function_read() and void group_function_write() functions, and found out one interesting thing:

The agent's extension:

exten => _1111XXXXX1.,1,Verbose(Group count before: ${GROUP_COUNT(${EXTEN:6:3}@agents)} for agent ${EXTEN:6:3} unique ${UNIQUEID})
exten => _1111XXXXX1.,2,Set(GROUP()=${EXTEN:6:3}@agents)
exten => _1111XXXXX1.,3,GotoIf($[ ${GROUP_COUNT(${EXTEN:6:3}@agents)} > 1 ]?900)
exten => _1111XXXXX1.,4,Verbose(Group count after: ${GROUP_COUNT(${EXTEN:6:3}@agents)} for agent ${EXTEN:6:3} unique ${UNIQUEID})
exten => _1111XXXXX1.,5,Random(50:15)
exten => _1111XXXXX1.,6,Dial(ZAP/g1/${EXTEN:10},8,tTj)
exten => _1111XXXXX1.,7,HangUp()
exten => _1111XXXXX1.,16,Dial(ZAP/G1/${EXTEN:10},8,tTj)
exten => _1111XXXXX1.,17,HangUp()
exten => _1111XXXXX1.,900,Verbose(Agent ${EXTEN:6:3} is busy...)
exten => _1111XXXXX1.,901,Busy()

Returns the following debug info (when GROUP_COUNT doesn't work):

=== group_count_function_read(): returns: 1, group: 008, category: GROUP_agents
Group count before: 1 for agent 008 unique 1166088877.1402
=== group_function_write() adds to the group: 008@agents
=== group_count_function_read(): returns: 1, group: 008, category: GROUP_agents
Executing GotoIf("Local/1111010081008@zap-asterisk-telefoni-b5c5,2", "0?900") in new stack
=== group_count_function_read(): returns: 2, group: 008, category: GROUP_agents
Group count after: 2 for agent 008 unique 1166088877.1402


As you can see group_count_function_read() function sometimes returns the same number of channels in group as they were before group_function_write() function was called. And my guess is, this is causing the problems. Is it possible that these two functions run in seperate threads, and therefore don't know what each other is doing?

By: Alex Richardson (alexrch) 2006-12-15 06:45:50.000-0600

Is anyone still on this?

By: Marcus Spurkel (mspurkel) 2006-12-30 12:38:54.000-0600

I'm seeing the same behavior with 1.2.14  It worked fine during my testing but with a call center of about 12 operators, it didn't take long for GROUP_COUNT to erroneously return a 0 when in fact that extension was on a call.

To see if I could catch something "in the act", I ran 'group show channels' from the CLI to see if the channel was properly assigned to the group (it was) after the start of the call.  When I noticed a second call coming in to the same extension, I re-ran 'group show channels' and everything still looked the same (the same channel showed as being properly assigned to the group).  The CLI log, however, showed that the latest attempt to dial that extension went through because GROUP_COUNT(${EXTEN}) returned 0 when it should have been 1.

Has _any_ progress been made on this issue?  I can't roll out the new system (without the Agent channel) until this is resolved.

By: wes (whoiswes) 2007-01-02 08:46:30.000-0600

I think this bug is related to the big devicestate bug - if you're not familiar with it, it's worth a read:

http://bugs.digium.com/view.php?id=7433

By: Marcus Spurkel (mspurkel) 2007-01-03 11:24:17.000-0600

Hmmm... interesting.  Yes, it does sound like it's got something to do with it.  Still, I'm using the GROUP function to increment the number of "calls" to a SIP phone rather than relying on its device state, so I'm not sure it's 100% related.  Also, the patch is for 1.4 and I'm still on 1.2.

By: Serge Vecher (serge-v) 2007-02-28 14:24:43.000-0600

alright, since 7433 has been merged, what's the status with 1.2 svn?

By: Leonardo Gomes Figueira (sabbathbh) 2007-03-01 07:57:27.000-0600

serge-v,

there is still a problem with GROUP() on both 1.2.15 and SVN-branch-1.2-r57118. It still returns wrong counter sometimes.

  Leonardo

By: Joshua C. Colp (jcolp) 2007-03-08 12:23:01.000-0600

I've isolated the cause and am working on a fix.

By: blanket (blanket) 2007-03-17 05:03:07

I have the same problem if using Manager API Action Originate with Channel: LOCAL/ext@context and Exten: DISA.

ext@context is increasing GROUP_COUNT correct.
The channel of an dialed exten through DISA can not access to GROUP values, GROUP_LIST() is empty and GROUP_COUNT() is zero.

Hopefully this is fixed with this patch.

version: Asterisk SVN-1.2.16



By: Serge Vecher (serge-v) 2007-03-19 08:38:12

blanket: did you forget to attach the patch? If so, please also mention your disclaimer status.

By: Joshua C. Colp (jcolp) 2007-03-19 11:00:03

Can everyone please try the groupcountv2 branch available at http://svn.digium.com/view/asterisk/team/file/groupcountv2 ? I've had one individual test it and it worked for him but want to make sure it fixes everyone else's issues too.

By: Joshua C. Colp (jcolp) 2007-03-27 19:27:38

I have gotten a report internally that something may still not be right in the proposed branch, so if anyone is using this and still having issues please speak up so I can see if the scenarios are similar.

By: Jared Smith (jsmith) 2007-03-28 20:11:50

I can report that the branch at http://svn.digium.com/view/asterisk/team/file/groupcountv2 has completely solved the issues I was seeing on a fairly heavily-loaded Asterisk server.  Since moving over to the branch, we haven't been able to find any reports of incorrect group counts.

By: Leonardo Gomes Figueira (sabbathbh) 2007-03-30 12:35:19

Tested on SVN-file-groupcountv2-r59373 and could not reproduce the wrong GROUP_COUNT on my test dialplan. Seems it's really fixed.

But it seems a bug was introduced that causes a segfault. BT attached:

debug_group_segfault_SVN-file-groupcountv2-r59373_gdb.txt

The same dialplan does not segfault on 1.2.17.

Thanks file,

 Leonardo

By: Joshua C. Colp (jcolp) 2007-04-02 12:52:44

Would it be possible to get access to the box where that core dump is so I can examine the data structures? Thanks.

By: Leonardo Gomes Figueira (sabbathbh) 2007-04-02 13:49:11

Sure.
How can I contact you in private ?

By: Joshua C. Colp (jcolp) 2007-04-02 13:50:22

file@digium.com will reach me!

By: Joshua C. Colp (jcolp) 2007-04-02 14:20:21

Segfault should be fixed in latest SVN. Update and give it a whirl.

By: Leonardo Gomes Figueira (sabbathbh) 2007-04-02 14:36:15

Yeap. No more segfaults on SVN-file-groupcountv2-r59726

Thanks,

  Leonardo

By: Joshua C. Colp (jcolp) 2007-04-25 14:27:51

Fixed in 1.2 as of revision 61804, 1.4 as of revision 61805, and trunk as of revision 61806. Peace all!