[Home]

Summary:ASTERISK-09971: 1.4.9 changes to app_queue breaks 'joinempty=yes'
Reporter:jfitzgibbon (jfitzgibbon)Labels:
Date Opened:2007-07-27 07:04:59Date Closed:2007-09-10 08:57:44
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 10320.patch
Description:1.4.9 introduced changes to app_queue that change the logic associated with exiting the queue due to a timeout.

The 1.4.8 logic would only ever be executed if the 'n' option was passed to Queue(), as designed.

The 1.4.9 logic can be executed in the absence of the 'n' option if the queue is empty.  The net effect is that a queue configured as 'joinempty=yes' with no members in it cannot be enqueued to in 1.4.9 - calls will immediately be kicked back to the dialplan.  Depending on how the dialplan was built, this could result in calls being unceremoniously hung up if there are no further steps in the extension that called Queue()

****** ADDITIONAL INFORMATION ******

The 1.4.8 code to decide if a queue call should be abandoned with EXITTIMEOUT as the reason is this:

                               /* exit after 'timeout' cycle if 'n' option enabled */
                               if (go_on) {
                                       if (option_verbose > 2)
                                               ast_verbose(VERBOSE_PREFIX_3 "Exiting on time-out cycle\n");
                                       ast_queue_log(args.queuename, chan->uniqueid, "NONE", "EXITWITHTIMEOUT", "%d", qe.pos);
                                       record_abandoned(&qe);
                                       reason = QUEUE_TIMEOUT;
                                       res = 0;
                                       break;
                               }

go_on is a variable that starts as zero and is only set to one if the 'n' switch is passed to Queue().  If you don't pass 'n', this code can never be executed.

The 1.4.9 code uses the new 'membercount' field in the queue struct to attempt to do the same thing more efficiently:

                       /* exit after 'timeout' cycle if 'n' option enabled */
                       if (go_on >= qe.parent->membercount) {
                               if (option_verbose > 2)
                                       ast_verbose(VERBOSE_PREFIX_3 "Exiting on time-out cycle\n");
                               ast_queue_log( args.queuename, chan->uniqueid, "NONE", "EXITWITHTIMEOUT", "%d", qe.pos);
                               record_abandoned(&qe);
                               reason = QUEUE_TIMEOUT;
                               res = 0;
                               break;
                       }

But due to the >= comparison, the code can be executed if qe.parent->membercount is zero.  A queue that has no members in it will have qe.parent->membercount set to zero, so this code will always be executed if Queue() is called without the 'n' option being passed.
Comments:By: jfitzgibbon (jfitzgibbon) 2007-07-27 07:10:06

Though I can't seem to lock onto the circumstances, I also oberved this breaking queues configured as "leavewhenempty=no".

I can't seem to replicate it using a single queue and a single member (on pause or on a call or letting their phone ring through), but last night when I had 4 agents on a queue (3 on calls), I saw this on my console just as one agent didn't pick up:

   -- Nobody picked up in 20000 ms
   -- Exiting on time-out cycle

All of my queues are configured as "joinempty=yes" and "leavewhenempty=no".  I never want to kick a caller out of my queue based on agent availability.

I am assuming that the same changes as broke joinempty are responsible for this, and hopefully reviewing those changes will bring the cause of this other situation to light as well.

By: Mark Michelson (mmichelson) 2007-07-27 07:19:08

The behavior of the 'n' option was changed just before the 1.4.9 release because of issue ASTERISK-9813. You can see the comments there which led to the change. That being said, this is a problem and I will have a fix for it as soon as possible.

By: jfitzgibbon (jfitzgibbon) 2007-07-27 07:23:16

I understand the change made to 'n' and why (the behaviour in 1.4.8 is actually what I want for a few other unrelated queues that do have 'leavewhenempty' set to 'yes'.

My concern is the application of the 'n' logic when the switch was not passed to Queue().

By: Mark Michelson (mmichelson) 2007-07-27 07:31:32

Patch uploaded. If the n option is not used, then go_on will be set to -1 so that it can never cause the timeout behavior.

By: jfitzgibbon (jfitzgibbon) 2007-07-27 07:32:49

Just posted this to -users, thought it might be useful to have it in this record:

[...]

to clarify though the change made in 1.4.8 related to the 'n' option doesn't seem to cause this.  1.4.8 (with or without the 'n' option) does what I'd expect it to.  It's the logical overloading of go_on in 1.4.9 that broke things.  go_on used to be a boolean - now it's a dual-valued variable.  Sometimes it's compared just for it's truthfulness, and sometimes it's used in a numeric comparison against the queues's membercount.

If go_on had been left alone and a new variable used to store the threshold to be compared against membercount, I don't think 1.4.9 would have broken things, because go_on would (in my configuration) never have been set to anything but zero, preventing the numeric comparison against membercount from ever being performed.

By: jfitzgibbon (jfitzgibbon) 2007-07-30 08:00:08

I had a major problem with our queues on Friday - almost every member was reported as state "Unknown" in the output of "show queues".  This was causing agents to get calls when they were already on one (even though I have ringinuse=no), and for calls to get stuck in the queue without being deqeueued to idle agents.

I tried to restart *, but the problem reappeared.  I was forced to downgrade to 1.4.7.1, where I have not seen the problem since.

In short, I'm not in a position to test this patch on my production system, nor do I expect that I will be any time soon.  It has been a very bad couple of weeks for Asterisk patching for us, and I'm not going to be able to convince anyone that going back to 1.4.9 is worth the perceived risk ATM.

By: Mark Michelson (mmichelson) 2007-07-30 09:49:44

All right. I can understand why you'd be wary to try out a patch on your production system given the recent problems. I will attempt to get someone to test this patch on a test setup. If it works properly, I will close out this issue.

I also recommend that if you have not done so yet, open a new bug report detailing your problem with the member status being set to UNKNOWN as that appears to be unrelated to this issue.

By: Leif Madsen (lmadsen) 2007-07-31 09:23:31

I tried testing this, and I can't get callers to stay in the Queue() with or without the patch either. Here is my debug output:

   -- Executing [new@sub-queues:12] Queue("SIP/ss1tampa-09970bf0", "xfertest_CQ_40||||300") in new stack
[Jul 31 10:38:38] DEBUG[7181]: app_queue.c:1185 join_queue: Retrieved queue "xfertest_CQ_40" with 0 members
   -- Started music on hold, class 'default', on SIP/ss1tampa-09970bf0
[Jul 31 10:38:38] DEBUG[7181]: app_queue.c:3451 queue_exec: About to try calling member, go_on is 0 and membercount is 0
[Jul 31 10:38:38] DEBUG[7181]: app_queue.c:1903 wait_for_answer: Everyone is busy at this time
[Jul 31 10:38:38] DEBUG[7181]: app_queue.c:3453 queue_exec: Tried calling member, go_on is -1 and membercount is 0

By: Mark Michelson (mmichelson) 2007-07-31 11:11:41

blitzrage, I'm a bit confused. You say you couldn't keep callers in the queue both with AND without the patch? Sounds like some other change to app_queue is also affecting the behavior. I'll look at some diffs between 1.4.7.1 and 1.4.9 and see if I can find the problem.

By: Leif Madsen (lmadsen) 2007-07-31 11:22:50

Well I'm running 1.4.9, so I'm just saying that I was able to reproduce the issue the original poster stated, and that applying the 10320.patch file didn't resolve the issue :)

By: Mark Michelson (mmichelson) 2007-07-31 11:23:48

I'm an idiot. Disregard my previous note.

By: Mark Michelson (mmichelson) 2007-07-31 12:47:55

Hmm, I just gave the patch a try and it worked for me.

blitzrage, did you have joinempty=yes and leavewhenempty=no set in your test?

By: Leif Madsen (lmadsen) 2007-07-31 13:18:21

Crap... my bad. I didn't have 'leavewhenempty' correctly set (it was yes, and not no)

Patch confirmed to work. Will do a bit more testing with 'n' to make sure that didn't get broke.

By: Digium Subversion (svnbot) 2007-08-01 08:51:33

Repository: asterisk
Revision: 77854

------------------------------------------------------------------------
r77854 | mmichelson | 2007-08-01 08:51:32 -0500 (Wed, 01 Aug 2007) | 8 lines

Fixes an issue I introduced to queues wherein a queue with joinempty=yes would kick people out of the queue because of erroneously
thinking the 'n' option was in use.

(closes issue ASTERISK-9971, reported by jfitzgibbon, patched by me, tested by blitzrage and me)

Thank you blitzrage for all the testing you've done lately with queues! It's much appreciated!


------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-08-01 08:52:22

Repository: asterisk
Revision: 77855

------------------------------------------------------------------------
r77855 | mmichelson | 2007-08-01 08:52:21 -0500 (Wed, 01 Aug 2007) | 16 lines

Merged revisions 77854 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r77854 | mmichelson | 2007-08-01 09:08:57 -0500 (Wed, 01 Aug 2007) | 8 lines

Fixes an issue I introduced to queues wherein a queue with joinempty=yes would kick people out of the queue because of erroneously
thinking the 'n' option was in use.

(closes issue ASTERISK-9971, reported by jfitzgibbon, patched by me, tested by blitzrage and me)

Thank you blitzrage for all the testing you've done lately with queues! It's much appreciated!


........

------------------------------------------------------------------------

By: jfitzgibbon (jfitzgibbon) 2007-08-08 06:04:35

I was been testing 1.4.9 with the patch for the last few days.  No problems until yesterday when we moved to having our agents use PauseQueueMember() when they had to leave their desk instead of logging off of the queue with RemoveQueueMember().

From August 3rd through the 6th, I had zero unanswered calls with an abandon code of EXITWITHTIMEOUT.  Yesterday, I lost 247 callers due to EXITWITHTIMEOUT.

I observed the "Exiting on time-out cycle" message on my console when all of my agents were in state "In-use" or "paused".

I am going to try to re-create the situation on a lab box.  I will need a few days, as the bug does not manifest when I test this with a small scale setup (i.e. one agent).  I already tested several single agent scenarios before putting the patch into production.  Somehow membercount is getting set to negative one, causing the new majik to break just like the old majik.

Is there any reason why the code has to use these magic values?  In try_calling(), the options are parsed and a boolean "noption" is set to 1.  Then if that boolean is true, go_on is given the magic value -1, which is supposed to ensure that the comparison against membercount will always fail back in queue_exec().  But my observations suggest that there are situations in which the comparison will be true even in the absence of the 'n' option being passed to Queue() in the dialplan.

If you've got a boolean value that is only true when n is passed and only false when n is not passed, why use it to impart magic?  Why not just use the boolean value.  It would even make the comment match the code:

 /* exit after 'timeout' cycle if 'n' option enabled */
 if (go_on >= qe.parent->membercount) {

Assume for a moment that 'noption' was available in the scope of queue_exec().  Then this code could read:

 /* exit after 'timeout' cycle if 'n' option enabled */
 if( noption && go_on >= qe.parent->membercount) {

Which would guarantee that this code branch can't accidentally be entered into.  Right now a more accurate comment is: "exit after 'timeout' cycle if go_on has what we believe to be an out of range value"

It seems that the changes to app_queue in 1.4.8 and 1.4.9 were made with the goal of doing things more efficiently (i.e. not having to walk the member list to answer questions like "how many members are in the queue"), but without enough edge case analysis.

I've found two edge cases now in which these changes cause the 'n' option logic to be executed even if the 'n' option wasn't patched.  One has been fixed and was rolled into 1.4.10, but it seems that there's at least one other set of circumstances that can cause the bug to manifest.  Neither edge case would exist if noption was used as a boolean consistently in both queue_exec() and try_calling().

The code that parses the 'n' option and sets 'noption' to true is in the try_calling() function.  The code that compares membercount to go_on is in queue_exec().  go_on is passed by reference from queue_exec() to try_calling(), whereas noption only exists in try_calling().  I can see why it may have seemed like a good idea to overload go_on in the manner that it was, but given the issues that have come up over it, why not just declare noption in queue_exec() and pass it by reference to try_calling() as well?

try_calling() is static to app_queue, and the only function in app_queue that calls it is queue_exec().  I can't think of any good reason not to change the signature of try_calling() so that the 1.4.8 app_queue changes are easier to read.  As it stands, I doubt anyone reading the code of app_queue for the first time without the benefit of this bug report would understand what's going on, as there are no comments to indicate what values go_on can have an what the various ranges of those values mean.

I'll update this ticket when I have more lab results on the edge case I observed yesterday.

By: Mark Michelson (mmichelson) 2007-08-08 09:03:52

That is an excellent idea, I will be committing just such a change in a little while. This will end any such accidental "Exiting on timeout" messages for people who do not use the 'n' option.

I won't close this issue though since it sounds like there is a flaw in the membercount logic somewhere that should be resolved first.

By: Digium Subversion (svnbot) 2007-08-08 09:09:00

Repository: asterisk
Revision: 78575

------------------------------------------------------------------------
r78575 | mmichelson | 2007-08-08 09:08:59 -0500 (Wed, 08 Aug 2007) | 4 lines

Changing a bit of logic so that someone will NEVER exit the queue on timeout unless they have enabled the 'n' option.
This commit relates to issue ASTERISK-9971. Thanks to jfitzgibbon for detailing the idea behind this code change.


------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-08-08 09:16:34

Repository: asterisk
Revision: 78590

------------------------------------------------------------------------
r78590 | mmichelson | 2007-08-08 09:16:33 -0500 (Wed, 08 Aug 2007) | 12 lines

Merged revisions 78575 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r78575 | mmichelson | 2007-08-08 09:26:36 -0500 (Wed, 08 Aug 2007) | 4 lines

Changing a bit of logic so that someone will NEVER exit the queue on timeout unless they have enabled the 'n' option.
This commit relates to issue ASTERISK-9971. Thanks to jfitzgibbon for detailing the idea behind this code change.


........

------------------------------------------------------------------------

By: Mark Michelson (mmichelson) 2007-08-08 09:18:47

You'll notice in my latest commit that now noption is a boolean that is set if the 'n' option is enabled, and the variable which keeps track of how many calls to queue members have been attempted is called "tries." This should clear up confusion in the code and get rid of any erroneous queue exits for people not using the 'n' option.

Thank you very much, jfitzgibbon, for suggesting this.

By: jfitzgibbon (jfitzgibbon) 2007-08-09 06:10:16

I'm now running 1.4.10 with the updated version of app_queue.c from yesterday.  Initial tests look good, and I'll report on any issues found under real world traffic.  Given that this change restores the 1.4.7.1 logic regarding the 'n' option, I don't expect we'll see any problems.  Every time we've seen issues with 1.4.8 or 1.4.9, we've immediately jumped back to 1.4.7.1 and run without issue until the next time we've tried to upgrade.

I'm still going to try to reproduce the "membercount = -1" situation I observed, it's just taking a little time to build an environment.  My hypothesis is that it doesn't happen when a queue just has one member in it (since I've tested that scenario extensively).  Unfortunately, even to just add one more agent makes the testing harness much larger, as I need to act like up to 5 people at once (two agents, two callers tying up those agents and one more caller attempting to enqueue).

I'm going to test the 9 variants of state ("not in use", "in use" and paused") between two members, attempting to enqueue a new caller in each case.

I've looked over all the places where q->membercount is manipulated and nothing obvious jumps out at me.  I'm hoping this isn't a race condition where membercount is being decremented in two threads at once (e.g. a member being removed at the same time the queue is being reloaded), because that will be much harder to track down.

Thanks

By: Leif Madsen (lmadsen) 2007-08-09 07:44:15

jfitzgibbon: if you need another set of hands to dial into the queue, let me know as I'd be happy to help. Just find me on IRC in #asterisk-dev as lmadsen.

By: jfitzgibbon (jfitzgibbon) 2007-08-10 08:28:05

I got pulled onto other things trying to look at the membercount problem, but the good news is that I saw zero timeouts all day yesterday, so the patch looks good thusfar.

By: Digium Subversion (svnbot) 2007-08-10 09:55:11

Repository: asterisk
Revision: 78994

------------------------------------------------------------------------
r78994 | murf | 2007-08-10 09:55:07 -0500 (Fri, 10 Aug 2007) | 117 lines

Merged revisions 78570,78590,78635,78637,78648-78649,78679,78683,78685-78686,78718,78747 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r78570 | file | 2007-08-08 07:52:13 -0600 (Wed, 08 Aug 2007) | 12 lines

Merged revisions 78569 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r78569 | file | 2007-08-08 10:51:01 -0300 (Wed, 08 Aug 2007) | 4 lines

(closes issue ASTERISK-9984)
Reported by: adamgundy
Update sip.conf to include another scenario where directrtpsetup will fail.

........

................
r78590 | mmichelson | 2007-08-08 08:34:10 -0600 (Wed, 08 Aug 2007) | 12 lines

Merged revisions 78575 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r78575 | mmichelson | 2007-08-08 09:26:36 -0500 (Wed, 08 Aug 2007) | 4 lines

Changing a bit of logic so that someone will NEVER exit the queue on timeout unless they have enabled the 'n' option.
This commit relates to issue ASTERISK-9971. Thanks to jfitzgibbon for detailing the idea behind this code change.


........

................
r78635 | mmichelson | 2007-08-08 12:34:16 -0600 (Wed, 08 Aug 2007) | 11 lines

Blocked revisions 78620 via svnmerge

........
r78620 | mmichelson | 2007-08-08 13:16:49 -0500 (Wed, 08 Aug 2007) | 4 lines

Fixed some compiler warnings so that compiling with dev-mode and IMAP storage would not have any errors.
This section of code may get changed again shortly since my change uncovers a rather silly bit of logic.


........

................
r78637 | file | 2007-08-08 13:03:46 -0600 (Wed, 08 Aug 2007) | 2 lines

Correct spelling. s/threaads/threads/

................
r78648 | qwell | 2007-08-08 13:30:33 -0600 (Wed, 08 Aug 2007) | 10 lines

Merged revisions 78646 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r78646 | qwell | 2007-08-08 14:29:42 -0500 (Wed, 08 Aug 2007) | 2 lines

Fix mogs email address.

........

................
r78649 | file | 2007-08-08 13:30:52 -0600 (Wed, 08 Aug 2007) | 2 lines

Merge audiohooks branch into trunk. This is a new API for developers to listen and manipulate the audio going through a channel.

................
r78679 | file | 2007-08-08 14:49:07 -0600 (Wed, 08 Aug 2007) | 2 lines

HAVEL_SS7 should be HAVE_SS7. Reported by kwallace.

................
r78683 | file | 2007-08-08 15:44:58 -0600 (Wed, 08 Aug 2007) | 2 lines

Add support for using epoll instead of poll. This should increase scalability and is done in such a way that we should be able to add support for other poll() replacements.

................
r78685 | file | 2007-08-08 15:58:12 -0600 (Wed, 08 Aug 2007) | 2 lines

Minor fix for building under dev mode when byteswapping macro header files are not available.

................
r78686 | file | 2007-08-08 16:05:45 -0600 (Wed, 08 Aug 2007) | 2 lines

Regenerate configure script. This actually just updated the revision number... since my last merge changed it to an older number, while it was in fact generated from a much newer revision.

................
r78718 | russell | 2007-08-09 10:13:26 -0600 (Thu, 09 Aug 2007) | 15 lines

Merged revisions 78717 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r78717 | russell | 2007-08-09 11:12:57 -0500 (Thu, 09 Aug 2007) | 7 lines

Fix a problem with the combination of the 'F' option to pass DTMF through a
conference and options that use DTMF to activate various features.  The problem
was that the BEGIN frame would be passed through, but the END frame would get
intercepted to activate a feature.  Then, the other conference members would hear
DTMF for forever, which they didn't seem to like very much.
(closes issue ASTERISK-10038, reported by stevefeinstein, fixed by me)

........

................
r78747 | russell | 2007-08-09 11:07:36 -0600 (Thu, 09 Aug 2007) | 4 lines

Fix a problem that I had introduced into MWI handling.  I had ignored
the mailbox context.  Now, all related MWI event dealings pay attention
to the context as well.

................

------------------------------------------------------------------------

By: Michiel van Baak (mvanbaak) 2007-09-08 07:15:00

jfitzgibbon: does that mean this issue qualifies for 'resolved' ?

By: jfitzgibbon (jfitzgibbon) 2007-09-10 07:43:22

Yes, sorry, I should have reported back sooner.  When I was running 1.4.10.1 or 1.4.11, I did not see this problem reoccur.