[Home]

Summary:ASTERISK-11564: Using state_interface and Local channels allows several simultenous calls to be sent to agent
Reporter:Atis Lezdins (atis)Labels:
Date Opened:2008-03-03 12:26:17.000-0600Date Closed:2008-03-18 15:59:57
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12127.patch
( 1) 12127v2.patch
( 2) 12127v3.patch
Description:I've seen this situation on 8-CPU 3GHz Xeon, that agent gets 4 queue calls simultaneously.

For now i've been able to avoid this by checking/setting group at beginning of Local channel, setting call-limit=1 for SIP peers (only for those who shouldn't receive more than 1 call simultaneously) and checking group upon unsuccessful Dial from Local channel. Those 3 fixes together are giving some good results for one most-problematic queue, however that's not a permanent and stable solution. If queue is creating several channels at the same time, on slower but multi-cored systems this could lead to race condition between GROUP_COUNT and Set(GROUP()) in two parallel channels, and call-limit=1 also can't always be set.

I'm not sure if this can be implemented in good way, but I think the most appropriate solution for this would be that Queue is setting device state of Local channel, and other queue(s) (if member is in multiple queues) receives event from first queue's devicestate change.

My system is Asterisk 1.4.14 with backported state_interface from r103809

****** ADDITIONAL INFORMATION ******

[Mar  3 06:51:57] VERBOSE[28683] logger.c:     -- Executing [28354@default_queue:1] NoOp("Local/28354@default_queue-db67,2", "-(B)------------------------------------- context def
[Mar  3 06:51:57] DEBUG[28682] app_macro.c: Executed application: Set

[Mar  3 06:51:57] VERBOSE[28687] logger.c:     -- Executing [28354@default_queue:1] NoOp("Local/28354@default_queue-087a,2", "-(B)------------------------------------- context def
[Mar  3 06:51:57] VERBOSE[28687] logger.c:     -- Executing [28354@default_queue:2] GotoIf("Local/28354@default_queue-087a,2", "1?3:21") in new stack

Comments:By: Mark Michelson (mmichelson) 2008-03-03 15:41:22.000-0600

It wouldn't be very difficult to program the member's status in app_queue to change to "ringing" as soon as the call is placed. The problem is that the state_interface mechanism was programmed in an open-ended way such that people could use it for whatever means they want. For instance, administrators can  have a SIP device specified as the member's interface and a completely different SIP device specified as the state_interface if they desire. What I'm getting at is that it's not wise for app_queue to try to guess what you're attempting to accomplish when specifying a state_interface for a queue member.

Having said that, I believe that use of the DEVICE_STATE dialplan function to set a custom device state would be the best way to handle this problem. I realize that you are using a backport of the state_interface mechanism in 1.4 (for which state_interface is not supported), which means that you don't have the DEVICE_STATE function. The use of DEVICE_STATE was kept in mind for people who may have situations like this one (where the local channel may take a measurable amount of time before dialling the end device) or in situations where the local channel specified may call one of many different end devices.

Since this situation is handleable in the dialplan, I will go ahead and close this.

By: Atis Lezdins (atis) 2008-03-03 16:06:47.000-0600

I understand what you mean by using func_devstate, i've tried that but not together with state_interface. As it doesn't allow setting state of Local channel (or any real device), GROUP_COUNT is still superior to that, as it allows to set group and use it until channel dies.

For example, now i have dialplan like this:

if (${GROUP_COUNT(${EXTEN}@queue_call)=0}) {
 Set(GROUP(queue_call)=${EXTEN});
 // call to device, etc
} else {
 Busy();
}


However the problem is that those channels are invoked at almost the same time, so it may happen that GROUP_COUNT of both channels get executed simultaneously and returns 0, and then both channels end up calling.




By: Atis Lezdins (atis) 2008-03-04 14:08:39.000-0600

I suspect that this is also reason for significant load increase. Regularly i hav loadavg 2 or 3 in peak hours, after migration to state_interface (and lot of other changes) loadavg is between 2 and 4, however i've seen that whenever agents are freeing up and taking those calls i have 3 times more channels than usually (most of them is executing Busy() as in previous dialplan), and loadavg is going up to 6.

By: Atis Lezdins (atis) 2008-03-04 14:37:35.000-0600

<putnopvut> atis_home: there's a real easy way to fix it in app_queue.
<putnopvut> After the ast_call in rin_entry, add update_status(qe->parent, tmp->member, AST_RINGING);
<putnopvut> s/rin_entry/ring_entry/
<jbot> putnopvut meant: After the ast_call in ring_entry, add update_status(qe->parent, tmp->member, AST_RINGING);
atis_home> so, that would be in else block, because if seems to be executed if call cannot be placed?
<putnopvut> I'd just put it after the else block.

By: Atis Lezdins (atis) 2008-03-10 08:52:38

Update: i found the problem with performance and lot of simultenous calls, and that's not what i've described before. Silly, but it was ringinuse parameter that defaults to 1.

However there's still a window where app_queue might dial to local channel when it's already being used. I followed status change, and seems that problem lays within update_dial_status. I have no idea how that's supposed to work, but it doesn't work correctly in combination of local channels and state_interface. For example if app_queue dials to local channel, update_dial_status sets queue member's state to UNKNOWN, regardless of dial result. If dialplan finally does Dial() to the same agent who's state interface is set in queue_members everything is ok, as handle_statechange gets executed upon ringing, and further status updates are correct (but there's a window until that when other thread can dial same agent). If dialplan doesn't dial actual SIP extension, queue will still update it's status trough update_dial_status, and queue member will remain in UNKNOWN state (or RINGING if this fix applied).

My setup is currently working quite good with call to update_status() commented out within update_dial_status, but it's first day since i put that in production. Generally it would be good if there would be some option, how to configure this, but i'll need some more understanding on update_dial_status and what it does.

By: David Brillert (aragon) 2008-03-10 17:16:43

Hi atis

I have the same problem on a site where agent receives multiple calls from queue while on call.

Has your fix helped reduce CPU load also?

By: Doug (doug) 2008-03-11 05:28:03

We have the same problem and from our discovery this happens when Devstate is showing as (Unknown). This causes the queue to send calls allthough the agent is (In use) and therefore will sent a 2nd, 3rd call to the agent. As far as i can see the (Unknown) state as said by Atis is to do with using the state_interface backport. Since 1.4.18 with patch backported we see at some points the correct states such as (ringing), (paused) but often we will see occurence as explained above.
On the http://bugs.digium.com/view.php?id=11603 ticket the end result shows an update to Asterisk
r97203 | mmichelson | 2008-01-08 15:14:44 -0600 (Tue, 08 Jan 2008) | 8 lines

does anyone know if this resolved the problem without the state_interface patch?



By: Atis Lezdins (atis) 2008-03-11 07:01:16

Removing call from update_dial_status to update_status works perfectly together with state_interface. We have run numerous tests, and there's no (Unknown) state anymore. As queue dials agents in (Unknown) state, this helps increasing performance, because no channels are created.

However there's a window between creation of Local channel and dialing to actual SIP device. I see that in trunk there's devstate support for local channels, however backporting that doesn't help (as queue doesn't check state of local channel anymore). So, with having second interface for queue_member, it's status should be calculated from both - interface and state_interface. This somehow is contradictory to purpose of state_interface, so maybe some option could be added for this purpose?

By: Joel Vandal (jvandal) 2008-03-11 10:31:16

I have remove(comment) the update_status() call on app_queue.c and "strangely" all  agents status look better.  I'm not sure what the exact use of this function but channel status is already handled by the 'handle_statechange' function.


DougUDI and aragon, I will contact you this afternoon on MSN and will provide you new RPMS packages to test on your server.

By: Atis Lezdins (atis) 2008-03-11 10:33:57

jvandal: in our tests status is always correct when "queues show" is invoked, do you have situation when it's not?

By: Joel Vandal (jvandal) 2008-03-11 10:38:12

atis: will let you know later tonight or tomorrow, will have multiple server that will it with 'update_status()' disabled.

By: Mark Michelson (mmichelson) 2008-03-11 11:50:08

Good find, folks! I had a look at the update_dial_status and update_status functions and had a discussion about them with Russell.

So here's the deal...those functions are in app_queue and predate the use of device state as the indicator for member status. There are three places where update_dial_status is called, and only one of those places actually makes any sense. It is called in ring_entry in the case that ast_request fails. The other two uses are immediately after a call to ast_request if the status returned from ast_request is different than the status we had for the member. Here's the problem with that comparison: ast_request returns a cause code, and the member's status is a device state value, so the two values are incompatible and it makes no sense to compare them. Definitely buggy.

Then, within update_dial_status, there is some more bugginess. Only three of the many cause codes are actually handled, and if any other cause is sent back, the member's status is set to "unknown."

But the main thing that bothers me is that even with all the weird things going on with update_dial_status and update_status, none of it should be necessary now because of the device state code. I'm going to investigate a bit further and see if there actually IS any reason to have those functions in the source.

Once again, great find!

By: Mark Michelson (mmichelson) 2008-03-11 13:17:36

I did a bit of looking and testing, and it seems like the update_dial_status function should be used in the case that an ast_request fails. Think about the following scenario: You have a SIP device as a member in the queue, then you unload chan_sip.so. If you do a "queue show" at this point, then the SIP device will still show up as "Not in Use" when it should be "Invalid." The first time the channel is requested, the failure will cause the status to be updated correctly to "Invalid." If chan_sip is loaded later, then the device state thread will take care of returning the device back to "Not in Use."

So, here are the changes I'm going to make.

1. Remove all but the one call to update_dial_status in the case where ast_request fails.

2. Since this indicates that some sort of failure occurred, we should never set the member's status to "unknown." I will remove the part of update_dial_status that sets the member's status to "unknown."

By: Atis Lezdins (atis) 2008-03-11 13:28:24

I don't like that member state is changed by other means than devicestate, because when devicestate does it - it's really corresponding. Maybe it's possible to adjust channel drivers upon unload to do devstate update?

After looking into ast_request i see that it can fail for example with codec incompatibility. Should that change state of queue member?

By: Mark Michelson (mmichelson) 2008-03-11 14:12:25

atis:

Codec incompatibility should not change the state of the member. I should have been more clear in my previous message. The case where we now set the status to "unknown" will instead NOT change the status. This is what I meant when I said I would remove the code to change the status to "unknown."

By: Mark Michelson (mmichelson) 2008-03-11 14:53:46

I've uploaded 12127.patch for testing after discussing with atis on IRC. Here are the changes:

1. Completely remove update_dial_status and update_status functions.
2. Replace the update_dial_status call when ast_request fails with ast_device_state instead. This will actually allow us to request the current device state from the channel driver instead of just guessing like we do now.
3. Remove other calls to update_dial_status from app_queue.

Give this a test and see if this works for you!

By: Mark Michelson (mmichelson) 2008-03-11 15:00:53

A note for those using trunk, the line

tmp->member->status = ast_device_state(tmp->member->interface);

will be changed to

tmp->member->status = ast_device_state(tmp->member->state_interface);

due to the state_interface feature.

By: Mark Michelson (mmichelson) 2008-03-11 15:12:25

12127v2.patch has now been uploaded. atis suggested that the member's status change should happen while the queue is locked. While most of the time this would be the case in the original patch I uploaded, it would not be the case 100% of the time, especially if timeoutrestart were enabled for the queue in question.

Thanks, atis, for all the work you've put into this one!

By: David Brillert (aragon) 2008-03-11 15:58:19

putnopvut you rule :)
Thanks for the patch to test

We are testing using pbx-test to generate load etc...
But cannot complete test because our Asterisk servers keeps crashing (but not because of your 12127v2.patch)
Jvandal and I have the same problem with Asterisk segfaulting as well as original poster for bug http://bugs.digium.com/view.php?id=12098

We have attached backtraces and jvandal is generating debug info with valgrind but so far this bug has grown very cold.

We want to verify your patch but cannot since we cannot generate load without segfaulting every 15 minutes.

Can you help us by taking a look at http://bugs.digium.com/view.php?id=12098

We are desperate to get 12098 fixed
We are prepared to help debug in any possible way :)

By: Atis Lezdins (atis) 2008-03-11 16:16:01

Currently this wouldn't generate QueueMemberStatus event if chan_sip is unloaded.

I think it's worth keeping update_status (maybe throwing out the part with queue/member iteration), and maybe even using it from handle_statechange. That would allow to generate AMI event of member status change correctly at all ocasions, plus it would go well with one of my future patches (realtime member status).

aragon: regarding pbx-test you could try decreasing call volume going trough (increase delay between them, increase hangup probability). i've seen that on extremly high load Asterisk will crash nearly always because of various locking problems. Make sure load is real, and corresponds to your production usage :)

btw, my production is already working for 1 and half days with completely commented out update_status line, and no status problems so far.

By: David Brillert (aragon) 2008-03-11 16:32:23

Hi atis

I really appreciate your efforts to fix this bug!!!!!
But the other bug is reproduceable with an identical backtrace every time and across multiple implementations and not just by me.

I think a bug which causes a segfault should be looked at regardless of call volume.

My customer servers are generally call centers with very high call volumes.
I'm begging for someone to assist with the crashes in bug 12098 so I can move on to the next problem.

Putnopvut has been a lifesaver many times when no one else would look at a problem so I am hoping for his assistance with 12098 also. The bug has not been assigned or even confirmed by a bug marshall.

By: David Brillert (aragon) 2008-03-11 16:33:47

putnopvut...
I would not normally ask like this but we are desperate.

By: Doug (doug) 2008-03-13 03:50:30

Thanks to Atis for all your work! Your suggestion works very well. We will try putnopvut's patch and report back. Thanks to you all this issue has been bothering us for some time. The queues seem now to have a 20% performance improvment.

By: Mark Michelson (mmichelson) 2008-03-13 17:43:25

I talked with atis on IRC some, and he alerted me that while the latest patch was working, it was not exactly complete and he suggested a new method for updating the status of members. 12127v3.patch has the following differences from v2.

1. On failure to request a channel, instead of updating the member's status, make sure to update the status of that member in all queues to which that member belongs.

2. Call a single common function for whenever member status changes. This unifies   some repeated code and will hopefully reduce errors that may occur in the future due to code changes.

If you could please test to make sure there aren't any errors in the latest patch, that would be great. Thanks!

By: Atis Lezdins (atis) 2008-03-14 10:28:11

We have tested latest patch manually (for our usage scenarios), plus we loaded it for 10 hours on testing machine.

Can anybody else test it?

By: Atis Lezdins (atis) 2008-03-14 11:36:39

for those, using state_interface - you need to replace:

within update_status():

     tmp_interface = ast_strdupa(cur->state_interface);

within ring_entry():

     update_status(tmp->member->state_interface, ast_device_state(tmp->member->state_interface));

By: Joel Vandal (jvandal) 2008-03-14 14:05:37

For now, it seems that everything works perfectly on my 3 test servers (CentOS 4 i386, CentOS 4 i386 and x86_64).

I have repackage new RPMS of latest 1.4 branches with this patch and sent it to my "testers" (thanks aragon, DougUDI). Will have feedback tommorow.

By: Mark Michelson (mmichelson) 2008-03-18 15:33:30

Based on the latest comments I'm seeing here, it seems like this is working well for those who have tested it. I'm going to commit this.

By: Digium Subversion (svnbot) 2008-03-18 15:48:05

Repository: asterisk
Revision: 109713

U   branches/1.4/apps/app_queue.c

------------------------------------------------------------------------
r109713 | mmichelson | 2008-03-18 15:47:57 -0500 (Tue, 18 Mar 2008) | 12 lines

This patch makes it so that all queue member status changes are handled through device state
code. This removes several problems people were seeing where their queue members would get into
an "unknown" state. Huge props go to atis on this one since he was the one who found the code
section that was causing the problem and proposed the solution. I just wrote what he suggested :)

(closes issue ASTERISK-11564)
Reported by: atis
Patches:
     12127v3.patch uploaded by putnopvut (license 60)
Tested by: atis, jvandal


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=109713

By: Digium Subversion (svnbot) 2008-03-18 15:54:49

Repository: asterisk
Revision: 109714

_U  trunk/
U   trunk/apps/app_queue.c

------------------------------------------------------------------------
r109714 | mmichelson | 2008-03-18 15:54:48 -0500 (Tue, 18 Mar 2008) | 20 lines

Merged revisions 109713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r109713 | mmichelson | 2008-03-18 15:52:15 -0500 (Tue, 18 Mar 2008) | 12 lines

This patch makes it so that all queue member status changes are handled through device state
code. This removes several problems people were seeing where their queue members would get into
an "unknown" state. Huge props go to atis on this one since he was the one who found the code
section that was causing the problem and proposed the solution. I just wrote what he suggested :)

(closes issue ASTERISK-11564)
Reported by: atis
Patches:
     12127v3.patch uploaded by putnopvut (license 60)
Tested by: atis, jvandal


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=109714

By: Digium Subversion (svnbot) 2008-03-18 15:59:57

Repository: asterisk
Revision: 109716

_U  branches/1.6.0/
U   branches/1.6.0/apps/app_queue.c

------------------------------------------------------------------------
r109716 | mmichelson | 2008-03-18 15:59:52 -0500 (Tue, 18 Mar 2008) | 28 lines

Merged revisions 109714 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r109714 | mmichelson | 2008-03-18 15:59:02 -0500 (Tue, 18 Mar 2008) | 20 lines

Merged revisions 109713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r109713 | mmichelson | 2008-03-18 15:52:15 -0500 (Tue, 18 Mar 2008) | 12 lines

This patch makes it so that all queue member status changes are handled through device state
code. This removes several problems people were seeing where their queue members would get into
an "unknown" state. Huge props go to atis on this one since he was the one who found the code
section that was causing the problem and proposed the solution. I just wrote what he suggested :)

(closes issue ASTERISK-11564)
Reported by: atis
Patches:
     12127v3.patch uploaded by putnopvut (license 60)
Tested by: atis, jvandal


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=109716