ASTERISK-12125: [patch] Bogus <member> is still 'Not in Use' warnings for AgentLogin'ed agents.

[Home]

Summary: ASTERISK-12125: [patch] Bogus <member> is still 'Not in Use' warnings for AgentLogin'ed agents.

Reporter: David Woolley (davidw) Labels:

Date Opened: 2008-06-02 09:57:05 Date Closed: 2008-07-29 08:51:28

Priority: Minor Regression? No

Status: Closed/Complete Components: Applications/app_queue

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 20080710__bug12771.diff.txt
( 1) 20080711__bug12771.diff.txt
( 2) 20080717__bug12771.diff.txt
( 3) 20080725__bug12771__trunk.diff.txt
( 4) el4-el5-state-compare.txt
( 5) el4-queu-state.txt
( 6) el5-queue-state.txt
( 7) trunk.adaptation

Description: Asterisk seems to correctly know when an AgentLogin type agent is busy on a call, however every time our agent gets a call, we get a message like:

WARNING[16777] app_queue.c: The device state of this queue member, Anonymous, is still 'Not in Use' when it probably should not be! Please check UPGRADE.txt for correct configuration settings.

We've tried to get good presence information from the SIP phones that are logging in as the agents, but haven't found a recipe that does achieves this for outgoing calls. However, this should be irrelevant as it is the Agent itself whose state matters, not that of the SIP phone. Moreover, for AgentLogin, the busy state on the phone wouldn't change for each call, but would cover the whole logged in period.

****** ADDITIONAL INFORMATION ******

ASTERISK-11684 comes close to duplicating this, but it was closed because they were using a back port patch; we are using unpatched code. I think the real channel behind the agent is more complex as well, in that case.

This message is coming up as a red herring in other reports, particularly ASTERISK-12070 and ASTERISK-11890. It also came up as a probable red herring, here, for an issue I'm not going to start looking at.

We did get one case where we didn't get the warning on the very first call, but I couldn't reproduce that after a restart.

This is our queue definition:

joinempty = strict
eventwhencalled => vars
autofill = yes ; Recommended for new uses
autopause = yes ; Presumably only works for call back
eventmemberstatus = yes ; Can review later if too noisy
monitor-format = gsm ; A bit CPU intensive
monitor-type = MixMonitor
strategy = rrmemory
timeout=15
retry=5
wrapuptime=0
maxlen = 0
announce-frequency = 0
announce-holdtime = no

member => Agent/1001,,Anonymous

and this is the configuration part of the agent's SIP setup:

type=friend
host=dynamic
canreinvite=no ; Typically set to NO if behind NAT
disallow=all
allow=gsm ; GSM consumes far less bandwidth than ulaw
allow=ulaw
allow=alaw
qualify=yes ; Says we don't need for Xlite, but not so sure.
limitonpeers=yes ; Try and get good presence
call-limit=1 ; not ideal as ignores outgoing calls
md5secret=............
callerid=...........
context=...........
accountcode=..............

Comments: By: Mark Michelson (mmichelson) 2008-06-02 11:00:17

I took a bit of time to try to reproduce this and I couldn't do it. I set up a SIP friend to be exactly the same as yours and logged an Agent in on that SIP phone. Whenever I'd call into the queue and that Agent would answer, I never would see the message about reading UPGRADE.txt and the channel's device state would change to "in use."

On a separate note, you mentioned that you had difficulty with presence on outbound calls for SIP devices. I found in my test setup that outbound calls were not going towards the call-limit unless I had type=peer set. If I had type=friend, then I could make outbound calls on that SIP phone but still appear "Not in Use."
By: David Woolley (davidw) 2008-06-02 12:01:39

I think it is worth noting ASTERISK-1263709 and ASTERISK-1242492 which show that other people have this happening as well.

I can try separating the peer and user functions (although I need to confirm that this is acutally possible when there is a subscription).

However, are you saying that, even when SIP was not in a position to give accurate presence, you still weren't getting the warning? We have 1800+ of these messages (across several versions of 1.4) in the last month, on a test system that has just one, rather part time, agent.

My assumption was that it was trying to do presence on the underlying channel, not the agent. It looks like it might be doing it on the agent, in which case there is maybe a race condition.

It looks like it is a race condition - note that a different thread does the test from the one that updates the status; it is the actual channel whose status is updated synchronously, but it is the member structure whose state is checked, and the update to that is queued:

[Jun 2 17:23:42] DEBUG[25994] devicestate.c: Notification of state change to be queued on device/channel Agent/1001

[Jun 2 17:23:42] DEBUG[23452] devicestate.c: Changing state for Agent/1001 - state 3 (Busy)

[Jun 2 17:23:42] WARNING[25994] app_queue.c: The device state of this queue member, Anonymous, is still 'Not in Use' when it probably should not be! Please check UPGRADE.txt for correct configuration settings.
[Jun 2 17:23:42] DEBUG[25994] devicestate.c: Notification of state change to be queued on device/channel Local/XXXXXX@YYYYYYY-2a56,2
[Jun 2 17:23:42] DEBUG[25994] devicestate.c: Notification of state change to be queued on device/channel Local/XXXXXX@YYYYYYY
[Jun 2 17:23:42] DEBUG[23477] app_queue.c: Device 'Agent/1001' changed to state
'3' (Busy)
By: David Woolley (davidw) 2008-06-02 12:45:13

On the, new, race condition diagnosis, this is not a "tweak". It's either "minor", if the queueing mechanism is designed to cope with member status lagging the actual channel, status, or "major".
By: Mark Michelson (mmichelson) 2008-06-02 15:59:45

Excellent analysis on the race condition. I removed all options from my queue which could cause delay between when the agent answers and when that message appears (for instance, I had a two second member delay), and sure enough, I was able to make it happen.

As far as the impact this has, my quick analysis (meaning going through the scenario in my head, not necessarily combing all code paths) is that it may be possible for a second call to reach the agent before the device state change is noticed by app_queue (although I highly doubt it would happen given the amount of locking necessary before a call finally reaches an agent). If this were to happen, and the proper options were in place, that new caller may end up getting kicked from the queue. That's about the worst I can think of happening here as it is, so this is probably a "minor" issue.

As a temporary workaround, you can add in a 1 second member delay to your queue. I'm not going to make any guarantees one way or another, but I would be willing to bet that fixing such a race condition may be quite an undertaking. I'll have a closer look and see how much work it will take.
By: David Brillert (aragon) 2008-06-05 10:11:32

Putnopvut

I have a site that periodically reports that agents are receiving multiple calls.
Asterisk 1.4.21rc1

music = default
strategy = leastrecent
servicelevel = 60
context = default-aa-servicemainmenubutton1englishsubmenu
timeout = 15
retry = 5
maxlen = 0
weight = 1
reportholdtime = no
memberdelay = 0
timeoutrestart = no
autofill = yes
autopause = no
ringinuse = no
setinterfacevar = yes
wrapuptime = 0
joinempty = yes
leavewhenempty = no

Agents are logged to multiple queues and oddly the calls are presented to the agent from mutiple queues simultaneously. Each call presentation is approximately 2 seconds apart but a 4 line Aastra phone has received 4 simultaneous calls.
This condition appears very rarely but I did not think it should be possible for an agent to receive more that one ringing call from a queue(s)...
This has occured when the agent is "not in use" because the agent has just gone from paused to unpaused status while calls were waiting in queue(s).

Is it possible that this behavior is a result of this race condition you speak of?

What setting would be required to add the 1 second delay to the queue?
I assume memberdelay = 1

By: Mark Michelson (mmichelson) 2008-06-05 10:55:02

aragon:

I suspect that the reason that the agent can receive multiple calls from multiple queues is due to a race condition between the calling threads and the device state thread. I cannot, however, state with any certainty if the race is happening at the same place that davidw reported. As a result, I also cannot say for sure if adding the memberdelay (as you correctly assumed) will fix the issue.

I suspect that what's happening is that several different callers from a queue are nearly simultaneously discovering that the agent is available, and so those callers attempt to call the agent at the same time. The result of that is that before the agent's state can change to "ringing" or "in use" all the callers have called the agent at about the same time.

I'm going to devote some time today to inspect the odd device state behaviour that appears to be creeping up lately here and in issue ASTERISK-12036.
By: Tilghman Lesher (tilghman) 2008-06-20 18:42:17

I believe the patch from ASTERISK-12127 should fix this one, as well, so this issue will be closed.
By: David Woolley (davidw) 2008-06-27 08:47:21

It's still producing the message with 20080626__bug12773.diff.txt applied, and I can't see how that patch really changes it (but wanted to give it a chance).

1.4.21 with the patch applied:

[Jun 27 14:24:38] WARNING[28344] app_queue.c: The device state of this queue member, Anonymous, is still 'Not in Use' when it probably should not be! Please check UPGRADE.txt for correct configuration settings.
By: stp (stp) 2008-07-04 01:21:28

My system also have this warning message but I use SIP channel with QueueAddMemeber. In my case, the warning happens because delay of devicestate on SIP device.

There is addition code in devicestate.c , function __ast_device_state_changed_literal(), that will handle state change on both device and call. For example, when SIP answer, there will be notification on state change for device SIP/XXXX and for call SIP/XXXX-YYYYY. By adding the call to notification list, sip_devicestate() function will search for device state of the call SIP/XXXX-YYYY. Which is not possible to find, therefore, sip_devicestate() will call ast_gethostbyname() and get stuck there for few seconds (about 10 secs). Delay in sip_devicestate() cause notification thread not be able to process others state change for a while. So the everything that wait for state change will not work as expect. After change code in __ast_device_state_changed_literal, my problem is solved and warning for "UPGRADE.txt" gone.

I also believe that this also a cause for bug [0012680] too.

By: David Brillert (aragon) 2008-07-04 07:37:40

stp

I have problems with agents receiving multiple calls when available. Usually when agent returns to ready state from pause. On occasion it is possible for an agent to receive as many calls simultaneously as defined in call limit.
Reducing call limit is not practical or a good solution. Setting memberdelay=1 as putnopvut suggested has not removed the problem.

Another symptom of the problem is delayed call presentation to an available agent.
When this occurs core show channels will show SIP in use
show queues will show agent not in use
Waiting for the channel to free will fix the call presentation
soft hangup the busy channel will fix the call presentation
core show locks does not show any locks

I'm tracking progress of this issue on bugs 12771 12773 and 12916
I don't use channel agent either... so I am not adding notes to bug 12773 since it is under category chan_agent and patch is for chan_agent
Currently I have issues on 1.4.21 using SIP QueueAddMember

Reading your notes is appears you were able to fix your problem; what did you do to fix?
By: David Woolley (davidw) 2008-07-04 07:52:25

I think setting memberdelay=1 only removes the specific symptom of the warning message, not the underlying race condition.
By: stp (stp) 2008-07-04 10:07:23

aragon

just remove unique id from call SIP/XXXX-YYYY so only device name is added into notification list. here is the change in devicestate.c function static int __ast_device_state_changed_literal(char *buf)

device = buf;

/* add the following back */
tmp = strrchr(device, '-');
if (tmp)
*tmp = '\0';
.....

and remove these

if (!norecurse && (tmp = strrchr(device, '-'))) {
*tmp = '\0';
__ast_device_state_changed_literal(device, 1);
}

Another way is to change how sip_devicestate() search for the device. Instead of always calling ast_gethostbyname(), we should check if there is a real host name in SIP device name. It may possible to fix by setting correct DNS so that ast_gethostbyname() can return immediately instead of waiting for DNS server to response.

As in my system, this is not racing or any configuration problem. Just device state thread is waiting to finish state change on SIP call which I don't know why we need to handle this. So I just do quick fix as above change in code.
By: Joel Vandal (jvandal) 2008-07-08 10:44:56

aragon, this patch look to cause some issues with agent transfer that cannot receive new call until agent complete transfer, probably because we use the state_interface patch. I've fix a small issue on the 'state_interface' patch where the interface isn't saved on astdb (PersistentMembers).
By: Doug (doug) 2008-07-08 10:56:26

We have a similar problem with state always showing as "not in use" and we get many of the UPGRADE.txt messages. I have added a debug of a working system and a system that does not work both running the same Asterisk version. Both servers are very similar the onlt difference is 1 uses Centos 4.6 and another Centos 5. Centos 5 Server does not update state in queue.

el5-queue-state (state updates not working)
el4-queue-state (state updates working)
el4-el5-state-compare (comparison of state events)

Asterisk 1.4.21 (polycom phones)
By: Joel Vandal (jvandal) 2008-07-09 10:28:02

stp, FYI, your patch revert bug ASTERISK-9385 (rev 96575)

U branches/1.4/main/devicestate.c

------------------------------------------------------------------------
r96575 | tilghman | 2008-01-04 16:59:58 -0600 (Fri, 04 Jan 2008) | 7 lines

Fix the problem of notification of a device state change to a device with a '-'
in the name. Could probably do with a better fix in trunk, but this bug has
been open way too long without a better solution.
Reported by: stevedavies
Patch by: tilghman
(Closes issue 0009668)
By: Joel Vandal (jvandal) 2008-07-10 10:35:38

We made several test with DougUDI and aragon and all delay / devicestate problems are now fixed. Using patch (see comments) on 12771 and 12916.

All delay problems are definitively caused by a DNS timeout. If we revert patch r96575 (see ticket 12771), then all state problems are fixed.

The only issue we have is when a peer/friend have a - on the username, but on ScopServ GUI, we will add a warning message to avoid use of - on username.
By: Tilghman Lesher (tilghman) 2008-07-10 11:28:51

jvandal: could you explain why you think reverting the patch in ASTERISK-9385 is the key to fixing this problem?
By: David Brillert (aragon) 2008-07-10 12:01:08

corydon76

Take a look here
http://bugs.digium.com/view.php?id=12680#90027

Reverting r96575 fixes all problems with state changes.

Also I dont believe this 12771 ticket is a duplicate of 12773 since we don't use chan_agent. Many of us use dynamic agent and have state issues. This is a separate issue with root cause related to bug 9668 and/or 12680 both related to
U branches/1.4/main/devicestate.c

------------------------------------------------------------------------
r96575 | tilghman | 2008-01-04 16:59:58 -0600 (Fri, 04 Jan 2008) | 7 lines

Fix the problem of notification of a device state change to a device with a '-'
in the name. Could probably do with a better fix in trunk, but this bug has
been open way too long without a better solution.
Reported by: stevedavies
Patch by: tilghman
(Closes issue 0009668)

By: Jimbo (digmaster) 2008-07-11 04:57:08

Just wanted to let you know that we also have this problem with asterisk 1.4.20 and under trixbox/freepbx (asterisk 1.4.19 and 1.4.20).
By: Joel Vandal (jvandal) 2008-07-11 07:25:39

Corydon76: After made lots of debugging, each time a DeviceState is queued, we notice that it try to resolve (DNS) all hosts and timeout after ~20 seconds...

Channel is SIP/1233-0b123abc ...

With r96575 : lookup for 123-0b123abc .. timeout ... 123 (no query.. local peer/user) ...

without r96575: lookup for 123 (no query... peer/user)

Now when I check with tcpdump on port 53, we now have ZERO query... with 'r96575' we have about 5-10 queries per seconds.

The problem we have is that if we revert the patch, all username that contain a - will not be able to get proper status. (Ex. SIP/1234-OUT-0b123abc will lookup only for 1234 instead of 1234-OUT).

I think that 9668 need to be re-open and continue 'thread' on this ticket.
By: Tilghman Lesher (tilghman) 2008-07-11 09:34:56

jvandal: I have uploaded a patch that changes the way the patch in 9668 behaves. Does this improve the situation?
By: Joel Vandal (jvandal) 2008-07-11 09:39:39

corydon76: thanks for this patch, look like exactly what I'm trying to do (but very low knowledge of C and Asterisk API).

+ if (!norecurse && (tmp = strrchr(device, '-')) && (tmpchan = ast_get_channel_by_name_locked(device))) {

Will try to test this patch tonight and let you know ASAP.
By: Mark Michelson (mmichelson) 2008-07-11 09:41:48

I think the patch that has been uploaded won't help because it still unconditionally queues a device state change for the device name with the '-' and suffix. This device state change is the one that is invalid the majority of the time.
By: Mark Michelson (mmichelson) 2008-07-11 10:26:17

If Corydon76's patch does not work properly, then give my patch a shot. It's called 12771.patch.
By: Tilghman Lesher (tilghman) 2008-07-11 11:34:19

Okay, I didn't want to have to do that, but it seems I have no choice but to check the channel list everytime there's a '-' passed in the devicename. New patch uploaded.
By: Mark Michelson (mmichelson) 2008-07-11 13:42:29

Oh cripes, I uploaded the wrong patch. Use Corydon76's latest patch since it's nearly exactly the same as what I had meant to upload.
By: Joel Vandal (jvandal) 2008-07-11 14:26:11

Will test this patch (20080711__bug12771.diff.txt) on some servers this weekend.

DougUDI, new packages are available on download server, please test.
By: Joel Vandal (jvandal) 2008-07-12 08:44:36

festr, look like you are using chan_agent (Agent/116), perhaps look patch from 0012773 (now commited to SVN branches/1.4).
By: Martin Vit (festr) 2008-07-14 03:22:47

I've applied patch from bug:12773 (chan_agent.c) but the symptoms are the same. Core dump and the last message was

"app_queue.c:3014 in try_calling: The device state of this queue member, Agent/111, is still 'Not in Use' when it pr
obably should not be! Please check UPGRADE.txt for correct configuration settings."

(notice: app_queue.c is patched with czech localization)

jvandal: have you tried 20080711__bug12771.diff.txt? I'm going to test it.
By: Doug (doug) 2008-07-15 02:32:41

Jvandal - The latest patch (20080711__bug12771.diff.txt) seems to cause more problems than previous. States are updated but seem not to release the stste in time keeping the agent in a "ringing" or "in use" state. If i reload app_queue the state refreshes. The UPGRADE.txt messages also apear with this version.
By: Martin Vit (festr) 2008-07-15 09:28:27

I'm able to always reproduce on clean asterisk (SVN-branch-1.4-r130889) with minimal configuration

"app_queue.c:3106 try_calling: The device state of this queue member, Local/510@mytest, is still 'Not in Use' when it probably should not be! Please check UPGRADE.txt for correct configuration settings"

queues.conf:

[general]
persistentmembers = yes
autofill = yes
[callcentrum]
strategy = fewestcalls
member => Local/510@mytest,0

extension.conf

[extensions-static-global]
exten => 91,1,Queue(callcentrum|rTt|||60)

[mytest]

exten => 510,1,Dial(SIP/test,,tT)

scenario: SIP/A -> Queue(callcentrum|rTt|||60) -> Local/510@mytest -> Dial(SIP/test,,tT)

I'll keep this virtual box for testing.
By: Joel Vandal (jvandal) 2008-07-15 10:30:04

To complete latest DougUDI post, using 20080711_bug12771, we still have weird DNS query (gethostbyname) that cause some lag.

Where revert of 96575 fix all issues (except - on a peer name).
By: Tilghman Lesher (tilghman) 2008-07-17 13:18:13

Okay, let's try this tack, which reverts revision 96575, and instead makes the API more strict in terms of what the device name can be. This forces the unique ID to be stripped everywhere before calling the function to return a device state.
By: Joel Vandal (jvandal) 2008-07-17 13:40:12

Thanks Corydon76, I have apply this patch on 1.4.21.1 and will test with help of DougUDI. Let you know very soon!
By: David Woolley (davidw) 2008-07-18 06:33:17

The 20080717 patch doesn't actually take against 1.4.21.1! It does take against the SVN HEAD version for 1.4.

patching file channels/chan_agent.c
Hunk #1 FAILED at 289.
Hunk #2 FAILED at 527.
Hunk #3 FAILED at 750.
Hunk #4 FAILED at 874.
4 out of 4 hunks FAILED -- saving rejects to file channels/chan_agent.c.rej
patching file main/channel.c
Hunk #1 succeeded at 1184 (offset -17 lines).
Hunk #3 succeeded at 3663 (offset -9 lines).
patching file main/devicestate.c
By: Joel Vandal (jvandal) 2008-07-18 06:49:04

I have manually apply the patch since it fail using 'patch' command.

Have only edit main/devicestate.c and main/channel.c since we dont use chan_agent.
By: Steve Murphy (murf) 2008-07-25 09:48:35

Corydon76-- I uploaded a patch for trunk that hoped to implement your changes, but it's nowhere near the performance I had hoped for. I must have mucked it up somewhere. Can you check it?
By: Digium Subversion (svnbot) 2008-07-25 12:11:33

Repository: asterisk
Revision: 133649

U branches/1.4/channels/chan_agent.c
U branches/1.4/main/channel.c
U branches/1.4/main/devicestate.c

------------------------------------------------------------------------
r133649 | tilghman | 2008-07-25 12:11:32 -0500 (Fri, 25 Jul 2008) | 8 lines

Fix some errant device states by making the devicestate API more strict in
terms of the device argument (only without the unique identifier appended).
(closes issue ASTERISK-12125)
Reported by: davidw
Patches:
20080717__bug12771.diff.txt uploaded by Corydon76 (license 14)
Tested by: davidw, jvandal, murf

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=133649