[Home]

Summary:ASTERISK-12127: [patch] Agent doesn't propagate device state changes from underlying device.
Reporter:David Woolley (davidw)Labels:
Date Opened:2008-06-02 11:30:09Date Closed:2010-01-05 16:58:47.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_agent
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20080710__bug12773.diff.txt
( 1) NotPropagating2.txt
( 2) output.txt
Description:Given a SIP phone logged in with AgentLogin, and monitored with qualify, if you remove the power on the phone (or, presumably, unplug it from the network) Asterisk detects that the phone has become unavailable, but the agent continues to receive calls, resulting in callers going into a black hole.

Ideally, I believe that the agent should detect the state change on the underlying device and change its own state to logged off, and therefore generate a state change to CHANUNVAIL.  This would result in correct hinting for the agent's state.

****** ADDITIONAL INFORMATION ******

When a SIP qualify failure occurs, chan_sip reports a device state change to unavailable (SIP/4222 is logged in as Agent/1001).

[Jun  2 17:02:39] VERBOSE[23470] logger.c: Really destroying SIP dialog '3eeb064f1b3a2eb419825c043bfc1016@192.168.130.116' Method: OPTIONS
[Jun  2 17:02:39] DEBUG[23470] devicestate.c: Notification of state change to be queued on device/channel SIP/4222
[Jun  2 17:02:39] DEBUG[23452] chan_sip.c: Checking device state for peer 4222

This is split into a state change call back and a hint update.  In principle anything can register for call backs, but in practice only app_queue.c actually does:

[Jun  2 17:02:39] DEBUG[23477] app_queue.c: Device 'SIP/4222' changed to state '5' (Unavailable) but we don't care because they're not a member of any queue.

However, as app_queue.c only knows about the agent, not the underlying channel, it discards the change notification and continues to offer calls.

One might be able to have app_queue.c recognize that the channel was supporting the agent, but then it would have to try and combine states and, unless it did what I think the agent code should do, hints would not get updated.
Comments:By: Tilghman Lesher (tilghman) 2008-06-17 19:12:43

Let's try something like this.  Testing is needed, because I don't use Agents very much.

By: David Woolley (davidw) 2008-06-20 05:35:04

The test rig I normally use wasn't available, so I got one of the marketing people who is involved in the internal procurement to apply and test the patch.  He says:

"The agent is not logged off and also continues to receive calls.

Below [attached] shows when I unplugged the phone, waited for it to be unreachable and then queued a call, which was sent to the agent."

Looking at the patch, I wonder if you should be stripping -xxxxxx from loginchan, before comparing.

By: Tilghman Lesher (tilghman) 2008-06-20 09:14:27

Okay, I see where I might have gone wrong.  Let's try this patch, instead.

By: James Taylor (jamestaylor) 2008-06-22 19:01:45

I added this to version 1.4.21 and it caused a segmentation fault during start up. Is a core dump neccessary?



app_echo.so => (Simple Echo Application)
 == Registered application 'Milliwatt'
app_milliwatt.so => (Digital Milliwatt (mu-law) Test Application)
 == Registered application 'ADSIProg'
app_adsiprog.so => (Asterisk ADSI Programming Application)
 == Registered translator 'g726tolin' from format g726 to slin, cost 1
[Jun 23 01:07:22] NOTICE[3505]: chan_sip.c:12669 handle_response_peerpoke: Peer 'xxxx' is now Reachable. (9ms / 2000ms)
Segmentation fault (core dumped)

By: Tilghman Lesher (tilghman) 2008-06-23 15:14:49

Yes, I will need a backtrace from your core.

By: James Taylor (jamestaylor) 2008-06-24 00:32:11

I have attached bt and bt full of the core dump (output.txt). Please let me know if you need anything else.

By: Tilghman Lesher (tilghman) 2008-06-24 17:36:34

I really would appreciate if you would not sanitize the output.  It makes it very difficult to figure out why it's crashing.

By: Tilghman Lesher (tilghman) 2008-06-24 17:41:17

New patch uploaded.

By: David Woolley (davidw) 2008-06-25 02:13:52

The alternative to sanitizing is for us to raise all issues based on real world cases as private; my feeling is that this sort of bug reporting system works best when reports are public.

The reasons for sanitizing are:

- commercial sensitivities - I could explain these but only privately;
- UK Data Protections legislation means that we have to be very careful with the names of living persons - that generally applies within the EU;
- potentially security; for this and other reasons, I wouldn't want routable IP addresses.

I didn't prepare the back trace, but when I do sanitize, I make sure that I don't change the length of strings and that distinct strings are  always transformed in the same way.

I'll ask if our people could pre-obfuscate personal names in the test configuration.

By: David Woolley (davidw) 2008-06-25 09:49:55

It's stable and recognizing that the SIP device is associated with the agent, but the agent is not going unavailable.  I'm still trying to work out what is wrong.

[Jun 25 15:32:39] NOTICE[10058] chan_sip.c: Peer '4222' is now UNREACHABLE!  Last qualify: 89
[Jun 25 15:32:39] DEBUG[10058] chan_sip.c: Destroying SIP dialog 483a739c0083a38d065929f633d76bdb@192.168.130.116
[Jun 25 15:32:39] VERBOSE[10058] logger.c: Really destroying SIP dialog '483a739c0083a38d065929f633d76bdb@192.168.130.116' Method: OPTIONS
[Jun 25 15:32:39] DEBUG[10058] devicestate.c: Notification of state change to be queued on device/channel SIP/4222
[Jun 25 15:32:39] DEBUG[10040] chan_sip.c: Checking device state for peer 4222
[Jun 25 15:32:39] DEBUG[10040] devicestate.c: Changing state for SIP/4222 - state 5 (Unavailable)
[Jun 25 15:32:39] DEBUG[10040] devicestate.c: Notification of state change to be queued on device/channel Agent/1001
[Jun 25 15:32:39] DEBUG[10040] chan_sip.c: Checking device state for peer 4222
[Jun 25 15:32:39] DEBUG[10040] devicestate.c: Changing state for Agent/1001 - state 1 (Not in use)
[Jun 25 15:32:39] DEBUG[10065] app_queue.c: Device 'SIP/4222' changed to state '5' (Unavailable) but we don't care because they're not a member of any queue.
[Jun 25 15:32:39] DEBUG[10065] app_queue.c: Device 'Agent/1001' changed to state '1' (Not in use)

By: David Woolley (davidw) 2008-06-25 10:04:38

I think the problem is this:

!ast_strlen_zero(p->loginchan) &&

it stops the inherited status being used for AgentLogin type agents.

By: David Woolley (davidw) 2008-06-25 10:37:21

Also, I think that one should be comparing against loginchan and basename, rather than p->chan->name and basename.  Note the loginchan check is useful even when p->chan is NULL, as one doesn't want to call an inaccessible agent.

Unfortunately, our setup is based on AgentLogin, not callback agents.

By: Tilghman Lesher (tilghman) 2008-06-25 10:41:09

I thought loginchan was the obvious field to use, too, until I actually checked what was stored in that field.  It's an extension and context, not a name of a channel.

By: David Woolley (davidw) 2008-06-25 10:51:37

Using basename is OK for us, but I'm not sure that chan->name is useful, even for agent callback, as chan->name will always be a channel name rather than just technology/address.

I'd guess you'd have to use hints for agent callback, but then the agent extension might have a list in its Dial command.

By: Tilghman Lesher (tilghman) 2008-06-25 11:02:57

To understand why we need to do that, you need to understand something about how device states work.  Sometimes the device name sent is the full channel name, with the suffix and sometimes the device name sent is without the suffix.  Hence, we compare with both variants.

By: David Woolley (davidw) 2008-06-25 11:27:53

It looks like AgentCallbackLogin is deprecated, and the suggested alternative uses a local channels to fulfil the role of the Agent, which means the problem becomes one for local channels.

In the short term, though, the comment on the declaration of loginchan is wrong!

By: David Woolley (davidw) 2008-06-26 05:54:40

I got this feedback from a person able to test this:

"The disappearing agent trick seems to work much better now.

If I pull the data cable on the agents phone, wait for unreachable and then call into the queue it just rings out. I left it for over a minute and then plugged the data cable back in. The call was answered as soon as the endpoint was detected as reachable.

The only wrinkle I could see was that the agent's status was still "logged in" even though they were not reachable."

Where I have some reservations are:

- I'm not sure what the "has owner" test was trying to achieve, so I'm not sure whether the "inherited status > -1" test is completely safe (e.g. if it is trying to protect against accessing a partially initialised structure, inherited status might be zero);

- I wonder if giving inherited status absolute priority is the right thing.

However, in terms of soak testing this, currently the marketing person is on leave and the developer is working on other aspects of the application so isn't extensively exercising the system, so empirical evidence of any collateral damage may be slow in coming.

By: David Woolley (davidw) 2008-06-26 10:06:34

Inherited status having precedence means that, after the connection recovers, the queue member status is stuck at "not in use", even though the agent is actually busy.  I think this would cause problems if there were multiple incoming queues.  In the following example, the queues are manipulated in ways that make it difficult or impossible to create eligible calls on multiple queues, so I can only demonstrate the status.  Our use of paused is unusual; it might be preventing our getting problems when a second call arrives, even with one real incoming queue - there is only one call in these examples.

The queue names have been changed.  The third queue is the one for which a call is actually being handled.

First a control.  Busy on one call before the phone went down.  Note queues show busy.

AAAAAAAAAAAA has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W:1, C:0, A:0, SL:0.0% within 0s
  Members: I>
     Anonymous (paused) (Busy) has taken no calls yet
  No Callers>
localhost*CLI>
BBBBBBBBBBBB has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W:2, C:0, A:0, SL:0.0% within 0s
  Members: I>
     Anonymous (Busy) has taken no calls yet
  No Callers>
localhost*CLI>
CCCCCCCCCCCC has 0 calls (max unlimited) in 'rrmemory' strategy (11s holdtime), W:0, C:6, A:0, SL:16.7% within 0s
  Members: I>
     Anonymous (paused) (Busy) has taken 6 calls (last was 6392 secs ago)
  No Callers>

agent show reports:

logged in on SIP/4222-09ad1ad8 talking to ..... (musiconhold is 'default')

After disconnecting and reconnecting the phone, the queues are showing the agent as not in use, even though there is a call active on the third queue.

localhost*CLI> queue show
AAAAAAAAAAAA has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W:1, C:0, A:0, SL:0.0% within 0s
  Members: I>
     Anonymous (paused) (Not in use) has taken no calls yet
  No Callers>
localhost*CLI>
BBBBBBBBBBBB has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W:2, C:0, A:0, SL:0.0% within 0s
  Members: I>
     Anonymous (Not in use) has taken no calls yet
  No Callers

CCCCCCCCCCC has 0 calls (max unlimited) in 'rrmemory' strategy (8s holdtime), W:0, C:7, A:0, SL:14.3% within 0s
  Members:
     Anonymous (paused) (Not in use) has taken 7 calls (last was 83 secs ago)
  No Callers

logged in on SIP/4222-09ad1ad8 talking to ...... (musiconhold is 'default')

All the devicestate changes for the agent, after the phone comes back, give a not in use state (until the phone goes unreachable again, but that's a different issue).

[Jun 26 14:47:00] DEBUG[16601] devicestate.c: Changing state for SIP/4222 - stat
e 5 (Unavailable)
[Jun 26 14:47:00] DEBUG[16601] devicestate.c: Notification of state change to be
queued on device/channel Agent/1001
[Jun 26 14:47:00] DEBUG[16601] devicestate.c: Changing state for Agent/1001 - st
ate 5 (Unavailable)
[Jun 26 14:47:10] DEBUG[16621] devicestate.c: Notification of state change to be
queued on device/channel SIP/4222
[Jun 26 14:47:10] DEBUG[16601] devicestate.c: Changing state for SIP/4222 - state 1 (Not in use)
[Jun 26 14:47:10] DEBUG[16601] devicestate.c: Notification of state change to be queued on device/channel Agent/1001
[Jun 26 14:47:10] DEBUG[16601] devicestate.c: Changing state for Agent/1001 - state 1 (Not in use)

Call answered here

[Jun 26 14:47:29] DEBUG[19792] devicestate.c: Notification of state change to be queued on device/channel Agent/1001
[Jun 26 14:47:29] DEBUG[16601] devicestate.c: Changing state for Agent/1001 - state 1 (Not in use)

By: Tilghman Lesher (tilghman) 2008-06-26 10:29:50

Well, we can move it down one in the logic, I suppose.  The real problem is that the devicestate changes are using the deprecated syntax, to ask what the current state is upon a change, rather than telling the infrastructure what the new state should be.

By: David Woolley (davidw) 2008-06-27 08:40:07

That's better.

By: Tilghman Lesher (tilghman) 2008-06-27 09:05:20

Are you happy with this patch as it currently stands?  Can it be committed and this issue closed?

By: David Woolley (davidw) 2008-06-27 09:35:46

Yes.  I can't see any obvious adverse side effects.

By: David Woolley (davidw) 2008-07-04 05:56:15

It looks like we are having a problem with the status not resetting when the agent logs out, at least when they do it by hanging up.

1001 (Xxxx Xxxxxx) not logged in (musiconhold is 'default')

  Members: I>
     Anonymous (Not in use) has taken 5 calls (last was 62223 secs ago)

   -= Registered Asterisk Dial Plan Hints =-
             Agent1001@internal : Agent/1001 State:Idle
         Watchers 1
                  6102@default : SIP/yyyyyyyyyyyyy State:Idle
         Watchers 1

6102 was the last agent in this case. Making 6102 unavailable doesn't make the agent unavailable, so it is presumably using information cached from when the agent was logged in.

The hint isn't, currently at least, critical to the application; it was just something I did to monitor what was going on.

By: David Woolley (davidw) 2008-07-04 06:36:29

Potentially anywhere that p->chan gets set to NULL, inherited status should get reset.  (Some of these will be redundant, as they relate to callback logins, but I think that the bit near line 1577 is already only applied for callback.)

In this case, I think the specific problem is the one near line 2242 in __login_exec.

Resetting inherited state immediately after that reset of p->chan seems to remove the symptom, here.

By: Tilghman Lesher (tilghman) 2008-07-10 10:01:44

Patch updated.

By: Digium Subversion (svnbot) 2008-07-11 11:43:04

Repository: asterisk
Revision: 130102

U   branches/1.4/channels/chan_agent.c

------------------------------------------------------------------------
r130102 | tilghman | 2008-07-11 11:43:02 -0500 (Fri, 11 Jul 2008) | 9 lines

Pass the devicestate from an underlying channel up through the Agent channel.
This should make the Agent always report the correct device state, even when
the underlying channel is used for other purposes.
(closes issue ASTERISK-12127)
Reported by: davidw
Patches:
      20080710__bug12773.diff.txt uploaded by Corydon76 (license 14)
Tested by: davidw

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=130102