ASTERISK-12043: Extension state change random delay

[Home]

Summary: ASTERISK-12043: Extension state change random delay

Reporter: Andrey Solovyev (corruptor) Labels:

Date Opened: 2008-05-19 05:04:09 Date Closed: 2008-12-13 18:34:54.000-0600

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/Subscriptions

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) notify_delay3.txt

Description: asterisk-1.4.20-rc3
Grandstream GXP-2000 is registered as 1100 and subscribed to receive changes about extensions states.
SIP/1101 is Linksys 942
SIP/1102 is SjPhone.
I've also set notifyringing=yes.

1102 calls 1101. 1101 sends RINGING but asterisk sends NOTIFY to Grandstream with 3 sec delay. You can see (log is attached) that RINGING was at 19:11:27 and NOTIFY is sent at 19:11:30.
Later when 1101 answers you can see that device state is changed 9 seconds after call have been established (19:11:50] DEBUG[24559] devicestate.c: Changing state for SIP/1102-09551818 - state 4 (Invalid))
When call is over asterisk changes state for one of the extensions almost instantly but there is 20 seconds delay for 1102 (19:12:20).
Please help to debug this problem.

Comments: By: Olle Johansson (oej) 2008-07-01 08:09:28

Is this problem still valid in latest 1.4 release?

You need to tell us more about your configuration. Platform, if you use realtime with a database and other configuration issues.
By: Joel Vandal (jvandal) 2008-07-09 11:44:17

We having same symptom using 1.4.21.1 where it can take up to 20 seconds before update the device state notification.

oej: When you talk about 'latest 1.4', you mean branches/1.4 svn ?
By: Olle Johansson (oej) 2008-07-09 11:48:58

Well, as a developer that's what I mean. My apologies for not explaining that.
By: Joel Vandal (jvandal) 2008-07-09 12:34:49

Hi,

Just tested with r129434 (branches/1.4, no custom patches) and still have delay problems :(

If I correctly read log, the notification has been queued at 19:33:44 and updated at 19:34:05 ?

[Jul 9 19:33:44] DEBUG[18320] devicestate.c: Notification of state change to be queued on device/channel SIP/1820
[Jul 9 19:33:44] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:47] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:49] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:51] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:53] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:56] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:33:58] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:34:01] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:34:03] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:34:05] DEBUG[18342] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:34:05] DEBUG[18315] chan_sip.c: Checking device state for peer 1820
[Jul 9 19:34:05] DEBUG[18315] devicestate.c: Changing state for SIP/1820 - state 1 (Not in use)
[Jul 9 19:34:05] DEBUG[18315] chan_sip.c: Checking device state for peer 1820

I currently have only 1 agent logged (AddQueueMember) and the status is never updated on 'show queues'.

Strange that this problem isn't present on all our 'servers'.
By: Doug (doug) 2008-07-10 04:54:38

as per Jvandal - the problem has now eculated. We now get late state updates where the state is put into "in use" some time after the call is picked up. The problem is that the state then remains ths way and no calls are passed to an agent untill I reload the app_queue.

for now i will look into adding a script to do a app_queue reload each 1 min but this is not ideal as it can still cause a 1 min delay on an agents calls.
By: David Woolley (davidw) 2008-07-10 05:49:14

I would say the delay was the same basic race condition issue as ASTERISK-12125. I'm not sure why you are getting such a huge scheduling delay, but could either of the following apply in your case:

- the machine is 100% CPU;
- virtual memory is thrashing, such that the code that does the asynchronous update is page faulting and being delayed as a result

There seems to be a fundamental problem in doing this update asynchronously, which as, I think, hinted for ASTERISK-12125 may not be easy to fix.

I don't understand why you are losing the transition to not-in-use.
By: Doug (doug) 2008-07-10 06:37:08

Think this can have anything to do with the problem?

Jul 10 09:51:40 electrolux kernel: hdc: status timeout: status=0xd0 { Busy }
Jul 10 09:51:40 electrolux kernel: ide: failed opcode was: unknown
Jul 10 09:51:40 electrolux kernel: hdc: no DRQ after issuing MULTWRITE_EXT
Jul 10 09:51:40 electrolux kernel: ide1: reset: success
By: David Woolley (davidw) 2008-07-10 07:04:04

That would certainly seriously delay anything that was dependent on the disk (e.g. to service a page fault).

IDE resets are hardware or IDE device driver problems. Failing to get a confirmed completion on a write may mean whatever was being written is in a somewhat indeterminate state.
By: David Brillert (aragon) 2008-07-10 08:18:06

This problem is causing problems all over the bug tracker in relation to agents not receiving calls.

I'm certainly no expert but based on comments derived from other tickets; to me this looks like some sort of DNS delay issue.
I think root cause may be buried in an earlier revision which causes DNS delay or failure to improperly update channel or hint status.

Repository: asterisk
Revision: 96575

U branches/1.4/main/devicestate.c

------------------------------------------------------------------------
r96575 | tilghman | 2008-01-04 16:59:58 -0600 (Fri, 04 Jan 2008) | 7 lines

Fix the problem of notification of a device state change to a device with a '-'
in the name. Could probably do with a better fix in trunk, but this bug has
been open way too long without a better solution.
Reported by: stevedavies
Patch by: tilghman
(Closes issue 0009668)

I believe relates to 12771 12680 12672 12627 and probably 12773

I like stp comments in:

http://bugs.digium.com/view.php?id=12771#89723
"There is addition code in devicestate.c , function __ast_device_state_changed_literal(), that will handle state change on both device and call. For example, when SIP answer, there will be notification on state change for device SIP/XXXX and for call SIP/XXXX-YYYYY. By adding the call to notification list, sip_devicestate() function will search for device state of the call SIP/XXXX-YYYY. Which is not possible to find, therefore, sip_devicestate() will call ast_gethostbyname() and get stuck there for few seconds (about 10 secs). Delay in sip_devicestate() cause notification thread not be able to process others state change for a while. So the everything that wait for state change will not work as expect. After change code in __ast_device_state_changed_literal, my problem is solved and warning for "UPGRADE.txt" gone.

I also believe that this also a cause for bug [0012680] too."

http://bugs.digium.com/view.php?id=12771#89750
"Another way is to change how sip_devicestate() search for the device. Instead of always calling ast_gethostbyname(), we should check if there is a real host name in SIP device name. It may possible to fix by setting correct DNS so that ast_gethostbyname() can return immediately instead of waiting for DNS server to response."
By: Joel Vandal (jvandal) 2008-07-10 08:19:33

After some search (thanks aragon), we found this ticket that look like same issue we have on device state :

0012627: The device state change thread gets locked (or waits) for 20 seconds

I think root cause for this issue must be buried in revision 96575

U branches/1.4/main/devicestate.c

------------------------------------------------------------------------
r96575 | tilghman | 2008-01-04 16:59:58 -0600 (Fri, 04 Jan 2008) | 7 lines

Fix the problem of notification of a device state change to a device with a '-'
in the name. Could probably do with a better fix in trunk, but this bug has
been open way too long without a better solution.
Reported by: stevedavies
Patch by: tilghman
(Closes issue 0009668)

IMHO, theses tickets need relationships :

0012771: Bogus <member> is still 'Not in Use' warnings for AgentLogin'ed agents.
0012627: The device state change thread gets locked (or waits) for 20 seconds
0012916: Multiple incoming call to free agent
By: Andrey Solovyev (corruptor) 2008-07-10 08:56:37

Sorry for being silent about this problem.
I was trying to reproduce this problem on many other machines but it worked ok. As for now I am unable to get access to the server which had problem.
I will report later if I experience this behaviour once again.
By: Joel Vandal (jvandal) 2008-07-10 10:33:01

We made several test with DougUDI and aragon and all delay problems are now fixed. Using patch (see comments) on 12771 and 12916.

All delay problems are definitively caused by a DNS timeout. If we revert patch r96575 (see ticket 12771), then all state problems are fixed.

The only issue we have is when a peer/friend have a - on the username, but on ScopServ GUI, we will add a warning message to avoid use of - on username.
By: David Brillert (aragon) 2008-07-10 11:52:21

IMHO bug 9668 should be re-opened because
The revision broke state changes creating multiple bug reports 12771 12680 12627 12672 by creating a problem with DNS failures which result in hung channels and hints.

If it is easier then a patch could be written for 12680 and relationships added to the other open tickets while leaving 9668 closed.

Reverting the revision 96575 fixes the apparently fictitious "race condition"

Have a look at these DNS errors responsible for delaying the extension states.
The problem is when the SIP channel is tagged with a - and a random tag as in SIP/1820-08845b70 and the DNS lookup fails.

[Jul 8 17:24:58] DEBUG[22329] devicestate.c: Notification of state change to be queued on device/channel SIP/9005
[Jul 8 17:24:58] DEBUG[22266] channel.c: Internal timing is disabled (option_internal_timing=0 chan->timingfd=66)
[Jul 8 17:24:58] DEBUG[29819] devicestate.c: No provider found, checking channel drivers for SIP - 9005
[Jul 8 17:24:58] DEBUG[29819] chan_sip.c: Checking device state for peer 9005
[Jul 8 17:24:58] DEBUG[29819] devicestate.c: Changing state for SIP/9005 - state 6 (Ringing)
[Jul 8 17:24:58] DEBUG[29819] devicestate.c: No provider found, checking channel drivers for SIP - 9005
[Jul 8 17:24:58] DEBUG[29819] chan_sip.c: Checking device state for peer 9005
[Jul 8 17:24:58] DEBUG[29819] devicestate.c: No provider found, checking channel drivers for SIP - 9005
[Jul 8 17:24:58] DEBUG[29819] chan_sip.c: Checking device state for peer 9005
[Jul 8 17:24:58] VERBOSE[29819] logger.c: Extension Changed 9005[default-local] new state Ringing for Notify User 9007
[Jul 8 17:24:58] DEBUG[29853] app_queue.c: Device 'SIP/9005' changed to state '6' (Ringing)
[Jul 8 17:24:58] DEBUG[22329] chan_sip.c: Our T38 capability (0), joint T38 capability (0)
[Jul 8 17:24:58] DEBUG[22329] chan_sip.c: ** Our capability: 0x8 (alaw) Video flag: False
[Jul 8 17:24:58] DEBUG[22329] chan_sip.c: ** Our prefcodec: 0x8 (alaw)
[Jul 8 17:24:58] DEBUG[22329] chan_sip.c: -- Done with adding codecs to SDP
[Jul 8 17:24:58] DEBUG[22329] channel.c: Internal timing is disabled (option_internal_timing=0 chan->timingfd=70)
[Jul 8 17:24:58] DEBUG[22329] chan_sip.c: Done building SDP. Settling with this capability: 0x8 (alaw)

[Jul 8 17:35:57] DEBUG[5980] devicestate.c: Notification of state change to be queued on device/channel SIP/1820
[Jul 8 17:35:57] DEBUG[5980] chan_sip.c: Our T38 capability (0), joint T38 capability (0)
[Jul 8 17:35:57] DEBUG[5980] chan_sip.c: ** Our capability: 0xa (gsm|alaw) Video flag: False
[Jul 8 17:35:57] DEBUG[5980] chan_sip.c: ** Our prefcodec: 0x8 (alaw)
[Jul 8 17:35:57] DEBUG[5980] chan_sip.c: -- Done with adding codecs to SDP
[Jul 8 17:35:57] DEBUG[5980] chan_sip.c: Done building SDP. Settling with this capability: 0xa (gsm|alaw)
[Jul 8 17:35:57] VERBOSE[5980] logger.c: -- Called 1820
[Jul 8 17:35:57] DEBUG[8924] chan_sip.c: (Provisional) Stopping retransmission (but retaining packet) on '098cca77047e2253118b86f561fbb742@192.168.21.250' Request 102: Found
[Jul 8 17:35:57] DEBUG[8945] manager.c: Manager received command 'Command'
[Jul 8 17:35:57] DEBUG[8945] app_queue.c: Queue default-service_centre has no realtime members defined. No need for update
[Jul 8 17:35:57] DEBUG[8945] manager.c: Manager received command 'Command'
[Jul 8 17:35:57] DEBUG[8945] app_queue.c: Queue default-spares_department has no realtime members defined. No need for update
[Jul 8 17:35:57] DEBUG[8945] manager.c: Manager received command 'Command'
[Jul 8 17:35:57] DEBUG[8924] chan_sip.c: (Provisional) Stopping retransmission (but retaining packet) on '098cca77047e2253118b86f561fbb742@192.168.21.250' Request 102: Found
[Jul 8 17:35:57] DEBUG[8924] devicestate.c: Notification of state change to be queued on device/channel SIP/1820-08845b70
[Jul 8 17:35:57] DEBUG[8924] devicestate.c: Notification of state change to be queued on device/channel SIP/1820
[Jul 8 17:35:57] VERBOSE[5980] logger.c: -- SIP/1820-08845b70 is ringing
[Jul 8 17:35:57] DEBUG[5980] rtp.c: Channel 'Local/1820@default-agent-60a9,2' has no RTP, not doing anything
[Jul 8 17:35:57] VERBOSE[5971] logger.c: -- Local/1820@default-agent-60a9,1 is ringing
[Jul 8 17:35:57] DEBUG[8945] manager.c: Manager received command 'Command'
[Jul 8 17:35:57] DEBUG[8945] app_queue.c: Queue default-internal_sales has no realtime members defined. No need for update
[Jul 8 17:35:57] DEBUG[5979] audiohook.c: Read factory 0x88441c8 was pretty quick last time, waiting for them.

tcpdump -i any port 53 -n -l

15:42:29.089068 IP 192.168.13.3.39661 > 128.63.2.53.domain: 19578 A? 1815-b67b5b58. (31) 15:42:30.077827 IP 192.168.13.3.39661 > 192.228.79.201.domain: 12403 [1au] A? 9000-08f8a300. (42) 15:42:32.079329 IP 192.168.13.3.39661 > 192.33.4.12.domain: 21764 [1au] A? 9000-08f8a300. (42) 15:42:32.090282 IP 192.168.13.3.39661 > 192.36.148.17.domain: 12227 A? 1815-b67b5b58. (31)

5:42:58.067823 IP 127.0.0.1.51503 > 127.0.0.1.domain: 43072+ A? 9000-08f8a300. (31) 15:42:58.076865 IP 127.0.0.1.domain > 127.0.0.1.51501: 22664 ServFail 0/0/0 (31) 15:42:58.076884 IP 127.0.0.1.domain > 127.0.0.1.51501: 22664 ServFail 0/0/0 (31)
By: David Woolley (davidw) 2008-07-10 12:34:44

The race condition exists because the state change is queued. It could only be removed if all queue member states, referencing the same device, were locked until the update completes, in which case you might as well update the state directly and synchronously! DNS failures could increase the size of the failure window, but ASTERISK-12125 only needs a very small time window. The resulting symptom is harmless, but it indicates an exposure that could catch actual calls.

DNS failures may well account for this case, however, we have also had reports that the disk controller is broken. They are not relevant to ASTERISK-12125. Basically the race condition is not ficticious.
By: Andrey Solovyev (corruptor) 2008-07-16 05:18:15

Thx for the hint about DNS.
My delays were defenately caused by DNS timeout.
When asterisk server has access to DNS there are no delays.
By: David Brillert (aragon) 2008-07-29 07:33:30

We have been using patch from 12771 and no longer experience delays in state change.
IMHO this ticket is related to bug 12771
By: snuffy (snuffy) 2008-07-29 08:44:47

Can you comment on this corruptor?
Has the patch from 12771 fixed this issue for you.
Let us know so that we can update the status of this bug.
By: ptorres (ptorres) 2008-07-30 15:53:12

I can confirm this was the problem for issue 0012627 ( delay due to dns querys )

We will try to test this patch soon.

By: Tilghman Lesher (tilghman) 2008-09-11 17:56:52

corruptor, PTorres: have either of you been able to test this fix yet? 1.4.22 should have this fix, if you'd like to download and try that.
By: Doug (doug) 2008-09-15 05:30:31

Our problem that was solved by patch from 0012771 has returned. We have seen this on 1.4.21 I have no results from 1.4.22 as we have bad queue locks when using that version. Added lock info to Ticket 12970.
By: ptorres (ptorres) 2008-09-15 10:15:02

sorry, We are still running on 1.4.18 , we haven´t had any real issues since then, I just checked against the trunk revisions and the patch looks pretty much equivalent to what we did.
By: Andrey Solovyev (corruptor) 2008-11-05 07:10:42.000-0600

I've tested 1.4.22. There are no delays if DNS server is unreachable.
Thank you.
By: Tilghman Lesher (tilghman) 2008-12-13 18:34:53.000-0600

This appears to be already fixed in SVN.