Summary: | ASTERISK-08783: DUNDi ignoring incoming replies when system is busy | ||
Reporter: | gdhgdh (gdhgdh) | Labels: | |
Date Opened: | 2007-02-12 05:59:59.000-0600 | Date Closed: | 2011-06-07 14:03:03 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | PBX/pbx_dundi |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ||
Description: | I have two machines, star (10.0.0.239) and hash (10.0.0.235), both with one E1 telco uplink. star is almost idle, whilst hash is quite busy (~25 Zap channels + MixMonitor + 30 SIP clients). I do DUNDi lookups between them to find an available Zap channel. Lookups originating from star complete almost instantly ( < 10ms). Lookups originating from hash may take several seconds or fail completely. tcpdump shows star is responding to the request immediately, but hash is not processing the incoming reply. star then retries once per second for up to 4 seconds before giving up. I've annotated the tcpdump output (as run on hash!) for completeness: ENCRYPT + DPDISCOVER request to star: 10:22:58.768384 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 88 ACK to hash 10:22:58.768773 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 ENCRYPT to hash: 10:22:58.769921 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90 ENCRYPT to hash: 10:22:59.769966 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90 ENCRYPT to hash: 10:23:00.770025 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90 (then finally after 3 seconds) ACK to star: 10:23:01.651028 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.651446 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.651518 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.651593 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.651716 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 10:23:01.651783 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 10:23:01.651808 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.651841 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 10:23:01.651868 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8 10:23:01.652041 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 10:23:01.652074 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8 The rest are an assortment of ACKs, NULLs, INVALIDs and ENCREJ presumably as hash is processing the backlog (possibly out of order ?) The problem is, /why/ is hash waiting so long? Are other asterisk threads too busy with other work? How can I determine this? I'm using 1.2.14 on both machines. Load average on hash is typically only 0.2. On hash, I notice other small indications like typing 'show chan' and pressing TAB... the commandline may wait for several seconds before completing 'show channel'. Neither machine is doing anything exotic like MeetMe, queueing/agents or IVR. hash is new - an HP DL145 G2 (Opteron 246 with HyperTransport) and is dedicated to Asterisk. There is a single Sangoma A104 installed. | ||
Comments: | By: gdhgdh (gdhgdh) 2007-02-12 09:59:32.000-0600 Further info to this which suggests that hash is simply 'too busy' is I will get batches (maybe a dozen or two) of: 2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1217' is now UNREACHABLE! Last qualify: 21 2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1216' is now UNREACHABLE! Last qualify: 20 2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1323' is now UNREACHABLE! Last qualify: 11 followed always ten seconds later by the matching: 2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1217' is now REACHABLE! (21ms / 2000ms) 2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1216' is now REACHABLE! (19ms / 2000ms) 2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1323' is now REACHABLE! (17ms / 2000ms) 13xx clients are on a busy switched LAN, whilst 12xx clients are on a private switched LAN - it is not possible for network congestion to be the cause for 12xx clients! Likewise, on star I will see periodic: 2007-02-12 12:21:20 NOTICE[3098] pbx_dundi.c: Peer '00:17:08:92:37:6e' has become UNREACHABLE! 2007-02-12 12:22:15 NOTICE[3098] pbx_dundi.c: Peer '00:17:08:92:37:6e' has become REACHABLE! again most likely because hash is too busy to respond to DUNDi ACKs :( Is there anything that can be suggested other than 'have you tried 1.4?' (because I have, and was scared away by many segfaults, reported onto this tracker) By: Joshua C. Colp (jcolp) 2007-02-15 19:50:24.000-0600 The first thing that potentially comes to mind is DNS, are lots of DNS Lookups happening that might be blocking? Besides that console output would be helpful. By: gdhgdh (gdhgdh) 2007-02-16 06:00:35.000-0600 No, DNS lookups are returning instantly: e.g. hash*CLI> dundi lookup 1310@local bypass 1. 0 IAX2/local:colophon@10.0.0.239/1310 (EXISTS|NOUNSLCTD|NOCOMUNSLTD) from 00:12:3f:ec:d2:de, expires in 5 s DUNDi lookup completed in 5769 ms ETx-Frame Retry[No] -- OSeqno: 000 ISeqno: 000 Type: DPDISCOVER (Command) Flags: 00 STrans: 23046 DTrans: 00000 [10.0.0.239:4520] VERSION : 1 DIRECT EID : 00:17:08:92:37:6e CALLED NUMBER : 1310 CALLED CONTEXT : local TTL : 4 CBYPASS : Bypass Caches Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 000 Type: ENCRYPT (Command) Flags: 00 STrans: 23046 DTrans: 00000 [10.0.0.239:4520] ENTITY IDENT : 00:17:08:92:37:6e KEYCRC32 : 1058293206 ENCDATA : [IV dbaece6a90e84552551c4c11b9b0dd72] 4 encrypted blocks Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ACK (Response) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT (Command) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] ENCDATA : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks Erx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: DPRESPONSE (Response) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] (Final) ANSWER : [EXISTS|NOUNSLCTD|NOCOMUNSLTD] 0 <IAX/local:colophon@10.0.0.239/1310�> from [00:12:3f:ec:d2:de] HINT : [UNAFFECTED] EXPIRATION : 5 Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 001 Type: ACK (Response) Flags: 00 STrans: 23046 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT (Command) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] ENCDATA : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ (Response) Flags: 00 STrans: 15255 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT (Command) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] ENCDATA : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks hash*CLI> Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ (Response) Flags: 00 STrans: 28011 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT (Command) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] ENCDATA : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks hash*CLI> Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ (Response) Flags: 00 STrans: 24907 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT (Command) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] ENCDATA : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks hash*CLI> Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ (Response) Flags: 00 STrans: 28027 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 001 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 23046 [10.0.0.239:4520] Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 15255 [10.0.0.239:4520] Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL (Command) Flags: 00 STrans: 15255 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 28011 [10.0.0.239:4520] Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL (Command) Flags: 00 STrans: 28011 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 24907 [10.0.0.239:4520] Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL (Command) Flags: 00 STrans: 24907 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 28027 [10.0.0.239:4520] Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL (Command) Flags: 00 STrans: 28027 DTrans: 02663 [10.0.0.239:4520] (Final) Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 15255 [10.0.0.239:4520] Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 28011 [10.0.0.239:4520] Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 24907 [10.0.0.239:4520] Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID (Response) Flags: 00 STrans: 02663 DTrans: 28027 [10.0.0.239:4520] As you can see, this lookup took nearly six seconds to complete, and tcpdump shows the following: hash:~# tcpdump -i eth0 -n port 53 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes i.e. nothing, as I would expect - no DNS lookups have been attempted during any point of the lookup, my DUNDi config is completely based on IP addresses on the LAN. By: Michel Belleau (malaiwah) 2007-07-11 07:42:13 I'm using IAX and DUNDi for communication between two Asterisk servers and SIP for communication to clients (~50 SIP clients). One server is busy while the other is mostly idle. I had issues similar to this since I upgraded from 1.2.9 to > 1.2.13 (1.2.17 if my memory serves me right). Upgrading the two servers to 1.2.20 fixed the problem. Servers are now running 1.2.21.1 as of today and still no sign of malfunction. By: Michel Belleau (malaiwah) 2007-07-13 08:59:58 I had some issues today, in my particular setup I'm using DUNDi-over-OpenVPN which switches calls using IAX2 over two servers, I get this message quite often in the debug log of one server (snippet): Jul 13 08:16:08 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:16:18 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:16:29 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:16:39 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:16:46 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:16:55 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:17:00 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:17:10 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:18:32 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:18:42 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:19:56 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:20:06 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:21:34 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:21:44 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:21:48 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:21:58 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:22:20 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:22:30 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:22:44 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:22:54 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:22:56 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:23:06 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:23:20 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) Jul 13 08:23:30 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid) And the other server: Jul 11 09:56:45 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 09:56:50 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 09:59:21 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:01:16 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:01:21 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:04:08 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:04:12 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:04:29 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:06:11 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:06:54 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:06:55 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:09:13 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) Jul 11 10:09:19 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid) By: Michiel van Baak (mvanbaak) 2007-09-08 07:25:07 Since 1.2 no longer gets patches beside security updates can you confirm this behavior on 1.4 ? By: gdhgdh (gdhgdh) 2007-09-10 02:40:29 As the OP, I can't confirm this because we no longer use Asterisk. By: Michiel van Baak (mvanbaak) 2007-09-10 03:45:56 @malaiwah: Can you give some information wether this issue comes up in 1.4 as well? By: Tilghman Lesher (tilghman) 2007-12-05 09:43:42.000-0600 Suspending, due to lack of response. |