[Home]

Summary:ASTERISK-08783: DUNDi ignoring incoming replies when system is busy
Reporter:gdhgdh (gdhgdh)Labels:
Date Opened:2007-02-12 05:59:59.000-0600Date Closed:2011-06-07 14:03:03
Priority:MajorRegression?No
Status:Closed/CompleteComponents:PBX/pbx_dundi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:I have two machines, star (10.0.0.239) and hash (10.0.0.235), both with one E1 telco uplink. star is almost idle, whilst hash is quite busy (~25 Zap channels + MixMonitor + 30 SIP clients). I do DUNDi lookups between them to find an available Zap channel.

Lookups originating from star complete almost instantly ( < 10ms).
Lookups originating from hash may take several seconds or fail completely. tcpdump shows star is responding to the request immediately, but hash is not processing the incoming reply. star then retries once per second for up to 4 seconds before giving up.

I've annotated the tcpdump output (as run on hash!) for completeness:

ENCRYPT + DPDISCOVER request to star:
10:22:58.768384 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 88

ACK to hash
10:22:58.768773 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8

ENCRYPT to hash:
10:22:58.769921 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90

ENCRYPT to hash:
10:22:59.769966 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90

ENCRYPT to hash:
10:23:00.770025 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 90

(then finally after 3 seconds) ACK to star:
10:23:01.651028 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.651446 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.651518 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.651593 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.651716 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8
10:23:01.651783 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8
10:23:01.651808 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.651841 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8
10:23:01.651868 IP 10.0.0.235.4520 > 10.0.0.239.4520: UDP, length 8
10:23:01.652041 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8
10:23:01.652074 IP 10.0.0.239.4520 > 10.0.0.235.4520: UDP, length 8

The rest are an assortment of ACKs, NULLs, INVALIDs and ENCREJ presumably as hash is processing the backlog (possibly out of order ?)

The problem is, /why/ is hash waiting so long? Are other asterisk threads too busy with other work? How can I determine this? I'm using 1.2.14 on both machines. Load average on hash is typically only 0.2.

On hash, I notice other small indications like typing 'show chan' and pressing TAB... the commandline may wait for several seconds before completing 'show channel'.

Neither machine is doing anything exotic like MeetMe, queueing/agents or IVR. hash is new - an HP DL145 G2 (Opteron 246 with HyperTransport) and is dedicated to Asterisk. There is a single Sangoma A104 installed.

Comments:By: gdhgdh (gdhgdh) 2007-02-12 09:59:32.000-0600

Further info to this which suggests that hash is simply 'too busy' is I will get batches (maybe a dozen or two) of:

2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1217' is now UNREACHABLE!  Last qualify: 21
2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1216' is now UNREACHABLE!  Last qualify: 20
2007-02-12 11:45:45 NOTICE[4948] chan_sip.c: Peer '1323' is now UNREACHABLE!  Last qualify: 11

followed always ten seconds later by the matching:

2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1217' is now REACHABLE! (21ms / 2000ms)
2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1216' is now REACHABLE! (19ms / 2000ms)
2007-02-12 11:45:55 NOTICE[4948] chan_sip.c: Peer '1323' is now REACHABLE! (17ms / 2000ms)

13xx clients are on a busy switched LAN, whilst 12xx clients are on a private switched LAN - it is not possible for network congestion to be the cause for 12xx clients!

Likewise, on star I will see periodic:

2007-02-12 12:21:20 NOTICE[3098] pbx_dundi.c: Peer '00:17:08:92:37:6e' has become UNREACHABLE!
2007-02-12 12:22:15 NOTICE[3098] pbx_dundi.c: Peer '00:17:08:92:37:6e' has become REACHABLE!

again most likely because hash is too busy to respond to DUNDi ACKs :(

Is there anything that can be suggested other than 'have you tried 1.4?' (because I have, and was scared away by many segfaults, reported onto this tracker)

By: Joshua C. Colp (jcolp) 2007-02-15 19:50:24.000-0600

The first thing that potentially comes to mind is DNS, are lots of DNS Lookups happening that might be blocking? Besides that console output would be helpful.

By: gdhgdh (gdhgdh) 2007-02-16 06:00:35.000-0600

No, DNS lookups are returning instantly:

e.g.

hash*CLI> dundi lookup 1310@local bypass
 1.     0 IAX2/local:colophon@10.0.0.239/1310 (EXISTS|NOUNSLCTD|NOCOMUNSLTD)
    from 00:12:3f:ec:d2:de, expires in 5 s
DUNDi lookup completed in 5769 ms
   ETx-Frame Retry[No] -- OSeqno: 000 ISeqno: 000 Type: DPDISCOVER   (Command)
         Flags: 00 STrans: 23046  DTrans: 00000 [10.0.0.239:4520]
       VERSION         : 1
       DIRECT EID      : 00:17:08:92:37:6e
       CALLED NUMBER   : 1310
       CALLED CONTEXT  : local
       TTL             : 4
       CBYPASS         : Bypass Caches

Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 000 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 23046  DTrans: 00000 [10.0.0.239:4520]
  ENTITY IDENT    : 00:17:08:92:37:6e
  KEYCRC32        : 1058293206
  ENCDATA         : [IV dbaece6a90e84552551c4c11b9b0dd72] 4 encrypted blocks


Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ACK          (Response)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
  ENCDATA         : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks


   Erx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: DPRESPONSE   (Response)
         Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520] (Final)
       ANSWER          : [EXISTS|NOUNSLCTD|NOCOMUNSLTD] 0 <IAX/local:colophon@10.0.0.239/1310�> from [00:12:3f:ec:d2:de]
       HINT            : [UNAFFECTED]
       EXPIRATION      : 5

Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 001 Type: ACK          (Response)
    Flags: 00 STrans: 23046  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
  ENCDATA         : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks


Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ       (Response)
    Flags: 00 STrans: 15255  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
  ENCDATA         : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks

hash*CLI>
Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ       (Response)
    Flags: 00 STrans: 28011  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
  ENCDATA         : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks

hash*CLI>
Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ       (Response)
    Flags: 00 STrans: 24907  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCRYPT      (Command)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
  ENCDATA         : [IV 5037a536f9aa6a2db721c06f299d0f36] 5 encrypted blocks

hash*CLI>
Tx-Frame Retry[No] -- OSeqno: 000 ISeqno: 001 Type: ENCREJ       (Response)
    Flags: 00 STrans: 28027  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 001 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 23046 [10.0.0.239:4520]
Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 15255 [10.0.0.239:4520]
Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL       (Command)
    Flags: 00 STrans: 15255  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 28011 [10.0.0.239:4520]
Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL       (Command)
    Flags: 00 STrans: 28011  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 24907 [10.0.0.239:4520]
Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL       (Command)
    Flags: 00 STrans: 24907  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 001 ISeqno: 000 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 28027 [10.0.0.239:4520]
Tx-Frame Retry[No] -- OSeqno: 001 ISeqno: 002 Type: CANCEL       (Command)
    Flags: 00 STrans: 28027  DTrans: 02663 [10.0.0.239:4520] (Final)
Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 15255 [10.0.0.239:4520]
Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 28011 [10.0.0.239:4520]
Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 24907 [10.0.0.239:4520]
Rx-Frame Retry[No] -- OSeqno: 002 ISeqno: 001 Type: INVALID      (Response)
    Flags: 00 STrans: 02663  DTrans: 28027 [10.0.0.239:4520]


As you can see, this lookup took nearly six seconds to complete, and tcpdump shows the following:

hash:~# tcpdump -i eth0 -n port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

i.e. nothing, as I would expect - no DNS lookups have been attempted during any point of the lookup, my DUNDi config is completely based on IP addresses on the LAN.

By: Michel Belleau (malaiwah) 2007-07-11 07:42:13

I'm using IAX and DUNDi for communication between two Asterisk servers and SIP for communication to clients (~50 SIP clients). One server is busy while the other is mostly idle.

I had issues similar to this since I upgraded from 1.2.9 to > 1.2.13 (1.2.17 if my memory serves me right).
Upgrading the two servers to 1.2.20 fixed the problem.
Servers are now running 1.2.21.1 as of today and still no sign of malfunction.

By: Michel Belleau (malaiwah) 2007-07-13 08:59:58

I had some issues today, in my particular setup I'm using DUNDi-over-OpenVPN which switches calls using IAX2 over two servers, I get this message quite often in the debug log of one server (snippet):

Jul 13 08:16:08 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:16:18 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:16:29 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:16:39 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:16:46 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:16:55 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:17:00 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:17:10 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:18:32 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:18:42 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:19:56 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:20:06 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:21:34 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:21:44 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:21:48 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:21:58 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:22:20 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:22:30 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:22:44 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:22:54 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:22:56 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:23:06 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:23:20 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)
Jul 13 08:23:30 DEBUG[31661] devicestate.c: Changing state for IAX2/192.168.10.62:4569 - state 4 (Invalid)

And the other server:

Jul 11 09:56:45 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 09:56:50 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 09:59:21 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:01:16 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:01:21 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:04:08 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:04:12 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:04:29 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:06:11 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:06:54 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:06:55 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:09:13 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)
Jul 11 10:09:19 DEBUG[23149] devicestate.c: Changing state for IAX2/dundi - state 4 (Invalid)

By: Michiel van Baak (mvanbaak) 2007-09-08 07:25:07

Since 1.2 no longer gets patches beside security updates can you confirm this behavior on 1.4 ?

By: gdhgdh (gdhgdh) 2007-09-10 02:40:29

As the OP, I can't confirm this because we no longer use Asterisk.

By: Michiel van Baak (mvanbaak) 2007-09-10 03:45:56

@malaiwah: Can you give some information wether this issue comes up in 1.4 as well?

By: Tilghman Lesher (tilghman) 2007-12-05 09:43:42.000-0600

Suspending, due to lack of response.