[Home]

Summary:ASTERISK-12940: IAX2 storm (type 4, subtype 20: AST_CONTROL_SRCUPDATE)
Reporter:Chris Tracy (adiemus)Labels:
Date Opened:2008-10-20 17:46:49Date Closed:2009-02-16 17:57:12.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 13749_DEBUG_ONLY.txt
( 1) iax_trace.pcap.gz
( 2) storm.cap
Description:This is a rather odd issue to describe.

In simplest terms, I have a server connected via IAX2 to another server.  The configuration had been working fine through at least 1.4.18.  However, between then an 1.4.21.2, this issue began occuring for all calls from server A to server B.  (But oddly, not for calls from B->A)  The issue persists in 1.4.22.

When a call is placed through server A to server B via IAX2, server A sends a torrent of retransmitted packets right after the initial call setup.  This storm eats up more than 20 Mbit and is reproducible for every call from server A to server B.  If I replicate the configuration of server B onto server C, the behavior remains the same.  Calls from server A to server C show the same "packet storm" bug.

I have complete pcap captures of both sides of the connection, if that's useful.  However, the short overview from server A's perspective is:

A -> B NEW
B -> A AUTHREQ
A -> B AUTHREP
B -> A ACCEPT
A -> B ACK
B -> A ANSWER
A -> B ACK
A -> B CONTROL (subclass = 20, AST_CONTROL_SRCUPDATE)
A -> B CONTROL (subclass = 20, AST_CONTROL_SRCUPDATE) [retrans]
<storm of identical, retransmitted packets>

This is a serious issue for us as the storm is enough to saturate the internet connections both of LAN A and LAN B.  (Packet rate is into the tens or hundreds of thousands per second)

I'm happy to provide more information, but I don't know what else would be useful.

The iax.conf entry for server B on server A looks like:
[serverB]
type=user
auth=md5
secret=redacted
transfer=no
context=from-serverB
jitterbuffer=no

[serverB]
type=peer
auth=md5
secret=redacted
username=serverA
transfer=no
host=dynamic
disallow=all
allow=gsm
jitterbuffer=no

The config for serverA on serverB:
register => serverB:redacted@serverA.fqdn

[serverA]
type=user
auth=md5
secret=redacted
transfer=no
context=from-serverA
jitterbuffer=yes

[serverA]
type=peer
auth=md5
secret=redacted
username=serverB
host=serverA.fqdn
jitterbuffer=yes
trunk=no
trunktimestamps=no
disallow=all
allow=gsm

In the interim, I'll see if I can do a binary search to find in what version this behavior started.
Comments:By: Chris Tracy (adiemus) 2008-10-20 18:05:39

Ok, a quick set of downgrades shows me that:

1.4.20.1 - Broken
1.4.17 - Works

with the same config on this server.  Now to try all the interim versions till I find the first one that's broken.

By: Chris Tracy (adiemus) 2008-10-20 18:40:47

Ok, the issue seems to have started in 1.4.19-rc2:

1.4.17 - Works
1.4.18 - Works
1.4.18.1 - Works
1.4.19-rc1 - Works
1.4.19-rc2 - Broken
1.4.19-rc3 - Untested
1.4.19-rc4 - Untested
1.4.19 - Broken
1.4.20.1 - Broken

I tried copying chan_iax2.c from 1.4.19-rc1 into 1.4.19-rc2.  This worked, but the problem remained, which makes me think the bug isn't in chan_iax2.c.  However, I'm largely out of my depth in the actual asterisk code.  In any case, some change between rc1 and rc2 introduced this behavior.

By: Leif Madsen (lmadsen) 2008-10-21 09:43:39

Thanks for the thorough bug report! Hopefully a developer can take a look at this shortly.

By: Tilghman Lesher (tilghman) 2008-10-21 10:00:03

You should only be getting a DPREP if you're geting a corresponding DPREQ.  I would very much like to see a packet capture, as well as the iax.conf configuration files (just the [general] section) from each involved host.

By: Chris Tracy (adiemus) 2008-10-21 10:11:10

As requested, the iax.conf general sections:

ServerA:
[general]
disallow=all
allow=gsm
allow=g729
autokill=yes
jitterbuffer=no
dropcount=5
maxjitterbuffer=500
jittershrinkrate=1
tos=ef

ServerB:
[general]
disallow=all
allow=gsm
tos=ef
qualify=yes

As to the capture, what exactly would you like?  Full payloads or just an ascii summary?  (If it's full details, I'll need to sanitize it somewhat)

By: Chris Tracy (adiemus) 2008-10-21 10:15:55

I should clarify that my diagnosis as DPREP is predicated on the retransmitted packets being having a type of 4 (CONTROL) and a subclass of 0x14 which is decimal 20, which seems to be DPREP.

Interestingly also, now that I look at the capture, the IAX payload seems to have a retransmit flag inside of it.  This flag is set to 0 for all the DPREP packets, even though they look like retransmits.  (in that they're nearly identical to the first one sent out, differing only in IAX timestamp value)

By: Joshua C. Colp (jcolp) 2008-10-21 10:19:28

Since it is an Asterisk control frame it is actually a source update control frame, not an IAX2 DPREP frame.

By: Chris Tracy (adiemus) 2008-10-21 10:23:29

Ahh, ok.  Then that would point to all the source update control stuff that got changed between rc1 and rc2.  Makes slightly more sense now.  :-)

By: Chris Tracy (adiemus) 2008-10-21 17:19:03

Is there still interest in a packet capture, and if so, at what level of detail?  (tcpdump console output?  Full payload?)

By: Leif Madsen (lmadsen) 2008-10-22 10:02:06

Just pinging Corydon76 to see what level of debugging info is being requested here.

By: Tilghman Lesher (tilghman) 2008-10-22 14:26:35

blitzrage:  I honestly don't know how to proceed, which is why I unassigned myself.

By: Leif Madsen (lmadsen) 2008-10-22 16:07:19

Corydon76: no problem! that's the way to do it.

adiemus: I'm going to say that a full pcap capture (wireshark) is probably going to be the best way to move this forward so that a developer looking at this has all the information necessary.

Thanks!

By: Chris Tracy (adiemus) 2008-10-23 16:08:03

Alright, a bit more info to add.  It took me a while to induce the behavior again to get a better capture.  Interestingly, a call from a softphone registered to serverA made to serverB was working fine.  However, a call from a softphone registered to serverC, routed through serverA to get to serverB showed the issue.  It also dumped a fair amount of warnings to the asterisk console:

[Oct 23 22:14:23] WARNING[22160]: chan_iax2.c:6927 socket_process: Received trunked frame before first full voice frame
<quite a number of these, then>
[Oct 23 22:14:47] NOTICE[22168]: chan_iax2.c:6701 socket_read: Out of idle IAX2 threads for I/O, pausing!
[Oct 23 22:14:48] NOTICE[22168]: chan_iax2.c:6701 socket_read: Out of idle IAX2 threads for I/O, pausing!
[Oct 23 22:14:51] WARNING[20520]: chan_iax2.c:2047 __attempt_transmit: Max retries exceeded to host 127.0.0.1 on IAX2/om-enum-12 (type = 4, subclass = 20, ts=5818, seqno=20)
[Oct 23 22:14:51] WARNING[20566]: chan_iax2.c:2047 __attempt_transmit: Max retries exceeded to host 192.168.10.10 on IAX2/doulos-13 (type = 4, subclass = 20, ts=5673, seqno=25)
[Oct 23 22:14:51] WARNING[20553]: chan_iax2.c:2047 __attempt_transmit: Max retries exceeded to host 192.168.10.10 on IAX2/doulos-13 (type = 4, subclass = 20, ts=5526, seqno=31)
[Oct 23 22:14:51] WARNING[22164]: chan_iax2.c:2047 __attempt_transmit: Max retries exceeded to host 127.0.0.1 on IAX2/om-enum-12 (type = 4, subclass = 20, ts=5785, seqno=9)
[Oct 23 22:14:51] WARNING[20567]: chan_iax2.c:2047 __attempt_transmit: Max retries exceeded to host 127.0.0.1 on IAX2/om-enum-12 (type = 4, subclass = 20, ts=5791, seqno=11)


anonymized pcap dump of this behavior for this call path will follow.  (For reference, IAX2/om-enum-12 is the first leg from serverC to serverA, while IAX2/doulos-13 is the link from serverA to serverB)



By: Chris Tracy (adiemus) 2008-10-23 16:38:14

Uploaded a short capture of the first few hundred packets of a call exhibiting this behavior.  IPs and checksums have been anonymized.  (Hopefully the RSA auth mechanism is strong enough for me not to worry about the IAX2 data :-)

By: Chris Tracy (adiemus) 2008-11-11 17:45:18.000-0600

Is there something else I can do or more information I can provide to help this issue along?  The DoS it creates doesn't lend itself to just ignoring it.

If it would help, I could arrange ssh access to the affected system.

By: Matt Riddell (zx81) 2008-11-12 00:44:55.000-0600

Is there any packet loss between the three machines?

By: Chris Tracy (adiemus) 2008-11-12 00:52:59.000-0600

"Any" loss, yes.

The issue was originally discovered with calls going from the misbehaving asterisk box to an asterisk box on a ship behind a VSAT connection.  Loss on this link typically varies from 1-4%.  However, the issue also exists on connections to terrestrially connected sites with much lower levels of loss.  (<1%)

As far as I can tell, I don't need (abnormal) amounts of packet loss to trigger the issue.  (Though all this testing is taking place over the Internet as I don't currently have the option of reproducing the issue entirely on a single LAN)

By: Matt Riddell (zx81) 2008-11-12 01:36:52.000-0600

If you have local access to one of the machines (i.e. not via SSH) you could test to see if it happens quicker under loss with something like:

tc qdisc add dev eth0 root handle 1: netem delay 100ms
tc qdisc add dev eth0 parent 1:1 handle 10: netem loss 80%

You really need to be sure you are doing this locally as SSH will be just about inoperable in these situations.

By: Matt Riddell (zx81) 2008-11-12 01:38:20.000-0600

Oh, and what are the svn revisions of the not working and working versions (I may be able to set up some tests in the lab down at the office).  I could probably find out but am kinda lazy :)

By: Chris Tracy (adiemus) 2008-11-12 01:53:49.000-0600

I could do the netem tests (scaled back a bit, because I don't have physical access to this system and 80% is rather a lot of loss, as you pointed out), but I'm not sure what good it would do.  There's no need to excite the phenomenon.  If you look at the pcap capture you'll see that it happens basically immediately once the call is set up.  The results are quite reproducible on this system, so far with every external asterisk box I've tried against.

As to the exact SVN revisions, I'm not sure I know how to translate the releases into SVN revs.  Doing it the simple minded way from the web interface on svn.asterisk.org, it looks like:

1.4.19-rc1 - Works  - rev 105671
1.4.19-rc2 - Broken - rev 107529

Not sure if that's what you're after, but hopefully it helps.

Looking through the SVN logs for chan_iax2.c, I have to say that this issue sounds a LOT like the one mentioned in the commit for rev 115565, but instead of a storm of VNAK's, it's a storm of CONTROL's:

http://svn.digium.com/view/asterisk?view=rev&revision=115565

By: Simon Kirby (sim) 2008-11-26 18:57:00.000-0600

I believe we are also seeing this problem.  I'm not sure if it's exactly the same trigger, but I seem to have found a reliable way to reproduce the IAX AST_CONTROL_SRCUPDATE storm, after digging through packet traces from a production environment.

I finally tracked this down to two relatively simple dialplan pieces.  I should also note that we first saw this on 1.4.21.2 and tried to upgrade to 1.4.22 and 1.4.22 + svn 1.4 chan_iax2.c in an attempt to fix this, but nothing appeared to change.

The dialplan in our case is an unintended loop between two boxes.  Box A calls Queue() with a 20 second timeout and then Dial()s box B which calls Queue() joinempty=no (is empty) and continues immediately back to Dial()ing Box A.  In theory, this should just stack up new channels every 20 seconds, but when I dial this extension, it plays hold music for 20 seconds, then again for 20 seconds, and then silence while everything storms.  Eventually, the IAX channels time out and hang up.  This happens exactly the same way each time I call it.

I will attempt to reduce the dialplan and get a packet trace tonight that isn't full of other calls.

By: Simon Kirby (sim) 2008-11-27 12:17:00.000-0600

Ok, this issue can be reproduced very easily with a very basic dialplan.  This particular dialplan used to work in older versions, because I used it to try to load-test IAX previously when we were debugging the last VNAK storming problem.

Steps to reproduce:

Set up two Asterisk 1.4.22 instances and peer them via IAX, with defaults.  It happens with transfer=no or transfer=yes.

Install this dialplan, changing the Dial() on each box to reach the other box:

[default]
exten => _1XXX,1,Answer
exten => _1XXX,n,Wait(.02)
exten => _1XXX,n,Dial(IAX2/t3/$[${EXTEN} - 1]@default)
exten => _1000,1,Answer
exten => _1000,n,SayDigits(123)
exten => _1000,n,Echo

(The Wait() or SayDigits() may not really be necessary, as it doesn't seem to even get to extension 1000.)

From the console with chan_oss enabled or similar, or from a sip phone, Dial an extension such as "1005".  The higher the number, the more times it will try to  dial back and forth.  I'm doing this on a box without a soundcard, and it complains about not being able to open /dev/dsp but dials anyway.

In my case, this test hangs after the first few channels are opened.  "iax2 show stats" shows results such as:

Outstanding frames: 2174 (31 ingress, 2143 egress)
Packets in transmit queue: 1249 dead, 0 final, 2142 total

and "iax2 show threads" shows:

15 of 10 threads accounted for with 94 dynamic threads

I have attached "iax_trace.pcap.gz", captured on the box initiating the call.

By: Simon Kirby (sim) 2008-11-27 17:46:36.000-0600

Confirmed that the problem starts with 1.4.19-rc2.  I can see SVN commit r106235 as released in -rc2 to fix ASTERISK-11581 adds the code to check for AST_CONTROL_SRCUPDATE in all of the control frame processing switch statements.  1.4.19-rc1, when talking to 1.4.19-rc2 and newer, emit this message instead of storming:

[Nov 27 15:19:49] WARNING[15693]: file.c:1167 waitstream_core: Unexpected control subclass '20'

By: Leif Madsen (lmadsen) 2009-01-27 15:10:18.000-0600

Assigned to dvossel to take a look and see if he can move this forward. Thanks!

By: David Vossel (dvossel) 2009-01-28 13:53:33.000-0600

Uploaded a patch for debugging, use 1.4.22.  I don't have the resources at the moment to be able to reproduce the issue.  The console output generated from the patch will help me isolate where the problem may be occurring.  Look for console notices containing "PACKET STORM TESTING:".



By: Leif Madsen (lmadsen) 2009-01-28 20:34:41.000-0600

Changing status to Feedback as there is a request for testing and feedback from the reporter or anyone with this issue.

By: David Brillert (aragon) 2009-02-05 20:27:12.000-0600

Possibly same issue as http://bugs.digium.com/view.php?id=14386 ?

By: mihai (mihai) 2009-02-12 18:58:25.000-0600

I've encountered this problem on our systems and I did some analysis. I believe I've figured out the mechanism by which the problem is triggered, even though I am not yet sure how to fix it.  Anyways, here goes:

My setup: Client1 <--> Asterisk1 <--> Asterisk2 <--> Client2; all talking IAX2

Client1 is attempting to call Client2 via the two Asterisk servers. In each server, app_dial attempts to bridge the two channels and calls res_features.c:ast_bridge_call(), which in turn calls channel.c:ast_channel_bridge() which in turn attempts to create a native bridge by calling channel->tech->bridge(). This translates to a call to chan_iax2.c:iax2_bridge(). The native iax2 bridge is supposed to pass media back and forth until an AST_FRAME_CONTROL frame arrives, at which point it shuts down the native bridge with an AST_BRIDGE_COMPLETE return value.

Commit 106235 (http://svn.digium.com/view/asterisk?view=revision&revision=106235) introduces a new control packet AST_CONTROL_SRCUPDATE, that I believe is intended to signal to an endpoint to reset its media streams. This frame is sent to both legs of a call in channel.c:ast_channel_bridge(), right before entering the bridge loop. Normally, this would be inconsequential, since most IAX2 clients I'm aware of do not know about this type of control frame, and will cheerfully acknowledge and ignore it.  

However, in our scenario, one leg of the call connects to another Asterisk server.  Upon receiving an AST_CONTROL_SRCUPDATE, chan_iax2 will promptly terminate the native bridge. Channel.c:ast_channel_bridge() will end its bridge loop and return with a value of 0. The control gets back to res_features.c:ast_bridge_call() and since there is no code path to handle AST_CONTROL_SRCUPDATE, the frame is simply consumed and the control goes back to the beginning of the loop.  Lather, rinse, repeat.

This infinite looping between the three functions will generate a constant stream of AST_CONTROL_SRCUPDATE frames between the two servers and between the servers and their clients.  We're talking about full IAX2 frames, so if for some reason some frames get lost are received out of order, VNAKs get involved.  This, in turn, will add retransmissions to an already thorny problem and everything increases exponentially until the brown stuff hits the proverbial fan.

I'm not sure how to best fix this.  Obviously, the simplest way is to nip it in the bud by simply dropping incoming AST_CONTROL_SRCUPDATE in iax2_bridge().  But this has some implications that I am not sure I fully understand. In any event, I hope this helps fix the issue.

By: Digium Subversion (svnbot) 2009-02-16 15:28:37.000-0600

Repository: asterisk
Revision: 176247

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r176247 | dvossel | 2009-02-16 15:28:37 -0600 (Mon, 16 Feb 2009) | 8 lines

Fixes issue with AST_CONTROL_SRCUPDATE breaking out of native bridge

In iax2, when a AST_CONTROL_SRCUPDATE is received it breaks from the native bridge, but since there is no code path to handle srcupdate it just goes to be beginning of the loop.  This was causing packet storms of srcupdate frames between servers.  Now srcupdate frames do not break the native bridge for processing.    

(closes issue ASTERISK-12940)
Reported by: adiemus


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176247

By: Digium Subversion (svnbot) 2009-02-16 17:31:12.000-0600

Repository: asterisk
Revision: 176354

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r176354 | dvossel | 2009-02-16 17:31:11 -0600 (Mon, 16 Feb 2009) | 8 lines

Fixes issue with AST_CONTROL_SRCUPDATE not being relayed correctly during bridging

This should have been committed with rev176247, but I missed it.  srcupdate frames no longer break out of the native bridge, but are not being sent to the other call leg either.  This fixs that.

issue ASTERISK-12940



------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176354

By: Digium Subversion (svnbot) 2009-02-16 17:33:56.000-0600

Repository: asterisk
Revision: 176355

_U  trunk/
U   trunk/channels/chan_iax2.c

------------------------------------------------------------------------
r176355 | dvossel | 2009-02-16 17:33:55 -0600 (Mon, 16 Feb 2009) | 13 lines

Merged revisions 176354 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r176354 | dvossel | 2009-02-16 17:30:52 -0600 (Mon, 16 Feb 2009) | 8 lines
 
 Fixes issue with AST_CONTROL_SRCUPDATE not being relayed correctly during bridging
 
 This should have been committed with rev176247, but I missed it.  srcupdate frames no longer break out of the native bridge, but are not being sent to the other call leg either.  This fixs that.
 
 issue ASTERISK-12940
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176355

By: Digium Subversion (svnbot) 2009-02-16 17:46:48.000-0600

Repository: asterisk
Revision: 176359

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_iax2.c

------------------------------------------------------------------------
r176359 | dvossel | 2009-02-16 17:46:48 -0600 (Mon, 16 Feb 2009) | 20 lines

Merged revisions 176355 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r176355 | dvossel | 2009-02-16 17:33:55 -0600 (Mon, 16 Feb 2009) | 13 lines
 
 Merged revisions 176354 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r176354 | dvossel | 2009-02-16 17:30:52 -0600 (Mon, 16 Feb 2009) | 8 lines
   
   Fixes issue with AST_CONTROL_SRCUPDATE not being relayed correctly during bridging
   
   This should have been committed with rev176247, but I missed it.  srcupdate frames no longer break out of the native bridge, but are not being sent to the other call leg either.  This fixs that.
   
   issue ASTERISK-12940
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176359

By: Digium Subversion (svnbot) 2009-02-16 17:57:12.000-0600

Repository: asterisk
Revision: 176362

_U  branches/1.6.1/
U   branches/1.6.1/channels/chan_iax2.c

------------------------------------------------------------------------
r176362 | dvossel | 2009-02-16 17:57:11 -0600 (Mon, 16 Feb 2009) | 20 lines

Merged revisions 176355 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r176355 | dvossel | 2009-02-16 17:33:55 -0600 (Mon, 16 Feb 2009) | 13 lines
 
 Merged revisions 176354 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r176354 | dvossel | 2009-02-16 17:30:52 -0600 (Mon, 16 Feb 2009) | 8 lines
   
   Fixes issue with AST_CONTROL_SRCUPDATE not being relayed correctly during bridging
   
   This should have been committed with rev176247, but I missed it.  srcupdate frames no longer break out of the native bridge, but are not being sent to the other call leg either.  This fixs that.
   
   issue ASTERISK-12940
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176362