[Home]

Summary:ASTERISK-12290: [patch] chan_iax2 will create multiple sessions when receiving retransmitted NEW
Reporter:Peter Grayson (jpgrayson)Labels:
Date Opened:2008-06-30 18:48:40Date Closed:2011-06-07 14:02:47
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan_iax_dup_new_fix5-trunk.diff
( 1) chan_iax2_dup_new_fix.patch
( 2) chan_iax2_dup_new_fix2.patch
( 3) chan_iax2_dup_new_fix3.patch
( 4) chan_iax2_dup_new_fix4.patch
( 5) chan_iax2_dup_new_fix5.patch
Description:When a client initiates a call with a NEW frame, frequently the client will send a retransmitted NEW frame prior to receiving the ACCEPT frame from asterisk. This may result in asterisk receiving duplicate NEW frames from the same client in short succession.

As of at least 1.4.20, asterisk will ACCEPT and create new call sessions for both of these NEW frames originating from the same client. This results in much protocol confusion between asterisk and the client and ultimately leads to asterisk and/or the client terminating the call.

The correct behavior would be for asterisk to recognize second (and third and fourth) NEW frames from the same source (socket and callno) and ignore them.


****** ADDITIONAL INFORMATION ******

There are at least two possible places this problem may be addressed.

(1) In socket_process(), at chan_iax2.c:7191, the check_dcallno calculation might be modified to include IAX_COMMAND_NEW as an exception to the dcallno-checkable kinds of IAX frames.

(2) In match(), at chan_iax2.c:1446, if the dcallno parameter is not set (zero) and the rest of the matching criteria (ip/port/peercallno) are met, then this channel could match.
Comments:By: Peter Grayson (jpgrayson) 2008-06-30 18:55:27

Attached patch chan_iax2_dup_new_fix.patch.

This implements fix (1). It also cleans up the messy logic using ternary operators in socket_process() and match().

I believe this is sufficient to fix the problem. Unfortunately this is socket_process() so second and third opinions are welcome.

By: Peter Grayson (jpgrayson) 2008-06-30 19:05:39

Of course this problem is related to the discussion found here:

http://lists.digium.com/pipermail/asterisk-dev/2008-May/033217.html

I think the double new scenario was unaccounted for.

By: Peter Grayson (jpgrayson) 2008-07-01 18:27:27

I've uploaded another proposed patch that obsoletes the first. This change is more comprehensive than the first. I'll try to explain and justify the changes:

1) Changed match() to no longer have the check_dcallno parameter. Instead, an incoming dcallno of 0 will allow the dcallno to match.

This is symmetric with how callno is matched against cur->peercallno. This symmetry makes it much easier to predict match()'s semantics. I also think it makes it easier for the rest of the find_call() related code to be made correct.

2) Remove check_dcallno parameter from __find_callno(), find_callno(), and find_callno_locked(). This parameter was a pass-through to match(), so this change makes sense if you buy the change to match().

Also eliminates the self-proclaimed "hack" of hooking check_dcallno to the frames_received member of the chan_iax2_pvt struct passed to ao2_find(). Eliminating this hack is a bonus.

3) Remove match_dcallno argument from find_callno() invocation at line 7036. Here it is important to note that dcallno will necessarily equal zero at this point in the code. In other words, these two lines would be equivalent:

 fr->callno = find_callno(ntohs(vh->callno) & ~0x8000, dcallno, &sin, new, fd);
 fr->callno = find_callno(ntohs(vh->callno) & ~0x8000,       0, &sin, new, fd);

So the new code will give the same results as previously.

4) Remove match_dcallno argument from find_callno_locked() invocation at line 7094. Here we see that dcallno and formerly match_dcallno were zero. The same matching semantic holds.

5) Around line 7180. This is where it gets hairy. Some factoids to help sort this out:

* For non-full-frames, dcallno is always 0.
* For full frames that are not (PING, LAGRQ, or NEW) if the dcallno is zero we know there is a problem without having to call find_callno().
* For (PING, LAGRQ, and NEW) frames, dcallno needs to either be 0 or match the call. This is actually a stronger test than what was done previously. Previously an incoming PING or LAGRQ would be matched even if its dcallno was non-zero and unequal to pvt->callno. I assert that this was bogus protocol.

So what the patched code does is take all of these factors into account in one conditional test. The only way a dcallno set to 0 will be passed to find_callno() is if is a frame that is explicitly allowed to have a dcallno of 0.

6) The next five changes to find_callno_locked() and find_callno() were already explicitly not checking dcallno and this semantic is preserved.

7) The last change in pvt_cmp_cb() simply removes the now-bogus frames_received=match_dcallno hack.

I have tested this patch against a rigged version of iaxclient that always sends out at least one retransmitted NEW frame. Asterisk behaves as expected in that case -- duplicate NEW frames are ignored because they match correctly against an already created call session.

By: Tilghman Lesher (tilghman) 2008-07-01 20:50:40

I need three things changed with this patch before I'm comfortable with committing it.

a) Remove gratuitous changes in spacing, so that the code change reflects just that.

b) Anywhere where you've changed possible semantics, I need to see a comment in the code, just in case the semantics are incorrect.  This will make finding those cases easier in the future.

c) In draft 4 of the IETF IAX2 specification, section 6.2.2, the document states that the destination call number is not required.  It does NOT say what value it should be, so a value other than zero is possible.  Asterisk should therefore not require this field to be zero.  That needs to be changed in your patch to submit an explicit zero in the dcallno parameter.

Additionally, I'm interested in confirming that your IAX2 client is following the specification for retransmission timeout by using twice the last POKE/PONG response time (Section 7.2.1).

By: Peter Grayson (jpgrayson) 2008-07-02 10:07:56

> I need three things changed with this patch before I'm comfortable with
> committing it.
>
> a) Remove gratuitous changes in spacing, so that the code change
> reflects just that.

Okay. I'll submit any whitespace-oriented stuff as supplemental patches.

> b) Anywhere where you've changed possible semantics, I need to see a
> comment in the code, just in case the semantics are incorrect. This will
> make finding those cases easier in the future.

I really don't know what you want here. Any non-whitespace change has
the possibility to change semantics. The code says what the semantic is;
beyond that comments are for the "why" not the "what". If there are
specific places you have in mind, I'll do my best to comment them.

> c) In draft 4 of the IETF IAX2 specification, section 6.2.2, the
> document states that the destination call number is not required. It
> does NOT say what value it should be, so a value other than zero is
> possible. Asterisk should therefore not require this field to be zero.
> That needs to be changed in your patch to submit an explicit zero in the
> dcallno parameter.

I see that interpretation now. I might suggest that the spec could be
clarified. e.g. "for a NEW frame ... the destination call identifier in
the header must be ignored by the receiver". Also, it might be
reasonable for the spec to demand that the sender set the destination
call identifier to 0, but maybe there are historical reasons why that is
not the case.

> Additionally, I'm interested in confirming that your IAX2 client is
> following the specification for retransmission timeout by using twice
> the last POKE/PONG response time (Section 7.2.1).

Nowhere in the spec does it specify what value to assume for the initial
round trip time. When that first NEW frame goes out, the client will not
have had an actual PING/PONG interaction with the server yet, so an
arbitrary number needs to be chosen. I am using iaxclient, it sets the
initial round trip time to 100ms.

In terms of protocol, none of this matters of course because UDP frames
have no guarantee to arrive on time or in order. Asterisk needs to be
able to deal with duplicate NEW frames.

Another problem with the patch:

Upon closer reading of the section 6.2.2, it appears that asterisk must
send an ACCEPT frame in response to this duplicate NEW frame. The patch
as it stands does not yield this behavior. Asterisk will just ignore the
duplicate NEW frame. The spec seems correct in this case. When asterisk
encounters a duplicate NEW frame, it cannot assume that the client
received its first ACCEPT frame.

I will rework the patch to take all of these things into account.

By: Tilghman Lesher (tilghman) 2008-07-02 12:48:14

> > b) Anywhere where you've changed possible semantics, I need to see a
> > comment in the code, just in case the semantics are incorrect. This will
> > make finding those cases easier in the future.

> I really don't know what you want here. Any non-whitespace change has
> the possibility to change semantics. The code says what the semantic is;
> beyond that comments are for the "why" not the "what". If there are
> specific places you have in mind, I'll do my best to comment them.

I'm specifically interested in seeing comments in any place where you've specifically said that that you are changing how Asterisk is responding.  So, protocol semantics, specifically.  Something along the lines of "Asterisk used to allow this, it is now more strict in that it requires that."

By: Peter Grayson (jpgrayson) 2008-07-02 16:56:38

Perhaps the third time is the charm. I believe this latest patch to be superior to the other two.

Things that are different:

* No more gratuitous whitespace changes.

* Changed where we decide to not match based on dcallno. We now simply leave dcallno set to 0 when we encounter one of our exceptional cases (NEW, PING, LAGRQ). I think this logic is simpler and more to-the-point than the alternative where we checked it later on in socket_process().

* The new logic also accounts for the case where dcallno may be non-zero for NEW, PING, and LAGRQ. We no longer match based on dcallno for NEW, PING, or LAGRQ under any circumstances. This matches the previous semantic for PING and LAGRQ and the semantic spelled out in the iax2 rfc section 6.2.2.

* Updated big comment to explain why we do not match dcallno for NEW frames.

* Updated big comment to explain how we avoid matching by dcallno for NEW, LAGRQ, and PING.

* Added test for IAX_COMMAND_NEW with the rest of the out of order packet exceptions. This causes asterisk to _not_ send an ACK when it encounters a duplicate/retransmitted NEW frame. (line 7290) I put a wimpy comment next to this test that identifies what condition it is catching. Not sure if this meets the commenting requirement.

> Upon closer reading of the section 6.2.2, it appears that asterisk must
> send an ACCEPT frame in response to this duplicate NEW frame. The patch
> as it stands does not yield this behavior. Asterisk will just ignore the
> duplicate NEW frame. The spec seems correct in this case. When asterisk
> encounters a duplicate NEW frame, it cannot assume that the client
> received its first ACCEPT frame.

I'm going to waffle on this one. For any second, third, etc. NEW frame that comes in, we will have already sent an ACCEPT frame. If the other end fails to receive that ACCEPT frame, it will receive a retransmission of that ACCEPT frame through the normal retry mechanism. There is no need to send out another ACCEPT frame (with its own retry timer) in response to this redundant NEW frame.

I have run this against the HEAD of the 1.4 branch successfully.

By: Digium Subversion (svnbot) 2008-07-10 16:49:28

Repository: asterisk
Revision: 129803

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r129803 | tilghman | 2008-07-10 16:49:26 -0500 (Thu, 10 Jul 2008) | 8 lines

Correctly deal with duplicate NEW frames (due to retransmission).  Also, fixup
the destination call number matching to be more strict and reliable.
(closes issue ASTERISK-12290)
Reported by: jpgrayson
Patches:
      chan_iax2_dup_new_fix3.patch uploaded by jpgrayson (license 492)
Tested by: jpgrayson, Corydon76

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=129803

By: Tilghman Lesher (tilghman) 2008-07-11 09:06:58

jpgrayson:  Russell Bryant has raised an objection to this patch, as written.  Could you address his concerns?

http://lists.digium.com/pipermail/asterisk-dev/2008-July/033912.html

By: Peter Grayson (jpgrayson) 2008-07-11 12:28:48

Russell is a gentleman and a scholar and I believe he is absolutely correct. With patch 3, this would allow full frames with a dcallno set to zero to inappropriately match existing call sessions.

I have submitted a fourth patch for review. This is against the new head of the 1.4 branch (post rev 129803). Instead of reverting the previous patch, I think this approach might be better.

Since we know that full-frames that are not NEW, PING, or LAGRQ must have a non-zero dcallno, we can just test for that before trying to match the call. This leaves match() and find_callno() with their simpler semantics that satisfy all other uses. It also keeps the "frames_received" hack out of the code.

By: Russell Bryant (russell) 2008-07-11 13:40:56

Yes, I believe this final patch is correct.  Well done.  :)

By: Digium Subversion (svnbot) 2008-07-11 13:44:32

Repository: asterisk
Revision: 130169

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r130169 | tilghman | 2008-07-11 13:44:26 -0500 (Fri, 11 Jul 2008) | 7 lines

Ensure that a destination callno of 0 will not match for frames that do not
start a dialog (new, lagrq, and ping).
(closes issue ASTERISK-12290)
Reported by: russellb
Patches:
      chan_iax2_dup_new_fix4.patch uploaded by jpgrayson (license 492)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=130169

By: Digium Subversion (svnbot) 2008-07-13 12:48:31

Repository: asterisk
Revision: 130514

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r130514 | tilghman | 2008-07-13 12:48:29 -0500 (Sun, 13 Jul 2008) | 4 lines

Reverting 2 changesets, as it breaks incoming IAX2 calls
(Related to issue ASTERISK-12290)
Reported by: mvanbaak

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=130514

By: Tilghman Lesher (tilghman) 2008-07-13 12:53:39

jpgrayson:  you'll need to restart your patches, as this broke incoming calls.

By: Peter Grayson (jpgrayson) 2008-07-14 09:25:31

I need more information. These patches do not break incoming calls for me.

Is there a mailing list thread or another bug post that I'm missing? Do incoming calls always fail or just sometimes? Is there a packet trace available that shows the issue?

Thanks.

By: Peter Grayson (jpgrayson) 2008-07-14 09:48:43

I have found a hole in the patch that may account for the previously reported problem.

In patch ASTERISK-1, we now also avoid checking dcallno for non-AST_FRAME_IAX frametypes. This is now the same semantic as before and different from the previous patch #4 which was reportedly bogus. Besides this one change, the rest of the patch is the same as before.

 /* Get the destination call number */
 dcallno = ntohs(fh->dcallno) & ~IAX_FLAG_RETRANS;

 - if (f.frametype == AST_FRAME_IAX &&
 + if (f.frametype != AST_FRAME_IAX ||

I have not yet tested this patch, so buyer beware.

By: Michiel van Baak (mvanbaak) 2008-07-14 09:49:17

It was me who found it, and I reported it to Corydon on IRC.
He asked me to test a couple of revisions, we found a working one, found out it was this patch and some other related to this, and Corydon reverted.

I can do some more research on it later today, including getting packetdumps etc.

The calls failed 100% of the time.

By: Peter Grayson (jpgrayson) 2008-07-14 09:53:26

Michiel: Thank you for the update. I would appreciate it if you would try the latest (fifth) patch that I've posted. If that fails, packet dumps and any other information you can provide would be appreciated.

By: Michiel van Baak (mvanbaak) 2008-07-14 12:33:50

bleh, this patch does not apply to trunk ;)

Give me a couple of minutes to fix it :)

By: Michiel van Baak (mvanbaak) 2008-07-14 12:37:01

there, if others want to test it against trunk :)

By: Michiel van Baak (mvanbaak) 2008-07-14 12:43:45

@jpgrayson:
Thank you, this new patch works.

Corydon: you can commit this one if it was for me :)

By: Michiel van Baak (mvanbaak) 2008-07-14 13:01:34

yipes!
It seems to only work right after a restart of asterisk.

Just to be sure I tried again now, and no incoming calls sir

:(

By: Peter Grayson (jpgrayson) 2008-07-14 13:11:10

Okay, so we're back to where I need more details.

Does this patch only have problems against trunk? It was developed against the 1.4 branch -- I have not even looked at whether it makes sense for the trunk.

What are the symptoms of failure?

And any other information including packet dumps that you could provide would be appreciated.

By: Michiel van Baak (mvanbaak) 2008-07-14 13:24:48

Can we meet on #asterisk-dev irc channel on irc.freenode.net ?

By: Michiel van Baak (mvanbaak) 2008-07-14 13:38:21

this is too weird.
With this patch, I dont see any packets on the wire when I call my system.
Without this patch, I can see them.

Looks like I have to capture the register packets as well, might be something wrong in there.

some CLI output:
asterisk*CLI> iax2 show registry
Host                  dnsmgr  Username    Perceived             Refresh  State
my_itsp_ip:4569   N       username     my_public_ip:61187         60  Registered
1 IAX2 registrations.
asterisk*CLI> iax2 show peers
Name/Username    Host                 Mask             Port          Status    
speakup01/vanba  my_itsp_ip  (S)  255.255.255.255  4569          OK (15 ms)
...
4 iax2 peers [4 online, 0 offline, 0 unmonitored]

so registration, and the qualify stuff works correctly.
Outgoing calls via speakup01 work as well

I did not try the 1.4 version. You want me to test that one as well ?

IIRC speakup is using asterisk 1.4.<someversion> on their side.

By: Peter Grayson (jpgrayson) 2008-07-14 14:13:22

Yes, I would be interested in knowing if this patch works against 1.4. Honestly, I don't really have the bandwidth to care about the trunk at the moment. So if we can figure this out against 1.4, then maybe it'll be easier to get it ported to trunk.

By: Michiel van Baak (mvanbaak) 2008-07-14 14:43:20

Then we'll have to wait roughly 20 hours.
Sleep and work are getting in the way, and I dont have time left today to install 1.4 and make my config compatible with it, sorry.

I'll checkout and compile it already on the box, and see how far I get with the configs. (All my stuff is generated, and it's 1.6/trunk only. Mostly because of the argument seperator etc. Will need to hack some stuff in my config generation scripts)

Sorry it will take so much time, but I need some time away from the pc every day to get sleep etc.

By: Michiel van Baak (mvanbaak) 2008-07-16 16:21:17

Did not have time to look into this one yet, and dont think I will before my 2 week holiday starting this weekend.

I know it breaks trunk, that is very clear.

By: Tilghman Lesher (tilghman) 2008-07-31 16:54:42

Given the history of this, I am skittish about any patch that removes the hack.  If you can produce a version of this patch that fixes ONLY the NEW part of this, and it works, we can discuss removing the hack in trunk only.

By: Tilghman Lesher (tilghman) 2008-09-11 17:43:28

No response from reporter.