Summary:ASTERISK-07204: [patch] timing issue (race) with poke/pong for very close peers can cause peer to be declared unreachable
Reporter:Steve Davies . (stevedavies)Labels:
Date Opened:2006-06-19 15:40:06Date Closed:2006-06-21 11:52:34
Versions:Frequency of
Environment:Attachments:( 0) dodgepokepongrace.patch

This is a subtle one needing just the right conditions - so please follow along carefully...

Disclaimer is on file.

I was doing testing with two hosts - actually running on the same server using Linux vserver.

Asterisk trunk as of today (6/19/06)

Seeing they were on the same host, there was very low latency between the two.  1ms as measured by qualify, 0.04 msec by ping.  The test machine was an SMP box with a dual-core processor.

I was suprised to find Asterisk repeatedly declaring the peer UNREACHABLE and then REACHABLE again and so on and so on.  Looking in the logs, two POKEs were being sent at more or less the same time to the same peer - looks like the sched tasks had been lost track of.  Then an extra poke_noanswer task ran and called the peer down.

After investigation, my opinion is that the problem is caused in this code in iax2_poke_peer:

       if (peer->pokeexpire > -1)
               ast_sched_del(sched, peer->pokeexpire);
       /* Speed up retransmission times */
       iaxs[peer->callno]->pingtime = peer->maxms / 4 + 1;
       iaxs[peer->callno]->peerpoke = peer;
       send_command(iaxs[peer->callno], AST_FRAME_IAX, IAX_COMMAND_POKE, 0, NULL, 0, -1);

       /* SLD: It would be bad for PONG to arrive now already!
            cos then there'd be nothing to ast_sched_del it and the peer
            will inevitably end up being declared unreachable */
       /* If the host is already unreachable then use the unreachable interval instead */
       if (peer->lastms < 0) {
               peer->pokeexpire = ast_sched_add(sched, peer->pokefreqnotok, iax2_poke_noanswer, peer);
       } else
               peer->pokeexpire = ast_sched_add(sched, DEFAULT_MAXMS * 2, iax2_poke_noanswer, peer);

This code sends the POKE, then schedules the poke_noanswer.  Now if the PONG reply arrives BEFORE the poke_noanswer is scheduled - where my comment is - then note what happens:

 1) The POKE is sent
 2) The arriving PONG is processed in socket_process:
         peer is declared REACHABLE
         the next iax2_poke_peer is scheduled (and remembered in pokeexpire)
 3) and now, in code above, poke_noanswer is scheduled.

But this poke_noanswer is scheduled too late - the arrival of the PONG is supposed to cancel it, but that did already arrive.  So poke_noanswer will fire and call the host UNREACHABLE.  

In my patch I resequence the code to make sure that we have the poke_noanswer scheduled BEFORE we send the POKE.  This avoids the race.  Similary before we send the ACK.

The issue seems to be present in branch-1.2 too, though I don't think I ever saw the problem.  Probably the problem is shown up by the addition of the SCHED_MULTITHREADED stuff which increases the chance for out-of-order execution.

Comments:By: Andrew Kohlsmith (akohlsmith) 2006-06-20 08:58:43

I've seen this for a VERY long time (at least a year or so).  I'll give this a shot and report back.

It wasn't ever a big problem but it was annoying that it'd happen... I'd not be able to receive calls for about 30 seconds to a minute before it decided to retry and suddenly I was available again.

By: Joshua C. Colp (jcolp) 2006-06-21 11:52:33

Very nice catch, gave you a little bit of extra karma since this probably took some good time to track down. Now, I only fixed this in trunk because as you said the scheduler is what uncovered it, in 1.2 the way chan_iax2 is - this shouldn't ever happen. If you do feel though that it would be good to have it in 1.2, reopen  and we can discuss. Thanks a lot!