[Home]

Summary:ASTERISK-09946: Old LAGRQ frames showing up in new IAX2 calls
Reporter:mihai (mihai)Labels:
Date Opened:2007-07-24 10:52:35Date Closed:2007-08-02 12:47:58
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 06asterisk-1.4-dynamic-thread-creation-race.patch
( 1) 06asterisk-1.4-dynamic-thread-creation-race-v2.patch
( 2) clean_frames_check_and_lock.patch
( 3) clean_frames_locked.patch
( 4) clean_frames.patch
( 5) vnak_storm-call168.cap.gz
Description:If there are two successive IAX2 calls having the same source and destination call id, sometimes old LAGRQ frames belonging to the first call are transmitted as part of a VNAK retransmission during the second call. Since the old LAGRQ have wildly out of order sequence numbers, the other endpoint will request retransmission, which can cause a VNAK storm.
I am able to reproduce this by stress testing chan_iax2 with an automated script that generates about 3 calls per second, up to about 200 simultaneous calls.  This will pretty much guarantee that source ids will be recycled on the server side, which can trigger this issue.
Comments:By: mihai (mihai) 2007-07-24 10:56:44

I have uploaded a file demonstrating the problem.  This is a filtered trace showing only full frames belonging to all calls with source id 168 and destination id 1 (on the server side).  The first incorrect LAGRQ is at packet ASTERISK-83. You will see that this frame is a duplicate of frame ASTERISK-73, which belongs to a call that ended about a minute before.

By: Simon Kirby (sim) 2007-07-25 13:23:26

I am seeing the exact same problem and have many traces of this issue, but only from production.  I cannot seem to reproduce this with a test setup, though I may just not have waited long enough.

In the traces I have captured, I am seeing less variation than your trace and instead just a complete all-out storm of VNAK, LAGRQ, LAGRP.  Of note here is that the embedded timestamp stays the same on the LAGRQ, LAGRP packets, but increases with the VNAK packets.

How were you able to get this to happen in your environment?  Did you have to fake any packet loss, lag, or reordering?

By: mihai (mihai) 2007-07-25 18:09:40

I have a script that runs a modified version of iaxclient in a loop. I start 3 calls every second.  Each call lasts about 30-40 seconds.  Asterisk is set to emulate 30% packet loss on both inbound and outbound traffic.  This is done using iax2 test losspct 30, but I had to modify the code so that packet loss happen on outbound traffic as well (standard asterisk only simulates inbound packet loss).  You could also use an external traffic shaper to obtain similar results.

The problem happens when a client issues a VNAK.  In some cases Asterisk has late frames in its retransmission queue, which in turn will generate more VNAKs and so on, exponentially.

VNAK frames have their own timestamp.  Retransmitted frames have their original timestamps, that's why you see that timestamp discrepancy.

By: Simon Kirby (sim) 2007-07-25 19:30:07

I was able to reduce a loop between VNAK/HANGUP just now, but now it doesn't seem to be consistent.  I am using a single call which dials into this dialplan:

[default]
exten => _1XXX,1,Answer
exten => _1XXX,n,Wait(.1)
exten => _1XXX,n,Dial(IAX2/t3/$[${EXTEN} - 1]@default)
exten => _1000,1,Answer
exten => _1000,n,Echo

t3 is the remote. on t3, change that to t1.

This allows n+1 channels to be opened by simply dialing 1000 + n.

By: Simon Kirby (sim) 2007-07-25 19:30:27

This keeps sounding more like a design issue to me.

My first feeling when looking at how VNAK works was that is an odd approach, but I am used to TCP where the sender retransmits until it gets an ACK.

Isn't it broken that VNAK could ever be sent to a frame that is retransmitted as part of a previous VNAK request?  To put another way, shouldn't frames retransmitted as a result of a VNAK contain a flag that makes them never generate another VNAK?

By: mihai (mihai) 2007-07-25 21:39:58

Yes, that could probably be done better.  You can also argue that if a bunch of frames are out of order, there is no need to send a separate VNAK for each, when one would suffice.

However, it is probably too late to change VNAK behavior in IAX at this time, at least for the 1.4 branch. I am more concerned with the cause of these stale frames. It probably means that there is a race condition somewhere that can create a lot of headache later on.

By: jeff oconnell (jeff_oconnell) 2007-07-25 22:05:10

sim: we've also started to discuss limiting the number of VNAKs iaxclient will send for a given frame.  we're going to check what the protocol requires and then see what the pros and cons are putting limits in. if you have any thoughts, please pass them along.

By: mihai (mihai) 2007-07-26 20:53:43

I've looked into this problem a little bit more and I've found something interesting: there are a lot of frames in the retransmission queue (iaxq) that have the retries field set to -1.  If my understanding of the code is correct, retries is -1 when the frame needs to be cleared up (either because it was acked, or because the call was terminated, or any other reason).  Now AFAIK, there are only two places that that clear up frames: one in attempt_transmit and another in network_thread.  The code in network_thread is never reached, because the sentyet flag is normally true.  Which leaves the code in attempt_transmit

Now this is where it gets interesting.  When a frame is scheduled for retransmission, the scheduler calls attempt_transmit, which in turn decides whether to remove the frame and free it or just to increment its retries counter.  The effect of this is that if the frame is marked for deletion now, it could take as much as MAX_RETRY_TIME (10 sec) before the scheduler triggers a call to attempt_transmit and gives it a chance to remove the frame.

This in itself is not such a big deal, however, if something goes wrong with the scheduler, the frame will never get removed from the queue.  Now assume that the call is over and after a while, a new call gets assigned to that slot.  If a VNAK request is made, chances are the queued up frames will show up and start the VNAK storm.

I am not saying that this is what happens, since I have not figured out what goes wrong yet.  All I am saying is that this is a strong possibility, and matches what we have seen in traces and core dumps.



By: mihai (mihai) 2007-07-30 15:28:30

Uploaded a patch that attempts to solve the problem.  The patch changes the following:
- adds code to network_thread() that cleans up frames that have retries == -1
- Comments out the code that cleans up frames in __attempt_transmit
- Adds a couple of checks to vnak_retransmit that prevents it from sending out incorrect frames

In my tests this patch works fine as long as the multithreaded scheduler is disabled.  If enabled, I get random crashes.

By: mihai (mihai) 2007-07-30 16:00:31

New and improved patch. Now we lock iaxsl[callno] before removing the frame. Hopefully this will clean up our synchronization issues

By: mihai (mihai) 2007-07-31 10:26:45

Well, turns out that the patch is not good enough.  There is still a race condition between the scheduler thread and the network thread. If a scheduled event is triggered and immediately after, the network thread clears the frame, then __attempt_transmit might not be able to extract the correct callno which can lead to crashes

By: mihai (mihai) 2007-07-31 15:22:21

Uploaded yet another patch.  This time we check if the pointer provided to __attempt_transmit is valid.  In order to do that, we scan the retransmit queue, and try to find our frame in there. Once we get our frame, we know we have a valid callno, so we can go ahead, lock our mutex and do our thing.

This is not optimal, performance-wise.  Scanning the list is a O(n) operation and we do it for each retransmit attempt. However, the retransmit queue should not get too long, except for extreme conditions, so I believe it's an acceptable tradeoff (until someone thinks of something better).

By: mihai (mihai) 2007-08-01 13:34:57

We (SteveK and I) believe we have found the root cause of this problem:

It all starts from a race condition between the moment a dynamic thread is created and when it is used.  When schedule_action requests a thread, if there are no available pooled threads, a new dynamic thread is created. NOrmally, this thread will grab its mutex and then wait for a condition.  Schedule action would set the thread callback and data and then signal the thread to start running.

However, if the signal arrives before the thread has acquired its mutex, it will ignored. The thread will go into sleep and will timeout after 30 seconds without executing its task. If the callback is attempt_transmit, the frame is not retransmitted and, more importantly, is not freed. The scheduler has no other chance to reschedule this event, so the frame stays there forever.

Patch to follow soon...

By: mihai (mihai) 2007-08-01 14:38:48

Uploaded two patches that attempt to solve the problem.
The first one is just a first attempt, and has some issues.
The second one is better and more elegant. Here's what it does:
- add a flag to iax_thread that signals if the thread is ready to accept conditions
- after creating the thread, wait in a spinloop for the flag to pop up
- in iax2_process_thread, set the flag after we lock our mutex

This way we ensure that by the time schedule_action gets its thread, the thread will be ready

By: mihai (mihai) 2007-08-01 14:56:57

There is a second thread condition in iax2_process_thread that could, potentially, trigger the same problem:

When a dynamic thread times out, it will unlock its mutex and then remove itself from the dynamic thread list.  If, between these two operations, the thread is picked up by find_idle_thread and returned to schedule_action, the action never gets executed.

One possible way of solving this would be to check the return value of AST_LIST_REMOVE, and if it is NULL, execute the action anyways... dunno if it is the best solution though...

By: Digium Subversion (svnbot) 2007-08-01 16:58:55

Repository: asterisk
Revision: 77887

------------------------------------------------------------------------
r77887 | russell | 2007-08-01 16:58:53 -0500 (Wed, 01 Aug 2007) | 23 lines

Fix some race conditions which have been causing weird problems in chan_iax2.
The most notable problem is that people have been seeing storms of VNAK frames
being sent due to really old frames mysteriously being in the retransmission
queue and never getting removed.

It was possible that a dynamic thread got created, but did not acquire its lock
before the thread that created it signals it to perform an action.  When this
happens, the thread will sleep until it hits a timeout, and then get destroyed.
So, the action never gets performed and in some cases, means a frame doesn't
get transmitted and never gets freed since the scheduler never gets a chance
to reschedule transmission.

Another less severe race condition is in the handling of a timeout for a dynamic
thread.  It was possible for it to be acquired to perform at action at the same
time that it hit a timeout.  When this occurs, whatever action it was acquired
for would never get performed.

(patch contributed by Mihai and SteveK)
(closes issue ASTERISK-9946)
(closes issue ASTERISK-9912)
(closes issue ASTERISK-9902)
(possibly related to issue ASTERISK-10004)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-08-01 17:06:59

Repository: asterisk
Revision: 77889

------------------------------------------------------------------------
r77889 | russell | 2007-08-01 17:06:58 -0500 (Wed, 01 Aug 2007) | 31 lines

Merged revisions 77887 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r77887 | russell | 2007-08-01 17:16:17 -0500 (Wed, 01 Aug 2007) | 23 lines

Fix some race conditions which have been causing weird problems in chan_iax2.
The most notable problem is that people have been seeing storms of VNAK frames
being sent due to really old frames mysteriously being in the retransmission
queue and never getting removed.

It was possible that a dynamic thread got created, but did not acquire its lock
before the thread that created it signals it to perform an action.  When this
happens, the thread will sleep until it hits a timeout, and then get destroyed.
So, the action never gets performed and in some cases, means a frame doesn't
get transmitted and never gets freed since the scheduler never gets a chance
to reschedule transmission.

Another less severe race condition is in the handling of a timeout for a dynamic
thread.  It was possible for it to be acquired to perform at action at the same
time that it hit a timeout.  When this occurs, whatever action it was acquired
for would never get performed.

(patch contributed by Mihai and SteveK)
(closes issue ASTERISK-9946)
(closes issue ASTERISK-9912)
(closes issue ASTERISK-9902)
(possibly related to issue ASTERISK-10004)

........

------------------------------------------------------------------------

By: Jason Parker (jparker) 2007-08-01 17:37:56

Reopened at request from sim.

By: Jason Parker (jparker) 2007-08-01 18:07:14

Re-closed.  There was a bit of confusion.

By: Digium Subversion (svnbot) 2007-08-02 12:47:18

Repository: asterisk
Revision: 77943

------------------------------------------------------------------------
r77943 | russell | 2007-08-02 12:47:17 -0500 (Thu, 02 Aug 2007) | 9 lines

Fix another race condition in the handling of dynamic threads.  If the dynamic
thread timed out waiting for something to do, but was acquired to perform an
action immediately afterwords, then wait on the condition again to give the
other thread a chance to finish setting up the data for what action this thread
should perform.  Otherwise, if it immediately continues, it will perform the
wrong action.
(reported on IRC by mihai, patch by me)
(related to issue ASTERISK-9946)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-08-02 12:47:58

Repository: asterisk
Revision: 77944

------------------------------------------------------------------------
r77944 | russell | 2007-08-02 12:47:57 -0500 (Thu, 02 Aug 2007) | 17 lines

Merged revisions 77943 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r77943 | russell | 2007-08-02 13:04:43 -0500 (Thu, 02 Aug 2007) | 9 lines

Fix another race condition in the handling of dynamic threads.  If the dynamic
thread timed out waiting for something to do, but was acquired to perform an
action immediately afterwords, then wait on the condition again to give the
other thread a chance to finish setting up the data for what action this thread
should perform.  Otherwise, if it immediately continues, it will perform the
wrong action.
(reported on IRC by mihai, patch by me)
(related to issue ASTERISK-9946)

........

------------------------------------------------------------------------