Summary:ASTERISK-21409: [patch] - Race condition with IAX2 transfer, 2 releases happen on same call legs. locks up with many threads blocked by iax2_destroy_helper
Reporter:Alec Davis (alecdavis)Labels:
Date Opened:2013-04-11 00:19:50Date Closed:2013-06-10 02:37:03
Versions: 11.3.0 Frequency of
Environment:Debian SqueezeAttachments:( 0) coreshowlocks-31may.txt
( 1) full.may08.C-0000034cd.txt
( 2) gdb-31may.txt
( 3) iax2_transfer.diff.txt
( 4) iax2_transfer.diff2.txt
( 5) iax-coreshowlocks-may08-auckland.txt
( 6) iax-lock-asterisk-11.txt
( 7) iax-lock-asterisk-1-8-15.txt
( 8) var_log_messages.txt
Description:Intermittently all IAX calls stop over the trunk.

I think this is reproduced when a call is transferred back to the originating site, we have "transfer=on"

Also happened with asterisk SVN-branch-11-r382514.

Attached are the lockup, and the 11-branch lockup.

Comments:By: Alec Davis (alecdavis) 2013-04-11 00:25:23.598-0500

In the affected versions for 11 I picked 11.3.

Both iax-lock-asterisk files are the same server.
iax-lock-asterisk-1.8.15.txt was 7 March.
iax-lock-asterisk-11.txt was 21st March.

By: Alec Davis (alecdavis) 2013-04-11 01:13:31.100-0500

Attached var_log_messages.txt

By: Rusty Newton (rnewton) 2013-04-12 09:50:56.250-0500

Alec, be sure to get DEBUG,VERBOSE messages in your log for the next time this occurs. Can you also remove the debug from your "11/Apr/13 1:13 AM" comment and attach in separate files?

By: Rusty Newton (rnewton) 2013-04-12 09:53:51.740-0500

Of course if you can also nail down a way to reproduce this then that is always helpful!

By: Alec Davis (alecdavis) 2013-05-07 20:46:33.832-0500

another lockup today.
luckily was running debug today:)

Note: at the end of iax-coreshowlocks-may08-auckland the idle threads, normally there are 10.

edit 3rd June 2013 added below:  The releasing of channels is normally only done by one side, except when we lockup.
[Apr 23 10:47:56] VERBOSE[20815][C-00000285] chan_iax2.c:     -- Releasing IAX2/auckland-17178 and IAX2/auckland-4151
[Apr 26 14:17:29] VERBOSE[20814] chan_iax2.c:     -- Releasing IAX2/auckland-1574 and IAX2/auckland-16633
[Apr 26 14:20:08] VERBOSE[20810] chan_iax2.c:     -- Releasing IAX2/auckland-1495 and IAX2/auckland-17776
[May  8 09:41:06] VERBOSE[19832][C-00000339] chan_iax2.c:     -- Releasing IAX2/auckland-17778 and IAX2/auckland-15665
[May  8 10:03:14] VERBOSE[19829][C-0000033e] chan_iax2.c:     -- Releasing IAX2/auckland-17063 and IAX2/auckland-5019
[May  8 10:14:57] VERBOSE[19835] chan_iax2.c:     -- Releasing IAX2/auckland-16361 and IAX2/auckland-19586
[May  8 10:27:47] VERBOSE[19832] chan_iax2.c:     -- Releasing IAX2/auckland-12923 and IAX2/auckland-16803                <<< these legs >>>>
[May  8 10:27:47] VERBOSE[19834][C-0000034d] chan_iax2.c:     -- Releasing IAX2/auckland-16803 and IAX2/auckland-12923    <<< and these same legs being released by different threads >>>

By: Alec Davis (alecdavis) 2013-05-07 21:13:50.818-0500

C-000034cd.txt debug catching both calls.

C-000034c is the first call originate in Wellington, sent to Auckland receptionist over iax.

C-000034d is the Auckland  receptionist transferring the call back to a Wellington extension 8554.

By: Matt Jordan (mjordan) 2013-05-08 08:23:56.219-0500

That's rather strange - it appears to be blocking on the removal of the scheduled callback for the {{lagid}}:

AST_SCHED_DEL_SPINLOCK(ast_sched_thread_get_context(sched), pvt->pingid, &iaxsl[pvt->callno]);
pvt->pingid = DONT_RESCHEDULE;
AST_SCHED_DEL_SPINLOCK(ast_sched_thread_get_context(sched), pvt->lagid, &iaxsl[pvt->callno]);
pvt->lagid = DONT_RESCHEDULE;

And in fact, we can see that some thread has stalled out while holding this lock at this location:

=== Thread ID: 0xaf4c3b70 (iax2_process_thread  started at [12405] chan_iax2.c start_network_thread())
=== ---> Lock #0 (chan_iax2.c): MUTEX 1920 iax2_destroy_helper &iaxsl[pvt->callno] 0xb698cbe0 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8139751]
       /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xac) [0x81329c1]
       /usr/lib/asterisk/modules/chan_iax2.so(+0x13860) [0xb683a860]
       /usr/sbin/asterisk() [0x81a922f]
       /usr/sbin/asterisk() [0x81ba1e3]
       /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb72f9955]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb77391de]
=== -------------------------------------------------------------------

We shouldn't be getting stuck on this - at most, we attempt to delete the scheduled ID 10 times:

int _sched_res = -1; \
while (id > -1 && (_sched_res = ast_sched_del(sched, id)) && ++_count < 10) { \
ast_mutex_unlock(lock); \
usleep(1); \
ast_mutex_lock(lock); \
} \

So, the question is: what is this thread waiting on?

A gdb backtrace may answer that.

By: Alec Davis (alecdavis) 2013-05-31 00:22:06.491-0500

Another lockup today. 2 threads trying to release the same call, notice the swapped around fr->callno and bridgecallno.
[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457
[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262

[May 31 14:44:01] DEBUG[12985][C-00000536] channel.c: Returning from native bridge, channels: IAX2/auckland-13262, IAX2/auckland-20457
[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Channel 'IAX2/auckland-13262' ready to transfer
[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Channel 'IAX2/auckland-20457' ready to transfer
[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262
[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457
[May 31 14:44:01] DEBUG[30824][C-00000536] sched.c: Attempted to delete nonexistent schedule entry 209951!
[May 31 14:44:01] ERROR[30824][C-00000536] lock.c: chan_iax2.c line 1918 (iax2_destroy_helper): mutex '&iaxsl[pvt->callno]' freed more times than we've locked!
[May 31 14:44:01] ERROR[30824][C-00000536] lock.c: chan_iax2.c line 1918 (iax2_destroy_helper): Error releasing mutex: Operation not permitted
[May 31 14:44:01] DEBUG[30818][C-00000536] chan_iax2.c: Ooh, voice format changed to 'alaw'

re: chan_iax.c line ~11558, stop_stuff(iaxs[fr->callno]->bridgecallno);
Shouldn't it have a lock around it, as iax2_destroy_helper() expects the pvt to be locked.
fr-callno is already locked, but the bridgedcall isn't. But this will lead to a deadlock!

By: Alec Davis (alecdavis) 2013-05-31 00:27:34.532-0500

gdb-31may.txt attached

By: Alec Davis (alecdavis) 2013-05-31 00:31:15.043-0500

coreshowlocks-31may.txt attached.

also output from 'iax2 show threads.'

where have all the threads gone??
auck1*CLI> iax2 show threads
IAX2 Thread Information
Idle Threads:
Thread 8: state=0, update=1, actions=670328, func=''
Active Threads:
Thread P4: state=1, update=2989, actions=664243, func='socket_process'
Dynamic Threads:
2 of 10 threads accounted for with 0 dynamic threads{code}

By: Alec Davis (alecdavis) 2013-05-31 21:16:54.524-0500

Since our previous lockup on 8 May.
The Releasing of the IAX channels is normally only done once.

[May  8 12:11:25] VERBOSE[17711][C-00000017] chan_iax2.c:     -- Releasing IAX2/auckland-18215 and IAX2/auckland-1116
[May  8 12:36:15] VERBOSE[17716][C-0000001c] chan_iax2.c:     -- Releasing IAX2/auckland-20953 and IAX2/auckland-5806
[May  8 12:44:46] VERBOSE[17708] chan_iax2.c:     -- Releasing IAX2/auckland-2666 and IAX2/auckland-22666
[May  8 13:59:42] VERBOSE[17714][C-0000002f] chan_iax2.c:     -- Releasing IAX2/auckland-18982 and IAX2/auckland-4754
[May  8 14:02:00] VERBOSE[17709] chan_iax2.c:     -- Releasing IAX2/auckland-1687 and IAX2/auckland-17678
[May  8 14:08:14] VERBOSE[17712] chan_iax2.c:     -- Releasing IAX2/auckland-3232 and IAX2/auckland-21935
[May  8 14:25:05] VERBOSE[17715] chan_iax2.c:     -- Releasing IAX2/auckland-2375 and IAX2/auckland-17247
[May  8 14:42:11] VERBOSE[17711] chan_iax2.c:     -- Releasing IAX2/auckland-2421 and IAX2/auckland-17997
[May  8 14:46:32] VERBOSE[17716] chan_iax2.c:     -- Releasing IAX2/auckland-2231 and IAX2/auckland-16676
[May 16 10:47:28] VERBOSE[24320][C-000002cd] chan_iax2.c:     -- Releasing IAX2/auckland-18387 and IAX2/auckland-10007
[May 16 13:34:18] VERBOSE[24317][C-000002f6] chan_iax2.c:     -- Releasing IAX2/auckland-23355 and IAX2/auckland-10086
[May 16 13:47:45] VERBOSE[24320][C-000002fc] chan_iax2.c:     -- Releasing IAX2/auckland-19838 and IAX2/auckland-13305
[May 23 11:09:22] VERBOSE[30826] chan_iax2.c:     -- Releasing IAX2/auckland-5967 and IAX2/auckland-17597
[May 23 12:14:39] VERBOSE[30823][C-0000017b] chan_iax2.c:     -- Releasing IAX2/auckland-17653 and IAX2/auckland-13270
[May 23 12:15:08] VERBOSE[30823] chan_iax2.c:     -- Releasing IAX2/auckland-5609 and IAX2/auckland-17390
[May 23 12:41:15] VERBOSE[30817] chan_iax2.c:     -- Releasing IAX2/auckland-3055 and IAX2/auckland-18415
[May 23 12:57:37] VERBOSE[30821][C-00000186] chan_iax2.c:     -- Releasing IAX2/auckland-21292 and IAX2/auckland-1604
[May 23 13:25:55] VERBOSE[30822] chan_iax2.c:     -- Releasing IAX2/auckland-10039 and IAX2/auckland-19104
[May 23 13:28:32] VERBOSE[30824] chan_iax2.c:     -- Releasing IAX2/auckland-1960 and IAX2/auckland-19964
[May 31 13:35:11] VERBOSE[30817][C-0000051e] chan_iax2.c:     -- Releasing IAX2/auckland-16855 and IAX2/auckland-9197
[May 31 14:10:53] VERBOSE[30825] chan_iax2.c:     -- Releasing IAX2/auckland-8691 and IAX2/auckland-20680
*strong*[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262  <<< these legs >>>
*strong*[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457              <<< both released by 2 threads >>>


By: Alec Davis (alecdavis) 2013-06-04 04:02:22.844-0500

How the sequence below happens
[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Channel 'IAX2/auckbdt-13262' ready to transfer
[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Channel 'IAX2/auckbdt-20457' ready to transfer
[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c:     -- Releasing IAX2/auckbdt-20457 and IAX2/auckbdt-13262
[May 31 14:44:01] VERBOSE[30820] chan_iax2.c:     -- Releasing IAX2/auckbdt-13262 and IAX2/auckbdt-20457

this race condition happens in chan_iax.c line ~11529

               if ((iaxs[fr->callno]->transferring == TRANSFER_BEGIN) || (iaxs[fr->callno]->transferring == TRANSFER_MBEGIN)) {
if (iaxs[fr->callno]->transferring == TRANSFER_MBEGIN)
iaxs[fr->callno]->transferring = TRANSFER_MREADY;
iaxs[fr->callno]->transferring = TRANSFER_READY;
ast_verb(3, "Channel '%s' ready to transfer\n", iaxs[fr->callno]->owner ? ast_channel_name(iaxs[fr->callno]->owner) : "<Unknown>");

/* If the other thread involved in the transfer kicks in now we'll run the following code twice */

                       if (iaxs[fr->callno]->bridgecallno) {
                       if ((iaxs[iaxs[fr->callno]->bridgecallno]->transferring == TRANSFER_READY) || (iaxs[iaxs[fr->callno]->bridgecallno]->transferring == TRANSFER_MREADY)) {
/* They're both ready, now release them. */

The example below is two IAX_COMMAND_TXREADYs are processed slightly after one another (enough time for the routine to exit), one on each leg of a call to be transferred.
The code current expects the 1st IAX_COMMAND_TXREADY to arrive, setting 'transferring' to TRANSFER_READY/MREADY and that bridged->transferring is not READY.
Then (on the bridged call leg) the 2nd IAX_COMMAND_TXREADY arrives, setting 'transferring' to TRANSFER_READY/MREADY, and bridged->transferring is already TRANSFER_READY/MREADY so then will release them.

But when the 1st IAX_COMMAND_TXREADY thread is interrupted before it's got to '->transferring = TRANSFER_MEDIA' by the 2nd IAX_COMMAND_TXREADY, things go wrong, the Releasing code gets run twice.

By: Alec Davis (alecdavis) 2013-06-04 04:48:06.915-0500

attached iax2_transfer.diff.txt

lock the bridged callno before setting '->transferring=TRANSFER_READY/MRERADY', with appropriate deadlock avoidance.

By: Alec Davis (alecdavis) 2013-06-04 06:55:49.652-0500

attached iax2_transfer_diff2.txt

refactored IAX_COMMAND_TXREADY decision logic.
.1 get a lock on both legs if bridgecallno exists.

.2 these tests are now needed as we the pvt is unlocked for Deadlock avoidance
.2a check if transferring == MBEGIN then MREADY
.2b check if transferring == BEGIN then READY
.2c else if transferrring!= above tests then exit

.3a remove redundant checks of thisside->MREADY||READY, we can use the bridgedcallno-transferring.
.3b when checking bridgedcallno->transferring positively identify MREADY or READY

many test call using ORIGINATE(), never wen't wrong
CLI output:

With locking, never would go wrong :)
   -- IAX2/auckland-20910 is proceeding passing it to IAX2/auckland-17388
   -- IAX2/auckland-20910 answered IAX2/auckland-17388
   -- Channel 'IAX2/auckland-20910' ready to transfer
   -- Channel 'IAX2/auckland-17388' ready to transfer
   -- Releasing IAX2/auckland-17388 and IAX2/auckland-20910

=== Examples without locking: ===

Without locking bridgecallno I could get the order wrong!
   -- IAX2/auckland-19071 is proceeding passing it to IAX2/auckland-17663
   -- IAX2/auckland-19071 answered IAX2/auckland-17663
   -- Channel 'IAX2/auckland-19071' ready to transfer
   -- Releasing IAX2/auckland-19071 and IAX2/auckland-17663
   -- Channel 'IAX2/auckland-17663' ready to transfer

Another wrong sequence without locking, server is under load - disk back up.
These 2 interleave each thread.
   -- Channel 'IAX2/auckbdt-19065' ready to transfer
   -- Channel 'IAX2/auckbdt-20047' ready to transfer
   -- Releasing IAX2/auckbdt-19065 and IAX2/auckbdt-20047
   -- Releasing IAX2/auckbdt-20047 and IAX2/auckbdt-19065
   -- Channel 'IAX2/auckbdt-19065' finished transfer (this is just before we exit the case statement)
   -- Channel 'IAX2/auckbdt-20047' finished transfer