ASTERISK-09380: chan_iax2.c deadlock in connection with iax2

[Home]

Summary: ASTERISK-09380: chan_iax2.c deadlock in connection with iax2_destroy

Reporter: Steve Davies . (stevedavies) Labels:

Date Opened: 2007-05-04 02:15:59 Date Closed: 2007-07-11 19:59:09

Priority: Blocker Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) iax2-destroy-deadlock.1.4.patch
( 1) iax2-destroy-deadlock.patch

Description: Been having 1.4 trunk deadlocks within chan_iax2.c.

Here's one deadlock I investigated in detail:

> -rw------- 1 asterisk asterisk 67301376 May 2 08:28 core.16841

When I noticed the problem, Asterisk had nearly 30 incoming calls from PRI that were all in "Ring" state trying to get to a co-located customer over IAX.

This was a deadlock (obviously)

Here are my debugging notes:

====================================================
Looking at some of the stuck threads:

Threads 2, 3, 4, 5, 6, 7, 8, 9, 10 (etc?) were at chan_iax2.c line 1325 trying to lock &iaxsl[1]

So: many threads are trying to get iaxsl[1]

So who is holding &iaxsl[1]?

Answer is pid 29209 which is thread 28.

It is busy transmitting a frame on callno 1.

static int iax2_transmit(struct iax_frame *fr)
{
/* Lock the queue and place this packet at the end */
/* By setting this to 0, the network thread will send it for us, and
queue retransmission if necessary */
fr->sentyet = 0;
AST_LIST_LOCK(&queue); <--- HERE
AST_LIST_INSERT_TAIL(&queue, fr, list);
AST_LIST_UNLOCK(&queue);
/* Wake up the network and scheduler thread */
pthread_kill(netthreadid, SIGURG);
signal_condition(&sched_lock, &sched_cond);
return 0;
}

It was busy trying to lock AST_LIST_LOCK(&queue).
And was holding iaxsl[1], which is reasonable seeing we're sending a frame on that call.

&queue->lock was held by 15220 (thread 55)

Thread 55 was busy trying to destroy call 52 out of socket_process:

ASTERISK-8 0x00002e4ad10398f5 in iax2_destroy (callno=52) at chan_iax2.c:1817
ASTERISK-9 0x00002e4ad1053acf in socket_process (thread=0x263d588d100) at chan_iax2.c:6869
ASTERISK-10 0x00002e4ad105c95b in iax2_process_thread (data=0x263d588d100) at chan_iax2.c:7962

socket_process around 6869:

for (x=iaxs[fr->callno]->rseqno; x != fr->iseqno; x++) {
/* Ack the packet with the given timestamp */
if (option_debug && iaxdebug)
ast_log(LOG_DEBUG, "Cancelling transmission of packet %d\n", x);
AST_LIST_LOCK(&queue);
AST_LIST_TRAVERSE(&queue, cur, list) {
/* If it's our call, and our timestamp, mark -1 retries */
if ((fr->callno == cur->callno) && (x == cur->oseqno)) {
cur->retries = -1;
/* Destroy call if this is the end */
if (cur->final) {
if (iaxdebug && option_debug)
ast_log(LOG_DEBUG, "Really destroying %d, having been acked on final message\n", fr->callno);
iax2_destroy(fr->callno); <--- HERE <----
}
}
}
AST_LIST_UNLOCK(&queue);
}

I believe that logic is that frames to be sent can be tagged as "final".
Then, when that frame is acked we can destroy the iax call.

So thread 55 is doing iax2_destroy because its received an ACK for a "final" frame that we sent.
Notice that &queue is held locked whilst iax2_destroy is called.

In iax2_destroy we're at line 1817:

1812 if (owner) {
1813 if (ast_mutex_trylock(&owner->lock)) {
1814 ast_log(LOG_NOTICE, "Avoiding IAX destroy deadlock\n");
1815 ast_mutex_unlock(&iaxsl[callno]);
1816 usleep(1);
1817 ast_mutex_lock(&iaxsl[callno]); <--- HERE <----
1818 goto retry;
1819 }
1820 }

In summary, thread 55 is holding:
&queue
And trying to get:
&owner (the incoming zap channel) (which is locked by 15548 (thread 29)
It did have iaxsl[52], but released it and is trying to get it back (but thread 29 grabbed it)

iaxsl[52] is locked by 15548, which is thread 29.
&owner (the incoming zap channel) is also locked by 15548 (thread 29)

Looks like during the usleep(1), thread 29 below grabbed iaxsl[52] and got this loop stuck.

Thread (29) was also in iax2_destroy, and also trying to destroy call 52 !?

ASTERISK-10 0x00002e4ad10399a9 in iax2_destroy (callno=52) at chan_iax2.c:1837
ASTERISK-11 0x00002e4ad103f35b in iax2_hangup (c=0x263d5a79b30) at chan_iax2.c:3099
ASTERISK-12 0x00000263d55e3bd6 in ast_hangup (chan=0x263d5a79b30) at channel.c:1658
ASTERISK-13 0x00002e4ad2b5afab in wait_for_answer (in=0x263d5a75370, outgoing=0x263d5a00330, to=0x4715892c, peerflags=0x471591b0, pa=0x47158980, num_in=0x47158900,
priority_jump=0, result=0x4715885c) at app_dial.c:618
ASTERISK-14 0x00002e4ad2b5f2d3 in dial_exec_full (chan=0x263d5a75370, data=0x47159740, peerflags=0x471591b0, continue_exec=0x0) at app_dial.c:1486
ASTERISK-15 0x00002e4ad2b60501 in dial_exec (chan=0x263d5a75370, data=0x47159740) at app_dial.c:1737

It was doing iax2_hangup:

3095 /* If we were already gone to begin with, destroy us now */
3096 if (alreadygone) {
3097 if (option_debug)
3098 ast_log(LOG_DEBUG, "Really destroying %s now...\n", c->name);
3099 iax2_destroy(callno); <----- HERE
3100 }

In iax2_destroy it was at line 1837:

1832 /* If there's an owner, prod it to give up */
1833 owner->_softhangup |= AST_SOFTHANGUP_DEV;
1834 ast_queue_hangup(owner);
1835 }
1836
1837 AST_LIST_LOCK(&queue); <--- HERE <---
1838 AST_LIST_TRAVERSE(&queue, cur, list) {
1839 /* Cancel any pending transmissions */
1840 if (cur->callno == pvt->callno)
1841 cur->retries = -1

So we got through the preceding lock loop at 1812ff. Consequently we must have
the &owner lock and the iaxsl[52] lock.

Now we are trying to get &queue, but that was locked by 15220 which is thread 55 again

=================================================

So the basic problem that caused this deadlock is calling iax2_destroy out of socket_process with the &queue mutex held.
Or, maybe the fundamental issue is having two threads both trying to destroy the same channel

Cos what happened here is something like this (don't think this is exactly right, but it has the gist):

(Bear in mind that this IAX2 peer is really on the same box so the iax2 latency is very very low)

1) Something did a send_command_final on the call 52.
2) Now the ack of that frame turns up - a separate thread is scheduled to process it
3) That thread (55) runs socket_process, locks &queue, finds the acked frame with final set and starts iax2_destroy
4) iax2_destroy sets IAX_ALREADYGONE on the channel
5) It happens that on thread 29 iax2_hangup gets called via ast_hangup
6) Somewhere &owner (zap channel) gets locked.
7) Lock iaxsl[52]
8) call iax2_destroy for call 52.
9) Call 52 hasn't got an owner any more because thread 29 predestroyed it. So no need to try to lock &owner
10) So now thread 29 gets to the lock of &queue and can't get it because thread 55 has got it
11) Thread 55 now can't lock &owner because thread 29 has it.
12) So it goes into deadlock avoid mode.
13) But that makes things worse because thread 55 releases iaxsl[52] and thread 29 grabs it.

So now thread 29 (which is running the zap channel) is holding lock for the incoming zap channel, and the lock on iaxsl[52], and is trying to get &queue

Thread 55 is holding &queue. It also _was_ holding iaxsl[52]. Its trying to get the lock on the incoming zap channel. In the process of trying to get that lock it lost the iaxsl[52] lock to thread 29.

So - thread 29 can't continue until it gets &queue. And it won't release the iaxsl[52] lock until it can continue.
And thread 55 can't continue until it gets iaxsl[52] back again. And it won't release &queue until it can continue.

So they are deadlocked.

Thread 28 comes along and locks iaxsl[1] and then tries to get &queue. So it is stuck.

Everything else ends up stuck waiting on iaxsl[1]

I have a second deadlock coredump that is on the same theme.

I have a possible fix for the problem that I'll post on this bug if it works for me.

****** ADDITIONAL INFORMATION ******

Unfortunately I can't supply the gdb output because I forgot to save it before I rebuilt my Asterisk with a possible fix. But I think my trace through should be sufficient.

I'll post my fix on this bug in a couple of days if it has proven effective for me. The patch moves the call to iax2_destroy in socket_process out to after the &queue mutex is unlocked.

Comments: By: Steve Davies . (stevedavies) 2007-05-14 14:21:33

Hi,

My svn trunk asterisk has managed 8 days running my production load since I installed this patch and the one I've posted in 9642. And still going strong. That beats the 1 to 2 segfaults or deadlocks per day I was getting before.

So here it is. It fixes the issue uncovered in my analysis above.

I've uploaded iax2-destroy-deadlock.patch for SVN trunk (against 62242),
and iax2-destroy-deadlock.1.4.patch for SVN branch-1.4 (against 62989).

Mihai - will you try it out in your environment and report back?

Please see also my bug 9642 where I've posted a patch that is my attempt to fix various segfaults caused by iaxs[] pointers going to null underneath various functions.

Regards,
Steve
By: Steve Davies . (stevedavies) 2007-05-14 14:23:40

And can someone with sufficient rights rename this bug with a [patch] please.

My disclaimer is on file.

Steve
By: mihai (mihai) 2007-05-15 09:53:00

I'm already running your patch on my dev box and it seems to be fine. I'm pushing it for adoption on our test/production systems. We have not experienced any deadlocks, just crashes, so I'm not sure if my feedback would be very valuable to you... I guess we can at least verify that it does not break anything else :)

Mihai
By: jeff oconnell (jeff_oconnell) 2007-05-21 12:07:19

steve: have you been running your patch on production servers? any problems? ( i assume no, but thought i should ask.... :) )
By: Steve Davies . (stevedavies) 2007-05-31 13:11:39

Hi Jeff,

Yes, I run these changes and Mihai's other iax stability fixes on my production box and they've made a huge difference.

I have a fix for an additional crash that I'll upload soon.

Steve
By: jeff oconnell (jeff_oconnell) 2007-06-03 16:44:20

steve, thanks for the update. we've applied your patch to our v1.4.2 sources and moved it into limited production, as well. so far so good. :)
By: Russell Bryant (russell) 2007-06-04 10:51:47

This patch has been applied to 1.4 and trunk in revisions 67020 and 67022. Thank you very much for your work!