Summary:ASTERISK-21128: Locking inversion when attempting to set caller ID while holding iaxsl lock causes deadlock
Reporter:Pavel Troller (patrol-cz)Labels:
Date Opened:2013-02-17 14:50:54.000-0600Date Closed:2013-02-28 11:14:13.000-0600
Versions:Frequency of
Environment:Attachments:( 0) AST-21128-11.diff
( 1) ASTERISK-21128-1.8.diff
( 2) ASTERISK-21128-1.8-modified.diff
( 3) locks.txt
Description:Systems running IAX2 trunks (either with trunk=yes or without it) freeze occasionally with typical symptoms of unreleased locks. The lock appears mostly during the call setup, it hasn't been observed during speech phase or call release. I never found it in 1.6 or less, it appears since 1.8. The newest system on which the problem was observed is indicated below. The only known way to restore functionality is to kill and restart Asterisk. A special debugging Asterisk build has been deployed to one of the sites and core show locks command applied during the deadlock. Its output is attached.
Comments:By: Pavel Troller (patrol-cz) 2013-02-17 14:52:03.230-0600

An output of "core show locks" command.

By: Matt Jordan (mjordan) 2013-02-18 21:34:01.168-0600

You can try this patch to see if it resolves the issue. This does deadlock avoidance before calling {{ast_set_callerid}}; that should let chan_iax get the channel lock safely instead of having it attempt to set the caller ID while another thread already has the lock.

By: Pavel Troller (patrol-cz) 2013-02-18 22:41:12.752-0600

Thanks for your patch, Matt! It required two modifications, one to let it compile and second to prevent systematic crashes, but now it seems OK. I've installed it on three nodes of our network. Because the deadlocks are relatively rare, now we will have to wait for at least 14 days to see, whether it really helps. Attaching modified version of the patch.

By: Pavel Troller (patrol-cz) 2013-02-19 01:51:12.115-0600

Added the same patch ported to Asterisk 11 (SVN branch). Asterisk 11 has the same bug.

By: Matt Jordan (mjordan) 2013-02-19 06:12:32.712-0600

Yikes. Not sure what I was doing with that first patch, but yes, your modifications are right.

Let me know if the issue pops up again with the patch in place - we'll hold off on committing until you've confirmed it works.

By: Pavel Troller (patrol-cz) 2013-02-28 05:10:05.345-0600

So, it's about 10 days since the patch has been deployed and since then, there was no deadlock on any of three and later four treated systems, even with increased frequency of calls, which were prone to making deadlocks before.
I think that the probability, that the patch is doing the right thing and doesn't do anything harmful, is high enough to consider it as verified and commit it.