ASTERISK-21207: [patch] - Deadlock on fax extension calling ast_async

[Home]

Summary: ASTERISK-21207: [patch] - Deadlock on fax extension calling ast_async_goto() with locked channel

Reporter: Ashley Winters (awinters) Labels: patch

Date Opened: 2013-03-05 12:05:35.000-0600 Date Closed:

Priority: Major Regression? No

Status: Open/New Components: Channels/chan_dahdi Channels/chan_sip/General

Versions: 10.7.1 11.2.1 13.18.4 Frequency of
Occurrence Frequent

Related
Issues:

Environment: CentOS 6.3 x86_64 Attachments: ( 0) backtrace-threads.txt
( 1) core-show-locks.txt
( 2) fax-deadlock.patch
( 3) fax-deadlock-v2.patch
( 4) fax-deadlock-v2.patch-11.3.0
( 5) gdb-fax-deadlock.txt
( 6) issue_log

Description: On an asterisk system with heavy use of AGI and inbound CNG-detected faxing, occasionally all channel activity will freeze. Running 'core show channels' returns nothing, but the logs continue running with anything except channel activity. Running with 'sip set debug on' shows that chan_sip.c doesn't even claim to be reading packets anymore.

This deadlock was triggered several times daily across our array of asterisk servers, which process hundreds of faxes and tens of thousands of calls daily.

Comments: By: Ashley Winters (awinters) 2013-03-05 12:07:46.516-0600

Info from gdb showing the deadlock in action
By: Ashley Winters (awinters) 2013-03-05 12:10:22.166-0600

Fix for all obvious locations (chan_sip.c and chan_dahdi.c) where ast_async_goto(chan, ...) is called with chan locked.
By: Ashley Winters (awinters) 2013-03-06 20:25:57.312-0600

Turns out the T.38 code path also (apparently) has a dangling lock that managed to deadlock me. I'll gather some more data on that, but my inbound T.38 frequency is low so it takes a while to reproduce.
By: Ashley Winters (awinters) 2013-03-07 12:04:33.697-0600

On T.38 fax upgrade, handle_request_invite() calls process_sdp() which calls ast_async_goto(). The problem is that the channel is locked before handle_request_invite() is called, which makes the entire ast_async_goto branch in process_sdp() unwise -- especially according to the comments elsewhere about ast_exists_extension() being called with locks.

I'm thinking process_sdp() should just queue up an AST_FRAME_DTMF with f->subclass.integer='f' in order to trigger the (now fixed) goto code in sip_read().
By: Ashley Winters (awinters) 2013-03-11 13:17:09.246-0500

I don't have a good solution for the T.38 unbalanced lock. It only happens after T38_PEER_REINVITE, which is stupid to receive on an inbound call. Fortunately, setting 'faxdetect = cng' in sip.conf prevents the ast_async_goto() from being triggered by T38_PEER_REINVITE and that avoids that particular deadlock.
By: Matt Jordan (mjordan) 2013-03-12 09:55:24.235-0500

The {{ast_async_goto}} in {{process_sdp}} is problematic. Queueing a DTMF frame on the channel won't uniquely fix it, as the fax detection is only performed on DTMF frames pulled out of the RTP layer.

I'm a little hesitant to do that, however, since that would mean we would try to trip off to the fax extension if anyone queued up a DTMF with an 'f' on it on the channel - which seems like it could be abused/confused.

One alternative would be to hijack the AST_CONTROL_T38_PARAMETERS control frame. We could add a new value onto {{ast_control_t38}} - AST_T38_GOTO_FAX - and queue that up on the channel safely. When that is passed back into {{sip_indicate}}, we could then safely bounce off to the fax extension.

What do you think of that solution? It's still an API change in a release branch (ew), but it may allow us to defer the {{ast_async_goto}} into a slightly safer portion of the code.
By: Ashley Winters (awinters) 2013-03-12 13:01:42.073-0500

Hijacking the control data seems like a good idea. Since the only place T38_PEER_REINVITE state is used is on the {{ast_async_goto}} path, I think it would be correct to modify {{interpret_t38_parameters}} so that when AST_T38_REQUEST_NEGOTIATE && p->t38.state == T38_PEER_REINVITE we could perform the goto there without necessarily making a new flag. However, I'm not sure I'm allowed to temporarily unlock the channel in {{tech->indicate}} either, since I can't find any other code paths doing it.
By: Matt Jordan (mjordan) 2013-03-12 13:40:12.981-0500

Well, there's a first time for everything :-)

So long as you reference bump the channel prior to unlocking it, you should be relatively okay. Relative, at least, in the sense that the state of the channel may change during the indication - but that's expected, since you're about to perform an async goto on the channel (which will mangle it via the masquerade anyway). If you ref bump the channel, you guarantee that at the very least, it won't be reclaimed prior to entering the async_goto.
By: Ashley Winters (awinters) 2013-03-12 15:59:56.718-0500

Here's what I came up with for moving the {{ast_async_goto}} into {{interpret_t38_parameters}}. I haven't even smoke-tested it yet, but I'll get to that this week.
By: Chris Cappuccio (chriscappuccio) 2013-03-24 20:36:22.506-0500

I upgraded from 1.8.8.1 under OpenBSD 5.0 amd64 which had no major problems in my use. Tried 10.12.1 first, then 11.3.0rc1, both with OpenBSD 5.3 amd64 and rthreads. Just wanted to let Digium and the submitters know, with either new version, I now seem to be running into the same deadlock.

I have one server with endpoints registering and asterisk proxying SIP calls with T.38 (and RTP), and also using ReceiveFAX (SpanDSP) with T.38. With this patch, I haven't seen the deadlock after 4 days, where I was getting it almost every day! I will post an update after a week.
By: Ashley Winters (awinters) 2013-03-25 13:00:13.919-0500

Hijacking AST_CONTROL_T38_PARAMETERS isn't quite enough. The {{interpret_t38_parameters}} function queues a frame so {{ast_read}} can see it, but {{ast_read}} doesn't call {{tech->indicate}}. Since I'm venturing into uncharted territory anyways, I might as well try making {{ast_read}} call {{ast_indicate_data}} for AST_CONTROL_T38_PARAMETERS.
By: Ashley Winters (awinters) 2013-03-25 13:52:57.644-0500

I'm thinking chan_sip's faxdetect option should just be deprecated into oblivion in favor of res_fax's framehook, enabled via FAXOPT(faxdetect)=yes.

Of course, THAT one is broken too, because it calls ast_async_goto() with the channel locked!
By: Ashley Winters (awinters) 2013-03-25 16:11:56.093-0500

This version of the patch fixes the res_fax framehook to perform the {{ast_async_goto}} correctly, as well as fixing the CNG goto in chan_sip and chan_dahdi. Instead of fixing the T.38 goto in chan_sip, I'm recommending that people use FAXOPT(faxdetect)=yes instead.
By: Chris Cappuccio (chriscappuccio) 2013-03-25 17:01:56.410-0500

Why not fix interpret_t38_parameters by moving the locks outside of the branch with the async goto just like the others? How do you know interpret_t38_parameters is only used by faxdetect=t38 and should be removed?
By: Ashley Winters (awinters) 2013-03-25 17:29:10.252-0500

The problem is that {{interpret_t38_parameters}} isn't getting called the way I thought. I expected {{ast_queue_control_data}} to trigger an indicate on chan, which would call {{interpret_t38_parameters}}, but it doesn't. If I modify {{change_t38_state}} to conditionally call {{ast_indicate_data}} instead (based on what flag? or simply !{{ast_channel_bridged}}?), I'd need to show that for every fax interop scenario (send, receive, bridge, gateway) that it still works. That's a tall order.
By: Ashley Winters (awinters) 2013-03-25 17:34:08.049-0500

Also, I'm not saying there's a problem with {{interpret_t38_parameters}}, I'm just saying I don't think I can move the {{ast_async_goto}} from {{process_sdp}} to there.
By: Chris Cappuccio (chriscappuccio) 2013-03-25 17:49:18.580-0500

God has already answered your prayers. ast_sync_goto is moved from process_sdp to interpret_t38_parameters in asterisk-11.3.0rc1 (which is what i'm currently using with your first patch)
By: Chris Cappuccio (chriscappuccio) 2013-03-25 17:50:35.552-0500

Writing these kinds of patches, you should be working against the head or at least the latest RC anyways...
By: Chris Cappuccio (chriscappuccio) 2013-03-25 17:52:43.080-0500

actually, 11.3.0rc1 already calls ast_async_goto with the channel unlocked in interpret_t38_parameters, so maybe nothing needs to be done there.
By: Chris Cappuccio (chriscappuccio) 2013-03-25 17:55:34.145-0500

In 11.3.0rc1, there appears to be another usage of ast_async_goto within sip_read() that should be fixed.
By: Chris Cappuccio (chriscappuccio) 2013-03-25 18:05:23.499-0500

Does Digium keep an internal list of functions that need to obtain certain big locks like this? It'd be nice if bugs like this didn't get committed in the first place. What about a "try" function where if the lock can't be obtained, the call errors out instead of hanging? Perhaps with an ERROR message so the bug can be caught and one call can fail, instead of the entire system locking up solid?
By: Ashley Winters (awinters) 2013-03-27 16:12:17.171-0500

Fixed backwards logic on the "Already in fax exten" check from the previous draft of this patch.
By: Chris Cappuccio (chriscappuccio) 2013-03-28 14:59:01.674-0500

sip_read in 11.3.0 release looks fine already. your patch fits very nicely.
By: Chris Cappuccio (chriscappuccio) 2013-03-28 15:06:15.742-0500

Same patch but for 11.3.0 release
By: Ashley Winters (awinters) 2013-03-28 17:37:49.103-0500

The chan_sip fix has been working without crash for me since the original patch. I've just started testing the res_fax fix, since I'd like to switch to the framehook detection which seems much cleaner to me.

I've put a few-hundred faxes through the FAXOPT(faxdetect)=yes path and I haven't had any problem or deadlock using the latest patch. I'm going to deploy it more widely to get additional testing in over the next week.
By: Chris Cappuccio (chriscappuccio) 2013-03-28 17:52:09.574-0500

No lockups for me either, before they were a daily occurrence. We are passing T.38 calls through as well as using ReceiveFAX.
By: Andrew Nowrot (andrutto) 2013-04-15 06:03:24.647-0500

I am not sure this is related, but my asterisk got deadlock today. I was able to catch "core show locks" and backtrace-threads, I was also logging debug messages (with verbose level 8) and I can see that the last thing it did was related to fax T38.
It all had happened after applying the lates fax-deadlock-v2-patch-11.3.30.
By: Ashley Winters (awinters) 2013-04-15 16:51:56.139-0500

Andrew's deadlock is due to calling {{ast_exists_extension}} with the channel lock held in {{process_sdp}}. I didn't fix that one. I just warned a suggestion to switch to FAXOPT. Unfortunately, the code deadlocked before the warning was logged.
By: Chris Cappuccio (chriscappuccio) 2013-04-16 17:13:29.137-0500

process_sdp seems to do an ast_channel_unlock before calling ast_exists_extension in 11.2 and 11.3. does sip_pvt_unlock need to be added as well?
By: Ashley Winters (awinters) 2013-04-16 18:28:26.582-0500

The {{ast_channel_unlock}} in {{process_sdp}} was just cut&pasted into place, and is pointless. It only unlocks the recursive lock from a couple lines up, not the big lock held by the caller of {{process_sdp}} -- ultimately {{handle_request_do}}.
By: Chris Cappuccio (chriscappuccio) 2013-07-18 19:05:34.589-0500

Digium: These bugs are obvious, simple to fix, and you've already got a patch. Why is this lingering?
By: Masahide Yamamoto (m.yamamoto) 2014-07-04 22:16:31.509-0500

We also have been encountering this deadlock issue so far in 1.8 branch.
Unfortunately this issue does not seem to have been fixed in the latest 1.8 branch either.

bq. The ast_channel_unlock in process_sdp was just cut&pasted into place, and is pointless. It only unlocks the recursive lock from a couple lines up, not the big lock held by the caller of process_sdp – ultimately handle_request_do.

According to the above comment from Ashley, the lock in process_sdp seems to need to be disabled so the following ast_exists_extension and ast_async_goto will work without deadlocking like:

--------------------------------------------------------------------
Modified on 06/Jul/14:

\[EDIT\] - removed inline code - mjordan

-FYI: Like the above code snippet, we can use ast_channel_trylock / ast_mutex_trylock for non-blocking lock attempt and checking.-

--------------------------------------------------------------------
Added on 06/Jul/14:

I read the following piece of code about PTHREAD_MUTEX_RECURSIVE_NP, which is the mutex's attribute that Asterisk uses.

In Linux and glibc,

\[EDIT\] - removed inline code

So I realized I made a big misunderstanding about the semantics of Asterisk's trylock function which uses PTHREAD_MUTEX_RECURSIVE_NP, so I rectified my previously mentioned code accordingly.
* I thought ast_channel_trylock always returns EBUSY when a channel given is locked by any threads including the calling thread itself.
* Note that my approach is not portable since I used architecture-specific way (I think it will work at least in Linux/glibc).

Also I put the following code in __ao2_lock, __ao2_unlock and __ao2_trylock as follows to see if everything works as expected:

\[EDIT\] - removed inline code

I'm still testing them, so far so good.
By: Masahide Yamamoto (m.yamamoto) 2014-07-05 00:50:43.097-0500

Probably the same approach can be taken in functions like ast_async_goto which calls ast_channel_alloc, ast_do_masquerade, etc, in which threads cannot hold any channel locks.
By: Matt Jordan (mjordan) 2014-07-06 10:16:19.723-0500

A few comments here.

# Please do not post patches in comments. Patches must be attached to an issue marking them as code after signing a license contributor agreement. I've unfortunately had to go back and remove the patches in comments here.
# Generally, the solution to locking issues is not to make them more complex. Typically, you can avoid having to recursively lock, or repeatedly attempt to gain a lock, by fixing the pattern of usage. A good example of this is the locking that is done in {{chan_local}} (now {{core_unreal}}) where the lifetime of an object is bumped such that the object is not destroyed while we attempt to lock it in the correct order. That pattern does not involve recursively locking the object, nor does it involve trylocks.

In the case of {{pbx.c}}, the correct solution would be to make the contexts objects in the core reference counted. This would prevent having to recursively lock the objects as well as try locking them; the lock could be obtained a single time and the reference to the context object bumped. By increasing the reference count on the context object, we would keep it from being destroyed if the dialplan is reloaded (which is really the point of the locks on the contexts object in the first place).
By: Masahide Yamamoto (m.yamamoto) 2014-07-06 19:09:02.402-0500

Thank you for your comment.

I'm sorry this is the first time for me to post a comment in Asterisk's issue tracking system. I'll be more careful.
I didn't intend you to review the patch nor appeal for adoption of the patch to the maintainers at all.
Yes, I agree with you for ``the correct solution would be to make the contexts objects in the core reference counted".
I just wanted you to know this issue is still even in the latest Asterisk 1.8 and I just wanted to provide information for users to avoid such deadlock bugs in a reasonably easy way.

How about the idea of counting the number of recursive mutex locks which I mentioned, although I didn't check if it works in other environment than linux/glibc so that people can check before whole Asterisk gets entirely locked? I wonder why such a function has not been added in Asterisk so far.

I think, anyway, we need to double-check all the paths of the function calls such as ast_channel_alloc, ast_do_masquerade, etc, in which threads cannot hold any locks to prevent unwanted deadlock issues from being happened, or at least we need to have some mechanism which can avoid whole system deadlock issues; ie: we need to limit the extent of the impact of deadlock issues.

I thought my patch provided was helpful to do such a task.

-
I'm afraid this seems to be a beginners question.
May I ask one more question?

You mentioned about reference counting on asterisk objects.
But in some condition, we still need to control race conditions especially when threads need to get/set the same resource with using a synchronization object such as mutex?

I think isolation of such resources is not involved with reference counting since it seems to only be involved with life cycle management.

Is this correct?

Thank you.
By: David Brillert (aragon) 2015-01-09 15:04:17.009-0600

@chriscappuccio and @awinters I'm thinking enough people areaffected by this bug so you might want to upload a patch to the reviewboard. I've seen nothing in any changelog that looks to fix this problem.