Summary:ASTERISK-21234: Deadlock when using two Local channels & fax gateway (local_queryoption)
Reporter:Faidon Liambotis (paravoid)Labels:
Date Opened:2013-03-11 12:22:08Date Closed:
Versions:11.2.1 13.18.4 Frequency of
Environment:Attachments:( 0) 2
( 1) 3
Description:There's a corner case when using two Local channels in series and having the T.38 fax gateway enabled. It seems there's a race and eventual deadlock on two channel locks (AB/BA). This quickly brings the rest of the system down (SIP monitoring thread gets stuck as is every other operation which enumerates channels and trying to get locks on them).

The issue is fully reproducible using a load generator in 5-10' using this purposefully trivialized dialplan:
exten => _X.,1,Set(FAXOPT(gateway)=yes)
exten => _X.,2,Dial(Local/${EXTEN}@local2)

exten => _X.,1,Set(FAXOPT(gateway)=yes)
exten => _X.,2,Dial(Local/${EXTEN}@local1)

exten => _X.,1,Set(FAXOPT(gateway)=yes)
exten => _X.,2,Dial(SIP/sip2/${EXTEN})

Attached is the backtrace for the two deadlocked threads when running with the above dialplan.

Both threads lock their respective channels in {{ast_indicate_data}}, then race in {{local_queryoption}} and deadlock each other. The whole process of locking/unlocking in {{local_queryoption}} looks fishy and is most likely the culrpit of this deadlock.
Comments:By: Faidon Liambotis (paravoid) 2013-03-11 12:23:57.704-0500

Backtrace of the two deadlocked threads.

By: Matt Jordan (mjordan) 2013-03-12 11:36:45.180-0500

While the backtrace helps to indicate where the problem is, a {{core show locks}} output would help. Do you mind compiling with {{DEBUG_THREADS}} and attaching one to this issue?

As an aside, why would you configure a system to use two Local channels in a chain while performing a fax gateway?

By: Faidon Liambotis (paravoid) 2013-03-13 05:43:09.390-0500

The production system was much more complicated than that, with multiple AGIs in the path that did all that, this is a much simplified version that was created for the purposes of reproducing in the lab and bug reporting. But yes, there was no good reason and this isn't the case in production anymore. It is a deadlock though and I thought it warranted a bug report, albeit with a warning about being a corner case as I said in my first sentence :)

I can compile with DEBUG_THREADS, although I've found the locks and turn of events exactly from the backtrace above so I'm not sure how much more it'll help you. More specifically:

Thread 2 locks {{0x2eac188}} (let's call that lock A) in {{ast_indicate_data}}, which then proceeds through the framehook, tries to get the T.38 state and ends up in {{local_queryoption}} which calls
       if (bridged) {
               res = ast_channel_queryoption(bridged, option, data, datalen, 0);
               bridged = ast_channel_unref(bridged);
with {{bridged}} being {{0x309e238}} and {{ast_channel_queryoption}} immediately trying to get a lock for that channel (lock B).

Thread 3 does the same, starting with {{ast_indicate_data}} for channel {{0x309e238}} which locks it (lock B again), goes through the same framehook, ends up in {{local_queryoption}} and specifically:
       if (!(tmp = IS_OUTBOUND(ast, p) ? p->owner : p->chan)) {
               return -1;
       ast_channel_unlock(ast); /* Held when called, unlock before locking another channel */

It gets the {{tmp}}  which is {{0x2eac188}} and then tries to lock it with {{ast_channel_lock(tmp);}}, getting lock A.

So, thread 2 holds AB and thread 3 holds BA and the threads deadlock and both of the channels end up locked for good and blocking other operations, effectively killing the system.