[Home]

Summary:ASTERISK-13201: Deadlock chan_dahdi.c and channel.c
Reporter:Ryan Trauntvein (rtrauntvein)Labels:
Date Opened:2008-12-10 17:13:12.000-0600Date Closed:2008-12-22 15:56:32.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_dahdi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 14057.patch
( 1) 14057v2.patch
( 2) 14057v3.patch
( 3) astcrashgdb1.txt
( 4) astcrashgdb2.txt
( 5) coreshowchannelsduringcrash.txt
( 6) coreshowlocks.txt
( 7) coreshowlocks2.txt
Description:I have an asterisk system that has been deadlocking on a weekly basis at least once, and we cant seem to reliably reproduce it. It is a production system with about 25 users that is being constantly used.

Running zaptel 1.4.12.1 with a TE220 card.
(Planning on installing the DAHDI release over the weekend as next troubleshooting step)

****** ADDITIONAL INFORMATION ******

I am attaching gdb backtrace output from two separate crashes, as well as the "core show locks" output from one crash.

I am also attaching output from "core show channels".  Every time a lock has been reported the output looks similar, and seems like SIP channels are getting stuck in the Ringing state.

Debugging is a bit difficult due to the reliance on the system during business hours, and it only seems to lock up when the system is being used midday.

Please let me know what other debugging / information I can provide.
Comments:By: Mark Michelson (mmichelson) 2008-12-10 17:31:40.000-0600

Thank you very much for the absolutely excellent information! You've made this much easier to track down.

The problem is that there is a thread which is "stuck" trying to stop a channel from being in autoservice. This thread is holding onto some locks at the time that it is stuck, and so no other threads can grab them. It's not clear yet why the thread is stuck yet, but this helps to narrow the search down by quite a lot.

By: Mark Michelson (mmichelson) 2008-12-11 07:40:16.000-0600

So the problem is that the channel which is stuck trying to stop autoservice is holding a lock that the autoservice thread is trying to obtain.

I will work on a patch for this issue.

By: Mark Michelson (mmichelson) 2008-12-11 09:16:05.000-0600

I have uploaded 14057.patch for testing.

I have also uploaded the patch for review at http://reviewboard.digium.com/r/83/ for other developers to review the fix.

By: Mark Michelson (mmichelson) 2008-12-11 09:39:34.000-0600

Uploaded a new version of the patch. I noticed a logical flaw in the first version.

By: Ryan Trauntvein (rtrauntvein) 2008-12-11 13:45:00.000-0600

Based on the activity on the reviewboard, would you still like me to apply this patch to test?   Since I cannot reliably reproduce it would it be better to wait?

By: Mark Michelson (mmichelson) 2008-12-11 13:51:39.000-0600

My opinion is that v2 of the patch is safe to apply and will likely cause you to not see the issue any more.  The problem is that there is still an extremely narrow window of time under which the deadlock may still occur, so the problem is not guaranteed to be gone just by using v2 of the patch.

You can interpret the above however you wish :)
I'm going to continue to work on this issue and try to find another way of finding a good way of resolving this deadlock.

By: Mark Michelson (mmichelson) 2008-12-11 13:52:44.000-0600

I'm going to remove the "ready for testing" status of this issue since what's attached will not end up being what gets committed to Asterisk.

By: Mark Michelson (mmichelson) 2008-12-11 13:58:09.000-0600

Actually, on second thought, I'm not going to recommend using the patch. I just can't recommend using something that I know will not end up being merged.

By: Ryan Trauntvein (rtrauntvein) 2008-12-11 16:50:13.000-0600

Just had another deadlock and I grabbed another "core show locks" to verify it was the same thing as before.  I have uploaded this output.

We may apply your patch temporarily until a final solution is found.  Any reduction in the number of deadlocks will help at this point :)

By: Mark Michelson (mmichelson) 2008-12-11 17:54:37.000-0600

Just to confirm, the new core show locks output shows the exact same deadlock occurring again.

By: Mark Michelson (mmichelson) 2008-12-15 12:31:09.000-0600

I've uploaded a v3 of the patch. This new approach is to fix the specific deadlock encountered here instead of trying to make a generic solution for all such related deadlocks. The problem is that it appears that such individual changes are what will be necessary in order to fix this problem, and spotting them is going to be very difficult. Nevertheless, I will try to find other similar cases and fix those as well.

Anyway, try the patch and let me know if any problems are encountered.

By: Ryan Trauntvein (rtrauntvein) 2008-12-15 19:37:57.000-0600

I have installed the v3 patch and will keep monitoring for deadlocks.  *crosses fingers*

By: Leif Madsen (lmadsen) 2008-12-22 11:46:49.000-0600

rtrauntvein:  so far so good?

By: Ryan Trauntvein (rtrauntvein) 2008-12-22 12:04:23.000-0600

Yes so far so good!

We have not had this deadlock occur since installing the patch.  Normally I would have seen 3 to 5 deadlocks over the same period of time.

By: Leif Madsen (lmadsen) 2008-12-22 12:08:31.000-0600

Excellent! I'm going to mark this as ready for review then. Thanks for testing!

By: Mark Michelson (mmichelson) 2008-12-22 14:21:20.000-0600

All right. I'll place the patch on reviewboard. I suspect I'll get a quick "ship it!" on it, but we'll see.

By: Mark Michelson (mmichelson) 2008-12-22 14:30:09.000-0600

Here's the reviewboard URL for the patch:

http://reviewboard.digium.com/r/107/

By: Mark Michelson (mmichelson) 2008-12-22 14:51:33.000-0600

As I suspected, the patch got approved rather quickly, and I'm ready to merge now. Thanks for testing it!

By: Digium Subversion (svnbot) 2008-12-22 14:56:25.000-0600

Repository: asterisk
Revision: 166380

U   branches/1.4/channels/chan_dahdi.c

------------------------------------------------------------------------
r166380 | mmichelson | 2008-12-22 14:56:24 -0600 (Mon, 22 Dec 2008) | 36 lines

Fix a deadlock relating to channel locks and autoservice

It has been discovered that if a channel is locked prior
to a call to ast_autoservice_stop, then it is likely that
a deadlock will occur. The reason is that the call to
ast_autoservice_stop has a check built into it to be sure
that the thread running autoservice is not currently trying
to manipulate the channel we are about to pull out of
autoservice.

The autoservice thread, however, cannot advance beyond where
it currently is, though, because it is trying to acquire
the lock of the channel for which autoservice is attempting
to be stopped.

The gist of all this is that a channel MUST NOT be locked
when attempting to stop autoservice on the channel.

In this particular case, the channel was locked by a call
to ast_read. A call to ast_exists_extension led to autoservice
being started and stopped due to the existence of dialplan
switches.

It may be that there are future commits which handle the same
symptoms but in a different location, but based on my looks through
the code, it is very rare to see a construct such as this one.

(closes issue ASTERISK-13201)
Reported by: rtrauntvein
Patches:
     14057v3.patch uploaded by putnopvut (license 60)
Tested by: rtrauntvein

Review: http://reviewboard.digium.com/r/107/


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=166380

By: Digium Subversion (svnbot) 2008-12-22 15:07:59.000-0600

Repository: asterisk
Revision: 166382

_U  trunk/
U   trunk/channels/chan_dahdi.c

------------------------------------------------------------------------
r166382 | mmichelson | 2008-12-22 15:07:59 -0600 (Mon, 22 Dec 2008) | 44 lines

Merged revisions 166380 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r166380 | mmichelson | 2008-12-22 14:56:29 -0600 (Mon, 22 Dec 2008) | 36 lines

Fix a deadlock relating to channel locks and autoservice

It has been discovered that if a channel is locked prior
to a call to ast_autoservice_stop, then it is likely that
a deadlock will occur. The reason is that the call to
ast_autoservice_stop has a check built into it to be sure
that the thread running autoservice is not currently trying
to manipulate the channel we are about to pull out of
autoservice.

The autoservice thread, however, cannot advance beyond where
it currently is, though, because it is trying to acquire
the lock of the channel for which autoservice is attempting
to be stopped.

The gist of all this is that a channel MUST NOT be locked
when attempting to stop autoservice on the channel.

In this particular case, the channel was locked by a call
to ast_read. A call to ast_exists_extension led to autoservice
being started and stopped due to the existence of dialplan
switches.

It may be that there are future commits which handle the same
symptoms but in a different location, but based on my looks through
the code, it is very rare to see a construct such as this one.

(closes issue ASTERISK-13201)
Reported by: rtrauntvein
Patches:
     14057v3.patch uploaded by putnopvut (license 60)
Tested by: rtrauntvein

Review: http://reviewboard.digium.com/r/107/


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=166382

By: Digium Subversion (svnbot) 2008-12-22 15:55:54.000-0600

Repository: asterisk
Revision: 166439

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_dahdi.c

------------------------------------------------------------------------
r166439 | mmichelson | 2008-12-22 15:55:54 -0600 (Mon, 22 Dec 2008) | 52 lines

Merged revisions 166382 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r166382 | mmichelson | 2008-12-22 15:08:03 -0600 (Mon, 22 Dec 2008) | 44 lines

Merged revisions 166380 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r166380 | mmichelson | 2008-12-22 14:56:29 -0600 (Mon, 22 Dec 2008) | 36 lines

Fix a deadlock relating to channel locks and autoservice

It has been discovered that if a channel is locked prior
to a call to ast_autoservice_stop, then it is likely that
a deadlock will occur. The reason is that the call to
ast_autoservice_stop has a check built into it to be sure
that the thread running autoservice is not currently trying
to manipulate the channel we are about to pull out of
autoservice.

The autoservice thread, however, cannot advance beyond where
it currently is, though, because it is trying to acquire
the lock of the channel for which autoservice is attempting
to be stopped.

The gist of all this is that a channel MUST NOT be locked
when attempting to stop autoservice on the channel.

In this particular case, the channel was locked by a call
to ast_read. A call to ast_exists_extension led to autoservice
being started and stopped due to the existence of dialplan
switches.

It may be that there are future commits which handle the same
symptoms but in a different location, but based on my looks through
the code, it is very rare to see a construct such as this one.

(closes issue ASTERISK-13201)
Reported by: rtrauntvein
Patches:
     14057v3.patch uploaded by putnopvut (license 60)
Tested by: rtrauntvein

Review: http://reviewboard.digium.com/r/107/


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=166439

By: Digium Subversion (svnbot) 2008-12-22 15:56:32.000-0600

Repository: asterisk
Revision: 166440

_U  branches/1.6.1/
U   branches/1.6.1/channels/chan_dahdi.c

------------------------------------------------------------------------
r166440 | mmichelson | 2008-12-22 15:56:31 -0600 (Mon, 22 Dec 2008) | 52 lines

Merged revisions 166382 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r166382 | mmichelson | 2008-12-22 15:08:03 -0600 (Mon, 22 Dec 2008) | 44 lines

Merged revisions 166380 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r166380 | mmichelson | 2008-12-22 14:56:29 -0600 (Mon, 22 Dec 2008) | 36 lines

Fix a deadlock relating to channel locks and autoservice

It has been discovered that if a channel is locked prior
to a call to ast_autoservice_stop, then it is likely that
a deadlock will occur. The reason is that the call to
ast_autoservice_stop has a check built into it to be sure
that the thread running autoservice is not currently trying
to manipulate the channel we are about to pull out of
autoservice.

The autoservice thread, however, cannot advance beyond where
it currently is, though, because it is trying to acquire
the lock of the channel for which autoservice is attempting
to be stopped.

The gist of all this is that a channel MUST NOT be locked
when attempting to stop autoservice on the channel.

In this particular case, the channel was locked by a call
to ast_read. A call to ast_exists_extension led to autoservice
being started and stopped due to the existence of dialplan
switches.

It may be that there are future commits which handle the same
symptoms but in a different location, but based on my looks through
the code, it is very rare to see a construct such as this one.

(closes issue ASTERISK-13201)
Reported by: rtrauntvein
Patches:
     14057v3.patch uploaded by putnopvut (license 60)
Tested by: rtrauntvein

Review: http://reviewboard.digium.com/r/107/


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=166440