Summary:ASTERISK-01650: Asterisk DEADLOCKED on latest cvs
Reporter:tan (tan)Labels:
Date Opened:2004-05-19 11:13:32Date Closed:2011-06-07 14:05:01
Versions:Frequency of
Environment:Attachments:( 0) debug3.txt
Description:We have a situation where the system is deadlocking on the latest cvs (yesterday night). We did a cvs update last night. Before this, we were running the cvs head version, dated 12th March. I enclose the thread traces. The environment is composed of SIP phones (SNOM 200), and an E100 card for incoming calls. Urgent assistance needed on this. Thanks.


Enclosed bug log.
Comments:By: Mark Spencer (markster) 2004-05-19 12:09:16

Tracing through the threads as best i can tell, what's happening is that the real deadlock is occuring with the transfer that someone executed on their phone.  In order to stop the music on hold on the bridged channel, a lock must be grabbed.  In total, a transfer requires the simultaneous locking of as many as *four* different locks.  It will take some effort to try to resolve this.

By: tan (tan) 2004-05-19 12:21:51

Thanks. Is it possible that this deadlock scenario has crept in after March 12th or do you think it's a situation which hasn't been hit before? Also, this customer doesn't use MOH. None of the dial strings have the "m" argument. They are using queuing.

By: tan (tan) 2004-05-19 16:02:30

Correction: The last update before the one of last night was on April 22nd when an update was made to fix a deadlock / race condition. Don't know if you remember that one Mark.

By: Mark Spencer (markster) 2004-05-19 17:43:30

This is a "long term" deadlock that has been around ever since it's been in there.  It hasn't been fixed and is generally just very difficult to reproduce and requires quite some magical timing.  The rest of the locks are all from trying to do "show channels" and "stop now".

By: tan (tan) 2004-05-20 07:26:44

Ok, but is extremely coincidental that we cvs update and the bug appears, twice in 15 minutes. Today, customer is running cvs dated 22nd April, and no problems so far today (as was the case since April 22nd). I would be interested to know when you think it was introduced into the load. Before April 22nd?

By: Mark Spencer (markster) 2004-05-22 15:12:47

I'm having some trouble with this backtrace and with the other one.  The line numbers don't jive with CVS, and because the locking has changed in recent CVS, it's important that we get the right one.  Please try to get me a backtrace against the most recent CVS so i can try to understand exactly who is waiting on whom.

By: tan (tan) 2004-05-22 16:12:13

Since the bug isn't easily reproducible, I enclose details (ls -alt *.c) of the files that were in the problematic CVSes for the 2 bug traces that I have logged:

File listing of CVS files in /usr/src/asterisk for this bug (1673):
127504 May 18 23:44 pbx.c
40236 May 18 23:44 rtp.c
111651 May 18 23:44 say.c
 2451 May 18 23:44 srv.c
 4121 May 18 23:44 term.c
13851 May 18 23:44 translate.c
 8089 May 18 23:44 acl.c
 9940 May 18 23:44 app.c
45943 May 18 23:44 asterisk.c

File listing of CVS files in /usr/src/asterisk from Bug ASTERISK-1672:
9940 May 18 17:44 app.c
45943 May 18 17:44 asterisk.c
75615 May 18 17:44 channel.c
10082 May 18 17:44 enum.c
16047 May 18 17:44 logger.c
27504 May 18 17:44 pbx.c
40236 May 18 17:44 rtp.c
11651 May 18 17:44 say.c
13851 May 18 17:44 translate.c
4103 May 18 17:44 utils.c

Is this helpful?

By: Mark Spencer (markster) 2004-05-22 17:37:01

This is definitely not latest CVS head.  You can use "cvs status" on a file to get its version and the latest, etc.

If in doubt, check out a clean copy.

I made a change a few days ago that in the CVS logs I said "this is a start to fixing the problem".  As it turns out, upon more analysis, I'm beginning to think my "start" might actually be the right fix after all.  In any case, the exact breakdown indicated in the backtrace you've shown me does not appear to be possible if I'm reading it correctly.  The channel having music stopped on is, I believe, not the bridged channel as I had theorized, but the channel being transferred, meaning that the attempt_transfer holds the private lock while wanting the channel lock, and sip_write holds the channel lock while wanting the private lock.  With CVS as of a few days ago, this condition cannot occur, because when we find and lock the refer_call, we grab both locks in such a way as to prevent a deadlock should we not be able to get them both.

By: tan (tan) 2004-05-22 18:05:22

In that case, we'll cvs update asterisk with the latest load, and see whether we hit this problem again. Please keep this bug report open for another week and we'll see what happens.

By: Mark Spencer (markster) 2004-05-25 09:44:27

Any new info here?

By: Mark Spencer (markster) 2004-05-27 22:08:08

bonbon?  Any more feedback here?

By: tan (tan) 2004-05-28 16:26:52

No. We've got the very latest cvs on one of our call gateways and haven't seen any deadlock problems at all. We haven't updated the problematic site where the initial problem was seen. Maybe we should close this now and re-open if problem occurs again.

By: Malcolm Davenport (mdavenport) 2004-05-28 16:40:45

Re-open if necessary.

By: Digium Subversion (svnbot) 2008-01-15 14:55:36.000-0600

Repository: asterisk
Revision: 3025

U   trunk/channels/chan_sip.c
U   trunk/config.c

r3025 | markster | 2008-01-15 14:55:36 -0600 (Tue, 15 Jan 2008) | 2 lines

Stage 1 of deadlock fix (bug ASTERISK-1650 -- but not yet solved, just started) and fix configs ending with no newline (bug ASTERISK-1647)