[Home]

Summary:ASTERISK-10504: SIP deadlocks unexpectedly at random intervals, trigger unknown
Reporter:xmarksthespot (xmarksthespot)Labels:
Date Opened:2007-10-11 13:24:13Date Closed:2008-01-09 10:29:02.000-0600
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) additionnalinfo.txt
( 1) bt.txt
( 2) btfull.txt
( 3) csl28122007.txt
( 4) locklog3
( 5) threadapplyallbt.txt
( 6) threadapplyallbtfull.txt
( 7) threadapplyallbtfull28122007.txt
Description:This is a relatively simple bug report in the sense that I have no idea what triggers the deadlock or when it happens.

From where I stand it appears to happen at random intervals. However it does not deadlock during the night, so at least it happens when there are open sip channels going on.

Other than that I cannot say.

Luckily I was able to "core show locks" quite a number of times during deadlocks, as it happens quite often during the day (sometimes more than once a day).

Something interesting happened on the last deadlock, which happened today. When I ran "core show locks", it crashed the machine. I will provide here bt and bt full too so you can figure out what went wrong this time around.

****** STEPS TO REPRODUCE ******

Absolutely unknown.

The setup is like so:

SIP Phones <--SIP--> MyPBX <--SIP--> Asterisk1 <--PRI--> PSTN

Let me reiterate that I have no idea what conditions make it deadlock.

****** ADDITIONAL INFORMATION ******

I have no idea from which version it started happening as this is a new machine. The best I can tell is that it happened with svn 83976 too.

There's not much else to say.
Comments:By: Mark Michelson (mmichelson) 2007-10-11 14:15:38

When you open the core file with gdb, could you issue the following two commands and then upload the output? Thanks.

f 5
p *lock_info

By: xmarksthespot (xmarksthespot) 2007-10-11 14:20:14

I added it, it's in the file additionnalinfo.txt

By: Mark Michelson (mmichelson) 2007-10-11 17:35:18

I've examined the core show locks output some more and I was wondering if the apparent deadlock goes away after a few seconds or if it never does.

The reason I ask is that the problem appears to be that the section of SIP code that is sending MWI is what is holding the lock, and the manager command of attempting to show sip peers is what is waiting on the same lock. Since I know that you use IMAP for your voicemail boxes, it wouldn't surprise me to know that the MWI could take longer than expected and that this could cause a stall in the system. I wonder if setting your checkmwi value larger (like to 60 seconds for instance) clears this up.

By: xmarksthespot (xmarksthespot) 2007-11-16 15:59:56.000-0600

As per your suggestion, I slowed down the mwi check interval from 10 seconds to 60 seconds, and I will report back in a week to see if it deadlocked or not with the added time.

By: xmarksthespot (xmarksthespot) 2007-11-23 12:49:51.000-0600

One week has passed, and I am reporting in.

Apparently the machine has deadlocked 2 times during the week, which is a lot better than the 4 times a day from before.

There's probably a correlation between the checkmwi period and the occurence of the crash, however slowing it down to 60 seconds has not cleared the issue straight up. Something else might be happening here, and a patch has been submitted for 11275, which might allow me to get a full core show locks next time it happens.

Stay tuned.

By: Tilghman Lesher (tilghman) 2007-12-26 11:37:43.000-0600

xmarksthespot:  It's been a month since your last post in which you encouraged us to "stay tuned".  What is the current status?

By: Russell Bryant (russell) 2007-12-26 12:08:59.000-0600

A bunch of stuff has been fixed in this area in the past month or so.  I'm going to mark this as suspended.  If you still have a problem, please let us know, and provide updated "core show locks" output.

Thanks

By: xmarksthespot (xmarksthespot) 2007-12-28 14:55:57.000-0600

The issue seemed to reappear on December 28th, 2007.

The revision is now SVN 1.4, 90040.

SIP deadlocked, but thanks to the work done I was able to retrieve a core show locks and I then kill -11 it to get a core file.

The issue seems more complicated now than before.

I ran a "thread apply all bt full" on the core file, which I am posting here.

You will discover that the threads appear to be optimized. This would be a mistake, as upon checking the menuselect.makeopts, I found this:

"MENUSELECT_CFLAGS=DEBUG_CHANNEL_LOCKS DEBUG_THREADS DEBUG_THREADLOCALS DETECT_DEADLOCKS DONT_OPTIMIZE LOADABLE_MODULES"

So the build is in fact not optimized, yet it appears to be optimized.

I am submitting an updated core show locks, and the thread apply all bt full, of course I do have the core file so I can check for anything, as usual.

The deadlock did not clear up after a few minutes, it had been ongoing for "a while", possibly more than half and hour.

Thank you as usual.

By: Mark Michelson (mmichelson) 2008-01-09 10:29:01.000-0600

The core show locks output you have provided is essentially the same as the previous one provided, and points to the same problem I pointed out before. I discussed this with you on IRC some recently, and one of the possibilities we discussed was that there was some broken communication over the TCP socket between Asterisk and the IMAP server.

Based on the report from issue ASTERISK-11138, it appears that setting timeouts for the TCP transactions can be helpful in relieving hangs when using IMAP. To close that issue, I added the ability to set timeouts in voicemail.conf, but only in trunk. For this particular issue, I would be willing to bet that setting timeouts would fix this as well.

Unfortunately, like I said, I only made the change to trunk, so you have a few ways of handling this:

1. Upgrade to trunk.
2. Backport the fix for 11665 into your 1.4 installation. Fortunately this would not be a difficult backport. See trunk revision 96934 to see what changes were made.
3. Directly modify the timeout values in tcp_unix.c in the IMAP c-client source.

Since I am pretty much convinced that these lack of timeouts are what is causing this issue, I am going to close this. Once again, feel free to reopen in the case that instituting the timeouts does not fix the problem.