Summary: | ASTERISK-10507: 1.4.13 lockups | ||
Reporter: | hevisko (hevisko) | Labels: | |
Date Opened: | 2007-10-12 00:54:16 | Date Closed: | 2011-06-07 14:00:24 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 20071012084054.bz2 ( 1) 20071012084126.bz2 ( 2) 20071012084256.bz2 ( 3) 20071105.full.bz2 ( 4) 20071105.messages.bz2 ( 5) core.19782.script.gz ( 6) core.20596.script.gz ( 7) core.26207.script.gz ( 8) core.27593.script.bz2 ( 9) core.28100.script.bz2 (10) core.30231.script.bz2 (11) core.31310.script.bz2 (12) showlock.20071106.tgz | |
Description: | Okay, Caught one of these "lockups" on 1.4.13, noticed the following from full log (And no, there are zillions of those :() [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' [Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8' ****** ADDITIONAL INFORMATION ****** Please update the Asterisk version in the bug tracker ;) | ||
Comments: | By: hevisko (hevisko) 2007-10-12 01:00:08 Russell, The errors may be in the jb itself, but it may also be that there's some unusual delay in getting packets _into_ the jb which cause it to display those messages. For example, if there's some single-threaded sections of code, or some kind of locking delays, etc. In the 19782 gdb output, you can see: Thread 72 (process 19796): #0 0xb7f82410 in __kernel_vsyscall () #1 0xb7e47cc6 in nanosleep () from /lib/libc.so.6 #2 0xb7e7544c in usleep () from /lib/libc.so.6 #3 0xb6dacbba in iax2_queue_frame (callno=40, f=0x83eebd0) at chan_iax2.c:1444 #4 0xb6dad60b in __get_from_jb (p=<value optimized out>) at chan_iax2.c:1804 ASTERISK-1 0xb6dcdd10 in iax2_process_thread (data=0x81f3d60) at chan_iax2.c:8259 ASTERISK-2 0x0812eb9d in ?? () Which seems to be from this code: 595 : matteo 629 static int iax2_queue_frame(int callno, struct ast_frame *f) 1596 : { 1597 : for (;;) { 1598 : if (iaxs[callno] && iaxs[callno]->owner) { 1599 : russell 82728 <http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=82727&r2=82728> if (ast_channel_trylock(iaxs[callno]->owner)) { 1600 : matteo 629 /* Avoid deadlock by pausing and trying again */ 1601 : markster 1310 <http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=1309&r2=1310> ast_mutex_unlock(&iaxsl[callno]); 1602 : matteo 629 usleep(1); 1603 : markster 1310 <http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=1309&r2=1310> ast_mutex_lock(&iaxsl[callno]); 1604 : matteo 629 } else { 1605 : markster 2644 <http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=2643&r2=2644> ast_queue_frame(iaxs[callno]->owner, f); 1606 : russell 82728 <http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=82727&r2=82728> ast_channel_unlock(iaxs[callno]->owner); 1607 : matteo 629 break; 1608 : } 1609 : } The other output shows the same deadlock avoidance pattern in thread 73, Thread 56 (and, they seem to be different calls..). -SteveK By: hevisko (hevisko) 2007-10-12 01:00:37 Ah, good catch! That would explain the 100% CPU load, then, if it gets stuck in this deadlock avoidance loop for forever. Tilghman improved the "core show locks" functionality for 1.4.13 to be able to show threads that are stuck in a trylock loop like this. So, hopefully if someone can grab that output with 1.4.13, we can see a thread stuck in a trylock loop like this, as well as what other thread is holding on to the lock. -- - Hide quoted text - Russell Bryant Senior Software Engineer By: hevisko (hevisko) 2007-10-16 01:04:32 Files 20071012084*.bz2 is the output of asterisk -rx "core show locks" > `date` just before the crashes of core.30231 and core.31310 By: hevisko (hevisko) 2007-10-17 00:50:28 Could somebody please change it from CDR/General to CORE/genereal or something more appropriate? Is anybody looking at this at present? Anything else I should do/try/etc.?? By: hevisko (hevisko) 2007-10-24 02:10:05 Any progress on this? By: hevisko (hevisko) 2007-10-25 12:07:26 Problem hit thrice again today. By: Russell Bryant (russell) 2007-10-31 10:59:23 I have been looking over this issue. I talked to file, and he is going to backport his re-write of ChanSpy and MixMonitor to 1.4. Once he has that done, I'd like you to give it a try to see if it solves your problem. I am very optimistic that it will. I'll let you know when the code is available. By: Joshua C. Colp (jcolp) 2007-10-31 13:44:22 Please give the branch located at http://svn.digium.com/svn/asterisk/team/file/audiohooks-1.4 a try and report back. Thanks! By: hevisko (hevisko) 2007-11-02 06:26:10 First try gives compilatin errors :( app_meetme.c: in function 'conf_run': 1575: error: 'struct ast_channel' has no member named spies ditto for lines 1893 and 1900 By: Joshua C. Colp (jcolp) 2007-11-02 08:41:35 Fixed, forgot to update it when I fixed another bug. By: hevisko (hevisko) 2007-11-05 00:06:35.000-0600 GRrr.... ***************************************** This one is actually worse :( ***************************************** AS I make the call, it goes into 173% CPU utilization (top's stat on a dual core AMD64 in 32bit mode on Gentoo) I've included the full and messages info logs. I've forgot to add the -g flag for the first run, which had the debug_deadlock enabled.That caused LOADS of messages (see messges file) I've then disabled that flag, did a rerun and caught core.28100 with the gdb stuff. By: Joshua C. Colp (jcolp) 2007-11-05 07:05:14.000-0600 This may have uncovered a greater issue... did you get a core show locks as well? As for the core dump I don't really see anything useful or an indication of what caused the crash. By: hevisko (hevisko) 2007-11-05 07:51:47.000-0600 "core show locks".... I knew I missed something :( Will try tonight after call centre stopped dialing. By: Tilghman Lesher (tilghman) 2007-11-07 22:12:49.000-0600 hevisko: have you had a chance to obtain a 'core show locks' yet? By: hevisko (hevisko) 2007-11-08 01:54:43.000-0600 core show locks while going into the crash (as explained before) Sorry, wiped the core file by accident. Wanted to redo the test, but won't have a decent opportunity soon. What is/was interesting to watch this one on "top", is that first the CPU peaked at ~73%, then ~100%, then 173% and then >198% (dual core Opteron i686 mode) By: Joshua C. Colp (jcolp) 2007-11-08 09:36:45.000-0600 Does this happen right off the bat every time you test? The issue is that the lock that protects the SIP channel ends up being the same lock that protects the audiohook. If possible I would like access to the machine so I can add some debugging to isolate it... quite strange. By: hevisko (hevisko) 2007-11-08 09:44:40.000-0600 The Picture: SPA901 -> PBX(1.4.12) --iax2 over WiFi(1+%packetloss) -->vrouter(audiohooks-1.4) -->SIPprovider Yes, it happens on the first call. The 1.4.13 behaviour is that it happens on "load" ~30-45 concurrent calls No, I can't provide direct access because of current internet setups, but you could provide an OpenVPN/sshd site to use for proxying/connecting back something could be arranged. By: Tilghman Lesher (tilghman) 2007-12-10 15:14:21.000-0600 Is this still an issue in 1.4.15? We have fixed a ton of deadlocks since 1.4.13. By: hevisko (hevisko) 2007-12-10 22:54:10.000-0600 No, I haven't checked 1.4.15 yet, didn't looked liked it was relevant going through the changes file;) Any case, We've moved one leg away from WiFi connection to a diginet/leased line, and since that move that link/set of *s haven't given this lock up problems... yet... what I seriously suspect (and might've been fixed in your code) is that the packet losses is/was the cause of the problem... By: Tilghman Lesher (tilghman) 2007-12-28 12:58:06.000-0600 Well, I'm going to close it, then. Reopen if this is still an issue in the current release, as I think we've probably already addressed it. |