[Home]

Summary:ASTERISK-10507: 1.4.13 lockups
Reporter:hevisko (hevisko)Labels:
Date Opened:2007-10-12 00:54:16Date Closed:2011-06-07 14:00:24
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20071012084054.bz2
( 1) 20071012084126.bz2
( 2) 20071012084256.bz2
( 3) 20071105.full.bz2
( 4) 20071105.messages.bz2
( 5) core.19782.script.gz
( 6) core.20596.script.gz
( 7) core.26207.script.gz
( 8) core.27593.script.bz2
( 9) core.28100.script.bz2
(10) core.30231.script.bz2
(11) core.31310.script.bz2
(12) showlock.20071106.tgz
Description:Okay,

Caught one of these "lockups" on 1.4.13,

noticed the following from full log (And no, there are zillions of those :()

[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'
[Oct 11 12:27:04] DEBUG[23386] channel.c: Triggering queue flush for
spy 'MixMonitor' on 'IAX2/192.168.124.2:4569-8'

****** ADDITIONAL INFORMATION ******

Please update the Asterisk version in the bug tracker ;)
Comments:By: hevisko (hevisko) 2007-10-12 01:00:08

Russell,

   The errors may be in the jb itself, but it may also be that there's
some unusual delay in getting packets _into_ the jb which cause it to
display those messages.  For example, if there's some single-threaded
sections of code, or some kind of locking delays, etc.

   In the 19782 gdb output, you can see:
Thread 72 (process 19796):
#0  0xb7f82410 in __kernel_vsyscall ()
#1  0xb7e47cc6 in nanosleep () from /lib/libc.so.6
#2  0xb7e7544c in usleep () from /lib/libc.so.6
#3  0xb6dacbba in iax2_queue_frame (callno=40, f=0x83eebd0) at
chan_iax2.c:1444
#4  0xb6dad60b in __get_from_jb (p=<value optimized out>) at
chan_iax2.c:1804
ASTERISK-1  0xb6dcdd10 in iax2_process_thread (data=0x81f3d60) at chan_iax2.c:8259
ASTERISK-2  0x0812eb9d in ?? ()

Which seems to be from this code:
595 : matteo 629 static int iax2_queue_frame(int callno, struct
ast_frame *f)
1596 :     {
1597 :     for (;;) {
1598 :     if (iaxs[callno] && iaxs[callno]->owner) {
1599 : russell 82728
<http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=82727&r2=82728>
if (ast_channel_trylock(iaxs[callno]->owner)) {
1600 : matteo 629 /* Avoid deadlock by pausing and trying again */
1601 : markster 1310
<http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=1309&r2=1310>
ast_mutex_unlock(&iaxsl[callno]);
1602 : matteo 629 usleep(1);
1603 : markster 1310
<http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=1309&r2=1310>
ast_mutex_lock(&iaxsl[callno]);
1604 : matteo 629 } else {
1605 : markster 2644
<http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=2643&r2=2644>
ast_queue_frame(iaxs[callno]->owner, f);
1606 : russell 82728
<http://svn.digium.com/view/asterisk/trunk/channels/chan_iax2.c?view=diff&r1=82727&r2=82728>
ast_channel_unlock(iaxs[callno]->owner);
1607 : matteo 629 break;
1608 :     }
1609 :     }


The other output shows the same deadlock avoidance pattern in thread 73,
Thread 56 (and, they seem to be different calls..).

-SteveK

By: hevisko (hevisko) 2007-10-12 01:00:37

Ah, good catch!  That would explain the 100% CPU load, then, if it gets stuck in
this deadlock avoidance loop for forever.

Tilghman improved the "core show locks" functionality for 1.4.13 to be able to
show threads that are stuck in a trylock loop like this.  So, hopefully if
someone can grab that output with 1.4.13, we can see a thread stuck in a trylock
loop like this, as well as what other thread is holding on to the lock.

--
- Hide quoted text -
Russell Bryant
Senior Software Engineer

By: hevisko (hevisko) 2007-10-16 01:04:32

Files 20071012084*.bz2 is the output of
asterisk -rx "core show locks" > `date`
just before the crashes of core.30231 and core.31310

By: hevisko (hevisko) 2007-10-17 00:50:28

Could somebody please change it from CDR/General to CORE/genereal or something more appropriate?

Is anybody looking at this at present?
Anything else I should do/try/etc.??

By: hevisko (hevisko) 2007-10-24 02:10:05

Any progress on this?

By: hevisko (hevisko) 2007-10-25 12:07:26

Problem hit thrice again today.

By: Russell Bryant (russell) 2007-10-31 10:59:23

I have been looking over this issue.  I talked to file, and he is going to backport his re-write of ChanSpy and MixMonitor to 1.4.  Once he has that done, I'd like you to give it a try to see if it solves your problem.  I am very optimistic that it will.  I'll let you know when the code is available.

By: Joshua C. Colp (jcolp) 2007-10-31 13:44:22

Please give the branch located at http://svn.digium.com/svn/asterisk/team/file/audiohooks-1.4 a try and report back. Thanks!

By: hevisko (hevisko) 2007-11-02 06:26:10

First try gives compilatin errors :(

app_meetme.c: in function 'conf_run':
1575: error: 'struct ast_channel' has no member named spies
ditto for lines 1893 and 1900

By: Joshua C. Colp (jcolp) 2007-11-02 08:41:35

Fixed, forgot to update it when I fixed another bug.

By: hevisko (hevisko) 2007-11-05 00:06:35.000-0600

GRrr....
*****************************************
This one is actually worse :(
*****************************************

AS I make the call, it goes into 173% CPU utilization (top's stat on a dual core AMD64 in 32bit mode on Gentoo)

I've included the full and messages info logs.
I've forgot to add the -g flag for the first run, which had the debug_deadlock enabled.That caused LOADS of messages (see messges file)

I've then disabled that flag, did a rerun and caught core.28100 with the gdb stuff.

By: Joshua C. Colp (jcolp) 2007-11-05 07:05:14.000-0600

This may have uncovered a greater issue... did you get a core show locks as well? As for the core dump I don't really see anything useful or an indication of what caused the crash.

By: hevisko (hevisko) 2007-11-05 07:51:47.000-0600

"core show locks".... I knew I missed something :(

Will try tonight after call centre stopped dialing.

By: Tilghman Lesher (tilghman) 2007-11-07 22:12:49.000-0600

hevisko: have you had a chance to obtain a 'core show locks' yet?

By: hevisko (hevisko) 2007-11-08 01:54:43.000-0600

core show locks while going into the crash (as explained before)
Sorry, wiped the core file by accident. Wanted to redo the test, but won't have a decent opportunity soon.

What is/was interesting to watch this one on "top", is that first the CPU peaked at ~73%, then ~100%, then 173% and then >198% (dual core Opteron i686 mode)

By: Joshua C. Colp (jcolp) 2007-11-08 09:36:45.000-0600

Does this happen right off the bat every time you test? The issue is that the lock that protects the SIP channel ends up being the same lock that protects the audiohook. If possible I would like access to the machine so I can add some debugging to isolate it... quite strange.

By: hevisko (hevisko) 2007-11-08 09:44:40.000-0600

The Picture:
SPA901 -> PBX(1.4.12) --iax2 over WiFi(1+%packetloss) -->vrouter(audiohooks-1.4) -->SIPprovider

Yes, it happens on the first call.
The 1.4.13 behaviour is that it happens on "load" ~30-45 concurrent calls

No, I can't provide direct access because of current internet setups,
but you could provide an OpenVPN/sshd site to use for proxying/connecting back something could be arranged.

By: Tilghman Lesher (tilghman) 2007-12-10 15:14:21.000-0600

Is this still an issue in 1.4.15?  We have fixed a ton of deadlocks since 1.4.13.



By: hevisko (hevisko) 2007-12-10 22:54:10.000-0600

No, I haven't checked 1.4.15 yet, didn't looked liked it was relevant going through the changes file;)

Any case, We've moved one leg away from WiFi connection to a diginet/leased line, and since that move that link/set of *s haven't given this lock up problems... yet... what I seriously suspect (and might've been fixed in your code) is that the packet losses is/was the cause of the problem...

By: Tilghman Lesher (tilghman) 2007-12-28 12:58:06.000-0600

Well, I'm going to close it, then.  Reopen if this is still an issue in the current release, as I think we've probably already addressed it.