[Home]

Summary:ASTERISK-13307: CLI hang and unresponsive when issuing "show channels" or "core show channels"
Reporter:Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech)Labels:
Date Opened:2009-01-06 03:09:21.000-0600Date Closed:2009-05-12 15:13:01
Priority:MinorRegression?No
Status:Closed/CompleteComponents:General
Versions:Frequency of
Occurrence
Related
Issues:
is the original version of this clone:ASTERISK-20022 CLONE - CLI hang and unresponsive when issuing "show channels" or "core show channels"
Environment:Attachments:( 0) 14178.patch
Description:Asterisk CLI becomes un-responsive after issuing "show channels" or "core show channels". The issue is exhibited on Asterisk 1.4.22 release. The issue had also been exhibited with using versions 1.4.21.X and 1.4.20.

****** ADDITIONAL INFORMATION ******

I couldn't find any indication why would this happen, even increasing the logging level in logger.conf doesn't yield any information regarding this.
The issue is exhibited when the system is using a large number of channels for a long period of time.

The interesting part here that this machine is part of a two machine environment, running the same Kernel and the same code (apart from minor changes in the dialplan). One machine is not exhbiting any issues, while the other does. Originally, we've used two DELL 2950 servers, and were under the impression that the server hardware is causing this, however, the problematic server had been replaced to an IBM server, however, the issue remains.

The configuration of the server is as following:

Hardware
IBM xServer x3650
2GB RAM
2 x 144GB SAS (System Drives)
4 x 300GB SAS (DATA Drives)
2 x TE420P card with HEC

Software
CentOS 5.2 with all updates
Asterisk 1.4.22
FreePBX for management only version 2.4.X
A set of small AGI scripts for the platform usage

In terms of modules compiled, here is the list:
DAHDI 2.1.0.3
LIBPRI 1.4.7
ASTERISK 1.4.22
ASTERISK-ADDONS 1.4.7

Linux Kernel is:
2.6.18-92.1.22.el5 #1 SMP Tue Dec 16 11:57:43 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Another issue that I've noticed (may be related or not), when loading DAHDI and the coresponding Card module, the network seems to halt for almost 20 seconds, then resumes for some reason. I couldn't find any indication why that would happen, however, when the HEC boards are removed - that doesn't happen, however, the problem still remains.

Before the server hardware was replaced, we've replaced(on the DELL servers) the TE420 cards, the HEC modules, processors, memory, motherboards - you name it, it was replaced. Nothing seems to budge this one.

My general feel is that we're having some problems with the TE420P and the HEC module, however, I can't really pin point it, as these were replaced. However, I can't really classify this behaviour to a specific part of Asterisk, in order to classify the bug personality.

Following below is some machine information:

lspci -v Digium related output:

08:08.0 Communication controller: Digium, Inc. Unknown device 0420 (rev 02)
       Subsystem: Unknown device 0004:0000
       Flags: bus master, medium devsel, latency 64, IRQ 122
       Memory at c7000000 (32-bit, non-prefetchable) [size=128]

11:08.0 Communication controller: Digium, Inc. Unknown device 0420 (rev 02)
       Subsystem: Unknown device 0004:0000
       Flags: bus master, medium devsel, latency 64, IRQ 114
       Memory at c4000000 (32-bit, non-prefetchable) [size=128]

/proc/interrupts
          CPU0       CPU1       CPU2       CPU3
 0:   19994098          0          0          0    IO-APIC-edge  timer
 1:          0          0          0          0    IO-APIC-edge  i8042
 8:          1          0          0          0    IO-APIC-edge  rtc
 9:          0          0          0          0   IO-APIC-level  acpi
14:          0          0          0          0    IO-APIC-edge  libata
15:        149          0     179997          0    IO-APIC-edge  ide1
90:         55          0          0          0   IO-APIC-level  ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4
98:          0          0          0          0   IO-APIC-level  uhci_hcd:usb3, uhci_hcd:usb5
114:      27612   17278848    2650001       9998   IO-APIC-level  wct4xxp
122:      22174    2660071   16950066     328732   IO-APIC-level  wct4xxp
130:     253531          0          0          0         PCI-MSI  eth0
169:      32980          0     826323          0   IO-APIC-level  aacraid
NMI:       1476       2868       2806       1359
LOC:   19986898   19986863   19985283   19985236
ERR:          0
MIS:          0

Cheers,
Nir
Comments:By: Mark Michelson (mmichelson) 2009-01-06 10:49:15.000-0600

If the CLI is becoming unresponsive after issuing a "core show channels" command, I have to believe that this is a deadlock occurring. There are a few pieces of information that you can provide for us to help get this resolved, unfortunately the only way to get this information is to reproduce the problem.

First of all, please run 'make menuselect' from the Asterisk source directory and turn on the options DONT_OPTIMIZE and DEBUG_THREADS. Then recompile Asterisk.

When the problem occurs, connect to the CLI remotely by opening a new terminal window and issuing the command 'asterisk -r'

Even though the CLI is not responsive in the window where the "core show channels" command was executed, it should be operational in the new window. Get the output from the command "core show locks" and upload that output here.

As a safety measure, it would also be a good idea to attach to the frozen Asterisk process with gdb and get the output of the "thread apply all bt" command and upload that here as well. It is important that this output is from the same failure as the output of "core show locks."

This should hopefully be enough information to pinpoint where the deadlock is occurring, assuming there is one.

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-06 14:51:00.000-0600

Ok, I'm re-compiling now and will do as you request - as the call center is currently inactive, it may take me up to 24 hours to provide the information.

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-07 07:07:29.000-0600

output of "core show locks" as you described:

[root@recording-raanana-2 ~]# asterisk -rx "core show locks"

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 1102244160 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 1451 ast_hangup &chan->lock 0xbc76128 (1)
=== ---> Lock #1 (chan_local.c): MUTEX 515 local_hangup &p->lock 0xbd69b30 (1)
=== ---> Tried and failed to get Lock #2 (channel.c): MUTEX 962 ast_queue_hangup &chan->lock 0xbdfbf18 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 1114949952 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2613 ast_write &chan->lock 0xbdfbf18 (1)
=== ---> Waiting for Lock #1 (chan_local.c): MUTEX 309 local_write &p->lock 0xbd69b30 (1)
=== --- ---> Locked Here: chan_local.c line 515 (local_hangup)
=== -------------------------------------------------------------------
===
=== Thread ID: 1076431168 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 1451 ast_hangup &chan->lock 0xbc3e0f8 (1)
=== ---> Lock #1 (chan_local.c): MUTEX 515 local_hangup &p->lock 0xbbe01e0 (1)
=== ---> Tried and failed to get Lock #2 (channel.c): MUTEX 962 ast_queue_hangup &chan->lock 0xbe46148 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 1084942656 (pbx_thread           started at [ 2645] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2613 ast_write &chan->lock 0xbe46148 (1)
=== ---> Waiting for Lock #1 (chan_local.c): MUTEX 309 local_write &p->lock 0xbbe01e0 (1)
=== --- ---> Locked Here: chan_local.c line 515 (local_hangup)
=== -------------------------------------------------------------------
===
=== Thread ID: 1113966912 (netconsole           started at [ 1026] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1085 channel_find_locked &c->lock 0xbe46148 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 1085540672 (netconsole           started at [ 1026] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1085 channel_find_locked &c->lock 0xbe46148 (0)
=== -------------------------------------------------------------------
===
=======================================================================


I hope this helps a little. One thing that I noticed that strengthns the possiblity of a a dead-lock is that while I waited for a minute or so, the CLI suddenly became responsive again.

Nir

By: Mark Michelson (mmichelson) 2009-01-08 16:59:51.000-0600

Sorry, I didn't get an e-mail notification that you had added a note, which is why I didn't respond at all yesterday.

A quick glance at the output shows that there is a conflict of lock order occurring between local_pvt structures and ast_channel structures. Specifically, in your output, threads 1084942656 and 1076431168 are the competing threads. What you are saying about the CLI becoming responsive again makes sense, too, considering that thread 1076431168 will eventually just give up trying to get the lock and everything will "unwind" after that. I'll see if this can be avoided.

By: Mark Michelson (mmichelson) 2009-01-08 17:03:04.000-0600

Heh, my eyes were drawn to the bottom set of threads, but it appears that in your output 1102244160 and 1114949952 also have the exact same conflict occurring.

By: Leif Madsen (lmadsen) 2009-01-08 17:03:26.000-0600

Just to make a note, I'm seeing the exact same thing on a clients box.

By: Mark Michelson (mmichelson) 2009-01-08 17:25:15.000-0600

I have uploaded 14178.patch in an attempt to avoid the deadlock seen here. Please see if you still experience the problem with the patch applied. If you do still see a deadlock occurring, it is most likely a different one than the specific one I have fixed here and also needs to be addressed. In other words, if you still have a deadlock occur, I'll need to see new "core show locks" output if it occurs.

Thanks!

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-09 06:54:56.000-0600

Ok, I'll compile and check it out, I'll have an answer only on sunday I believe, as the call center isn't working today.

Thanks!

By: Leif Madsen (lmadsen) 2009-01-09 07:11:46.000-0600

I've applied the patch to my clients system as of last night. The first indication is that it fixed it, but I won't know for sure until we get some load on the system and a lot of channels to know if it really was fixed, so might be a few days.

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-09 07:14:01.000-0600

Ok,

 I've deployed to our systems over here, looks ok so far. We reached about 90 concurrent calls is less than 1 minute, and the CLI didn't lock up.

Thanks

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-10 14:44:46.000-0600

Status Report on the deployed patch (10/01/2008 - 22:44 GMT+2)

It's now been 17 hours into operations, and the deadlock hadn't been observed again. It would appear that the patch solved it. Let's monitor for another 48 hours and see if all is well.

Nir

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-12 21:46:42.000-0600

Status Report on the deployed patch (13/01/2008 - 5:45 GMT+2)

Ok, Patch appears to have solved the dead-lock issue.
I believe this bug can now be closed and patch merged into upstream.

Thanks,
Nir

By: Mark Michelson (mmichelson) 2009-01-13 13:04:32.000-0600

Thank you for your updates. I will merge this patch in.

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-01-13 13:06:46.000-0600

You are welcome.

By: Mark Michelson (mmichelson) 2009-01-13 13:10:15.000-0600

Hmm, I feel a bit silly. It turns out this exact same deadlock was reported in issue ASTERISK-12877 and I made the exact same patch for it and merged it in already. I apparently forgot that I had done that. Anyway, I'm still closing this issue, and now you will know why there is no message from the svnbot on this issue when it is closed.

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-04-07 03:19:27

Hi Guys,

I've got a few systems running with version 1.4.24, and this CLI lockup is still happening. Had this been merged into the 1.4 tree already?



By: Mark Michelson (mmichelson) 2009-04-07 11:46:53

Are you sure it's the same deadlock? The one you originally reported definitely has been fixed in the 1.4 branch. The patch I uploaded here was actually merged into 1.4 back in October last year. Could you get the "core show locks" output again so that we may verify that the same problem is still happening? Thanks!

By: Nir Simionovich (GreenfieldTech - Israel) (greenfieldtech) 2009-04-11 15:06:13

Well,

As I installed it from the packages repository, I'll need to recompile and try again to view the dead-locks on it. What I've noticed is that the dead-lock seems to free up in stages. Meaning that the output begins, then holds for a few seconds, then continues again, holds and continues intermittently.

Nir

By: Leif Madsen (lmadsen) 2009-04-15 10:58:49

It is always advisable to report issues against a compiled, latest version of Asterisk rather than packaged versions, as we have no way to know what they have changed in the packaged version.

By: Tilghman Lesher (tilghman) 2009-05-12 15:13:01

Given the lack of feedback, I'm assuming the remaining lockup was due to a packaging change.

Also, given that this issue already has a fix confirmed for it, please do not reopen this issue, but open a new issue if you are still having problems.