Summary:ASTERISK-12480: PRI span is locking up frequently
Reporter:Kevin Shanahan (kmshanah)Labels:
Date Opened:2008-07-29 16:34:19Date Closed:2011-06-07 14:02:53
Versions:Frequency of
Environment:Attachments:( 0) example-20080729.1.log
( 1) example-20080729.2.log
( 2) example-20080730.1.log
( 3) example-20080730.2.log
( 4) example-20080730.3.log
( 5) example-20080730.4.log
( 6) extensions.conf
( 7) gdb.txt
( 8) zapata.conf
( 9) zaptel.conf
Description:Since moving our Asterisk system into production on Monday, we are frequently (every hour or so) having all the PRI channels lock up.

System details:
- Debian "Lenny" base system
- Linux
- Zaptel 1.4.11
- libpri 1.4.3
- Asterisk packages from Debian "Sid"


To qualify "lock up", I mean:

* An outside caller trying to call in will receive a busy signal.
* Outgoing calls can fail in two different ways;

a) When the number is dialled (from a SIP extension), the call may try to proceed, but the caller only hears silence on their handset. The last thing in the log shows is e.g.:

logger.c:     -- Executing [s@macro-outbound:6] Dial("SIP/7307-013a0140", "Zap/g1/12345678") in new stack

b) When the user tries to dial, the phone displays "Service unavailable". In this case, the log shows something different:

[Jul 29 09:24:11] VERBOSE[5294] logger.c:     -- Executing [s@macro-outbound:6] Dial("SIP/7175-01f3e080", "Zap/g1/82435670") in new stack
[Jul 29 09:24:11] WARNING[5294] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)
[Jul 29 09:24:11] VERBOSE[5294] logger.c:   == Everyone is busy/congested at this time (1:0/1/0)
[Jul 29 09:24:11] DEBUG[5294] app_macro.c: Executed application: Dial

In both cases note that PRI debugging was enabled on the span, but nothing is output once the system gets into this state.

Also note that the CLI is exhibiting some problems once we get into this state. When using tab completion for zap commands the CLI will stop responding to keystrokes, although I can use ctrl-C to quit and restart the CLI again.

I will attach some logs of activity leading up to the problem, including PRI debugging. The problem seems to reproduce quite easily during business hours, so I'm able to gather more info as required.

Please /query me on Freenode IRC (nick: kmshanah) if you are able to work with me on this issue.
Comments:By: Kevin Shanahan (kmshanah) 2008-07-29 16:55:38

example-20080729.1.log - trivial example which shows:

- Restarted Asterisk
- Logged in to CLI, verbosity 3 (IIRC)
- Enable PRI debugging on span 1
- Made an incoming call
- After that, the whole PRI span just "died" (no more output from pri debug)

I tried calling in again shortly after, but I got the busy signal and there was no activity on the asterisk CLI or the log file.

By: Kevin Shanahan (kmshanah) 2008-07-29 17:05:19

example-20080729.2.log - example of outbound call failures.

I just chose an arbitrary starting point where we still had PRI debug info coming through. The call failures start where we see:

"Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)"

By: Kevin Shanahan (kmshanah) 2008-07-30 01:24:04

Okay, I now have a strong indication that there is a way this can be readily reproduced. I'm about to attach log files which show an incoming call from the same number caused the PRI problem four times in a row. In all cases the last time any activity is seen from the pri is an incoming call to a fax modem our maintenance team use to collect jobs.

I've arranged to run further tests using this scenario later tonight (after everyone goes home so we don't cut off any more client calls!)

By: Russell Bryant (russell) 2008-07-30 14:33:08

Please rebuild Asterisk with DEBUG_THREADS enabled.  Then, when it's in this locked up state, grab the output of "Core show locks".

# asterisk -rx "core show locks" > locks.txt

The debian package probably has modifications, so you'll have to recompile using the official source from downloads.digium.com.

By: Kevin Shanahan (kmshanah) 2008-07-30 18:58:50

Yes, unfortunately it was the Debian modifications causing the problem. I just got to the bottom of this last night, and it's due to an AB<->BA deadlock introduced in the bristuff patches Debian applies.

I'll attach my debug info just for completeness, but this is not a bug in core Asterisk so feel free to close.

(attachment: gdb.txt)