[Home]

Summary:ASTERISK-18899: Erroneous ISDN 44 Rejection Hangup() bug
Reporter:CJ Oster (lordvadr)Labels:
Date Opened:2011-11-21 12:20:31.000-0600Date Closed:2011-11-29 11:24:12.000-0600
Priority:CriticalRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_dahdi
Versions:1.8.7.0 Frequency of
Occurrence
Frequent
Related
Issues:
is caused byASTERISK-18687 CLONE - [regression] Asterisk 1.8.7.0-rc1: configure error (libpri related)
is related toASTERISK-18687 CLONE - [regression] Asterisk 1.8.7.0-rc1: configure error (libpri related)
Environment:CentOS 5.7 64-bit, asterisk 1.8 installed from repo, asterisk-dahdi-1.8.7.1-1_centos5 and asterisk18-core-1.8.7.0-2_centos5Attachments:
Description:--------
Synopsis
--------

An asterisk 1.6.2.11 box was upgraded to 1.8.7.1 via de-install from source, and yum install from digium repo.  The following morning, approximately 12 hours later, customers called complaining that they were not receiving inbound calls.  Test calls show that most (90% or more) inbound calls were receiving an "all circuits busy" message.

The frequency of this occurring was roughly several minutes to several hours during high usage, and several hours during little or no usage (evening/overnight for example).

Debug logs showed everything working normally until for whatever reason, asterisk rejected a call, typically with a 41.  In our experience, a heavily used asterisk box will somewhat frequently reject calls due to various, normal reasons.  After this occurred, we saw our carrier repeatedly trying to establish a call on channel X (typically 2 or 3) and it promptly getting rejected with an ISDN cause 44.  Our side appeared to still have the channel in use.  A ticket was opened with the carrier about why they were sending calls to a busy channel.  They told us the channel was not busy.

During testing, it was noticed that the channel's call pointer was not getting NULL'ed out after the reject/hangup.  I inserted a line of code near the end of the function sig_pri_hangup() in sig_pri.c (patch at the bottom) to set the call pointer to NULL, similar to most other sections of code where this is done after calling pri_hangup().  This has lead to now 72 hours of proper behavior.

My assumption is that my fix is going to lead to a memory leak of some kind simply because most calls that get hungup hit this chunk of sig_pri.c and behave normally.  I don't know where that pointer get's NULL'ed out normally, nor why it doesn't work when a call gets rejected.

We believe this is potentially related to DAHLIN-254.

Again, I do not know where to look to fix this appropriately, or if this is the correct fix.  I just wanted someone to benefit from my sleepless, nights, hours of frustration, broken keyboards and heavy drinking over this issue.


-----------------
Relevant software
-----------------

libpri-1.4.11.5-1_centos5
asterisk-dahdi-1.8.7.1-1_centos5
asterisk18-core-1.8.7.0-2_centos5
dahdi-linux-2.5.0.1-1_centos5
(and associated dependencies)


-----------------
Relevant Hardware
-----------------

Communication controller: Digium, Inc. Wildcard TE420 quad-span T1/E1/J1 card 3.3V (PCI-Express) (5th gen) (rev 02)
*Ethernet controller: Digium, Inc. Wildcard TE121 single-span T1/E1/J1 card (PCI-Express) (rev 11)

*This board is not in use, and we also believe it to be faulty.  We do not think either of these points are relevant.


--------------
How to produce
--------------

Take a DID that lives on an ISDN PRI channel and Hangup(34) it.  I don't believe the 34 is relevant because this issue self-presents during other normally rejected calls (appears to be a 41 from the logs).  I should also mention that this is a few steps into the dialplan (although I did test it as the first and only step prior to knowing what the issue was; IIRC it was identical behavior), digit-fixup and a wait(1) for CNAM to come though.

Following this hangup, 'dahdi show channel 1' will show "PRI Flags: call" for the now-hungup call.

A call from the carrier (who in our case was using ascending channel selection) would get immediately rejected with a 44.  The next call would also hit this channel.  Basically, no more incoming calls would work on this or higher channels because no call would ever establish on that particular channel.  Eventually someone on a lower channel would hangup, allowing one call to establish correctly, explaining the 90% problem.

'dahdi restart' will clear the PRI Flags on the offending channel allowing it to work again (but also tearing down all other calls).


-------------
Patch Applied
-------------

The one-line patch to asterisk-1.8.7.0 is available at http://test.ctc.biz/sig_pri.c.patch.txt
Comments:By: Matt Jordan (mjordan) 2011-11-21 13:45:12.876-0600

There are some known issues with lib_pri that are addressed in both the 1.8 branch and 1.8.8-rc4.  Would it be possible to test using the 1.8.8 release candidate, or the 1.8 branch?

By: Richard Mudgett (rmudgett) 2011-11-21 13:46:09.378-0600

This issue is likely the result of ASTERISK-18687.

By: CJ Oster (lordvadr) 2011-11-22 09:02:38.241-0600

I can't test for at least a while.  This is a production system and I had an uprising on my hands.  I need stability for a couple of months.  We used to get stung by this bug roughly twice a day.  It's been 4 days now, with 3500 calls yesterday.  No memory leak or other problems.  I'll need to get a dev system together, but its not an absolute must for us to be on 1.8--it just made sense to have the features available.  

If Digium wants to donate the hardware and a PRI, I'll be happy to be a test bed--even give the devs access to the box.

After a week of complete instability, I'd say this is a pretty serious bug.  In any case, I don't really care about the what's and why's, I just wanted to share with you folks what I found.

By: Leif Madsen (lmadsen) 2011-11-29 11:24:12.650-0600

Based on feedback from developers and the inability for the reporter to continue testing, I'm closing this issue as potentially fixed in 1.8.8.0.