[Home]

Summary:ASTERISK-18154: Regression: Channel Restarts, Congestion, Limited Call Capacity on PRI
Reporter:Justin Phelps (ecnf)Labels:
Date Opened:2011-07-19 11:47:45Date Closed:2011-08-24 08:22:32
Priority:MajorRegression?
Status:Closed/CompleteComponents:
Versions:1.8.5.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Debian 6.0.2 Dahdi Complete 2.5.0+2.5.0 Libpri 1.4.12 FreePBX 2.9Attachments:( 0) full.log
( 1) pri_show_channels_v1.8.patch
Description:In response to PRI-124, I upgraded my asterisk version from 1.8.2.4 to 1.8.5.0.

About 1/2 way through the next day, incoming and outgoing calls started receiving a high number of congestion messages. I would say about 90% of attempted calls were affected. Running dahdi restart would give me another 10 to 12 hours of stable calls, and then the congestion would pickup again.

I have since downgraded to asterisk 1.8.4.4, and the pri seems to have stabilized.

I will be attaching the full log from my asterisk box to this issue, as well as my dahdi configuration information.
Comments:By: Justin Phelps (ecnf) 2011-07-19 11:51:18.928-0500

span=1,0,0,ESF,B8ZS
bchan=1-23
dchan=24
loadzone=us
defaultzone=us
/etc/dahdi/system.conf (END)


By: Justin Phelps (ecnf) 2011-07-19 11:52:08.124-0500

; Autogenerated by /usr/sbin/dahdi_genconf on Fri Feb 25 13:21:55 2011
; If you edit this file and execute /usr/sbin/dahdi_genconf again,
; your manual changes will be LOST.
; Dahdi Channels Configurations (chan_dahdi.conf)
;
; This is not intended to be a complete chan_dahdi.conf. Rather, it is intended
; to be #include-d by /etc/chan_dahdi.conf that will include the global settings
;

; Span 1: WCT1/0 "Wildcard TE121 Card 0" (MASTER) B8ZS/ESF REDClockSource
group=0,11
context=from-pstn
switchtype = national
signalling = pri_cpe
channel => 1-23
context = default
group = 63

dahdi-channels.conf (END)


By: Justin Phelps (ecnf) 2011-07-19 11:53:36.563-0500

The full log file is huge, so I'm uploading to my dropbox account, and linking to it here.

http://dl.dropbox.com/u/13160/asterisk/full.0
http://dl.dropbox.com/u/13160/asterisk/full

These files will take time to upload.

By: Justin Phelps (ecnf) 2011-07-19 11:58:25.165-0500

If you grep the full log for "restart" you will see a large number of channel restarts before the 15th. The 15th is the day that I downgraded from 1.8.5.0. I upgraded from 1.8.2.4 to 1.8.5.0 on the 12th, per the previous issue.

By: Justin Phelps (ecnf) 2011-07-19 11:59:04.745-0500

Please let me know what additional information you need to help diagnose this issue.

By: Richard Mudgett (rmudgett) 2011-07-20 13:37:37.076-0500

I cannot see anything obvious in that HUGE (1Gbyte) log file to help diagnose what is going on.  There seem to be quite a few call collisions.  It does appear that the log file spans several Asterisk and libpri version changes which does not help.

chan_dahdi does not handle call collisions (glare) very well because of how it is designed.  It originally supported POTS ports and handles the B channels as such.  One thing you can do to reduce the occurrence is to change the channel allocation strategy.  The network seems to be allocating from lowest B channel to highest.  For your outgoing calls you can then allocate from highest B channel to lowest by using the upper case G in the dial application: Dial(DAHDI/G0/exten).

By: Justin Phelps (ecnf) 2011-07-20 13:51:05.881-0500

What information would you need to help isolate the problem? I've been on 1.8.4.4 for a few days now, and everything seems stable, which reinforces the idea that there's a regression somewhere in the way Asterisk talks to dahdi.

By: Richard Mudgett (rmudgett) 2011-07-20 16:10:55.181-0500

[^pri_show_channels_v1.8.patch] is a backport of the CLI command "pri show channels".  It is useful to quickly see if there are any stuck B channels when there should not be any active calls.  The output might be helpful.

It will apply ok to v1.8.5 and later code.

I also committed it to the v1.8 branch -r329012.

By: Justin Phelps (ecnf) 2011-08-17 10:48:39.851-0500

Due to the previous ticket about 1-800 numbers, I was forced to update to 1.8.5 again. 1.8.4.4 started to refuse to connect to those numbers.

I'll update this with my config files again. What information do you need to help me isolate and solve this problem?

By: Justin Phelps (ecnf) 2011-08-17 10:49:06.003-0500

# cat /etc/dahdi/system.conf
# Autogenerated by /usr/sbin/dahdi_genconf on Fri Feb 25 12:18:41 2011
# If you edit this file and execute /usr/sbin/dahdi_genconf again,
# your manual changes will be LOST.
# Dahdi Configuration File
#
# This file is parsed by the Dahdi Configurator, dahdi_cfg
#
# Span 1: WCT1/0 "Wildcard TE121 Card 0" (MASTER)
span=1,1,0,esf,b8zs
# termtype: te
bchan=1-23
dchan=24
echocanceller=mg2,1-23

# Global data

loadzone = us
defaultzone = us


By: Justin Phelps (ecnf) 2011-08-17 10:52:42.527-0500

/etc/asterisk/chan_dahdi.conf

[general]

[channels]

faxdetect=incoming
; [span_1]
signalling=pri_cpe
switchtype=dms100
pridialplan=national
prilocaldialplan=national
group=0
context=from-pstn
channel=1-23

By: Richard Mudgett (rmudgett) 2011-08-17 11:37:56.613-0500

As I stated earlier, minimize glare and the output of "pri show channels" may help.

By: Justin Phelps (ecnf) 2011-08-17 11:51:26.542-0500

When trying to patch 1.8.5.0, I get the following output:


/usr/src/asterisk-1.8.5.0# patch -i 40726_pri_show_channels_v1.8.patch
patching file sig_pri.h
Hunk #1 FAILED at 521.
1 out of 1 hunk FAILED -- saving rejects to file sig_pri.h.rej
patching file chan_dahdi.c
Hunk #1 FAILED at 14128.
Hunk #2 FAILED at 14271.
Hunk #3 FAILED at 14282.
Hunk #4 FAILED at 14403.
Hunk #5 FAILED at 14995.
5 out of 5 hunks FAILED -- saving rejects to file chan_dahdi.c.rej
patching file sig_pri.c
Hunk #1 FAILED at 124.
Hunk #2 FAILED at 7213.
Hunk #3 FAILED at 7634.
3 out of 3 hunks FAILED -- saving rejects to file sig_pri.c.rej


By: Richard Mudgett (rmudgett) 2011-08-17 12:07:15.948-0500

You should have gotten from an unmodified v1.8.5.0:

~/projects/sa/asterisk/tags/v1.8.5.0$ patch -p0 -i pri_show_channels_v1.8.patch
patching file channels/sig_pri.h
patching file channels/chan_dahdi.c
Hunk #1 succeeded at 14127 (offset -1 lines).
Hunk #2 succeeded at 14271 (offset -1 lines).
Hunk #3 succeeded at 14315 (offset -1 lines).
Hunk #4 succeeded at 14436 (offset -1 lines).
Hunk #5 succeeded at 15029 (offset -1 lines).
patching file channels/sig_pri.c


By: Justin Phelps (ecnf) 2011-08-17 12:22:55.134-0500

Thanks, that worked. I was missing the -p0.

I'm compiling now, and I will install this. I'm guessing when you say "remove glare" you mean install this and watch the pri after hours so we can get a good idea about what is going on. Is this correct?

I also was looking at the 1.8.6-rc1 changlog, and noticed this entry. Do you think this might be the issue I'm experiencing?

2011-05-10 23:41 +0000 [r318499]  Richard Mudgett <rmudgett@digium.com>

* channels/sig_pri.c, channels/sig_ss7.c: Unable to pickup
 DAHDI/PRI call because call state is reported as DIALING. The
 channel state is not updated to RINGING when an ALERTING message
 is received. Regression caused when sig_pri.c (also sig_ss7.c)
 extracted from chan_dahdi.c. * Added missing channel state update
 to RINGING when the AST_CONTROL_RINGING frame is queued for ISDN
 and SS7. (closes issue #19257) Reported by: alecdavis Patches:
 issue19257_v1.8_v2.patch uploaded by rmudgett (license 664)
 Tested by: alecdavis, rmudgett

By: Richard Mudgett (rmudgett) 2011-08-17 13:12:34.459-0500

The [^pri_show_channels_v1.8.patch] patch just adds the CLI "pri show channels" command.  Using the command after hours should show any stuck channels.

How to minimize glare (call collisions) I explained earlier (Comment on 20/Jul/11 1:37 PM above).  In a glare situation, one or both calls will fail so minimizing them is a good thing.  Minimizing glare should also reduce the possibility of a channel getting stuck.

I don't think the changelog fix you found would cause your problem since your issue seems to be stuck channels.  On the other hand, it couldn't hurt and is only adding a group of 5 lines.

By: Trey Blancher (tblancher) 2011-08-17 14:18:45.086-0500

Reporter has filed Digium Technical Support case #00243578 to address this issue.  This comment is simply documenting that we're watching this issue.


By: Justin Phelps (ecnf) 2011-08-17 14:22:27.925-0500

I have reconfigured my trunk to use group 0 decending. I can't seem to find a way to verify outgoing calls are starting at the top of the pri, b/c dahdi show channels and core show channels don't give that information. I'm guessing the pri show channels option I've patched into 1.8.5 will help me verify the setting?

I will be installing 1.8.5 in about 3 hours, after most people go home. I hope to have a good report for you then.

By: Richard Mudgett (rmudgett) 2011-08-17 14:46:03.672-0500

Yes, the pri show channels command can be used to verify that incoming and outgoing calls are allocating channels from opposite ends of the B channel range.  The most important column for determining if a channel is stuck is the "Chan Idle" column.  If a channel is not idle, Asterisk will not use it for new calls.

By: Justin Phelps (ecnf) 2011-08-17 17:13:45.759-0500

I have successfully installed the patched version of 1.8.5.0.

pri show channels confirms I have incoming calls starting at channel 1, and outgoing starting at channel 23.

Do you have a recommended method of stress testing my pri, or should I let it work throughout the night and see how it does during the day?

By: Richard Mudgett (rmudgett) 2011-08-17 18:37:35.780-0500

The most stressful thing for channel allocations is the call collision.  Two calls are fighting for the same B channel.  When I reworked that area for v1.8.5 (-r312575), my test setup attempted to blast 10 incoming and 10 outgoing calls at the same time with both sides picking channels the same way.

The incoming calls starting at 1 and outgoing call starting at 23 setup is designed to minimize call collisions.  That is best for production use.  I expect your system will work better now.


By: Justin Phelps (ecnf) 2011-08-18 08:30:59.259-0500

I've monitored the system through the night, and it's coming under load this morning. It usually lasts for several hours before I see the excessive congestion messages. If it makes it a couple of days, I'd say the problem is solved.

Is there a command that I can run that tells me how many call collisions there have been on the system since starting?

By: Justin Phelps (ecnf) 2011-08-18 08:40:36.423-0500

Are entries like this normal?

  1    1 Yes  No   Connect    Yes  DAHDI/i1/-b1

There doesn't seem to be any numbers associated with this incoming call. Are they blocking caller id?

By: Trey Blancher (tblancher) 2011-08-18 09:10:58.739-0500

Attached URL to support case.

Justin,

We'd need to see the PRI debug, it may be that the network isn't providing CID for that call, unless it's outbound.  Do you know what the status of that call was?  Was it answered properly, or did it have any problems?


By: Justin Phelps (ecnf) 2011-08-18 09:20:07.225-0500

I do know it was an Inbound Call. My cli scrolls by so fast, it's hard to keep track of a single call, and with pri debug on, it is unreadable. If I enable debug, won't that show up in my full log?

By: Trey Blancher (tblancher) 2011-08-18 10:16:38.526-0500

If you enable PRI debug, it should show up in the full log.  Note that you may need to execute a "logger reload" after enabling PRI debugging to have the messages show in /var/log/asterisk/full.

By: Richard Mudgett (rmudgett) 2011-08-18 11:16:54.863-0500

Q: Is there a command available to tell how many call collisions there have been?
A: Call collisions are not tracked.
There is a debug message that can be enabled:
ast_debug(1, "Span %d: SETUP requested unavailable channel %d/%d.  Attempting to renegotiate.\n"
It is output when an incoming SETUP requests a channel that is already in use.

There is a span device state available to indicate when the span is fully utilized:
DAHDI/I<span>/congestion

Q: DAHDI/i1/-b1 channel name.  Are they blocking caller id?
A: The channel name for an incoming call uses the caller id provided in the SETUP message.  If it is not there then that is the kind of channel name generated.  For outgoing calls, the called number is used instead.  Outgoing calls that are dialing using the overlap method can also generate this kind of channel name.  The called number is not available or not complete when the channel is created.

By: Justin Phelps (ecnf) 2011-08-18 15:53:35.263-0500

Looks like the congestion issue is solved. Do you both work out of the Huntsville office?

By: Richard Mudgett (rmudgett) 2011-08-18 16:10:40.602-0500

Yes.  I also think you should monitor for a few days to be sure.

By: Justin Phelps (ecnf) 2011-08-18 16:45:26.260-0500

I will continue monitoring as you suggest. Monday and Tuesday are our busiest call days (around 4,000 to 5,000 calls each day), so that will be a good stress test.

By: Justin Phelps (ecnf) 2011-08-19 12:18:00.151-0500

I just had a solid lockup on the server. The CLI wasn't responding to any commands.
pri show channels showed channel 1 in Progress with an empty trailing field. Like this:

{noformat}
PRI       B    Chan Call       PRI  Channel
Span Chan Chan Idle Level      Call Name
  1    1 Yes  No   Progress   Yes  
  1    2 Yes  No   Connect    Yes  DAHDI/i1/8506749597-b
{noformat}

Attaching the last 10,000 lines of the full log.

By: Justin Phelps (ecnf) 2011-08-19 12:18:33.630-0500

10,000 lines from the full log after the hard lockup.

By: Justin Phelps (ecnf) 2011-08-19 12:43:44.017-0500

I had to kill Asterisk and restart it to get calls working again.

By: Richard Mudgett (rmudgett) 2011-08-19 13:21:27.732-0500

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install).  This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!

You have come across a deadlock situation that is not related to the original issue.  From your pri show channels output, DAHDI channel 1 at the "Progress" level (I think you meant "Proceeding") cannot create the Asterisk channel (DAHDI/i1/...).

By: Justin Phelps (ecnf) 2011-08-19 16:58:15.114-0500

It must have been Proceeding. I was in a rush to get the phones back on, and misread the output.

Do you want me to open a new ticket with this new issue, or continue posting here?

By: Richard Mudgett (rmudgett) 2011-08-19 17:20:42.575-0500

It really should be a new ticket since it is a different issue with different symptoms.

Deadlocks are best found by the core show locks output.  Backtraces from an unoptimized build could be used, but it is *much* harder to find the deadlock.

By: Justin Phelps (ecnf) 2011-08-20 03:40:03.748-0500

I have created a new issue here: [ASTERISK-18302|https://issues.asterisk.org/jira/browse/ASTERISK-18302].

By: Justin Phelps (ecnf) 2011-08-23 16:44:51.362-0500

I have updated the new issue with the "core show locks" output, and an attempted backtrace.

By: Justin Phelps (ecnf) 2011-08-24 08:22:08.761-0500

Since the congestion issue has been resolved, I suppose we can close this ticket.