Summary:ASTERISK-18742: PRI Span: 1 !! Unknown IE 128 (cs0)
Reporter:Stephen H. Gerstacker (sgerstacker)Labels:
Date Opened:2011-10-21 08:15:46Date Closed:2012-04-06 18:05:48
Versions:1.8.3 1.8.4 Frequency of
is related toASTERISK-00269 dahdi timing issues with recent kernels and intel_idle driver
Environment:Ubuntu 10.04 64-bit Asterisk from official repository DAHDI 2.4.1 Digium Wildcard TE110P T1/E1 CardAttachments:( 0) chan_dahdi.conf
( 1) dahdi_test.txt
( 2) patlooptest.new.log
( 3) pattest.log
( 4) pri.txt
( 5) system.conf
Description:When attempting to make outbound calls, some times (1 in 10 times), I get the following:

PRI Span: 1 !! Unknown IE 128 (cs0)
   -- Span 1: Channel 0/1 got hangup, cause 25

The hangup cause changes.  There never seems to be a pattern to it.  If you attempt the same call immediately afterwords, it will go through successfully.

The card in question was successfully used in an Asterisk 1.2 box for 4+ years with no problems.  The move to 1.8 started this.  

All IRQ conflicts have been fixed.  The phone company tested the line and found nothing wrong.  They also traced a call that failed in this manner they say they cannot see it.  They only see the successful attempt after the failure.
Comments:By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 08:17:59.207-0500


By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 08:18:13.930-0500


By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 08:19:23.041-0500

I'll also add that I've been seeing this:

 == Primary D-Channel on span 1 down
[Oct 21 09:22:24] WARNING[31841]: sig_pri.c:1054 pri_find_dchan: Span 1: No D-channels available!  Using Primary channel as D-channel anyway!
 == Primary D-Channel on span 1 up

It happens quickly and everything is okay afterwards.

By: Richard Mudgett (rmudgett) 2011-10-21 12:25:32.916-0500

It would be interesting to see a "pri set debug 2 span 1" capture of the failing call.
The reason I am asking for an intense debug capture is because a hex dump of the packets is output.

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 13:33:15.090-0500

Log from pri set debug 2 span 1

It was running for a bit before I got a failure, but the last two calls were a successful one to my cell, followed by a failure.

By: Richard Mudgett (rmudgett) 2011-10-21 14:04:39.517-0500

What is the libpri version?

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 14:08:39.502-0500

1.4.12 now.  I had this problem with the stock Ubuntu version, which I believe was 1.4.10 and I've built packages from source for 1.4.10, 1.4.11 & 1.4.12, all exhibiting this problem.

By: Richard Mudgett (rmudgett) 2011-10-21 14:36:49.898-0500

The network is reseting the link with a SABME when the failed call is going out.  Resetting the link is why the call fails.

Why does the network think it needs to reset the link?

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 14:45:56.660-0500

To that I cannot say.  I'm just a lowly software developer that happens to be the only tech guy for the company.  I can't "speak the language", which has made this process a lot harder.  I've had a previous thread on the mailing list trying to address this issue, to no avail. (http://lists.digium.com/pipermail/asterisk-users/2011-September/266172.html)

That sent me to the phone company who says nothing is wrong, which finally sent me here.

By: Shaun Ruffell (sruffell) 2011-10-21 15:20:09.574-0500

What is the output of:

dahdi_maint -s 1


head /proc/dahdi/1

Also, is there anything in dmesg from the wcte12xp driver that correlates with the times that you experience failures?

By: Shaun Ruffell (sruffell) 2011-10-21 15:24:33.843-0500

I also noticed that you are using a TE110.  Did you move this card into a new server by any chance or is the hardware all the same and you only updated the software?

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 15:41:56.214-0500

It's an old card moved to a new server.  We went from a very old server with this card and Asterisk 1.2 to a new server with Asterisk 1.8.

The old server was Ubuntu 8.04 with a hand compiled Asterisk 1.2, which would also make it libpri 1.2, IIRC.

The new server was initially all Ubuntu 10.04 packages and Asterisk packages from the official repo.  I've since found some newer libpri debs I compiled from source, just to eliminate libpri from the equation.

dahdi_maint isn't in the dahdi packages, so I'll need to find that... more to come.

By: Shaun Ruffell (sruffell) 2011-10-21 15:46:48.461-0500

And are you seeing IRQ misses increase in /proc/dahdi/1? One theory is that your new server might be incompatible with the TE110P card and glitches are generating invalid HDLC messages on the span, which prompts the remote side to reset you.

Also dahdi_maint -s 1 won't give you any information from the TE110P card, so no need to worry about that.

If you have a loopback plug, you could try running patgen / pattest on the server at the same time you try to do other things...and see if that comes up clean.

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 15:54:21.671-0500

I've got 2 IRQ conflicts on a 22 day uptime:

Digium Wildcard TE110P T1/E1 Card 0      OK      2      0      0      ESF B8ZS          0 db (CSU)/0-133 feet (DSX-1)
Wildcard AEX410 Board 1                  OK      1      0      0      CAS Unk           0 db (CSU)/0-133 feet (DSX-1)

I made a loopback plug a while back... I'll see if I can dig it up.

By: Stephen H. Gerstacker (sgerstacker) 2011-10-21 16:02:42.834-0500

The system is a Dell PowerEdge 840 with the latest BIOS: A08

By: Shaun Ruffell (sruffell) 2011-10-22 15:20:46.969-0500

Hmm...only 2 IRQ misses. That certainly wouldn't fit my hypothesis but I'm not very familiar with the TE110P.  I'll wait to hear about any results with using the loopback plug.

By: Stephen H. Gerstacker (sgerstacker) 2011-10-24 17:34:05.473-0500

I could not get patgen and pattest to work, since I am on the same system, on the same channel, one takes priority of the device.  sudo patgen /dev/dahdi/1 locks the device file, so I can't run sudo pattest /dev/dahdi/1 and vice versa.

That being said, I ran dahdi_test and got numbers that seem lower than acceptable from the man page.  Could this be the problem?

By: Shaun Ruffell (sruffell) 2011-11-01 10:56:10.086-0500

Stephen, patlooptest should work with the loopback plug.

The low numbers from dahdi_test could be related....if it's a problem with the card servicing it's interrupt in time which is what we're trying to establish.

By: Stephen H. Gerstacker (sgerstacker) 2011-11-01 17:11:11.111-0500

Well, this output doesn't bode well I think...

The result of sudo ./patlooptest /dev/dahdi/1 -vvvvv -t 30

This did not increase the IRQ misses, which are still at 2.

By: Shaun Ruffell (sruffell) 2011-11-01 17:53:07.630-0500

Do you get the same thing when you run like:

sudo chrt -f 99 ./patlooptest /dev/dahdi/1 -vvvv -t 30

This will make sure that it's not scheduling.

However, you're right...it doesn't bode well.  I'm not very familiar with the wcte11xp driver so I'm not sure why IRQ misses isn't increasing.  But if the board isn't handling it's interrupt in a timely manner I can see how misframed HDLC messages are generated which causes the reset.

By: Stephen H. Gerstacker (sgerstacker) 2011-11-02 17:12:28.242-0500

Looks like more of the same...

Still only showing 2 IRQ misses.

By: Shaun Ruffell (sruffell) 2011-11-02 19:15:21.740-0500

My best guess is for you to try another server if there is not a BIOS update available for this server. The patloop tests results are not showing complete drops which would be expressed by missing blocks of 8 or more bytes. Instead, just single bytes are replaced with 0xff which could be because of problems with patloop test being scheduled in a timely manner...or a general problem on the PCI performing the DMA operations.  Most problems with patlooptest being scheduled quickly enough are highlighted with chrt.

Also you might want to call tech support and see if you can get credit for your TE11XP if you upgrade to a TE12XP.

By: Stephen H. Gerstacker (sgerstacker) 2011-11-03 08:13:36.387-0500

The server is already on the latest BIOS.  I'll have to test a card in a different server to rule out a card issue, but that will take a couple of days.

It seems as if this is coming down to a problem with the server hardware itself.  Is there any way to avoid this on a future server purchase?

By: Shaun Ruffell (sruffell) 2011-11-04 09:52:18.505-0500

Not that I'm really aware of. My understanding for the reason the PCI interface on the TE110P was changed in the TE122 was due to server incompatibilities but those decisions all predate my involvement with DAHDI.

I'm also not aware of any server incompatibilities with the TE122 (as long as the BIOS is completely updated). I wish I had better news for you...

By: Stephen H. Gerstacker (sgerstacker) 2011-12-08 08:36:08.607-0600

Just as a follow up, we purchased a TE122PF card.  I put it in the server last night and so far, so good.  It got through a night without any errors and we haven't had a congestion yet.

By: Shaun Ruffell (sruffell) 2012-04-06 18:05:48.479-0500

I linked this to DAHLIN-269 since it might be fundamentally related, but still there is probably not much else to do here.  Feel free to reopen if I'm mistaken.