ASTERISK-06827: libpri locks up if an external resource-hungry process is run on same system

[Home]

Summary: ASTERISK-06827: libpri locks up if an external resource-hungry process is run on same system

Reporter: Alessandro Polverini (alessandro polverini) Labels:

Date Opened: 2006-04-22 02:43:52 Date Closed: 2007-07-05 09:24:41

Priority: Major Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) zapata.conf
( 1) zaptel.conf

Description: Hello,
I'm using asterisk 1.2.7.1 on Debian Sarge with kernel 2.6.15-1-k7-smp and zaptel 1.2.5.
I have a quad-PRI Digium card with 2 only PRI interfaces linked to a CISCO AS5350.
Everything works fine for a while but after a random time (1 day or a week) the PRI "locks up" and it's impossibile to make and receive calls with this error:
Apr 22 09:20:32 NOTICE[1366]: app_dial.c:1029 dial_exec_full: Unable to create channel of type 'ZAP' (cause 34 - Circuit/channel congestion)
== Everyone is busy/congested at this time (1:0/1/0)
At the same time a lot of other things stops to work like playback of files and conferencing. It seems like a timing issue, like the timer stops working.

****** ADDITIONAL INFORMATION ******

Restarting asterisk is not enough to fix the problem, not even running ztcfg.

I have to :
- stop asterisk
- unload wct4xxp module from the kernel
- reload it
- running ztcfg
- run asterisk

To make it work again.

I had the same problem with other kernel versions since zaptel 1.2.x.
I didn't had the problem with asterisk and zaptel 1.0.x

Comments: By: Alessandro Polverini (alessandro polverini) 2006-04-22 04:15:21

I don't know if that can be a useful information to understand the problem but when the PRI stops working I also receive no more these kinds of messages:
-- B-channel 0/1 successfully restarted on span 1
-- B-channel 0/1 successfully restarted on span 2
[...]

While I receive them regularly when everything works fine.
By: Alessandro Polverini (alessandro polverini) 2006-04-26 02:04:39

The problems now occurs on a regular base, every night.
During the night the pbx handles no calls at all, so the problem could be related in some ways to being idle.
By: Abhishek Tiwari (abhi) 2006-04-26 02:57:34

try using resetinterval=never in your zapata.conf. Please update with the results.
By: Alessandro Polverini (alessandro polverini) 2006-04-27 02:04:52

Yesterday I tried setting resetinterval=never and today morning the result on dialing was that one:
-- Executing Dial("SIP/soa-10-7f5a", "ZAP/r1/3471235444") in new stack
Apr 27 08:06:50 WARNING[13703]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0x81622f0', 10 retries!

without any sign of life (call was mute).
I had to kill -9 asterisk and do the usual (unload wct4xxp, load it and ztcfg) to make it work again.
These are the details of my quad PRI, if useful:
0000:03:04.0 Communication controller: Xilinx Corporation: Unknown device 0314 (rev 01)
Flags: bus master, medium devsel, latency 64, IRQ 201
Memory at feafe800 (32-bit, non-prefetchable) [size=128]

It has no IRQ sharing. Kernel 2.6.15-1-k7-smp with 2 dual core Opterons.
By: opsys (opsys) 2006-04-29 01:38:09

Are you using Queues and/or Agents???
By: Alessandro Polverini (alessandro polverini) 2006-04-29 03:58:47

No queues nor agents used
By: zoa (zoa) 2006-05-01 07:36:33

I have seen this several times too on some of our servers. (But not every night).
Happens too few to actually debug it.
By: opsys (opsys) 2006-05-01 08:09:27

Can you post /etc/zaptel.conf and zapata.conf files?

What switchtype are you useing? Are you using Hangup Causes or early media?

I have not had this problem on any of my machines, but I am US based and that may be the differance.
By: Alessandro Polverini (alessandro polverini) 2006-05-01 08:31:04

I've to PBXs, this one with a TE405P (old model) and another one with a TE210P, I've the problem only on the former.
I'm not using early dials.
Attached switch is a Cisco 2651XM.

I attach both requested file configs

By: Donny Kavanagh (donnyk) 2006-05-01 14:15:27

Are both t1's from the same provider?

Edit: never mind i should read more carefully.

Try running the spans as internally clocked (eg set the timeing on both to 0) or set the timeing on the first span to 1 and the rest to 0.

By: Alessandro Polverini (alessandro polverini) 2006-05-03 04:52:52

Ok I understood the problem.

Every night a script is run that backups the database and in doing so (I suppose) it locks its tables for a couple of minutes (the cdr table is quite big).
Please note that no kind of activity on the PBX is going on during the backup (it's in the night and there are no users using it).

So it seems that this jobs has a negative impact on the handling od the PRI.
I can see these kind of errors during the backup (or maybe just after, I don't know):
May 2 05:05:33 NOTICE[16086]: chan_zap.c:8324 pri_dchannel: PRI got event: HDLC Bad FCS (8) on Primary D-channel o
f span 1
May 2 05:05:33 NOTICE[16086]: chan_zap.c:8324 pri_dchannel: PRI got event: HDLC Bad FCS (8) on Primary D-channel o
f span 1
== Primary D-Channel on span 1 down
May 2 05:05:39 WARNING[16086]: chan_zap.c:2298 pri_find_dchan: No D-channels available! Using Primary channel 16
as D-channel anyway!
== Primary D-Channel on span 2 down
May 2 05:05:45 WARNING[16087]: chan_zap.c:2298 pri_find_dchan: No D-channels available! Using Primary channel 47
as D-channel anyway!

So for now I disabled the backup and the PRI is working fine.

Nevertheless I think this is a bug and should be fixed, do you agree?

Please ask me everything I could provide for better understand the problem.

By: Matthew Fredrickson (mattf) 2006-05-15 09:08:02

This is NOT a bug in libpri. It sounds like a problem with the cdr/cdr_engine that you are using.
By: Serge Vecher (serge-v) 2006-05-15 09:46:50

mattf: can you please switch the projects (I don't have that ability). Thanks.
By: Alessandro Polverini (alessandro polverini) 2006-05-15 15:29:35

Shouldn't libpri able to recover from this kind of happening?
Remember that I had to:
- stop asterisk
- unload kernel module wct4xxp
- load it
- ztcfg
and finally restart asterisk to make the PRI working again.
Simply restarting asterisk or running ztcfg again was completely unuseful.
How could a cdr engine cause such a nasty problem to the PRI interface?
By: Serge Vecher (serge-v) 2006-05-22 15:47:04

Alessandro: which cdr engine are you using?
By: Alessandro Polverini (alessandro polverini) 2006-05-22 16:14:39

Both cdr_odbc (to mysql) and cdr_pgsql.

I use both to have redundancy on the cdrs.
By: davet (davet) 2006-05-24 04:39:28

This sounds like it may be similar to a problem I noticed recently (reported on BUG 7004).

When /var fills up with CDR data (or maybe temporarily during your overnight MySQL dump) unpredictable things start to happen. MySQL appears to block while the disk is full.

Since stopping your MySQL dump "fixed it" I wondered whether your /var is nearly full. I also suspect that the MySQL addon may be making matters worse.
By: Alessandro Polverini (alessandro polverini) 2006-05-24 04:44:30

All my partitions are nearly empty and with a LOT of free space.

I'm not using the mysql-addon for mysql, I'm using the ODBC driver.
By: Kevin P. Fleming (kpfleming) 2006-05-24 10:57:51

I suspect this issue has nothing to do with Asterisk directly, but instead is due to the fact that your database backup process is consuming 100% CPU time during the backup and that is causing the PRI layer 1 protocol to lose frames.

As a test, you can try running Asterisk with realtime priority (the -p switch on the command line) and the database backup at the lowest possible 'nice' level, to attempt to ensure that Asterisk gets adequate CPU time to handle the HDLC messages.

Since there are no calls occurring during this time (you say), this is most definitely not a CDR engine issue at all.
By: Alessandro Polverini (alessandro polverini) 2006-05-24 17:11:36

Asterisk is already executing at real time priority.
The PRI card has it's own IRQ, not shared.

Anyway, even if the PRI loses some frames, shouldn't be able to recover?
By: Serge Vecher (serge-v) 2006-05-31 08:58:31

Allesandro: any difference in operation when 1.2.8 is used?
By: Serge Vecher (serge-v) 2006-06-08 15:41:52

No response from Alessandro. If you can reproduce this with 1.2.9.1, please feel free to reopen the bug. Thanks.
By: Alessandro Polverini (alessandro polverini) 2006-06-12 12:26:00

I've disabled the nightly DB backup but the PRI stops to work after some days, at (seemingly) random intervals.

I can verify the same (bad) behaviour on two different machines.

I'll try the latest version as soon as it will be packaged for Debian (the work is in progress).
By: Serge Vecher (serge-v) 2006-06-12 12:29:20

didn't I ask you to reopen the bug _after_ you confirm this to be a problem in 1.2.9.1?
By: Alessandro Polverini (alessandro polverini) 2006-06-12 12:34:38

Please excuse me, I intended to add a comment leaving the bug closed. My bad.
Please close it again, I seem I don't have the rights to do it.
By: Alessandro Polverini (alessandro polverini) 2006-08-26 01:30:09

I verified the problem with asterisk 1.2.10 and zaptel 1.2.8.
By: jmls (jmls) 2006-11-01 12:48:13.000-0600

is this still an issue with 1.2.13 or trunk ?
By: Alessandro Polverini (alessandro polverini) 2006-11-07 16:55:30.000-0600

Verified the problem with asterisk 1.2.13 and zaptel 1.2.10.

By: Joshua C. Colp (jcolp) 2007-03-08 10:15:38.000-0600

Okay, I'm going to try to get the information we need to try to track this down. Can you please open the main Makefile (assuming 1.2), locate the DEBUG_THREADS line and put in the following one:

DEBUG_THREADS = -DDEBUG_THREADS -DDETECT_DEADLOCKS

Also ensure it is built via make dont-optimize

It would also be very helpful to know whether it would be possible to get access to the machine where this is happening so we can look at it when it occurs.

Thanks!
By: Alessandro Polverini (alessandro polverini) 2007-03-08 10:55:50.000-0600

I'm sorry I'm not using PRI channels any more, I had too many problems :(

Anyway to replicate the problem should be easy enough: the only thing needed is a PRI channel and loggin to mysql on a table with at least one million records (the goal is to make the dump last several minutes).
Then dump the database while asterisk is running.
By: Joshua C. Colp (jcolp) 2007-03-11 20:22:59

Okay - is there anybody else who is having this issue and can do what I asked?
By: Serge Vecher (serge-v) 2007-03-26 12:56:07

cmaj: are you able to provide the output file is looking for?
By: Joshua C. Colp (jcolp) 2007-03-27 15:09:02

Since ASTERISK-9006 seems to be this same issue I'm keeping this open as well... if anyone is having it happen as well, please reply back.
By: cmaj (cmaj) 2007-03-27 15:37:50

serge-v: I'm not having this exact problem where it completely locks up, so I'm not sure what help my logs would be. I have similar problems -- that's why I'm watching this bug.
By: Joshua C. Colp (jcolp) 2007-06-27 15:31:12

Everyone if you are still having this please try 1.2 as of revision 72256 or 1.4 as of revision 72257.
By: Joshua C. Colp (jcolp) 2007-07-05 09:24:37

There has been no reply for a week, and no reply prior to my notes for quite a bit... my CDR commits should fix this, but if not please feel free to reopen it.