ASTERISK-09006: heavy traffic in postgresql cdr database causes PRI errors

[Home]

Summary: ASTERISK-09006: heavy traffic in postgresql cdr database causes PRI errors

Reporter: Stéphane HENRY (stef) Labels:

Date Opened: 2007-03-14 09:44:53 Date Closed: 2007-06-27 15:28:25

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) debug_trace.txt
( 1) full_backtrace.txt
( 2) messages

Description: I still experiment some glare problem. Here is the DEBUG trace in attachment.

Thanks

****** ADDITIONAL INFORMATION ******

Software we have :
asterisk version : 1.2.16
libpri version : 1.2.4
zaptel version : 1.2.14

Linux software :
Debian Linux stable version (sarge)
kernel version 2.6.8-1-686-smp
gcc-3.3
libc6-dev 2.3.2.ds1-22

Hardware we have :
TE410P driver used : wct4xxp
Intel(R) Pentium(R) 4 CPU 3.20GHz
MemTotal: 2076376 kB

Comments: By: Serge Vecher (serge-v) 2007-03-14 10:30:21

why did you open a new issue? I didn't give you any indication to do that in 9190. Since you experience this with 1.2.16 or the 1.2.15 with glare patch, the next step is to provide a backtrace. Please post it in 9190.
By: Serge Vecher (serge-v) 2007-03-19 09:44:45

reopening with a new title, since further debugging revealed a different problem exhibiting the symptoms of a closed (fixed) issue.
By: Serge Vecher (serge-v) 2007-03-19 09:48:30

stef, alright, the error messages are different, but I believe this is the same problem as 7011. Can you please produce some output as asked by file in note 0060592 and then email him with details on accessing the machine to further track the problem down (jcolp at digium.com)? Thanks
By: cmaj (cmaj) 2007-03-19 19:04:36

This has been a problem for a long time IMO. That's why I wrote the CDR batch submission patch pre v1.2. Please try setting "batch=yes" in cdr.conf and reporting if this helps make things better. I don't think it will fix it all the way -- only a complete redesign of the hardware of the T1 interface cards will do that.
By: Stéphane HENRY (stef) 2007-03-20 22:02:18

I tested batch=yes in my cdr.conf file, but I lost 80% rows, no more rows was inserted during 1 hour... When I got this problem I was not able to do a 'reload', * said that a 'reload is already in progress'. Here is a backtrace just before I killed astersik. I'm sorry but I don't have the thread traces, I hope my backtrace will help.

By: Stéphane HENRY (stef) 2007-03-20 22:21:02

I sent files and a description of the problem to Joshua.
By: cmaj (cmaj) 2007-03-21 15:58:30

Is that backtrace correct ?

I do not understand how you lost rows. How do you know this ? Did a call not produce a row ?
By: Stéphane HENRY (stef) 2007-03-21 19:04:21

During one hour, I didn't get any CDR in my database, but I received a lot of phone calls. Settings in cdr.conf were :
batch=yes
size=200
time=300

How I found it ? By chance calls go to another * servers with a working cdr log...

I tried some 'reload' before killing asterisk, but I got warnings 'reload is already in progress', I tried 2 reloads within an interval of 15 minutes with the same warning.
By: Clod Patry (junky) 2007-03-23 00:29:03

stef: your backtrace is not readable, please attach a new one.
Thanks.
By: Joshua C. Colp (jcolp) 2007-03-27 15:08:11

Can you clarify on what the AGIs are doing? Specifically update_rtm_decrim.agi?
By: cmaj (cmaj) 2007-03-27 15:33:23

I just looked at that attached file again. Zap/100 ? Do you have 2 4-port T1 cards in there ? From my experience, that is a recipe for disaster. Try running "cat /proc/interrupts" from a shell -- I bet you are having conflicts. Then take out one of the T1 cards. Run "cat /proc/interrupts" again to make sure that the card left in there is not sharing interrupts with another device. If that solves your problem, then buy another server and in the future adhere to the "1 Digium T1 card / 1 fast P4 box" rule.

Also, I still don't believe the CDR batch submit problems are there, but I digress...
By: Stéphane HENRY (stef) 2007-03-29 10:53:39

update_rtm_decrim.agi is used to update datas (write queries) in a postgres database.

No I've already checked, I have no interrupts conflict with my 2 TE410P cards in the same box.

I will upload a new backtrace if I get the same problem.
By: Joshua C. Colp (jcolp) 2007-03-29 11:06:47

Is it possible to gain access to this machine? I definitely want to take a look first hand.
By: Stéphane HENRY (stef) 2007-03-29 15:00:51

Here is my new backtrace. In my debug trace, I can see errors like :
Mar 29 15:21:41 DEBUG[28167] chan_zap.c: Ring requested on channel 0/22 already in use or previously requested on span 5. Attempting to renegotiating channel.

asterisk doesn't crash, but doesn't take anymore calls until I restart it. This happens only when database is overloaded. This is a backtrace of the running asterisk process of Asterisk version 1.2.17. The problem is also present in 1.4.*

By: Serge Vecher (serge-v) 2007-03-29 15:02:56

stef: if it is possible to get access to this system, please email the details to file -- jcolp at digium.com
By: Stéphane HENRY (stef) 2007-03-29 15:51:52

I've sent email to Joshua.
By: fugitivo (fugitivo) 2007-06-07 13:31:22

I'll add my comment on this bug. Using batch support, I have SOMETIMES the same problem as the reporter. No calls are submited to the cdr and I have to stop and start asterisk to make it work.

batch=yes
size=200
time=200
scheduleronly=no
safeshutdown=yes

Maybe we should open another bug for batch support?
By: Joshua C. Colp (jcolp) 2007-06-27 15:28:24

Fixed in 1.2 as of revision 72256, 1.4 as of revision 72257, and trunk as of revision 72258. Fixed to the capacity that a channel will not stick around waiting on the database to post the CDRs any longer. As for the batch support feel free to open an issue.