ASTERISK-08038: SIP module deadlocks when PostgreSQL cdr db is overloaded

[Home]

Summary: ASTERISK-08038: SIP module deadlocks when PostgreSQL cdr db is overloaded

Reporter: fugitivo (fugitivo) Labels:

Date Opened: 2006-11-01 08:27:57.000-0600 Date Closed: 2007-10-18 01:45:33

Priority: Blocker Regression? No

Status: Closed/Complete Components: Core/Configuration

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) cli_debug.txt.gz

Description: I'm experiencing random freezes for the SIP module. A sip reload doesn't do anything and the only way to "unlock" it is restarting asterisk.
Current calls keep working, but phones are loosing registrations and can't register anymore with asterisk until restart.
Attached you will find a FULL sip debug trace.

****** ADDITIONAL INFORMATION ******

Server is a Dual XEON 3.2 with 1gb RAM.

Comments: By: fugitivo (fugitivo) 2006-11-01 10:32:55.000-0600

I downgraded to 1.2.8 and it seems to work OK with more load than before:

System uptime: 2 hours, 26 minutes, 54 seconds
By: Francesco Romano (francesco_r) 2006-11-02 10:29:51.000-0600

The same problem i reported one week ago, see ASTERISK-7983.
By: Anthony LaMantia (alamantia) 2006-11-03 11:13:36.000-0600

closing this one and keeping 8204 open. to avoid duplicate issues
By: fugitivo (fugitivo) 2006-12-05 15:25:38.000-0600

I think this issue should be open. It doesn't seem to be the same as bug 0008204. No agi or deadagi was used, and migrating from sjphone to eyebeam fixed the problem.
By: Serge Vecher (serge-v) 2006-12-05 15:29:34.000-0600

fugitivo: is there a way to reliably reproduce this problem, say with using SJPHONE and a specific asterisk version? Just trying to corner this bug in ...
By: fugitivo (fugitivo) 2006-12-05 16:09:45.000-0600

Version 1.2.12.1 and using +30 extensions with Sjphone freezes sip module.
Using 1.2.8 I experience another weird behavior. Using AgentMonitorOutgoing(c) before dialing any trunk, I can see a lot of that function doing "show channels" in asterisk CLI.
The bug seems to be the same, but the result isn't the same. With 1.2.8 you can see a lot of AgentMonitorOutgoing(c) in show channels and asterisk must be restarted, and with 1.2.12.1 the sip module just freezes and asterisk must be restarted also.
It seems that there should be a lot of sip activity for this bug, this happened on a call center with a lot of incoming and outgoing calls (using zap channels but extensions were SIP sjphones).
I have another company with sjphones but with more lower call traffic and this bug is not happening.
By: Serge Vecher (serge-v) 2006-12-05 16:13:45.000-0600

ok, what about 1.2.13 and or latest 1.2 svn?
By: fugitivo (fugitivo) 2006-12-05 16:22:30.000-0600

1.2.13 is the reported version, first version tried, i downgraded to 1.2.12.1 and 1.2.8 and same happened.
By: Olle Johansson (oej) 2006-12-06 10:28:55.000-0600

Do you have call queues and AMI connections to handle them?
By: fugitivo (fugitivo) 2006-12-06 10:37:01.000-0600

I have call queues, what do you mean with AMI connections?
By: Serge Vecher (serge-v) 2006-12-06 11:42:07.000-0600

do you use any AMI actions to manage those queues, say "Redirect" or "Originate"?
By: fugitivo (fugitivo) 2006-12-06 12:13:25.000-0600

Yes, I'm using manager commands, Originate calls and Redirect to agents.
By: Olle Johansson (oej) 2006-12-06 13:56:11.000-0600

Ok, then we have plenty of other bug reports on the issue. This is a duplicate, and it's not really for sure it's a SIP bug.
By: fugitivo (fugitivo) 2006-12-29 13:43:03.000-0600

Changed again to version 1.2.14. Again sip module freezes sometimes, not as usual as before. Redirecting calls to agents with manager commands is slow, the call arrives soon to the agent, but not the audio.
By: Serge Vecher (serge-v) 2007-01-02 09:52:00.000-0600

Can you please produce a new log as per following instructions from 1.2.14?

1) Prepare test environment (reduce the amount of unrelated traffic on the server);
2) Make sure your logger.conf has the following line:
console => notice,warning,error,debug
3) restart Asterisk with the following command:
'asterisk -Tvvvvvdddddngc | tee /tmp/verbosedebug.txt'
4) Enable SIP transaction logging with the following CLI commands:
set debug 4
set verbose 4
sip debug
5) Trim startup information and attach verbosedebug.txt to the issue.
By: Olle Johansson (oej) 2007-02-15 11:48:10.000-0600

fugitivo: We need feedback from you.
By: fugitivo (fugitivo) 2007-02-15 16:15:33.000-0600

Ok, sorry I was following another post. I have some clue about this problem, if the database (PostgreSQL 8.x) is overloaded, sip freezes. The other day I was reindexing a table and sip freezed, if I stop the database, sip freezes.
By: Olle Johansson (oej) 2007-02-16 05:31:38.000-0600

Seems to be a problem with the realtime driver, not the SIP channel.
By: fugitivo (fugitivo) 2007-02-16 05:49:13.000-0600

I'm not using realtime, cdr_pgsql.so seems to be the problem, and the only result is a sip module freeze, everything else keep working.
By: Serge Vecher (serge-v) 2007-02-19 09:48:51.000-0600

fugitivo: thanks for clarification, we still need to see the debug output as per note 0057080. After step 4, start reindexing the database (or whatever you do to reproduce the problem).
By: Olle Johansson (oej) 2007-02-19 11:54:25.000-0600

Please test with and without cdr batch support in cdr.conf.

This is one of the reasons why we implemented that feature.
By: fugitivo (fugitivo) 2007-02-19 12:04:41.000-0600

Ok, batch support is scary, I can't loose any data. But I'll try it after hours...
By: Serge Vecher (serge-v) 2007-03-06 09:56:42.000-0600

any luck, fugitivo?
By: fugitivo (fugitivo) 2007-03-18 12:59:23

I'm testing batch support, I'll give you more news next week.
Please modify the summary, this bug is not realtime related. I'm not using realtime at all.
By: Serge Vecher (serge-v) 2007-03-30 09:48:38

linking bugs on the premise that problems in cdr_pgsql module cause deadlocks in another module.

fugitivo: file has requested to gain access to affected systems. If you are able to provide it, please find file on #asterisk-bugs IRC channel or send information by email. Thanks.
By: fugitivo (fugitivo) 2007-04-25 12:41:32

People,

It seems that using batch solved the problem. Uptime is 5 days and 16 hours with really heavy traffic and manager commands. My cdr.conf looks like:

batch=yes
size=200
time=300
safeshutdown=yes

Please don't close this issue yet. I'll give a new feedback next week.
By: Craig Z (cazimmy) 2007-05-26 14:14:28

fugitivo - I seem to be having the same or a very similar issue but being somewhat new to this (only about 6 months in), I don't understand this entire thread. We are using Trixbox and also have random freezes. When the system freezes it does exactly as your describe except that we can' access the box at all - not even through the console. It takes a cold reboot.

Can you provide some insight? I'm happy to take it off line from here.

Thanks.
Craig
By: Russell Bryant (russell) 2007-06-07 13:18:40

Since there have been no more comments from fugitivo since April, I'm going to assume this is no longer a problem. If there is still a problem, feel free to reopen this issue so that we can start getting some information on the deadlock using gdb. Thanks!
By: fugitivo (fugitivo) 2007-06-07 13:26:05

Russell, still having the problem. I can't find any way to reproduce it, it just happen randomly.
Please tell me how should I procede with more information.
Thank you.
By: Russell Bryant (russell) 2007-06-07 13:41:35

First, upgrade to the latest 1.2 code from svn.

Then, build Asterisk with -DDEBUG_THREADS support by uncommenting it in the main Makefile. Then, install asterisk with "make dont-optimize".

When the deadlock occurs, grab a backtrace using the ast_grab_core script available in the contrib/scripts/ directory.

Also, please keep the core file on your machine just in case a developer needs to log in and look at it directly. Deadlocks are the hardest thing to debug without direct access to the machine to analyze the core dump.
By: fugitivo (fugitivo) 2007-06-07 13:54:38

russell, ok. Give me some time because this is a production system. I'm also thinking on changing the hardware to discard a problem related to this.
By: Joshua C. Colp (jcolp) 2007-06-27 15:29:33

I just put in a CDR change in 1.2 as of revision 72256 that may solve this, can you please try it?
By: Joshua C. Colp (jcolp) 2007-07-05 08:56:24

It's been a week now with no positive/negative responses, if this is still an issue please feel free to reopen.
By: fugitivo (fugitivo) 2007-07-20 17:29:04

file, is this fix included in 1.2.21.1? If yes, it didn't solve my problem, I'm still having this SIP locks and asterisk needs a restart to fix it.
By: Joshua C. Colp (jcolp) 2007-07-23 08:53:48

Yes it was, so we need to look at this again. Attach all relevant information... configuration, console output, show channel, sip show channel, everything.
By: fugitivo (fugitivo) 2007-07-23 09:02:54

file, I think heavy traffic in postgresql is not the problem here. Last time this problem happened, I checked all locks and there wasn't any lock at that time. Honestly I don't know where to look to reproduce this problem.
I'll try removing a digium card te100p that is giving me some problems.
By: Eliel Sardanons (eliel) 2007-07-26 13:54:17

We are facing the same problem, our first thought was that chan_sip were getting freez because of a dns resolution. But we changed every sip registration host with the IP and it continues to happen at heavy traffic. We have many clients running asterisk 1.2.22 and it only happens with the 'callcenter' clients (heavy queue traffic and call-limit=1). We made a change in update_call_counter() to reproduce the limitonpeers parameter in 1.4:

---- update_call_counter() ----
- if (!outgoing && (u = find_user(name, 1))) {
+ if (0 && !outgoing && (u = find_user(name, 1))) {
-------------------------------

This "patch" is working fine, but at some point chan_sip locks and we can't register or generate new calls (in chan_sip), all the currently ongoing calls continue normaly until someone hangup.
chan_zap and other modules work without problems, if we made a "reload chan_sip.so" it locks again, and if we repeat the command it says that a reload is already in progress.
Please 'fugitivo' tell me if you are running queue members with call-limit.
By: Eliel Sardanons (eliel) 2007-07-26 15:46:14

I notice in your cli_debug.txt.gz that you are using call-limit too, maybe there is where we need to take a look, Tomorrow I will be in a client that has a chan_sip.so lock every day and I will wait for that situation and attach to the asterisk process while this condition occurs.
By: Tilghman Lesher (tilghman) 2007-08-29 14:07:41

1.2 development was shut down on August 1st. Can you confirm if this is still a problem in 1.4 (preferably in 1.4.11)?