ASTERISK-08925: Channel/Thread deadlock with heavy in-bound traffic on PRI

[Home]

Summary: ASTERISK-08925: Channel/Thread deadlock with heavy in-bound traffic on PRI - GLARE!

Reporter: martin cabrera (galeras) Labels:

Date Opened: 2007-03-02 10:39:10.000-0600 Date Closed: 2011-06-07 14:00:25

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_zap

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace_core_1.4.txt
( 1) backtrace_full.txt
( 2) backtrace_running_process_1.4.txt
( 3) backtrace.txt
( 4) full.crash

Description: We have call-center type installation with 20 agents and one queue. our system have a TE412P card. We use 3 E1 for inbound calls (about 800 calls per hour on peak) and a 1 E1 for outbound calls (less than 20 per hour).

It seems, we have same symptoms described on bug 0008957. We have this kind of warnings during asterisk crash:

"warning Ring requested on channel 0/24 already in use or previously requested on span 4. Attempting to renegotiating channel."

We have applied the patch for bug 0008957, however we are experiencing about 7 crashes per day.

****** ADDITIONAL INFORMATION ******

using agentlogin for our agents.

Comments: By: Serge Vecher (serge-v) 2007-03-02 11:09:17.000-0600

1. Please read the bug guidelines, you need to produce a backtrace from non-optimized build for crash-related bugs.
2. Instead of applying the patches, can you please check out 1.2 from svn?
3. Are there other modifications you have done to the code?
By: martin cabrera (galeras) 2007-03-02 16:26:12.000-0600

Serge, I will send you a backtrace generated by an asterisk built with an unmodified asterisk-1.2.15 tarball. I can't use svn, because of proxy restrictions (i'm sorry). Is that ok?

Thanks.

By: Serge Vecher (serge-v) 2007-03-05 08:43:39.000-0600

no, use 1.2.16 from tarballs that has just been released.
By: martin cabrera (galeras) 2007-03-06 14:25:53.000-0600

I have found a work around. First let me to describe the issue:

- Our box have 4 E1 (a TDM412 Digium Card)
- We use 3 E1 for intensive inbound and 1 E1 for a few outbound calls.
- We have one queue and 23 agents. All agents are available via agentlogin.
- All agents are using Eyebeam.

Asterisk crash happen in this situation:

- There is a heavy inbound traffic, almost all of 90 inbound channels are busy.
- The Agent is active on eyebeam's line 1 (listening moh until call arrives).
- The agent receive a call and need to transfer that call to an outbound number.
- The agent activate a second line on the softphone, dial the number and transfer the call using phone keys, i mean Control-T and click in line 1 and line 2 and enter.
- After this, some of current calls ends normally, but most of them are hanged up, agents are disconnected and are unable to login again, the only way to recover functionality is to kill and restart asterisk.
- log file shows a lot of messages like this:
"Ring requested on channel 0/11 already in use on span 1. Hanging up owner"

WORKAROUND: When the agent needs to transfer a call to an outbound route proceed in this way:

- Parks the current call.
- Disconnect himself (agentlogoff).
- In line 1 dial the outbound number.
- In line 2 recover the parked call.
- Do the transfer in the same way: control T, click on line 2, line 2 and enter.
- Login again to continue receiving calls.

Doing transfer in this way, asterisk doesn't crash.

I feel this is not a pri-glare case, because we are not mixing inbound and outbound calls in the same span, i think is some issue related with channel status when a call is transfered from an active agent line.

Serge, with this workaround working, i prefer to stop more tests on the box at least for this week, because this callcenter has had many crashes in the last week and my client is not so happy.

However if someone is interested i can try to generate a backtrace next week. I'm sure i can reproduce the issue.

Thanks a lot for all your cooperation.

By: Serge Vecher (serge-v) 2007-03-06 14:47:28.000-0600

did you use 1.2.16 for the tests?
By: martin cabrera (galeras) 2007-03-07 09:40:24.000-0600

Tests were done with asterisk 1.2.15 and zaptel 1.2.14. I couldn't use asterisk 1.2.16, because i got fatal errors trying to run zaptel 1.2.15 (i have centos). I didn't try with asterisk 1.2.16 and zaptel 1.2.14 (i had no much time, i'm sorry).

By: Stéphane HENRY (stef) 2007-03-13 12:04:23

serge-v, I would like to know if there is a difference betweek 1.2.15+patch from 0008957 and version 1.2.16 for this glare problem (chan_zap.c) ?
By: Serge Vecher (serge-v) 2007-03-13 12:08:37

for the glare problem, I would think there is no difference.
By: Stéphane HENRY (stef) 2007-03-13 18:29:04

serge-v, I would like to know if there is a difference betweek 1.2.15+patch from 0008957 and version 1.2.16 for this glare problem (chan_zap.c) ?
By: Serge Vecher (serge-v) 2007-03-14 08:09:11

stef, as I said earlier, I think there is no difference.
By: Stéphane HENRY (stef) 2007-03-14 09:41:35

ok thanks. I still experiment some glare problem, so I will open an other issue.
By: Serge Vecher (serge-v) 2007-03-14 10:32:26

don't open duplicate issues for the same problem. Please compile asterisk with 'make dont-optimize' and provide the backtrace of a core file dumped after the crash. We'll need to see the output of "bt" and "bt full".
By: Stéphane HENRY (stef) 2007-03-14 11:22:20

our big problem is that we don't have a core dump. asterisk process is still running, but it doesn't accept anymore calls. We have to kill it manually.
By: Stéphane HENRY (stef) 2007-03-14 11:25:37

my problem doesn't have anything to do with agents, we don't use this feature on this machine, that's why I opened 0009275.
By: Stéphane HENRY (stef) 2007-03-14 11:31:35

the only debug trace I can give you is the attached file in 0009275
By: Serge Vecher (serge-v) 2007-03-14 11:37:47

This situation is known as a deadlock. Debugging it is similar to debugging a crash. Instead of doing a bt on the core, you do a bt on the running asterisk process. Please view the additional details at:
1) http://www.voip-info.org/tiki-index.php?page=Asterisk%20debugging
2) Read the "HowTo Debug a DeadLock in Asterisk" section
3) Post the relevant ouput here
By: Stéphane HENRY (stef) 2007-03-16 09:03:51

I'm not sure this crash is related to the same glare problem, but with my new options
DDEBUG_THREADS DDO_CRASH
i think asterisk may crash before receiving debug message such as "warning Ring requested on channel 0/24 already in use or previously requested on span 4. Attempting to renegotiating channel."
By: Stéphane HENRY (stef) 2007-03-17 15:06:31

I known where my problem comes from : I use postgresql cdr database with asterisk cdr_pgsql.
When we have a huge traffic, or when our database is very busy (during a night backup for example), I get this error :
Mar 17 15:29:48 DEBUG[901] chan_zap.c: Ring requested on channel 0/4 already in use or previ
ously requested on span 5. Attempting to renegotiating channel.
You can easyly reproduce this problem if you lock the cdr table in your postgresql database :
services=> begin work;
BEGIN
services=> lock table cdr IN EXCLUSIVE MODE ;
LOCK TABLE

now you wait until you get the problem. Unlock cdr table :

services=> COMMIT work;

I have the same problem in version 1.4 : asterisk doesn't accept anymore calls but I don't get the error message. I attach the backtrace of the running process and the backtrace of the core dump file for version 1.4.

My suggestion : It would be better with a query timeout, after the timeout, write datas into cdr-csv or cdr-custom asterisk syslog files.
By: Stéphane HENRY (stef) 2007-03-19 08:54:19

blitzrage suggests to test with cdr_odbc. I will test it and report results here.
By: Stéphane HENRY (stef) 2007-03-19 09:32:22

blitzrage suggests to test with cdr_odbc. I will test it and report results here.
By: Serge Vecher (serge-v) 2007-03-19 09:39:29

stef: since you have shown that the bug you are experiencing is not related to the original glare issue (8957), which is what galeras reported, let's continue with debugging it in the issue, which you've opened later (9275) and that I will now reopen. Thanks.