[Home]

Summary:ASTERISK-09502: Deadlock problem with agents, queues and libpri (stop accepting incoming calls in PRI lines)
Reporter:Ted Brown (ted brown)Labels:
Date Opened:2007-05-23 18:18:47Date Closed:2011-06-07 14:01:08
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Addons/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 170907_bt_full.txt
( 1) 170907_info_threads.txt
( 2) 170907_info_variables.txt
( 3) 170907_thread_apply_all.txt
( 4) bt_full_100907.txt
( 5) gbd_bt_fulll.txt
( 6) gdb_info_stack.txt
( 7) gdb_thread_apply_all_bt.txt
( 8) info_stack_100907.txt
( 9) thread_apply_all_bt_100907.txt
Description:I have a Asterisk-based call center deployment with around 40 SIP users, attending incoming calls from two PRI lines (2xE1) using agents and queues.

The problem is that Asterisk stops accepting new incoming calls to the PRI lines without reason, although there should be free channels to make room for new incoming calls, but Asterisk thinks these channels are being used. SIP calls can be placed without problems between internal users.

PRI lines shouldn't be the origin of the problem, as an old legacy PBX works perfectly with the same lines, so the problem seems to be related with agents or queues.

After the crash, performing an "zap show channels" shows that all channels are busy, and calls seems that have been queued for a long time in different queues (and they are not really there - users usually don't wait 90 minutes to be attended while listening to the music on hold).

There is no other services running on the server, CDR is being stored to  disk and we are not using any kind of AGI's or reporting tools. Currently the only solution is to reboot the machine, as rebooting Asterisk is not enough. Using any command on the CLI results in no output at all.

The crash is not easily reproduceable, as it doesn't follow a clear pattern. Asterisk just seem to get blocked when it manages around 30-40 calls in the queues. During last week, we had 2-3 crashed each day.

Based on users lists mails, it seems that other users have had a similar problem within the same scenario, at least with 1.2.x. More precisely, we have observed the same problem in bug ID 0006147, but it has been closed without a clear answer.

Hardware and software specs:

Platform: Suse Linux Enterprise Server 10
Machine: IBM xSeries 226, 1 GB RAM, Intel CPU
PRI card: Digium TE212 with echo cancellation module
Asterisk version: 1.2.18

Follows a list of the most relevant messages before and after the crash:

DEBUG[28519] chan_sip.c: Stopping retransmission on 'NzNmZWM0ZDc0OTYyNWI5YWM2ZTBhZjY3NDM4N2RjNmQ.' of Response 12: Match Found  (lots of messages like that)

DEBUG[28511] chan_zap.c: Ring requested on channel 0/13 already in use or previously requested on span 1.  Attempting to renegotiating channel.

DEBUG[28511] chan_zap.c: Found empty available channel 0/9

DEBUG[29939] app_dial.c: Exiting with DIALSTATUS=CONGESTION.

I would very appreciate any help on this. I can provide backtrace if needed.

Best regards,
Comments:By: Ted Brown (ted brown) 2007-05-24 13:23:07

Hi again,

after a deep analisys of log files, we have learned that Asterisk crashes with a segmentation fault, maybe related with channel bridging, thus dropping all active conversations. After crashing, safe_asterisk reloads Asterisk again, but at the same time there are new incoming calls.

IMHO, this bug may not be related with PRI lines but with blind transfers, bridging or similar. Here follows the event log just before the crash (which occurred at 18:09:56).

May 23 18:09:49 DEBUG[12153] chan_sip.c: (Provisional) Stopping retransmission (but retaining packet) on '65d004213b2096196df5
af51443c9d55@192.168.2.9' Request 102: Found
May 23 18:09:49 VERBOSE[28480] logger.c:     -- SIP/507-0858f708 is ringing
May 23 18:09:49 VERBOSE[28451] logger.c:     -- Agent/507 is ringing
May 23 18:09:49 DEBUG[28073] app_queue.c: Everyone is busy at this time
May 23 18:09:50 DEBUG[28073] app_queue.c: Everyone is busy at this time
May 23 18:09:51 DEBUG[28073] app_queue.c: Everyone is busy at this time
May 23 18:09:52 DEBUG[27919] chan_agent.c: Bridge on 'SIP/504-081fa4a8' being cleared (2)
May 23 18:09:52 DEBUG[27919] chan_sip.c: update_call_counter(504) - decrement call limit counter
May 23 18:09:52 DEBUG[27919] channel.c: Didn't get a frame from channel: Agent/504
May 23 18:09:52 DEBUG[27919] channel.c: Bridge stops bridging channels Zap/27-1 and Agent/504
May 23 18:09:52 DEBUG[27919] chan_agent.c: Hangup called for state Up
May 23 18:09:52 VERBOSE[27919] logger.c:   == Spawn extension (queuename, s, 20) exited non-zero on 'Zap/27-1'
May 23 18:09:52 DEBUG[27919] chan_zap.c: Set option AUDIO MODE, value: ON(1) on Zap/27-1
May 23 18:09:52 DEBUG[27919] chan_zap.c: Hangup: channel: 27 index = 0, normal = 38, callwait = -1, thirdcall = -1
May 23 18:09:52 DEBUG[27919] chan_zap.c: Not yet hungup...  Calling hangup once with icause, and clearing call
May 23 18:09:52 DEBUG[27919] chan_zap.c: disabled echo cancellation on channel 27
May 23 18:09:52 DEBUG[27919] chan_zap.c: Set option TDD MODE, value: OFF(0) on Zap/27-1
May 23 18:09:52 DEBUG[27919] chan_zap.c: Updated conferencing on 27, with 0 conference users
May 23 18:09:52 DEBUG[27919] chan_zap.c: Set option AUDIO MODE, value: OFF(0) on Zap/27-1
May 23 18:09:52 DEBUG[27919] chan_zap.c: disabled echo cancellation on channel 27
May 23 18:09:52 VERBOSE[27919] logger.c:     -- Hungup 'Zap/27-1'
May 23 18:09:56 VERBOSE[28498] logger.c: Asterisk Event Logger Started /var/log/asterisk/event_log
May 23 18:09:56 VERBOSE[28498] logger.c:   == Parsing '/etc/asterisk/dnsmgr.conf': May 23 18:09:56 VERBOSE[28498] logger.c:
== Parsing '/etc/asterisk/dnsmgr.conf': Found
May 23 18:09:56 VERBOSE[28498] logger.c: Asterisk Dynamic Loader loading preload modules:
May 23 18:09:56 VERBOSE[28498] logger.c:   == Parsing '/etc/asterisk/modules.conf': May 23 18:09:56 VERBOSE[28498] logger.c:
== Parsing '/etc/asterisk/modules.conf': Found
May 23 18:09:56 VERBOSE[28498] logger.c:   == Manager registered action Ping
May 23 18:09:56 VERBOSE[28498] logger.c:   == Manager registered action Events

(etc... rest of the initialization messages...)

By: Ted Brown (ted brown) 2007-05-24 13:23:46

I also attach the result of several GDB commands.

By: Ted Brown (ted brown) 2007-06-09 16:37:50

Any help on this? I have tried to reproduce the crash, but I didn't find the error pattern.

By: Russell Bryant (russell) 2007-08-23 13:30:06

This bug report indicates that the asterisk version was 1.2.  This version is no longer supported except for security fixes.  Feel free to reopen this if you still have a problem with 1.4.

By: Ted Brown (ted brown) 2007-09-11 05:39:06

Hi again,

I regret to inform that the problem has appeared again, using the following versions:

- Asterisk 1.4.11
- Libpri 1.4.1
- Zaptel 1.4.5.1

The scenario remains the same as before:

- 2 PRI E1 links to the telco (using TE212P)
- 40 SIP users in the LAN (using Eyebeam)
- Using queues and agents to dispatch calls to SIP users

CPU and RAM levels are OK, and there are no other process or services running on the machine (no MySQL server, no Apache, etc...). Without special reason, Asterisk crashes with a segmentation fault error, this is the content of the /var/log/messages file regarding the crashes. Sometimes it takes 2-3 hours to crash, but yestearde we got 3 crashes in less than 10 minutes:

Sep 10 16:32:19 pbx kernel: asterisk[20940] general protection rip:433aff rsp:40bf2128 error:0
Sep 10 18:46:14 pbx kernel: asterisk[17391] general protection rip:433aff rsp:40a12128 error:0
Sep 10 19:01:09 pbx kernel: asterisk[21651]: segfault at 00002ae254525f50 rip 00002ae254525f50 rsp 0000000040d1e128 error 15
Sep 10 19:03:07 pbx kernel: asterisk[21699]: segfault at 00000000000000b8 rip 0000000000433aff rsp 00000000406c6958 error 4
Sep 10 19:06:53 pbx kernel: asterisk[22497]: segfault at 00000000000000b8 rip 0000000000433aff rsp 00000000406c6958 error 4

Do not hesitate to contact us in case you need further information. I attach several GDB results.

By: Ted Brown (ted brown) 2007-09-14 08:39:59

Hi everyone,

after a 3-days troubleshooting process, I think we've located the problem, and we can force it to appear as many times as we want.

The scenario is like follows:

- A call ("Call_1") arrives to a queue named 'Queue_A'. An agent 'Agent_A', registered in that queue, takes that call 'Call_1'.

- After that, the same agent 'Agent_A' initiates another call ("Call_2", using a free line in his/her softphone, and placing "Call_1" on hold" ) to another queue 'Queue_B'.

- An agent 'B' ('Agent_B') registered in that queue ('Queue_B') takes the call, so 'Agent_A' transfers the first call 'Call_1' to agent 'B'.

- Both agents use softphones (Eyebeam).

After this simple transfer, Asterisk will crash in one of these situations:

a) If you make a 'show channels' or 'core show channels' command on the CLI
b) If another new call is issued, no matter where it comes from.

In the first case, Asterisk doesn't always crash for the first time, but it will eventually crash if you make several 'show channels" continously (15-20 in our tests).

In the second case, Asterisk crashes as soon as it receives an INVITE request (we've noticed in the coredump that this request is sent from agent 'B' to agent 'A' a time after the transfer, which we think makes no sense, as the new call has nothing to do with any of them).
 
In the troubleshooting process, we first tried with incoming calls from a PRI line. Asterisk crashed. So we tried it from a SIP line, as we thought
it could be a PRI line misconfiguration, but it also crashed. So the issue is not related at all with PRI configuration (I've tried to change the summary of this issue with no luck, shoul I create a new one??).

Looking at the core generated by the crash, we've found this:

#0  0x000000000044a03b in ast_bridged_channel (chan=0x2aaab5227b24) at channel.c:3900
3900  if (bridged && bridged->tech && bridged->tech->bridged_channel)

In fact, the line should be "if (bridged && bridged->tech->bridged_channel)", but as bridged->tech was null, the comprobation caused the crash. The change we made (to include "&& bridged->tech") didn't solve the issue.

Now, bridged->tech value is:

(gdb) print bridged->tech
$1 = (const struct ast_channel_tech *) 0x2aaadeadbeef

We can't go much further than this, as this variable is referenced from a lot of functions. The big issue is to know why bridged->tech is corrupted. Could this be a bad-freeing memory issue? Or maybe a semaphore/monitor problem accessing to this variable?

We have also recompiled Asterisk with DONT-OPTIMIZE directive and other debug flags, and we disabled every not necessary IRQ in the bios. CPU and RAM aren't an issue, as we are monitoring them and in no point CPU raised higher than 40% and neither RAM raised higher than 10% (both peak values).

As we can easily reproduce the crash, do not hesitate to contact us for more tests.

Thank you!

By: Russell Bryant (russell) 2007-09-14 13:12:52

I just wanted to point out one thing.  This value is very interesting.

(gdb) print bridged->tech
$1 = (const struct ast_channel_tech *) 0x2aaadeadbeef

Note the "deadbeef" there at the end.  That is no coincidence.  When you compile with MALLOC_DEBUG enabled, Asterisk allocates a "fence" around all blocks of memory it allocates.  It sets the value of the high and low fence to 0xdeadbeef.  The code should _never_ be touching this memory.

By: Ted Brown (ted brown) 2007-09-17 10:16:14

Russel,

thanks for pointing out this. Just in case it helps, I've added several files with a more complete backtraces with the result of the following commands at GDB:

 - bt full
 - info stack
 - thread apply all
 - several values of relevant variables (bridged, bridged->tech, etc...)

The process to force the segfault are as follows:

- New call with caller-id "302" to queue "400"
- Call passed to agent 402
- Agent 402 takes a new line in his softphone and starts a new call the number of another queue (queue 100 in our example)
- Agent 103 takes the previous call
- Agent 402 transfer the call from "302" to agent 103

After that, Asterisk will crash on a "show channels" or when processing an INVITE. The examples provided here are related to a "show channel" command in the CLI. This time, Asterisk crashes at line 3091 of channel.c, instead of 3900:

[line 3900]  if (bridged && bridged->tech && bridged->tech->bridged_channel)
[line 3901]     bridged = bridged->tech->bridged_channel(chan, bridged);
[line 3902]            return bridged;

(gdb) print bridged
$1 = (struct ast_channel *) 0x2aaab53c13c4

(gdb) print bridged->tech
$2 = (const struct ast_channel_tech *) 0x532f04

(gdb) print bridged->tech->bridged_channel
$3 = (struct ast_channel *(* const)(struct ast_channel *, struct ast_channel *)) 0x614c202e79726576

(gdb) print chan
$4 = (struct ast_channel *) 0x2aaab53becc4

If I can provide further information or debugging which can be of help, do not hesitate to contact me.

By: Ted Brown (ted brown) 2007-09-18 09:48:20

Hi again,

I've been working on this issue, and have discovered that the result of the "show channels" command at the CLI is different when using a softphone to do an attended transfer to a queue.

Output of "show channels" using softphones (Eyebeam):

SIP/112-01686304     112@internal:1       Up      (None)
Agent/112            402@internal:1       Up      Bridged Call(SIP/302-b50fbfc4)
SIP/302-b50fbfc4     s@cogaadmin          Up      Queue(QueueADMIN|tT|||300)

The second line above seems a bit strange. This is the output of "show channels" using a different phone (Linksys):

Agent/112            112@internal:1       Up      Bridged Call(SIP/324-b4abefd0)
SIP/112-00864ba0     112@internal:1       Up      (None)
SIP/324-b4abefd0     s@cogareserve:4      Up      Queue(QueueRESERVE|tT|||3000)

Agents 302 and 324 are registered to the same queue (300). Hope it helps.

By: Ted Brown (ted brown) 2007-09-23 16:15:26

We've narrowed our search, and we can say that the crash is not related to PRI lines as stated in the subject.

We've opened a new bug ( id 10809) so this one can be closed.

By: Michiel van Baak (mvanbaak) 2007-09-23 16:47:32

closed on reportes request