ASTERISK-07261: [patch] An Agent transfering call via SIP transfer causes a segfault or deadlock in queue/agent system

[Home]

Summary: ASTERISK-07261: [patch] An Agent transfering call via SIP transfer causes a segfault or deadlock in queue/agent system

Reporter: Leonardo Gomes Figueira (sabbathbh) Labels:

Date Opened: 2006-06-30 06:59:56 Date Closed: 2006-09-13 12:12:06

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_agent

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) chan_agent_noapplock_on_cb.diff
( 1) debug_agent_transfer_deadlock_1.2-r37143_gdb.txt
( 2) debug_agent_transfer_deadlock_1.2-r37143_log.txt
( 3) debug_agent_transfer_segfault_1.2-r37143_gdb.txt
( 4) debug_agent_transfer_segfault_1.2-r37143_log.txt
( 5) debug_agent_transfer_segfault_1.2-r42014_gdb.txt

Description: Tested on 1.2.9.1 and on 1.2-r36254M:

Asterisk SVN-branch-1.2-r36254M built by root @ pfdesenv on a i686 running Linux on 2006-06-29 17:29:20 UTC

I got this lockup/segfault problems with queues after upgrading to 1.2.9.1 from 1.2.7.1. I had to downgrade back to 1.2.7.1 on production servers. The tests below were made on a devel box.

I uploaded 2 set of logs and backtraces of this 2 different results I get after transfering a call answered by an agent from a queue using SIP transfer:

1- Transfering the call to a Polycom IP300 the call gets "locked" on Polycom/Asterisk and Asterisk segfaults after issuing an "show channels":

debugagenttransfer_segfault_log.txt
debugagenttransfer_segfault_gdb.txt

2- Asterisk in deadlock on queue call

debugagenttransfer_queuedeadlock_log.txt
debugagenttransfer_queuedeadlock_gdb.txt

Both situations can be reproduced on every attempt.

Summary of the situations registered on log files:

1- Segfault situation:

A- SIP/3050 (Soyo) calls queue
B- Agent/3054 - SIP/3054 (Grandstream) answer call
C- SIP/3054 (Grandstream) makes a SIP attended transfer to SIP/3053 (Polycom)
D- SIP/3050 (Soyo) gets a hangup after the transfer
E- Call stay up on SIP/3053 (Polycom)
F- Issuing a "show channels" on cli segfault Asterisk:

pfdesenv*CLI> show channels
Channel Location State Application(Data)
SIP/3053-89ad (None) Up Bridged Call(Agent/3054)
Agent/3054 s@macro-atende:900 Up Dial(SIP/3053|30|tT)
pfdesenv*CLI>
Disconnected from Asterisk server
Executing last minute cleanups

2- Deadlock situation:

A- SIP/3050 (Soyo) calls queue
B- Agent/3053 - SIP/3053 (Polycom) answer call
C- SIP/3053 (Polycom) makes a SIP attended transfer to SIP/3054 (Grandstream)
D- Both SIP/3050 (Soyo) and SIP/3054 (Grandstream) gets a hangup after the transfer
E- SIP/3050 (Soyo) calls queue again
F- Agent/3054 - SIP/3054 (Grandstream) answer call
G- SIP/3050 (Soyo) hangup the call
H- SIP/3054 (Grandstream) gets a hangup
Everything is "fine" up to now (except by the unsucessuful transfer that results on hangup)
I- SIP/3050 (Soyo) calls queue for the third time
J- Asterisk deadlocks on Queue:

-- Executing Queue("SIP/3050-0fb0", "desenvolvimento|tr|||600") in new stack

Every queue/agent related command stops working on cli. Asterisk must be restarted.

I hope the backtraces are useful, if I missed some info please ask and I will be glad to help tracking this problem.

Thanks,

Leonardo

****** ADDITIONAL INFORMATION ******

1- This situations are not load related. On this tests I start asterisk on devel box, make the 2 or 3 calls to queue like said on description and the problem occurs. There is no other calls or external services running during this tests.

2- This box has 2 E1 ports but no PRI span configured.

So I think the problem it's not related to any hardware/zaptel/pri stuff

3- There is no manager connections during this tests.

4- I'm not using MixMonitor, I'm using Monitor but I tested disabling all monitor and queue recording and repeated the test with same results.

5- IAX2 transfers (Agent call on IAX2) does not trigger this problems.

Comments: By: Serge Vecher (serge-v) 2006-06-30 09:46:08

btw, your revision # indicates code modifications. Can you please detail what was changed from the original source?
By: Leonardo Gomes Figueira (sabbathbh) 2006-06-30 12:37:25

A modification in astgenkey script triggered that M flag. There are no modifications in Asterisk code:

[root@pfdesenv asterisk-1.2-060629]# svnversion -c .
555:36254M
[root@pfdesenv asterisk-1.2-060629]# diff -qr /usr/src/asterisk-1.2-060629/ . |grep differ
Files /usr/src/asterisk-1.2-060629/contrib/scripts/astgenkey and ./contrib/scripts/astgenkey differ
[root@pfdesenv asterisk-1.2-060629]# cp /usr/src/asterisk-1.2-060629/contrib/scripts/astgenkey ./contrib/scripts/astgenkey
cp: sobrescrever `./contrib/scripts/astgenkey'? y
[root@pfdesenv asterisk-1.2-060629]# svnversion -c .
555:36254
[root@pfdesenv asterisk-1.2-060629]#
By: Serge Vecher (serge-v) 2006-06-30 13:17:45

Apparently this is related to 6626. Please watch for development there, as jstorm has been working on locking problems in chan_agent. If any of the patches arise, please test them and report. Thanks.
By: Leonardo Gomes Figueira (sabbathbh) 2006-06-30 14:42:29

Ok. I'm monitoring 7458 and will test any patches.

Thanks,

Leonardo
By: Frank Waller (explidous) 2006-06-30 19:57:33

We seem to have the same problem however all of our backtraces came out unusable (yes, we used DONT_OPTIMIZE).

We are running Asterisk SVN-trunk-r33066

And we are seeing hundreds of those before it crashes:

Jun 28 15:20:33 �[31;40mERROR�[0;37;40m[9953]: �[1;37;40m../include/asterisk/lock.h�[0;37;40m:�[1;37;40m306�[0;37;40m �[1;37;40m__ast_pthread_mutex_lock�[0;37;40m:
Jun 28 15:20:33 �[31;40mERROR�[0;37;40m[9953]: �[1;37;40m../include/asterisk/lock.h�[0;37;40m:�[1;37;40m309�[0;37;40m �[1;37;40m__ast_pthread_mutex_lock�[0;37;40m:
By: Leonardo Gomes Figueira (sabbathbh) 2006-07-06 15:30:39

Because of a very silly error in my build process the backtraces that I had uploaded before were from an Asterisk without DEBUG_THREADS = -DDEBUG_THREADS -DDETECT_DEADLOCKS

So I updated to SVN-branch-1.2-r37143, recompiled and generated the logs and backtraces again:

debug_agent_transfer_segfault_1.2-r37143_log.txt
debug_agent_transfer_segfault_1.2-r37143_gdb.txt
debug_agent_transfer_deadlock_1.2-r37143_log.txt
debug_agent_transfer_deadlock_1.2-r37143_gdb.txt

I think they will be more useful now.

Sorry for the mistake.

Leonardo
By: Serge Vecher (serge-v) 2006-09-01 14:03:19

what are the results with latest 1.2 branch / trunk?
By: Leonardo Gomes Figueira (sabbathbh) 2006-09-01 15:05:28

pfdesenv*CLI> show version
Asterisk SVN-branch-1.2-r41716 built by root @ pfdesenv on a i686 running Linux on 2006-09-01 19:51:59 UTC

Same results: segfault or deadlock.
By: BJ Weschke (bweschke) 2006-09-03 13:39:12

Guys - I'd be interested to see results of the chan_agent_noapplock_on_cb.diff patch just attached to this bug. p->app_lock was a mutex really designed for use with agents not in callback mode. That being the case, I've tried to code it so that when callback mode is used, the app_lock mutex will not be locked/unlocked at all. Please let me know how you make out - and if you continue to deadlock now, please reproduce the deadlock logging information as you've already done.
By: Leonardo Gomes Figueira (sabbathbh) 2006-09-05 14:39:35

bweschke,

applied chan_agent_noapplock_on_cb.diff to SVN-branch-1.2-r42014M and the deadlock is gone. Great work!

The segfault still happens if I issue an "show channels" on the console after the SIP transfer of the Agent (before the phone hangup). The transfer still does not work (the call is hangup).

Uploaded debug_agent_transfer_segfault_1.2-r42014_gdb.txt with backtrace about the core dumped.

Thanks,

Leonardo
By: Serge Vecher (serge-v) 2006-09-05 14:45:16

sabbathbh: now, one thing at a time, ok? If the asterisk segfaults when issuing "show channels" regardless of whether it is patched or not, then that's a another issue (hint: open a new bug report and attach a backtrace there -- I hope it was built with make dont-optimize)
By: Amilcar S Silvestre (amilcar) 2006-09-05 15:19:31

bweschke,

I think you got it! Great work, man! No more deadlocks here too (the deadlocks that made asterisk > 1.2.7.1 with agentcallback and transfering completely unusable).

On my dev machine, not one deadlock or segfault (both sip and '#' transfers). I put your patch on a production system (2000 calls day). Let's see what happens. Everything fine until now! :-)
By: Leonardo Gomes Figueira (sabbathbh) 2006-09-05 15:31:59

serge-v,

and the problem that the transfer does not work ? (caller is hungup, transfee not) Is this a matter to this bug report or not ?

I think the cause of the segfault is the remaining channel of transferee so if the transfer problem is fixed the segfault will go together. I don't think other bug should be opened just for the segfault.

After the failed transfer we have this channels:

Event: Status
Privilege: Call
Channel: SIP/3054-084ce038
CallerID: 3054
CallerIDName: <unknown>
Account: 3052
State: Up
Link: Agent/3052
Uniqueid: 1157487558.13

Event: Status
Privilege: Call
Channel: Agent/3052
CallerID: 3050
CallerIDName: Ramal Teste
Account: 3052
State: Up
Context: sip_to_anywhere
Extension: 3054
Priority: 7
Seconds: 41
Link: SIP/3054-084ce038
Uniqueid: 1157487558.12

Event: Status
Privilege: Call
Channel: SIP/3052-08431178
CallerID: 3052
CallerIDName: <unknown>
Account:
State: Up
Link: ???
Uniqueid: 1157487551.8

SIP/3052-08431178 : transferer (Agent)
SIP/3054-084ce038 : transferee

After transferee hangup all channels are hangup and "show channels" works fine.
By: Serge Vecher (serge-v) 2006-09-05 15:52:21

sabbathbh: I've looked at your bt and the crash is in another part of asterisk, not touched by the patch. So, yes, you have a couple of issues going on. Again, you have full authorization to open another bug report. Bweschke is interested in helping you with that crash, so after you open a new bug-report, please find him on #asterisk-bugs IRC channel. Thanks.
By: Leonardo Gomes Figueira (sabbathbh) 2006-09-06 07:31:17

Opened bug 7890 for the transfer problem.
By: Terry Giufre-Sweetser (tcgs) 2006-09-08 01:59:38

I am seeing the same bug here... (debian "testing" distro)

asterisk*CLI> show version
Asterisk 1.2.10-BRIstuffed-0.3.0-PRE-1q built by mark @ dell.purcell.id.au on a i686 running Linux on 2006-07-27 07:24:14 UTC

Changing to asterisk-classic (debain package name) demonstrated the same problem.

[1] "reloads" freeze, and any call not using the app_queue module still works.
[2] It always happens after a queue call is transferred.
[3] Restarting asterisk gets you 30 minutes to 2 hours of operation before another agent sip transfer deadlocks the queues.

After changing from Agent based queues to dynamic queues, the problem is not showing up, so therefore, app_queue is definitely getting deadlocked by chan_agent during attended transfers.

Side issue not related to this bug:- Unfortunately, the dynamic queues are proving to be rather painful, we have 5 queues, with various operators in more than 1 queue, and they now get multiple simultaneous calls from different queues.

TCGS

By: Serge Vecher (serge-v) 2006-09-12 10:30:44

tcgs: the patch that sabbathbh has reported to fix the issue is now part of 1.2.12.1 . Please update to this release and report the results. This request goes out to all the people monitoring this bug as well.

By: Matt King, M.A. Oxon. (kebl0155) 2006-09-13 05:22:40

We've been using the above patch AND the manager event backport patch from ASTERISK-6452 for about a week.

We've had NO deadlocks in that time.

Also the manager event patch has made Asterisk much more responsive, even under heavy load.

Thanks so much for these patches - just what we needed!

Matt.
By: Serge Vecher (serge-v) 2006-09-13 12:11:49

Alright, I think there are enough positive reports to close this issue down (finally). Major props to bweschke for fixing this! Please open a new bug report if somehow this issue is not fixed.

Fixed by chan_agent_noapplock_on_cb.diff, which was committed to 1.2 branch in r42133 and appears in 1.2.12.1 release.