Summary: | ASTERISK-07261: [patch] An Agent transfering call via SIP transfer causes a segfault or deadlock in queue/agent system | ||
Reporter: | Leonardo Gomes Figueira (sabbathbh) | Labels: | |
Date Opened: | 2006-06-30 06:59:56 | Date Closed: | 2006-09-13 12:12:06 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_agent |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) chan_agent_noapplock_on_cb.diff ( 1) debug_agent_transfer_deadlock_1.2-r37143_gdb.txt ( 2) debug_agent_transfer_deadlock_1.2-r37143_log.txt ( 3) debug_agent_transfer_segfault_1.2-r37143_gdb.txt ( 4) debug_agent_transfer_segfault_1.2-r37143_log.txt ( 5) debug_agent_transfer_segfault_1.2-r42014_gdb.txt | |
Description: | Tested on 1.2.9.1 and on 1.2-r36254M: Asterisk SVN-branch-1.2-r36254M built by root @ pfdesenv on a i686 running Linux on 2006-06-29 17:29:20 UTC I got this lockup/segfault problems with queues after upgrading to 1.2.9.1 from 1.2.7.1. I had to downgrade back to 1.2.7.1 on production servers. The tests below were made on a devel box. I uploaded 2 set of logs and backtraces of this 2 different results I get after transfering a call answered by an agent from a queue using SIP transfer: 1- Transfering the call to a Polycom IP300 the call gets "locked" on Polycom/Asterisk and Asterisk segfaults after issuing an "show channels": debugagenttransfer_segfault_log.txt debugagenttransfer_segfault_gdb.txt 2- Asterisk in deadlock on queue call debugagenttransfer_queuedeadlock_log.txt debugagenttransfer_queuedeadlock_gdb.txt Both situations can be reproduced on every attempt. Summary of the situations registered on log files: 1- Segfault situation: A- SIP/3050 (Soyo) calls queue B- Agent/3054 - SIP/3054 (Grandstream) answer call C- SIP/3054 (Grandstream) makes a SIP attended transfer to SIP/3053 (Polycom) D- SIP/3050 (Soyo) gets a hangup after the transfer E- Call stay up on SIP/3053 (Polycom) F- Issuing a "show channels" on cli segfault Asterisk: pfdesenv*CLI> show channels Channel Location State Application(Data) SIP/3053-89ad (None) Up Bridged Call(Agent/3054) Agent/3054 s@macro-atende:900 Up Dial(SIP/3053|30|tT) pfdesenv*CLI> Disconnected from Asterisk server Executing last minute cleanups 2- Deadlock situation: A- SIP/3050 (Soyo) calls queue B- Agent/3053 - SIP/3053 (Polycom) answer call C- SIP/3053 (Polycom) makes a SIP attended transfer to SIP/3054 (Grandstream) D- Both SIP/3050 (Soyo) and SIP/3054 (Grandstream) gets a hangup after the transfer E- SIP/3050 (Soyo) calls queue again F- Agent/3054 - SIP/3054 (Grandstream) answer call G- SIP/3050 (Soyo) hangup the call H- SIP/3054 (Grandstream) gets a hangup Everything is "fine" up to now (except by the unsucessuful transfer that results on hangup) I- SIP/3050 (Soyo) calls queue for the third time J- Asterisk deadlocks on Queue: -- Executing Queue("SIP/3050-0fb0", "desenvolvimento|tr|||600") in new stack Every queue/agent related command stops working on cli. Asterisk must be restarted. I hope the backtraces are useful, if I missed some info please ask and I will be glad to help tracking this problem. Thanks, Leonardo ****** ADDITIONAL INFORMATION ****** 1- This situations are not load related. On this tests I start asterisk on devel box, make the 2 or 3 calls to queue like said on description and the problem occurs. There is no other calls or external services running during this tests. 2- This box has 2 E1 ports but no PRI span configured. So I think the problem it's not related to any hardware/zaptel/pri stuff 3- There is no manager connections during this tests. 4- I'm not using MixMonitor, I'm using Monitor but I tested disabling all monitor and queue recording and repeated the test with same results. 5- IAX2 transfers (Agent call on IAX2) does not trigger this problems. | ||
Comments: | By: Serge Vecher (serge-v) 2006-06-30 09:46:08 btw, your revision # indicates code modifications. Can you please detail what was changed from the original source? By: Leonardo Gomes Figueira (sabbathbh) 2006-06-30 12:37:25 A modification in astgenkey script triggered that M flag. There are no modifications in Asterisk code: [root@pfdesenv asterisk-1.2-060629]# svnversion -c . 555:36254M [root@pfdesenv asterisk-1.2-060629]# diff -qr /usr/src/asterisk-1.2-060629/ . |grep differ Files /usr/src/asterisk-1.2-060629/contrib/scripts/astgenkey and ./contrib/scripts/astgenkey differ [root@pfdesenv asterisk-1.2-060629]# cp /usr/src/asterisk-1.2-060629/contrib/scripts/astgenkey ./contrib/scripts/astgenkey cp: sobrescrever `./contrib/scripts/astgenkey'? y [root@pfdesenv asterisk-1.2-060629]# svnversion -c . 555:36254 [root@pfdesenv asterisk-1.2-060629]# By: Serge Vecher (serge-v) 2006-06-30 13:17:45 Apparently this is related to 6626. Please watch for development there, as jstorm has been working on locking problems in chan_agent. If any of the patches arise, please test them and report. Thanks. By: Leonardo Gomes Figueira (sabbathbh) 2006-06-30 14:42:29 Ok. I'm monitoring 7458 and will test any patches. Thanks, Leonardo By: Frank Waller (explidous) 2006-06-30 19:57:33 We seem to have the same problem however all of our backtraces came out unusable (yes, we used DONT_OPTIMIZE). We are running Asterisk SVN-trunk-r33066 And we are seeing hundreds of those before it crashes: Jun 28 15:20:33 �[31;40mERROR�[0;37;40m[9953]: �[1;37;40m../include/asterisk/lock.h�[0;37;40m:�[1;37;40m306�[0;37;40m �[1;37;40m__ast_pthread_mutex_lock�[0;37;40m: Jun 28 15:20:33 �[31;40mERROR�[0;37;40m[9953]: �[1;37;40m../include/asterisk/lock.h�[0;37;40m:�[1;37;40m309�[0;37;40m �[1;37;40m__ast_pthread_mutex_lock�[0;37;40m: By: Leonardo Gomes Figueira (sabbathbh) 2006-07-06 15:30:39 Because of a very silly error in my build process the backtraces that I had uploaded before were from an Asterisk without DEBUG_THREADS = -DDEBUG_THREADS -DDETECT_DEADLOCKS So I updated to SVN-branch-1.2-r37143, recompiled and generated the logs and backtraces again: debug_agent_transfer_segfault_1.2-r37143_log.txt debug_agent_transfer_segfault_1.2-r37143_gdb.txt debug_agent_transfer_deadlock_1.2-r37143_log.txt debug_agent_transfer_deadlock_1.2-r37143_gdb.txt I think they will be more useful now. Sorry for the mistake. Leonardo By: Serge Vecher (serge-v) 2006-09-01 14:03:19 what are the results with latest 1.2 branch / trunk? By: Leonardo Gomes Figueira (sabbathbh) 2006-09-01 15:05:28 pfdesenv*CLI> show version Asterisk SVN-branch-1.2-r41716 built by root @ pfdesenv on a i686 running Linux on 2006-09-01 19:51:59 UTC Same results: segfault or deadlock. By: BJ Weschke (bweschke) 2006-09-03 13:39:12 Guys - I'd be interested to see results of the chan_agent_noapplock_on_cb.diff patch just attached to this bug. p->app_lock was a mutex really designed for use with agents not in callback mode. That being the case, I've tried to code it so that when callback mode is used, the app_lock mutex will not be locked/unlocked at all. Please let me know how you make out - and if you continue to deadlock now, please reproduce the deadlock logging information as you've already done. By: Leonardo Gomes Figueira (sabbathbh) 2006-09-05 14:39:35 bweschke, applied chan_agent_noapplock_on_cb.diff to SVN-branch-1.2-r42014M and the deadlock is gone. Great work! The segfault still happens if I issue an "show channels" on the console after the SIP transfer of the Agent (before the phone hangup). The transfer still does not work (the call is hangup). Uploaded debug_agent_transfer_segfault_1.2-r42014_gdb.txt with backtrace about the core dumped. Thanks, Leonardo By: Serge Vecher (serge-v) 2006-09-05 14:45:16 sabbathbh: now, one thing at a time, ok? If the asterisk segfaults when issuing "show channels" regardless of whether it is patched or not, then that's a another issue (hint: open a new bug report and attach a backtrace there -- I hope it was built with make dont-optimize) By: Amilcar S Silvestre (amilcar) 2006-09-05 15:19:31 bweschke, I think you got it! Great work, man! No more deadlocks here too (the deadlocks that made asterisk > 1.2.7.1 with agentcallback and transfering completely unusable). On my dev machine, not one deadlock or segfault (both sip and '#' transfers). I put your patch on a production system (2000 calls day). Let's see what happens. Everything fine until now! :-) By: Leonardo Gomes Figueira (sabbathbh) 2006-09-05 15:31:59 serge-v, and the problem that the transfer does not work ? (caller is hungup, transfee not) Is this a matter to this bug report or not ? I think the cause of the segfault is the remaining channel of transferee so if the transfer problem is fixed the segfault will go together. I don't think other bug should be opened just for the segfault. After the failed transfer we have this channels: Event: Status Privilege: Call Channel: SIP/3054-084ce038 CallerID: 3054 CallerIDName: <unknown> Account: 3052 State: Up Link: Agent/3052 Uniqueid: 1157487558.13 Event: Status Privilege: Call Channel: Agent/3052 CallerID: 3050 CallerIDName: Ramal Teste Account: 3052 State: Up Context: sip_to_anywhere Extension: 3054 Priority: 7 Seconds: 41 Link: SIP/3054-084ce038 Uniqueid: 1157487558.12 Event: Status Privilege: Call Channel: SIP/3052-08431178 CallerID: 3052 CallerIDName: <unknown> Account: State: Up Link: ??? Uniqueid: 1157487551.8 SIP/3052-08431178 : transferer (Agent) SIP/3054-084ce038 : transferee After transferee hangup all channels are hangup and "show channels" works fine. By: Serge Vecher (serge-v) 2006-09-05 15:52:21 sabbathbh: I've looked at your bt and the crash is in another part of asterisk, not touched by the patch. So, yes, you have a couple of issues going on. Again, you have full authorization to open another bug report. Bweschke is interested in helping you with that crash, so after you open a new bug-report, please find him on #asterisk-bugs IRC channel. Thanks. By: Leonardo Gomes Figueira (sabbathbh) 2006-09-06 07:31:17 Opened bug 7890 for the transfer problem. By: Terry Giufre-Sweetser (tcgs) 2006-09-08 01:59:38 I am seeing the same bug here... (debian "testing" distro) asterisk*CLI> show version Asterisk 1.2.10-BRIstuffed-0.3.0-PRE-1q built by mark @ dell.purcell.id.au on a i686 running Linux on 2006-07-27 07:24:14 UTC Changing to asterisk-classic (debain package name) demonstrated the same problem. [1] "reloads" freeze, and any call not using the app_queue module still works. [2] It always happens after a queue call is transferred. [3] Restarting asterisk gets you 30 minutes to 2 hours of operation before another agent sip transfer deadlocks the queues. After changing from Agent based queues to dynamic queues, the problem is not showing up, so therefore, app_queue is definitely getting deadlocked by chan_agent during attended transfers. Side issue not related to this bug:- Unfortunately, the dynamic queues are proving to be rather painful, we have 5 queues, with various operators in more than 1 queue, and they now get multiple simultaneous calls from different queues. TCGS By: Serge Vecher (serge-v) 2006-09-12 10:30:44 tcgs: the patch that sabbathbh has reported to fix the issue is now part of 1.2.12.1 . Please update to this release and report the results. This request goes out to all the people monitoring this bug as well. By: Matt King, M.A. Oxon. (kebl0155) 2006-09-13 05:22:40 We've been using the above patch AND the manager event backport patch from ASTERISK-6452 for about a week. We've had NO deadlocks in that time. Also the manager event patch has made Asterisk much more responsive, even under heavy load. Thanks so much for these patches - just what we needed! Matt. By: Serge Vecher (serge-v) 2006-09-13 12:11:49 Alright, I think there are enough positive reports to close this issue down (finally). Major props to bweschke for fixing this! Please open a new bug report if somehow this issue is not fixed. Fixed by chan_agent_noapplock_on_cb.diff, which was committed to 1.2 branch in r42133 and appears in 1.2.12.1 release. |