Summary: | ASTERISK-11692: [branch] Deadlock after Originate from AMI to Agent | ||
Reporter: | Igor Goncharovsky (igorg) | Labels: | |
Date Opened: | 2008-03-21 03:15:26 | Date Closed: | 2009-01-12 15:02:36.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_agent |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) core_show_locks_2.txt ( 1) core_show_locks.txt ( 2) jpeeler-lock.txt ( 3) normal_status.txt ( 4) wrong_status.txt ( 5) wrong-output.txt | |
Description: | I have discovered deadlock while using chan_agent and originate cal via AMI. To reproduce need one agent and one registered phone. 1) Login via AMI and make originate call 2) Answer a call both sides, talk and hangup phone. 3) After hangup MoH for agent doesn't starts and CLI show ERRORs: -- Started music on hold, class 'default', on SIP/104-08362618 [Mar 21 13:35:21] ERROR[8145]: /usr/src/voip/asterisk-trunk/asterisk-trunk.patched-cng/include/aster:461 __ast_pthread_mutex_unlock: chan_agent.c line 843 (agent_hangup): attempted unlock mutex '&p->app_lock' without owning it! [Mar 21 13:35:21] ERROR[8145]: /usr/src/voip/asterisk-trunk/asterisk-trunk.patched-cng/include/aster:463 __ast_pthread_mutex_unlock: chan_agent.c line 979 (agent_new): '&p->app_lock' was locked here. [Mar 21 13:35:21] ERROR[8145]: /usr/src/voip/asterisk-trunk/asterisk-trunk.patched-cng/include/aster:486 __ast_pthread_mutex_unlock: chan_agent.c line 843 (agent_hangup): Error releasing mutex: Operation not permitted 3) 'agent show online' still show called side in list server-voip*CLI> agent show online 1001 (Vasya Pupkin) logged in on SIP/104-08362618 is idle (musiconhold is 'default') 1 agents online> 3) After second try to Originage, originate fail and using CLI command 'agent show online' fail and make CLI frozen ****** ADDITIONAL INFORMATION ****** Command used to originate: Action: Originate Channel: Agent/1001 Exten: 102 Priority: 1 Context: default Async: true Branch available at: http://svn.digium.com/svn/asterisk/team/jpeeler/bug12269 | ||
Comments: | By: Tilghman Lesher (tilghman) 2008-05-15 12:57:32 Can you replicate this issue with unpatched trunk? I don't want to try to track this down, when it's clear that you're using an altered version. By: Igor Goncharovsky (igorg) 2008-05-16 00:46:19 This bug reported to me while implementing call-center (1.4 without modifications) and I reproduce it on trunk (unpatched). If it needed I'll check latest trunk for this issue too. By: Igor Goncharovsky (igorg) 2008-05-16 02:11:22 Reproduced on latest trunk server-voip*CLI> core show version Asterisk SVN-trunk-r116731 built by root @ server-voip on a i686 running Linux on 2008-05-16 05:54:04 UTC By: Tilghman Lesher (tilghman) 2008-06-17 18:22:55 Please compile Asterisk with DEBUG_THREADS, and obtain a 'core show locks' when this occurs. By: Igor Goncharovsky (igorg) 2008-06-17 21:49:51 Info uploaded, tested with SVN-trunk-r123609. By: Tilghman Lesher (tilghman) 2008-06-17 22:19:57 Is that the full output? It looks like you're missing a large amount. By: Igor Goncharovsky (igorg) 2008-06-17 23:22:34 Yes it is all output after extension 102 disconnected. While agent and exten on the call there is one more lock. Output attached. By: Guillaume Knispel (gknispel_proformatique) 2008-07-11 09:32:29 We have hit this problem too and traced it some time ago. The problem is that the Originate command launch an incredible number of threads just to start the new call (maybe 3 or 4, i don't remember clearly). In one of this thread a callback to chan_agent is done. IIRC an other callback to chan_agent is done latter in another thread. Now the thing is Asterisk uses reentrant locks. The other thing is that synchronisation in chan_agent is performed in a very "imaginative" way taking and releasing locks in a loop in a separated thread in chan_agent that provides something like moh, and that is temporarily stopped when a call is propagated (so that the thread of this call can send its audio to the bridged agent instead of the moh audio). The result of all of the above is that, when using the AMI Originate command, a lock is taken in one thread, that is later tried to be released by an other thread. I'm not sure if this could work with non reentrant lock, but what i'm sure about is that this can't with reentrant ones :p This situation can not happen with a Dial because calls performed by a Dial do not follow the same path as call performed by an Originate, there are less threads involved and the lock that is taken and later released is correctly managed in the same thread. Now about fixing this issue; In our case we did not know why at the begining there are so many threads involved in the execution of an Originate command, nor why synchronisation in chan_agent is written like that. There might me good reasons. Also we needed to solve the issue quickly with a guarantee that we would not introduce any regression, as a client was involved. So we did not took the risk to modify the chan_agent implementation nor the Originate implementation, but instead rewrote or own simplified version of Originate that do not start so many threads, and that is used just for this purpose of letting agents dial out with their agent identity. I don't know all the Asterisk code pathes well enough to think about how this could be fixed correctly upstream. Maybe the number of thread involed in Originate can be reduced, but this might cause other problems with other channels. Maybe somebody who know chan_agent well enough can rewrite it correctly so it does not abuse locks so much. FYI you can take a look at our alternate originate implementation. The problem is that it was for Asterisk 1.2. Also it does not support all the options that Originate does. It also abuses Asterisk 1.2 internals :/ http://www.proformatique.com/asterisk/ami_aoriginate.c By: Guillaume Knispel (gknispel_proformatique) 2008-07-15 12:13:17 Here is an update of ami_aoriginate for Asterisk 1.4, still untested: http://www.proformatique.com/asterisk/ami_aoriginate_1_4.c IgorG if you can give it a try and it works, that means the issue comes probably from what I explained previously. By: Mark Michelson (mmichelson) 2008-07-17 18:28:49 I have uploaded a patch which caused locking errors to go away. The patch I have created is made against Asterisk 1.4 By: Mark Michelson (mmichelson) 2008-07-18 09:13:15 After reconsidering this, my patch is bad. The problem is that the manager's "actionlock" will be held through the entire call if "Async: false" is specified. This will result in all manager actions being blocked until the call has completed. I will attempt a different patch in the meantime. By: Tilghman Lesher (tilghman) 2008-07-31 12:54:42 putnopvut: any progress on the aforementioned patch? By: Mark Michelson (mmichelson) 2008-07-31 13:12:35 Corydon76: I worked on a patch for this one day, but ended up not completing it and not really liking it much either. I believe I can carve some time for this either today or tomorrow. Thanks for the reminder. By: Jeff Peeler (jpeeler) 2008-08-20 12:16:41 As a work around until this is fixed, you can reverse the direction of the originate. Originating from the non-agent channel causes the request and the hangup to occur in the same thread as the code expects. By: Jeff Peeler (jpeeler) 2008-10-07 16:31:20 I created a branch for this bug and I think I have the locking problems solved. I'd like to get some feedback though! http://svn.digium.com/svn/asterisk/team/jpeeler/bug12269 By: Igor Goncharovsky (igorg) 2008-10-13 04:24:07 jpeeler, i have tested this branch and found that locking problem now solved. There are infinite output of these log messages in console while agend loged in, but seems there is no problem here: [Oct 13 16:31:17] NOTICE[14451]: chan_agent.c:2290 __login_exec: jpeeler: app_lock released by thread 0xb52ebb90 [Oct 13 16:31:17] NOTICE[14451]: chan_agent.c:2275 __login_exec: jpeeler: waiting for app_lock [Oct 13 16:31:17] NOTICE[14451]: chan_agent.c:2280 __login_exec: jpeeler: app_lock obtained by thread 0xb52ebb90 By: Denis Galvao (denisgalvao) 2008-10-13 20:06:42 Same problem(deadlock) with Asterisk 1.4.21.2. The workaround is to originate a Local channel calling an application that executes the Dial to the Agent. The problem is the duplicated CDR that you must work with when you need the monitor filename of the recorded agent call. By: Denis Galvao (denisgalvao) 2008-10-13 22:04:20 jpeeler, I tested you branch but the problem still happens. Asterisk SVN-jpeeler-bug12269-r148288-/branches/1.4 built by root @ localhost on a i686 running Linux on 2008-10-14 02:56:39 UTC After the first Originate, deadlock. The *cli stop working. No more moh to the Agent. IgorG, could you confirm this solves your problem? By: Igor Goncharovsky (igorg) 2008-10-13 22:29:35 Jpeeler's branch fix this issue for me. I have done more then 10 originate without locking CLI and hearing MoH on Agent side. I think you must as Corydon76 asks recompile Asterisk with DEBUG_THREADS, obtain a 'core show locks' and CLI output. By: Denis Galvao (denisgalvao) 2008-10-14 08:24:55 Compiled with DEBUG_THREADS, but core show locks didn't work because *cli get frozen. I got a "core show locks" just after Originate and right before doing a *cli>show agents. Uploaded file. By: Denis Galvao (denisgalvao) 2008-10-14 08:27:24 Please, delete: jpeeler_branch-lock.rtf The right file in txt format is: jpeeler-lock.txt By: Jeff Peeler (jpeeler) 2008-10-14 19:02:40 Yes, I see what the problem is. Will update once I figure out a solution. By: Denis Galvao (denisgalvao) 2008-10-14 19:18:31 jpeeler, tell me if you need something else. By: Jeff Peeler (jpeeler) 2008-10-15 15:33:11 Branch updated, svn up and let me know how it goes. By: Denis Galvao (denisgalvao) 2008-10-15 16:14:39 That's it! Some superficial tests reports me no problems. Let me do some stress tests by tomorrow, ok? Thanks. By: Jeff Peeler (jpeeler) 2008-10-15 16:20:55 Yes, I very much appreciate the additional testing. By: Denis Galvao (denisgalvao) 2008-10-17 15:18:48 A lot of testing report us that it is working as expected. We can close it. Im not familiar with but, will http manager get the same behavior? By: Denis Galvao (denisgalvao) 2008-10-20 08:32:10 Something strange here. After using this branch the deadlock has gone, but we are facing some wrong behavior when showing the "show queues" output. There is an Agent that is logged in, but not talking to anybody. Asterisk give to me a wrong information about his state, the output show me (In Use) but should be (Not in Use). File wrong-output.txt uploaded with *cli output. By: Jeff Peeler (jpeeler) 2008-10-20 18:31:37 I have not been able to get the agent status to report wrong information. I have, however, been able to get the call count to not increment when using originate. So I'll start looking into that. If you know what you did to cause the agent status to not be reported accurately, describe what you did. I haven't looked at the code yet, but these two problems are probably directly related. By: Jeff Peeler (jpeeler) 2008-10-21 14:20:35 Actually the call counts shouldn't be incremented as the count is for received calls only. By: Denis Galvao (denisgalvao) 2008-10-21 14:22:51 Unfortunetly I can not reproduce the problem. We are investigating. Based on the source code, could this problem occur? By: Denis Galvao (denisgalvao) 2008-10-24 07:53:19 jpeeler, should I use your branch in a production platform? There is more than just this bug fixed there? By: Denis Galvao (denisgalvao) 2008-10-24 08:45:48 And, what you guys need to commit this out? By: Jeff Peeler (jpeeler) 2008-10-24 13:57:03 denisgalvao: I can't recommend putting this code into production use yet. I'm waiting for more testing before committing. I'd like to make sure that other parts of the system that interoperate with chan_agent aren't adversely effected (such as queues). The branch only fixes this bug though and nothing else. By: Marcos Jose Setim (msetim) 2008-10-24 14:45:12 jpeeler, Is there some way to help you to test the problem of call count increment when using originate? Where can we check if it is receving and wrong increment/decrement? By: Denis Galvao (denisgalvao) 2008-10-30 11:03:42 jpeeler, we are doing tests here, but we can't reproduce the problem, it just happens. Happened two times today. The agent hangup(pressing *) he get the MOH but his status still show "in use". What we should provide you to get this problem solved? By: Marcos Jose Setim (msetim) 2008-10-31 09:54:40 Hi jpeeler, I'm doing tests trying to find out how to reproduce this issue. We can't find out a way to do it always (it's yet is randomly), however I found a interesting thing that can help with issue: First the configuration eventqueuestatus was changed to yes in queue.conf. Then start listening manager ("telnet localhost 5038") to get events of an agent answer. After it we do a lot of calls by this way: 1. Call to queue 2. Agent answered 3. Call was hangup (by pressing *) 4. Execute "show queues" We get the output of manager of "Wrong Member Queues Status(show queues)" and (Correct Member Queues Status). Randomly after hangup a call by pressing * the status is not correctly: Event: QueueMemberStatus Privilege: agent,all Timestamp: 1225390283.359228 Queue: 800 Location: agent/605 MemberName: agent/605 Membership: static Penalty: 0 CallsTaken: 27 LastCall: 1225390283 Status: 2 Paused: 0 Status 2: It's not right. We are expeting Status 1. Looking for code that print this event (QueueMemberStatus ) in app_queue.c the "Status: X" came from devicestate... Can exist some code don't be changing the status of devicestate correctly? I'm attaching two output to you analyze. By: Jeff Peeler (jpeeler) 2008-10-31 13:49:08 msetim: I'll start wrapping my head around the device state code. I would like to confirm that this doesn't happen outside of this branch to ensure this is something that is caused by the changes in this branch. Also, the call counts are as far as I can tell correct. By: Marcos Jose Setim (msetim) 2008-10-31 14:39:02 jpeller: I'm almost sure that it doesn't occur outside of this branch, but to have dead certainty I will make same test in last 1.4.x version. As soon as possible I will post the result. By: Jeff Peeler (jpeeler) 2008-11-04 15:16:43.000-0600 I talked to denisgalvao on IRC today and told him this: What I meant is to test against the latest 1.4 in svn, not the latest release. Although it's extremely unlikely, I'd like to make sure the agent status bug is not a separate issue. By: Marcos Jose Setim (msetim) 2008-11-06 11:25:22.000-0600 Hi jpeeler, I have news. We make test against last 1.4 branch (SVN-branch-1.4-r154685) and issue is happening there too. I guess We need open a new ticket for it. What do you think? By: Jeff Peeler (jpeeler) 2008-11-06 11:54:29.000-0600 Yes, go ahead. I'll start working to get this merged since there don't seem to be any problems with it. By: Denis Galvao (denisgalvao) 2008-11-24 14:34:28.000-0600 The deadlock problem has been solved using the jpeeleer branch, but we are facing problems regarding devicestate. We backported the code to 1.4.20.1, that seems to be ok with devicestate, and the problem has also been solved. Not sure how to proceed. Close this one and create/monitor another one? By: Jeff Peeler (jpeeler) 2008-11-24 15:36:30.000-0600 I'm waiting for the code to be reviewed as the changes are substantial, despite the small amount of code changed. The device state issues though should have been reported in another bug. Somebody else may have reported it by now. By: Leif Madsen (lmadsen) 2009-01-08 13:54:29.000-0600 msetim: the files you have uploaded to this bug, and 13860 need to be re-uploaded. They are neither code nor documentation submissions; they are debugging files. Thus they do not need to be marked as code, since we cannot access them since you have no license on file. We cannot unmark them as code, thus they are lost in the vault for no one to see. By: Mark Michelson (mmichelson) 2009-01-08 14:16:21.000-0600 blitzrage, I have a feeling this is actually some sort of Mantis error. There's another issue I'm working on where the reporter's uploads are not viewable for licensing reasons. I made the same assumption that the checkbox for code submission was checked, but this issue is leading me to believe that is not the case. The reason is that you can tell from previous comments that the older submissions *used* to be viewable, but now are not. By: Leif Madsen (lmadsen) 2009-01-09 12:42:49.000-0600 putnopvut: yes you are right. It seems it was a mantis bug, but this was the first time I had seen this. It appears to be fixed now. Sorry for the noise! By: Digium Subversion (svnbot) 2009-01-12 14:26:23.000-0600 Repository: asterisk Revision: 168507 U branches/1.4/channels/chan_agent.c ------------------------------------------------------------------------ r168507 | jpeeler | 2009-01-12 14:26:22 -0600 (Mon, 12 Jan 2009) | 9 lines (closes issue ASTERISK-11692) Reported by: IgorG Tested by: denisgalvao This gits rid of the notion of an owning_app allowing the request and hangup to be initiated by different threads. Originating from an active agent channel requires this. The implementation primarily changes __login_exec to wait on a condition variable rather than a lock. Review: http://reviewboard.digium.com/r/35/ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=168507 By: Digium Subversion (svnbot) 2009-01-12 14:53:05.000-0600 Repository: asterisk Revision: 168508 _U trunk/ U trunk/channels/chan_agent.c ------------------------------------------------------------------------ r168508 | jpeeler | 2009-01-12 14:53:05 -0600 (Mon, 12 Jan 2009) | 15 lines Merged revisions 168507 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r168507 | jpeeler | 2009-01-12 14:26:22 -0600 (Mon, 12 Jan 2009) | 9 lines (closes issue ASTERISK-11692) Reported by: IgorG Tested by: denisgalvao This gits rid of the notion of an owning_app allowing the request and hangup to be initiated by different threads. Originating from an active agent channel requires this. The implementation primarily changes __login_exec to wait on a condition variable rather than a lock. Review: http://reviewboard.digium.com/r/35/ ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=168508 By: Digium Subversion (svnbot) 2009-01-12 15:01:56.000-0600 Repository: asterisk Revision: 168509 _U branches/1.6.0/ U branches/1.6.0/channels/chan_agent.c ------------------------------------------------------------------------ r168509 | jpeeler | 2009-01-12 15:01:56 -0600 (Mon, 12 Jan 2009) | 22 lines Merged revisions 168508 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r168508 | jpeeler | 2009-01-12 14:53:04 -0600 (Mon, 12 Jan 2009) | 15 lines Merged revisions 168507 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r168507 | jpeeler | 2009-01-12 14:26:22 -0600 (Mon, 12 Jan 2009) | 9 lines (closes issue ASTERISK-11692) Reported by: IgorG Tested by: denisgalvao This gits rid of the notion of an owning_app allowing the request and hangup to be initiated by different threads. Originating from an active agent channel requires this. The implementation primarily changes __login_exec to wait on a condition variable rather than a lock. Review: http://reviewboard.digium.com/r/35/ ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=168509 By: Digium Subversion (svnbot) 2009-01-12 15:02:33.000-0600 Repository: asterisk Revision: 168510 _U branches/1.6.1/ U branches/1.6.1/channels/chan_agent.c ------------------------------------------------------------------------ r168510 | jpeeler | 2009-01-12 15:02:32 -0600 (Mon, 12 Jan 2009) | 22 lines Merged revisions 168508 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r168508 | jpeeler | 2009-01-12 14:53:04 -0600 (Mon, 12 Jan 2009) | 15 lines Merged revisions 168507 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r168507 | jpeeler | 2009-01-12 14:26:22 -0600 (Mon, 12 Jan 2009) | 9 lines (closes issue ASTERISK-11692) Reported by: IgorG Tested by: denisgalvao This gits rid of the notion of an owning_app allowing the request and hangup to be initiated by different threads. Originating from an active agent channel requires this. The implementation primarily changes __login_exec to wait on a condition variable rather than a lock. Review: http://reviewboard.digium.com/r/35/ ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=168510 |