Summary: | ASTERISK-18101: Asterisk 1.8 Deadlock in app_queue | ||||||
Reporter: | Paul Rolfe (prolfe) | Labels: | |||||
Date Opened: | 2011-07-08 01:26:33 | Date Closed: | 2011-09-15 10:46:41 | ||||
Priority: | Critical | Regression? | |||||
Status: | Closed/Complete | Components: | Applications/app_queue | ||||
Versions: | 1.8.4 1.8.5.0 1.8.6.0 | Frequency of Occurrence | Constant | ||||
Related Issues: |
| ||||||
Environment: | Debian Squeeze - Linux pabx 2.6.35.11 #1 SMP Tue Feb 8 14:35:57 WST 2011 x86_64 GNU/Linux | Attachments: | ( 0) 1139_backtrace_threads.txt ( 1) 1139_core_show_locks.txt ( 2) 12740_backtrace_threads.txt ( 3) 12740_core_show_locks.txt ( 4) 13994_backtrace_threads.txt ( 5) 13994_core_show_locks.txt ( 6) 14276_backtrace_threads.txt ( 7) 14276_core_show_locks.txt ( 8) 25292_backtrace_threads.txt ( 9) 25292_core_show_locks.txt (10) 28050_backtrace_threads.txt (11) 28050_core_show_locks.txt (12) 32179_backtrace_threads.txt (13) 32179_core_show_locks.txt (14) 4619_backtrace_threads.txt (15) 4619_core_show_locks.txt (16) 4670_backtrace_threads.txt (17) 4670_core_show_locks.txt (18) 7646_backtrace_threads.txt (19) 7646_core_show_locks.txt (20) app_queue_0830.patch (21) bugASTERISK-18101.patch (22) bugASTERISK-18101-1.8.patch | ||||
Description: | Asterisk 1.8 SVN is deadlocking at random intervals. Have switched from timerfd timing to DAHDI, issue still occurs. The machine is a production PABX. We run approx 30 queues with about 60 agents logged in, and a large number of AMI manager connections from 2 webservers. Calls are frequently transferred between queues and agents, as well as having several channel spy sessions initiated from the webservers. Previously to this we were running 1.4.40 and had not had any deadlocking issues. We have been through versions 1.8.3, 1.8.4, 1.8.4.2, 1.8.4.3, 1.8.4.4 and all have exhibited similar deadlocking issues. This seems to occur on average about twice a day for normal load. On weekends and evenings which have significantly less load the problem does not appear to occur. | ||||||
Comments: | By: Paul Rolfe (prolfe) 2011-07-12 22:29:26.954-0500 Had a further lockup today. core show locks and backtrace attached. By: Paul Rolfe (prolfe) 2011-07-13 02:15:33.407-0500 And another just now. Debug info attached. By: Paul Rolfe (prolfe) 2011-07-13 22:21:46.490-0500 The system got very busy today, and consequently we had 4 deadlocks in a 30 minute perdiod. I have attached the debug files for all. We're going to have to revert to 1.4 until this issue is resolved. By: Paul Rolfe (prolfe) 2011-07-28 02:01:58.009-0500 Further lockup again today. Recent SVN updates seem to have reduced the frequency. By: Paul Rolfe (prolfe) 2011-08-03 22:32:22.845-0500 Frequency of this issue seems to have reduced overall, however it still occurs during periods of high load. Is this actively being looked at? By: Gregory Hinton Nietsky (irroot) 2011-08-05 03:25:09.100-0500 Have mostly got it gone since last week small patch and discussion on IRC but still having infrequent channels been orphaned By: Paul Rolfe (prolfe) 2011-08-05 03:48:06.584-0500 Great, Thanks. Had another one yesterday, but unfortunately I wasn't around to capture debug info, and Asterisk was just restarted. By: Paul Rolfe (prolfe) 2011-08-17 21:57:27.724-0500 The bulk of this issue seems to have been resolved. We are currently running revsion 330704 from SVN, and have a current uptime in production of 1 week, 6 days, 19 hours, 49 minutes, 13 seconds. This is more than 3 times our previous best uptime in production. By: Paul Rolfe (prolfe) 2011-08-25 00:35:22.333-0500 Update: We've had two lockups in the past 2 weeks. These have been the biggest uptimes between dealocks since rolling out 1.8. I've added the 'core show locks' and backtraces for these two lockups. Today's lockup was in SVN version 332432 Hope its all helping. By: Paul Rolfe (prolfe) 2011-08-25 00:37:19.559-0500 4670 - Lockup on 25 Aug. 4619 - Lockup on 19 Aug. By: Richard Mudgett (rmudgett) 2011-08-25 10:22:01.291-0500 The last two core show locks files indicate a locking order problem between the queues container lock, channels container lock, and a channel lock. By: Gregory Hinton Nietsky (irroot) 2011-08-30 05:01:37.657-0500 Based on the latest deadlocks here is a patch. 1)Lock queues in update_queue before iterating 2)queues is held when calling a channel in some cases and probably should not By: Gregory Hinton Nietsky (irroot) 2011-08-30 08:48:48.704-0500 Fixed stupid mistake in previous patch By: Gregory Hinton Nietsky (irroot) 2011-09-05 02:23:10.595-0500 Ok there is a new patch on the RB that cleans up the locking please try it out and feedback. By: Paul Rolfe (prolfe) 2011-09-05 19:49:51.977-0500 I have compiled and installed this patch into our production system. Determining success may be difficult as the mean time between deadlocks has increased over the last month, and I don't have a method to reproduce the problem at will. Thanks. By: Gregory Hinton Nietsky (irroot) 2011-09-06 00:51:34.099-0500 Thx paul indeed 1.8 is far more stable and the key is the do no harm principle as long as this patch does no harm [cause additional more frequent DL's] hope fully the mean time will increase further and any DL's will not be related to this issue. please post any further information here so we can attend to it. By: Kristijan Vrban (vrban) 2011-09-09 04:11:01.148-0500 perhaps i hade the same issue today. all calls into the queues just stopped. but asterisk itself was still running. all other call which where not going into a queue, where fine. the last output when a call was going into a queue: {noformat} -- Executing [*770002-K1386@main-functions:3] ExecIf("SIP/K138617D9D-00001215", "1?Queue(K1386-0002,,,,10)") in new stack [Sep 9 10:44:06] DEBUG[9823]: app_queue.c:5873 queue_exec: NO QUEUE_PRIO variable found. Using default. [Sep 9 10:44:06] DEBUG[9823]: app_queue.c:5923 queue_exec: queue: K1386-0002, options: , url: , announce: , expires: 1315557856, priority: 0 {noformat} and then nothing. i try the patch now. By: Gregory Hinton Nietsky (irroot) 2011-09-10 14:15:47.187-0500 commit 331774 resolves one of these problems r331774 | mnicholson | 2011-08-12 21:01:27 +0200 (Fri, 12 Aug 2011) | 11 lines Unlock the channel before calling update_queue. Holding the channel lock when calling update_queue which attempts to lock the queue lock can cause a deadlock. This deadlock involves the following chain: 1. hold chan lock -> wait queue lock 2. hold queue lock -> wait agent list lock 3. hold agent list lock -> wait chan list lock 4. hold chan list lock -> wait chan lock By: Gregory Hinton Nietsky (irroot) 2011-09-10 14:37:14.309-0500 It appears that once the queues lock is held no channels must be locked there is no reason in app_queue to ever currently lock the queues container outside of the ao2 API. working on a comprehensive patch to implement this the patch i have previously posted should circumvent the reported problems in this bug report however its not complete or correct. By: Gregory Hinton Nietsky (irroot) 2011-09-10 15:09:27.991-0500 new review patch posted this should be a blocker for 1.8.8 By: Gregory Hinton Nietsky (irroot) 2011-09-14 09:26:14.246-0500 New and hopefully last patch. By: Jason Legault (jlegault) 2011-09-14 12:50:18.807-0500 Tried new patch with a fresh 1.8.7.0-rc1 and got the following when compiling. app_queue.c: In function ‘reload_queues’: app_queue.c:6784: error: ‘OBJ_NOLOCK’ undeclared (first use in this function) app_queue.c:6784: error: (Each undeclared identifier is reported only once app_queue.c:6784: error: for each function it appears in.) make[1]: *** [app_queue.o] Error 1 make: *** [apps] Error 2 By: Gregory Hinton Nietsky (irroot) 2011-09-14 13:19:16.123-0500 Oops my bad this is a ver 10 compat issue im working on it By: Gregory Hinton Nietsky (irroot) 2011-09-14 13:37:58.350-0500 Ok here is a 1.8 version of the patch the 10/Trunk version takes advantage of OBJ_NOLOCK not available in 1.8 |