Summary: | ASTERISK-17867: [patch] Random entire asterisk deadlock when use builtin_atxfer when use res_timing_timerfd module | ||||||||
Reporter: | Igor Nikolaev (microlana) | Labels: | |||||||
Date Opened: | 2011-05-16 12:55:47 | Date Closed: | 2011-09-22 06:59:04 | ||||||
Priority: | Major | Regression? | No | ||||||
Status: | Closed/Complete | Components: | Resources/res_timing_timerfd | ||||||
Versions: | 1.8.4 | Frequency of Occurrence | |||||||
Related Issues: |
| ||||||||
Environment: | Attachments: | ( 0) gdb-asterisk-deadlock1.txt ( 1) gdb-asterisk-deadlock2.txt ( 2) gdb-asterisk-deadlock3.txt ( 3) res_timing_timerfd.c.diff ( 4) res_timing_timerfd.full.diff ( 5) res_timing_timerfd.optimized.diff ( 6) res_timinig_timerfd_read_deadlock.patch | |||||||
Description: | Random entire asterisk deadlock when use builtin_atxfer when use res_timing_timerfd module. In this case impossible to create new channels and/or destruct existing channels. ****** ADDITIONAL INFORMATION ****** Channels containers is locked into (example 2 and 3) __ao2_callback() at astobj2.c:769, while it waiting to lock some channel, which already locked by __ast_read() at channel.c:3661, while this channel is into infinity read loop on disarmed timerfd timer. | ||||||||
Comments: | By: Igor Nikolaev (microlana) 2011-05-16 12:59:15 To avoid this deadlock don't call a read() function into disarmed timer. Patch for this solution is attached into diff file By: Igor Nikolaev (microlana) 2011-05-16 13:02:35 <inline patch removed by lmadsen> By: Leif Madsen (lmadsen) 2011-05-16 14:53:00 You can't provide patches inline like that. I've removed it. Please wait for your license to be approved. Thanks! By: Lott Caskey (lottc) 2011-05-16 22:41:53 I have a situation matching this description. I am eager to see if your patch works. Willing to test as soon as license is submitted. By: Igor Nikolaev (microlana) 2011-05-17 00:12:41 While pending license you may disable or reorder res_timing_timerfd and use res_timing_dahdi timing source instead (not tested by me). By: Gregory Hinton Nietsky (irroot) 2011-05-17 03:36:50 ASTERISK-17407 is related By: Mark Murawski (kobaz) 2011-05-24 01:22:02 Patch is working well for me so far. I'm getting the disarmed timer warning quite often with one of my unit tests. I'll break it down into a small bit of dialplan so we can figure out what causes the problem in the first place. By: Mark Murawski (kobaz) 2011-05-24 01:31:56 By working well, I mean asterisk isn't crashing. But something is broken in handling a channel hangup and channels stick around longer than they should. And the stuck channels doesn't happen with dahdi timing. By: zvision (zvision) 2011-05-26 08:23:59 I still see a small window in the proposed patch, where the timer can be disarmed between a call to the timerfd_gettime and the read functions from another thread, as there is no serialization during timerfd_timer_ack. By: Michel Verbraak (astmiv) 2011-06-07 02:09:20.894-0500 Update patch which has right debug text and uses locking to prevent changes when reading. By: zvision (zvision) 2011-06-07 09:59:35.505-0500 I've attached another patch proposals. Both are similar to the existing ones, but use ast_poll before read to avoid a read block. The optimized version does not use container lookups and timer locks in order to make timer_ack call as fast as possible. Diffs are made from the 1.6.2 SVN By: zvision (zvision) 2011-06-08 08:35:57.967-0500 From my observations, using timerfd_gettime call and checking for no remaining time can give false warnings on a heavily loaded system. Sometimes timerfd_gettime returns 0 in the timerspec for already expired timers, if no read call has been made yet. If we use ao2_find to find the timer, the best way would be to check the saved_timer values for a disarmed timer. By: Mark Murawski (kobaz) 2011-06-27 10:56:29.086-0500 Out of the patches, which patch is proposed to be the 'official' patch? By: Gregory Hinton Nietsky (irroot) 2011-06-28 06:38:10.105-0500 been a few hours running the patch from RB with log messages commented out no ill effects so far ... By: David Vossel (dvossel) 2011-06-29 14:05:16.836-0500 https://reviewboard.asterisk.org/r/1255/ Fixed this. r325673 in 1.8 By: zvision (zvision) 2011-06-30 04:27:36.635-0500 I think the committed patch has two issues. The first one is not unreferencing a timer object after ao2_find, causing its reference count to increase each time timer_ack is called. The second one is a missing lock for the timer. It can happend that the timer will get disarmed after the check for disarmed condition and before the read call. Can anyone confirm I am right? Should I report it as a new issue? By: Michel Verbraak (astmiv) 2011-06-30 10:02:13.935-0500 If they use my patch "res_timinig_timerfd_read_deadlock.patch" Then both comments of zvision have been resolved. But this patch was not committed. By: Mark Murawski (kobaz) 2011-06-30 17:34:41.535-0500 I'll do testing with res_timinig_timerfd_read_deadlock.patch. Other testers. Please test with this patch and report your results. By: KevinH (kevinh001) 2011-06-30 19:54:57.645-0500 I just upgraded to the 1.8.5-rc1 for testing and the dial plan code below causes loop on "Wait", have to kill the Asterisk process to terminate the loop. 1.8.4.3 works fine. {quote} [Jun 30 13:08:00] VERBOSE[10836] pbx.c: == Starting DAHDI/1-1 at from-pstn,,1 failed so falling back to exten 's' [Jun 30 13:08:00] VERBOSE[10836] pbx.c: -- Executing [s@from-pstn:1] Answer("DAHDI/1-1", "") in new stack [Jun 30 13:08:00] VERBOSE[10836] pbx.c: -- Executing [s@from-pstn:2] Wait("DAHDI/1-1", "0.5") in new stack [Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd. [Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd. [Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd. [Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd. [Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd. just keeps going.... {quote} By: David Vossel (dvossel) 2011-07-06 10:31:40.961-0500 It has been discovered that the proposed fix for timerfd caused a serious regression and has been reverted. By: Leif Madsen (lmadsen) 2011-09-22 06:59:04.157-0500 res_timing_timerfd should now be fixed in Asterisk 1.8.7.0. Please test the current release candidate, and if you continue to have problems with res_timing_timerfd, please open a new issue. Thanks! |