Summary: | ASTERISK-25321: [patch]DeadLock ChanSpy with call over Local channel | ||||||
Reporter: | Filip Frank (frenk77) | Labels: | |||||
Date Opened: | 2015-08-13 06:21:13 | Date Closed: | 2016-03-14 11:19:55 | ||||
Priority: | Critical | Regression? | |||||
Status: | Closed/Complete | Components: | Applications/app_chanspy | ||||
Versions: | 11.16.0 11.18.0 | Frequency of Occurrence | |||||
Related Issues: |
| ||||||
Environment: | custom CentOS 6 based distro | Attachments: | ( 0) asterisk.log_19082015_1258 ( 1) ASTERISK-25321_retry_chan_lock_until_stable.patch ( 2) backtrace-threads-ffr19082015_1258.txt ( 3) corelocks_ffr19082015_1258.txt ( 4) spydeadlock.patch | ||||
Description: | We have a problem with ramdom deadlocks with using ChanSpy running on SIP channels, and dialing by AMI Originate to Local channel, which Dial another Local channel, and then Dial SIP peer.
Example: 1. SIP/iptel205 doing ChanSpy(SIP/iptel210) 2. AMI Originate Local/210@dialer 3. Dial(Local/210@internal) 4. Dial(SIP/iptel210) 5. Answer SIP/iptel210 6. Start dial caller from originate for ex 00420591122223@outgoning Here is part of backtrace from coredump, I get it by gcore when asterisk was deadlocked. [Edit by Rusty - removed excessive debug from description. Please attach debug and annotated files to the issue with More > Attach Files] | ||||||
Comments: | By: Asterisk Team (asteriskteam) 2015-08-13 06:21:14.759-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Filip Frank (frenk77) 2015-08-13 06:48:45.031-0500 Please reopen, i do "click typo". By: Asterisk Team (asteriskteam) 2015-08-13 06:48:45.851-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Rusty Newton (rnewton) 2015-08-13 09:22:03.204-0500 Removed excessive debug from description. Please attach debug and annotated files to the issue with More > Attach Files. Please read: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines Please also provide (attach) the output described in the [Getting a Backtrace|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock] section relevant to deadlocks. By: Rusty Newton (rnewton) 2015-08-17 17:40:02.211-0500 Many of the values in the trace are optimized out "<value optimized out>" so it looks like you didn't follow the instructions. Putting this in waiting for feedback to await new traces. Traces with optimized out values are often not helpful for the developers.. Thanks! By: Rusty Newton (rnewton) 2015-08-17 17:40:32.792-0500 In addition to the above comment please provide the "core show locks" output as described in the wiki article. By: Filip Frank (frenk77) 2015-08-18 02:18:22.855-0500 I know, but I cant get it now. Our customer is very angry for this deadlocks, so he now not using spying to avoid this. Im trying to simulate this on our testing PBX, but occurrence is very low, maybe depends on more things like OS, network load(more time in critical section?). When i will be able to reproduce this i attach better backtrace, but now it is the best what i have. By: Rusty Newton (rnewton) 2015-08-18 14:31:31.541-0500 We can't investigate further with this debug so we'll go ahead and close this out for now. In the near future when you have additional non-optimized traces, 'core show locks' output and can capture correlating debug logs, you will want to open a new issue and we can move on from there. If you can additional narrow down how to reproduce the issue in a lab environment in a reliable (or almost reliable) fashion that would help tremendously. Thanks! By: Filip Frank (frenk77) 2015-08-18 15:00:50.515-0500 It seems i found some way to reproduce in my lab today. So now, i know how to. Then i am going to try it without optimized and with lock debug build. Give me a time please, i need thousands of calls to reproduce this deadlock. By: Asterisk Team (asteriskteam) 2015-08-18 15:00:50.962-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Filip Frank (frenk77) 2015-08-19 06:15:29.934-0500 backtrace threads added By: Filip Frank (frenk77) 2015-08-19 06:15:56.715-0500 last 10000 entries in log added By: Filip Frank (frenk77) 2015-08-19 06:16:41.136-0500 core show locks output after gcore added By: Filip Frank (frenk77) 2015-08-19 06:18:23.064-0500 I attach new thread backtrace, log, locks output from my lab asterisk 11.16.0. By: Filip Frank (frenk77) 2015-08-20 07:21:48.943-0500 I analyze backtrace and lock log. I found problem in app_chanspy.c lines 572 ast_channel_lock(spyee_autochan->chan); name = ast_strdupa(ast_channel_name(spyee_autochan->chan)); ast_channel_unlock(spyee_autochan->chan); It locks channel, get name, unlock. But i see thread which holds lock is after this code. So i think chan in spyee_autochan structure was changed between lock and unlock. I write a patch spydeadlock.patch. With this patch it seems solve deadlock problems. I try about 280 thousands of calls without occur. Now we testing it on customer PBX. But i dont know if the patch is ok. Maybe better way is find which code change spyee_autochan->chan and add check for lock. By: Rusty Newton (rnewton) 2015-08-21 08:56:02.695-0500 Thanks for looking into this further and providing the results of your investigation. You will want to get that patch up on Gerrit for review https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage Other Asterisk community members will review the patch and help you decide if the way you fixed it is the best way. Thanks! By: Filip Frank (frenk77) 2015-09-21 09:02:17.389-0500 My patch wasn't accepted because it doesn't solve the root cause. There are more places in the core code where deadlock can occur because of channel changes without holding lock. So i started further investigation. I think it's in main/autochan.c in ast_autochan_new_channel at this lines: autochan->chan = ast_channel_unref(old_chan); autochan->chan = ast_channel_ref(new_chan); called from ast_do_masquerade in channel.c. I think it replaces chan in autochan structure, and there is no locking or lock check for this. Can anyone tell me if i am on the right way? By: Joshua C. Colp (jcolp) 2015-09-21 09:21:55.535-0500 Both channels are locked during that operation. By: Jesper (ib2) 2016-01-19 05:32:14.416-0600 Still Critical at Asterisk 11.20.0 and 100% reproducible ChanSpy SIP/206 on SIP/402 CLI> originate Local/402@internal extension 80808080@external while SIP/402 is ringing, hangup ChanSpy SIP/206 answer SIP/402 Local/402 Dials 80808080 external answer 80808080 hangup SIP/402 AMI originate, same issue A fix would be appreciated By: Dirk Wendland (kesselklopfer79) 2016-02-24 04:33:01.078-0600 Hello together same issue here like the post from Jesper with asterisk 11.21.2 (callflow). {NOFORMAT} (gdb) thread 209 [Switching to thread 209 (Thread 0x7f1aec3ef700 (LWP 9019))]#0 0x0000003a66e0e334 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) f 2 #2 0x0000003a66e094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 (gdb) info reg rax 0xfffffffffffffe00 -512 rbx 0x7f1b2402bbf8 139754545003512 rcx 0xffffffffffffffff -1 rdx 0x5b23e0 5972960 rsi 0x80 128 rdi 0x7f1b2402bba0 139754545003424 rbp 0x5bc18e 0x5bc18e rsp 0x7f1aec3ee1c8 0x7f1aec3ee1c8 r8 0x7f1b2402bba0 139754545003424 r9 0x233b 9019 r10 0x3f6 1014 r11 0x246 582 r12 0x7f1aec3ee5f0 139753609422320 r13 0x0 0 r14 0x7f1b28002020 139754611941408 r15 0x1 1 rip 0x3a66e094d7 0x3a66e094d7 <pthread_mutex_lock+103> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) p *(pthread_mutex_t*) 0x7f1b2402bba0 $13 = {__data = {__lock = 2, __count = 1, __owner = 9058, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\000b#\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298} {NOFORMAT} The thread is allready dead ( 9058 ) that causes the deadlock so it looks like a lock without unlock. By: Dirk Wendland (kesselklopfer79) 2016-02-29 03:58:03.142-0600 the main problem is in ast_autochan_destroy I created log outputs from logging in this method. I added a parameter to get a workaround for the deadlock issue autochan_destroy(struct ast_autochan *autochan,struct ast_channel *ast_autochan_chan) so i locked on struct ast_channel *ast_autochan_chan it seems that the deadlock is gone but itÅ› only a workaround to get this maybe working in the next few days. I think the channel structure is replaced during the answer of the local channel. References struct ast_channel *ast_autochan_chan 0x7fbec0006ad8 References ->chan 0x7fbec0006ad8 ast_channel_lock(autochan->chan); References struct ast_channel *ast_autochan_chan 0x7fbec0006ad8 References ->chan 0x7fbebc00c138 The old version in autochan.c {NOFORMAT} void ast_autochan_destroy(struct ast_autochan *autochan) { struct ast_autochan *autochan_iter; ast_channel_lock(autochan->chan); AST_LIST_TRAVERSE_SAFE_BEGIN(ast_channel_autochans(autochan->chan), autochan_iter, list) { if (autochan_iter == autochan) { AST_LIST_REMOVE_CURRENT(list); ast_debug(1, "Removed autochan %p from the list, about to free it\n", autochan); break; } } AST_LIST_TRAVERSE_SAFE_END; ast_channel_unlock(autochan->chan); autochan->chan = ast_channel_unref(autochan->chan); ast_free(autochan); } {NOFORMAT} I hope that helps a little bit understanding and finding a resolution for the problem and maybe a workaround for Jesper. By: Walter Doekes (wdoekes) 2016-03-03 14:54:12.006-0600 So, after the very helpful comments of hexanol in IRC, I came up with this not so pretty fix in ASTERISK-25321_retry_chan_lock_until_stable.patch {quote} 20:42 < hexanol> the ast_channel_lock will wait until the thread executing ast_do_masquerade unlocks the channel 20:43 < hexanol> the problem is, the pointer to the channel (spyee_autochan->chan) has already been evaluated before ast_channel_lock returns 20:43 < hexanol> so ast_channel_lock is called with a pointer X 20:43 < hexanol> and ast_channel_unlock is called with a pointer Y (X != Y) 20:57 < wdoekes> you may be right. going slow here. spyee_autochan->chan is X, we fetch that, at that point it gets swapped, and *then* we lock(X), which should've been Y. when unlocking it has in the mean time become Y so we unlock the wrong thing. {quote} Compiled tested only thusfar. By: Walter Doekes (wdoekes) 2016-03-04 08:37:38.637-0600 {quote} 14:20 < hexanol> wdoekes: so I just did a few tests with a patched Asterisk 11, and before the patch, I was able to make it freeze systematically (by adding a sleep(1) in ast_do_masquerade, just before the call to ast_autochan_new_channel), and once the patch is applied, I can't reproduce it {quote} So, the patch works as advertised. Leaves us with the possibility of the old channel being unreffed to 0 after the unref in ast_autochan_new_channel. The ast_autochan_new_channel is called (only) in ast_do_masquerade where the argument to ast_do_masquerade is the channel that would disappear. {code} int ast_do_masquerade(struct ast_channel *original) ... ast_channel_lock(original); ... /* Bump the refs to ensure that they won't dissapear on us. */ ast_channel_ref(original); ast_channel_ref(clonechan); ... ast_channel_unlock(original); ... ast_channel_lock_both(original, clonechan); ... ast_autochan_new_channel(clonechan, original); ... ast_channel_unlock(original); ast_channel_unlock(clonechan); ... ast_channel_lock(original); ... ast_channel_unlock(original); ... ast_channel_unref(original); ast_channel_unref(clonechan); {code} During {{ast_autochan_new_channel(clonechan, original)}}, the clonechan, which was the previous autochan->chan, will be unreffed once. First at the end of ast_do_masquerade, it will be unreffed again ({{ast_channel_unref(clonechan)}}). The comments seem to suggest that the unref of clonechan could very well be the last ref to it, causing it to be destroyed. That means that there is a tiny tiny chance of ast_channel_lock() (through the ast_autochan_channel_lock) being called on a channel that was just destroyed. But the chance seems slim, and I'd be willing to delegate that problem to a different issue/ticket. |