ASTERISK-25321: [patch]DeadLock ChanSpy with call over Local channel

[Home]

Summary: ASTERISK-25321: [patch]DeadLock ChanSpy with call over Local channel

Reporter: Filip Frank (frenk77) Labels:

Date Opened: 2015-08-13 06:21:13 Date Closed: 2016-03-14 11:19:55

Priority: Critical Regression?

Status: Closed/Complete Components: Applications/app_chanspy

Versions: 11.16.0 11.18.0 Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-26867 autochan: Locking in a function ast_autochan_destroy() on destroyed channel (after masquerade).

is related to ASTERISK-19501 Channel group on a Local channel not released after it has been Chanspy'd upon

Environment: custom CentOS 6 based distro Attachments: ( 0) asterisk.log_19082015_1258
( 1) ASTERISK-25321_retry_chan_lock_until_stable.patch
( 2) backtrace-threads-ffr19082015_1258.txt
( 3) corelocks_ffr19082015_1258.txt
( 4) spydeadlock.patch

Description: We have a problem with ramdom deadlocks with using ChanSpy running on SIP channels, and dialing by AMI Originate to Local channel, which Dial another Local channel, and then Dial SIP peer.

Example: 1. SIP/iptel205 doing ChanSpy(SIP/iptel210)
2. AMI Originate Local/210@dialer
3. Dial(Local/210@internal)
4. Dial(SIP/iptel210)
5. Answer SIP/iptel210
6. Start dial caller from originate for ex 00420591122223@outgoning

Here is part of backtrace from coredump, I get it by gcore when asterisk was deadlocked.

[Edit by Rusty - removed excessive debug from description. Please attach debug and annotated files to the issue with More > Attach Files]

Comments: By: Asterisk Team (asteriskteam) 2015-08-13 06:21:14.759-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Filip Frank (frenk77) 2015-08-13 06:48:45.031-0500

Please reopen, i do "click typo".
By: Asterisk Team (asteriskteam) 2015-08-13 06:48:45.851-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Rusty Newton (rnewton) 2015-08-13 09:22:03.204-0500

Removed excessive debug from description. Please attach debug and annotated files to the issue with More > Attach Files.

Please read: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

Please also provide (attach) the output described in the [Getting a Backtrace|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock] section relevant to deadlocks.
By: Rusty Newton (rnewton) 2015-08-17 17:40:02.211-0500

Many of the values in the trace are optimized out "<value optimized out>" so it looks like you didn't follow the instructions. Putting this in waiting for feedback to await new traces. Traces with optimized out values are often not helpful for the developers..

Thanks!
By: Rusty Newton (rnewton) 2015-08-17 17:40:32.792-0500

In addition to the above comment please provide the "core show locks" output as described in the wiki article.
By: Filip Frank (frenk77) 2015-08-18 02:18:22.855-0500

I know, but I cant get it now. Our customer is very angry for this deadlocks, so he now not using spying to avoid this. Im trying to simulate this on our testing PBX, but occurrence is very low, maybe depends on more things like OS, network load(more time in critical section?). When i will be able to reproduce this i attach better backtrace, but now it is the best what i have.
By: Rusty Newton (rnewton) 2015-08-18 14:31:31.541-0500

We can't investigate further with this debug so we'll go ahead and close this out for now.

In the near future when you have additional non-optimized traces, 'core show locks' output and can capture correlating debug logs, you will want to open a new issue and we can move on from there.

If you can additional narrow down how to reproduce the issue in a lab environment in a reliable (or almost reliable) fashion that would help tremendously.

Thanks!
By: Filip Frank (frenk77) 2015-08-18 15:00:50.515-0500

It seems i found some way to reproduce in my lab today. So now, i know how to. Then i am going to try it without optimized and with lock debug build. Give me a time please, i need thousands of calls to reproduce this deadlock.
By: Asterisk Team (asteriskteam) 2015-08-18 15:00:50.962-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Filip Frank (frenk77) 2015-08-19 06:15:29.934-0500

backtrace threads added
By: Filip Frank (frenk77) 2015-08-19 06:15:56.715-0500

last 10000 entries in log added
By: Filip Frank (frenk77) 2015-08-19 06:16:41.136-0500

core show locks output after gcore added
By: Filip Frank (frenk77) 2015-08-19 06:18:23.064-0500

I attach new thread backtrace, log, locks output from my lab asterisk 11.16.0.
By: Filip Frank (frenk77) 2015-08-20 07:21:48.943-0500

I analyze backtrace and lock log. I found problem in app_chanspy.c lines 572

ast_channel_lock(spyee_autochan->chan);
name = ast_strdupa(ast_channel_name(spyee_autochan->chan));
ast_channel_unlock(spyee_autochan->chan);

It locks channel, get name, unlock. But i see thread which holds lock is after this code. So i think chan in spyee_autochan structure was changed between lock and unlock. I write a patch spydeadlock.patch. With this patch it seems solve deadlock problems. I try about 280 thousands of calls without occur. Now we testing it on customer PBX. But i dont know if the patch is ok. Maybe better way is find which code change spyee_autochan->chan and add check for lock.

By: Rusty Newton (rnewton) 2015-08-21 08:56:02.695-0500

Thanks for looking into this further and providing the results of your investigation.

You will want to get that patch up on Gerrit for review https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

Other Asterisk community members will review the patch and help you decide if the way you fixed it is the best way. Thanks!
By: Filip Frank (frenk77) 2015-09-21 09:02:17.389-0500

My patch wasn't accepted because it doesn't solve the root cause. There are more places in the core code where deadlock can occur because of channel changes without holding lock. So i started further investigation. I think it's in main/autochan.c in ast_autochan_new_channel at this lines:
autochan->chan = ast_channel_unref(old_chan);
autochan->chan = ast_channel_ref(new_chan);
called from ast_do_masquerade in channel.c. I think it replaces chan in autochan structure, and there is no locking or lock check for this. Can anyone tell me if i am on the right way?
By: Joshua C. Colp (jcolp) 2015-09-21 09:21:55.535-0500

Both channels are locked during that operation.
By: Jesper (ib2) 2016-01-19 05:32:14.416-0600

Still Critical at Asterisk 11.20.0 and 100% reproducible

ChanSpy SIP/206 on SIP/402
CLI> originate Local/402@internal extension 80808080@external
while SIP/402 is ringing, hangup ChanSpy SIP/206
answer SIP/402
Local/402 Dials 80808080 external
answer 80808080
hangup SIP/402

AMI originate, same issue

A fix would be appreciated
By: Dirk Wendland (kesselklopfer79) 2016-02-24 04:33:01.078-0600

Hello together

same issue here like the post from Jesper with asterisk 11.21.2 (callflow).

{NOFORMAT}
(gdb) thread 209
[Switching to thread 209 (Thread 0x7f1aec3ef700 (LWP 9019))]#0 0x0000003a66e0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) f 2
#2 0x0000003a66e094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) info reg
rax 0xfffffffffffffe00 -512
rbx 0x7f1b2402bbf8 139754545003512
rcx 0xffffffffffffffff -1
rdx 0x5b23e0 5972960
rsi 0x80 128
rdi 0x7f1b2402bba0 139754545003424
rbp 0x5bc18e 0x5bc18e
rsp 0x7f1aec3ee1c8 0x7f1aec3ee1c8
r8 0x7f1b2402bba0 139754545003424
r9 0x233b 9019
r10 0x3f6 1014
r11 0x246 582
r12 0x7f1aec3ee5f0 139753609422320
r13 0x0 0
r14 0x7f1b28002020 139754611941408
r15 0x1 1
rip 0x3a66e094d7 0x3a66e094d7 <pthread_mutex_lock+103>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) p *(pthread_mutex_t*) 0x7f1b2402bba0
$13 = {__data = {__lock = 2, __count = 1, __owner = 9058, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\001\000\000\000b#\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
{NOFORMAT}
The thread is allready dead ( 9058 ) that causes the deadlock so it looks like a lock without unlock.

By: Dirk Wendland (kesselklopfer79) 2016-02-29 03:58:03.142-0600

the main problem is in ast_autochan_destroy

I created log outputs from logging in this method. I added a parameter to get a workaround for the deadlock issue
autochan_destroy(struct ast_autochan *autochan,struct ast_channel *ast_autochan_chan)
so i locked on struct ast_channel *ast_autochan_chan
it seems that the deadlock is gone but itś only a workaround to get this maybe working in the next few days.
I think the channel structure is replaced during the answer of the local channel.

References struct ast_channel *ast_autochan_chan 0x7fbec0006ad8
References ->chan 0x7fbec0006ad8
ast_channel_lock(autochan->chan);
References struct ast_channel *ast_autochan_chan 0x7fbec0006ad8
References ->chan 0x7fbebc00c138

The old version in autochan.c
{NOFORMAT}
void ast_autochan_destroy(struct ast_autochan *autochan)
{
struct ast_autochan *autochan_iter;

ast_channel_lock(autochan->chan);
AST_LIST_TRAVERSE_SAFE_BEGIN(ast_channel_autochans(autochan->chan), autochan_iter, list) {
if (autochan_iter == autochan) {
AST_LIST_REMOVE_CURRENT(list);
ast_debug(1, "Removed autochan %p from the list, about to free it\n", autochan);
break;
}
}
AST_LIST_TRAVERSE_SAFE_END;
ast_channel_unlock(autochan->chan);

autochan->chan = ast_channel_unref(autochan->chan);

ast_free(autochan);
}
{NOFORMAT}

I hope that helps a little bit understanding and finding a resolution for the problem and maybe a workaround for Jesper.

By: Walter Doekes (wdoekes) 2016-03-03 14:54:12.006-0600

So, after the very helpful comments of hexanol in IRC, I came up with this not so pretty fix in ASTERISK-25321_retry_chan_lock_until_stable.patch

{quote}
20:42 < hexanol> the ast_channel_lock will wait until the thread executing ast_do_masquerade unlocks the channel
20:43 < hexanol> the problem is, the pointer to the channel (spyee_autochan->chan) has already been evaluated before ast_channel_lock returns
20:43 < hexanol> so ast_channel_lock is called with a pointer X
20:43 < hexanol> and ast_channel_unlock is called with a pointer Y (X != Y)
20:57 < wdoekes> you may be right. going slow here. spyee_autochan->chan is X, we fetch that, at that point it gets swapped, and *then* we lock(X), which should've been Y. when unlocking it has in the mean time become Y so we unlock the wrong thing.
{quote}

Compiled tested only thusfar.
By: Walter Doekes (wdoekes) 2016-03-04 08:37:38.637-0600

{quote}
14:20 < hexanol> wdoekes: so I just did a few tests with a patched Asterisk 11, and before the patch, I was able to make it freeze systematically (by adding a sleep(1) in ast_do_masquerade, just before the call to ast_autochan_new_channel), and once the patch is applied, I can't reproduce it
{quote}

So, the patch works as advertised.

Leaves us with the possibility of the old channel being unreffed to 0 after the unref in ast_autochan_new_channel.

The ast_autochan_new_channel is called (only) in ast_do_masquerade where the argument to ast_do_masquerade is the channel that would disappear.
{code}
int ast_do_masquerade(struct ast_channel *original)
...
ast_channel_lock(original);
...
/* Bump the refs to ensure that they won't dissapear on us. */
ast_channel_ref(original);
ast_channel_ref(clonechan);
...
ast_channel_unlock(original);
...
ast_channel_lock_both(original, clonechan);
...
ast_autochan_new_channel(clonechan, original);
...
ast_channel_unlock(original);
ast_channel_unlock(clonechan);
...
ast_channel_lock(original);
...
ast_channel_unlock(original);
...
ast_channel_unref(original);
ast_channel_unref(clonechan);
{code}
During {{ast_autochan_new_channel(clonechan, original)}}, the clonechan, which was the previous autochan->chan, will be unreffed once.

First at the end of ast_do_masquerade, it will be unreffed again ({{ast_channel_unref(clonechan)}}). The comments seem to suggest that the unref of clonechan could very well be the last ref to it, causing it to be destroyed.

That means that there is a tiny tiny chance of ast_channel_lock() (through the ast_autochan_channel_lock) being called on a channel that was just destroyed. But the chance seems slim, and I'd be willing to delegate that problem to a different issue/ticket.