[Home]

Summary:ASTERISK-12294: chanspy - crashes Asterisk - still
Reporter:Marcin Kowalczyk (kowalma)Labels:
Date Opened:2008-07-01 09:28:58Date Closed:2008-09-10 12:49:02
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Applications/app_chanspy
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt-full-12969.txt
Description:Problem is same as in issue 11782. Chanspy crashes asterisk. I'll recompile asterisk to fit settings decribed in docs/valgrind.txt
Comments:By: Marcin Kowalczyk (kowalma) 2008-07-01 09:48:58

My dialplan is following:

exten => _62XXXX,1,Answer()
exten => _62XXXX,n,ChanSpy(|qv(0)g(${EXTEN:2}))

As soon asterisk crashes again I'll start valgrind & backtrace.

By: Marcin Kowalczyk (kowalma) 2008-07-01 11:59:56

sorry -asterisk version is 1.4.21.1 (released today).

By: Joshua C. Colp (jcolp) 2008-07-02 21:30:10

Once you do have the needed information please attach it so putnopvut can take a gander.

By: Marcin Kowalczyk (kowalma) 2008-07-08 11:20:06

For some reasons I had to upgrade to 1.6.0beta9
I've asked my coleauges to restest ChanSpy on this version.

By: Mark Michelson (mmichelson) 2008-07-16 16:22:28

I suppose that getting info for the crash in 1.4.21.1 is not possible, then. There isn't much difference between the chanspy code in 1.4.21.1 and in 1.6.0-beta9. If you experience a crash in 1.6, please upload a backtrace and valgrind log here.

If anyone else is experiencing a chanspy crash in 1.4.21.1, please speak up. This and issue ASTERISK-12347 are crashes reported in chanspy that I have been unable to reproduce. Both seem to point to some sort of subtle problem that needs a tool like valgrind to detect.

Thanks.

By: Tilghman Lesher (tilghman) 2008-07-31 12:46:59

kowalma: anything further to report since upgrading to 1.6.0beta9 relating to this issue?

By: callguy (callguy) 2008-08-14 22:44:47

putnopvut/corydon76: we just experienced a crash today that looks related to this. bt/bt full attached. let me know if there's any other information we can provide.

By: Mark Michelson (mmichelson) 2008-08-18 11:05:43

callguy, thanks for the backtrace. I've not ever seen a crash occur in that spot before.

There are two peculiarities with the backtrace provided. One is the actual location of the crash. Unless there is some subtle memory corruption that has never been caught before, an ast_log call with valid arguments should not cause a crash. In order to figure this one out, it would probably be quickest to use valgrind to determine what the error is.

The more peculiar thing is the spot in lock.h which was reached. This place is only reached if a thread attempts to destroy a lock which is still locked by another thread. This is an error condition that should not arise, and it may be the issue in chanspy which has been causing other crashes experienced by the original reporter.

By: Mark Michelson (mmichelson) 2008-08-19 13:10:26

I've had a look at the backtrace, and it appears that there is a race condition in  app_chanspy that can happen if the spyee and spyer hang up simultaneously.

There is a datastore used in app_chanspy which is attached to the spyee channel when the spyer starts spying. The datastore is removed and freed if either the spyee hangs up or if the spyer changes who he is spying on or hangs up himself. The data which is stored in this datastore is a local stacked structure in the spyer's thread. Part of this structure is a lock. This lock is initialized when the spyer starts the chanspy application and is destroyed when the spyer hangs up.

Now, this is an extremely tight race, but if my analysis is correct, it seems as though the following set of circumstances could occur if both spyee and spyer hang up at once.

1. ast_channel_free is called on the spyee's channel.
2. The chanspy datastore is removed from the spyee's channel in ast_channel_free.
3. In the spyer's thread, the spyer attempts to remove and destroy the datastore, but the datastore has already been removed from the spyee's channel in step 2, so the spyer continues in the code.
4. The spyee's thread continues and calls the datastore's destroy callback, which involves locking the lock I mentioned in the second paragraph.
5. Now the spyer attempts to destroy the lock. The problem is that in step 4, the spyee has locked this lock, meaning that the spyer is attempting to destroy a lock which is currently locked by another thread.

In analyzing this, I also noticed that perhaps this same situation could arise if the spyer hangs up at the same moment that the spyee is transferred by the far end. My analysis for this turned up a locking order issue which could actually cause a deadlock.

So, what needs to be done is that some of app_chanspy's locking needs to be re-examined in order to prevent both of the above scenarios from occurring, no matter how rare the races may be. The crash associated with the 5 step scenario I provided above can most likely be solved by placing a lock/unlock sequence prior to the destruction of the lock. The potential transfer deadlock, though, is another matter which will require a more thoughtful solution.

By: Digium Subversion (svnbot) 2008-08-19 13:42:14

Repository: asterisk
Revision: 138886

U   branches/1.4/apps/app_chanspy.c

------------------------------------------------------------------------
r138886 | mmichelson | 2008-08-19 13:42:13 -0500 (Tue, 19 Aug 2008) | 23 lines

Add a lock and unlock prior to the destruction of the chanspy_ds
lock to ensure that no other threads still have it locked. While
this should not happen under normal circumstances, it appears that
if the spyer and spyee hang up at nearly the same time, the following
may occur.

1. ast_channel_free is called on the spyee's channel.
2. The chanspy datastore is removed from the spyee's channel in
  ast_channel_free.
3. In the spyer's thread, the spyer attempts to remove and destroy the datastore
  from the spyee channel, but the datastore has already been removed in step 2,
  so the spyer continues in the code.
4. The spyee's thread continues and calls the datastore's destroy callback,
  chanspy_ds_destroy. This involves locking the chanspy_ds.
5. Now the spyer attempts to destroy the chanspy_ds lock. The problem is that in step 4,
  the spyee has locked this lock, meaning that the spyer is attempting to destroy a lock
  which is currently locked by another thread.

The backtrace provided in issue ASTERISK-12294 supports the idea that this is possible
(and has even occurred). This commit does not close the issue, but should help
in preventing one type of crash associated with the use of app_chanspy.


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=138886

By: Mark Michelson (mmichelson) 2008-08-21 14:27:51

kowalma: When Asterisk was crashing before, was it happening while someone was using ChanSpy but not actually spying on anyone? Russell and I solved a stack overflow in the code yesterday that would show itself when attempting to spy using the 'g' option to ChanSpy. If there were no channels in the spygroup specified, then the spyer would continuously loop. With each loop, a small bit of stack space would be allocated using an ast_strdupa call until eventually there was an overflow.

I noticed that you were using the 'g' option for ChanSpy and it made me think that this might be the same issue.

By: Leif Madsen (lmadsen) 2008-09-02 17:11:09

/ping!

Hoping the original reporter has some new information on this issue?

By: Jason Parker (jparker) 2008-09-10 12:48:57

In talking with putnopvut, he seems confident that the commit he made has fixed this issue.  Combined with the lack of response from the reporter, I am going to close this issue.

Please reopen if you are still experiencing this.