[Home]

Summary:ASTERISK-05786: chanspy() in dialplan causing crash
Reporter:Mark Edwards (edwar64896)Labels:
Date Opened:2005-12-06 02:31:01.000-0600Date Closed:2005-12-16 18:57:11.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Applications/app_chanspy
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20051212__bug5941.diff.txt
( 1) bug-abnormal-log.txt
( 2) bug-normal-log.txt
( 3) channel.c.patch.7472.txt
( 4) chanspy-bug-backtrace.txt
( 5) chanspy-bug-backtrace-141205.txt
( 6) chanspy-bug-backtrace-curelm.txt
( 7) chanspy-bug-backtrace-curelm2+racefix.txt
( 8) chanspy-bug-gdbinfo-141205.txt
Description:I have implemented chanspy in my dialplan to allow supervisors to monitor outbound calling agents. Since upgrading from 1.2b1 to 1.2-stable-branch I have noticed a massive improvement in the audio quality emanating from chanspy as a result of upgrade, however we have now experienced two fatal server crashes as a result of using chanspy.

Chanspy appeared in the last entry in the logfile each time so I am making the assumption that something in apps/app_chanspy.c or perhaps channel.c is the culprit here.



****** ADDITIONAL INFORMATION ******

Consider as placeholder for now and I will endeavor to reproduce on test server, capture the BT and post here, although I havn't been able to do to yet and I have had to disable chanspy on production server to avoid crash scenario since upgrade.

Will endeavour to reproduce on test server and on production as circumstances permit.

Query association with 0005813?

mark.p.edwards@gmail.com
Comments:By: Clod Patry (junky) 2005-12-06 20:39:49.000-0600

we need a backtrace for that crash.
please see README.backtrace.

By: Mark Edwards (edwar64896) 2005-12-06 23:07:54.000-0600

Just as soon as I can negotiate with my client when it will be most convenient to crash his production system. ;-)

By: Clod Patry (junky) 2005-12-07 19:47:26.000-0600

u dont need to crash it...
it's to solve your issue.
When your asterisk will crash, u need to extract the info, on how and why your asterisk has crashed.

By: Russell Bryant (russell) 2005-12-08 13:30:48.000-0600

If you can figure out the necessary steps to reproduce the problem, we can get a backtrace when we reproduce the problem locally.

By: Mark Edwards (edwar64896) 2005-12-08 14:03:03.000-0600

As stated, it has been somewhat difficult to reproduce in the test environment but I'll endeavour to assist here.

Production environment consists of the following:

1. agent.conf contains 3-400 agents numbered sequentially. At any one time there is probably between 6-10 logged in at any one time.
2. Each agent has their own queue.
3. Calls are originated by the "dialer application" from a local channel (the queue) to an outbound number over an IAX2 channel to our ITSP.
4. Chanspy is configured to spy on the agent channels: Chanspy(Agent|q)

The crash has unfortunately been intermittent in production and I've been unable to reproduce in test to date. It did however happen twice within the period of an hour before I decided to disable Chanspy in the dialplan. Rock solid ever since.

If I recall correctly, the crash occurred at least once, when I was pressing the "*" key whilst chanspying  - thus initiating a scan and a change of target channel. The second crash was initiated by another party and I am unable to ascertain exactly what they were doing at the time. I can confirm they were chanspying though.

I have unsuccesfully attemped to reproduce locally, though chanspying to a SIP/ channel, therefore I suspect it might be more to do with the Agent channel being spied upon or perhaps the number of concurrent agents online. Outbound call durations are either quite short (10-15 seconds) or long - 10-15 mins. I suspect that when the supervisors are monitoring, there's a lot of 'scanning' going on if chanspy is active.

Happy to assist with providing information, and will negotiate with customer to capture BT if possible.



By: Mark Edwards (edwar64896) 2005-12-08 15:22:11.000-0600

Further Update.

Having managed to negotitate with customer a planned testing this morning, I was fortunately able to reproduce the issue twice. What really p'd me off was the fact that despite setting -g on the commandline and also setting ulimit to unlimited, I couldn't find the core dump anywhere at all! Not Happy. Oh well.

Steps to reproduce crash:

1. get 2 or 3 agents on calls at the same time as per previous note
2. Attempt to have two channels chanspying <b>at the same time</b>. I was on an IAX channel and the other supervisor was on one of the SIP extensions in the office. Crash was pretty easy to reproduce using this technique.

Now if I can just find where that little core file is hiding.....

By: Mark Edwards (edwar64896) 2005-12-08 17:30:16.000-0600

Found where ulimit -c was getting set <i>back</i> to 0, so hopefully next time we get a crash we'll get a core.

More update from client - they tried chanspying on their own and managed to crash the box. So, looks like this is not just limited to more than one chanspy, it can also occur with just one. Looks like though it's much easier to replicate when more than one person is chanspying at the same time.

If you need any more to go on, I'll do what I can to help.

cheers,

Mark.

By: Tilghman Lesher (tilghman) 2005-12-09 13:48:49.000-0600

I have an idea of what's probably wrong.  Here's a patch that applies to 1.2 which better avoids a race condition.  Please test and note whether it fixes the crash you're seeing.



By: Mark Edwards (edwar64896) 2005-12-10 04:58:35.000-0600

Finally managed to replicate the issue in test and also managed to get a core, so I have uploaded a backtrace as per request.

By: Mark Edwards (edwar64896) 2005-12-10 05:17:32.000-0600

Have tested 20051209__bug5941.diff patch in test environment.

Unfortunately, segfault remains.

I find it easiest to reproduce crash using the following technique:

1. set up chanspy thus: exten => 178,1,chanspy(SIP|q)
2. Dial 178 from two extensions
3. Dial a number from a third SIP extension
4. Hack away on the "*" key a few times on one of the extensions until Asterisk dies.

By: Tilghman Lesher (tilghman) 2005-12-10 07:30:11.000-0600

From gdb, enter the following commands and give us the output:

(gdb) p chan->spies
(gdb) p *(chan->spies)
(gdb) p *(chan->spies->list)

By: Mark Edwards (edwar64896) 2005-12-10 15:59:10.000-0600

(gdb) p chan->spies
$1 = (struct ast_channel_spy_list *) 0x815e960
(gdb) p *(chan->spies)
$2 = {read_translator = {last_format = 4, path = 0x8215910},
 write_translator = {last_format = 4, path = 0x815cbc8}, list = {
   first = 0xb77588b0, last = 0x0}}
(gdb) p *(chan->spies->list)
Structure has no component named operator*.
(gdb) p (chan->spies->list)
$3 = {first = 0xb77588b0, last = 0x0}

regards,

Mark.

By: Mark Edwards (edwar64896) 2005-12-10 16:18:01.000-0600

http://www.switchnet.com.au/cores/core.4713.copy.bz2

if you think it might help.
The core was taken pre the 20051209__bug5941.diff patch.

By: Tilghman Lesher (tilghman) 2005-12-10 18:20:03.000-0600

Apply the patch in ASTERISK-5810 and see if it fixes your issue.  I wasn't sure it was related, which is why it's a separate issue.

By: Tilghman Lesher (tilghman) 2005-12-10 18:21:33.000-0600

Oh, and a core file is completely useless if you don't have the executable (and all associated .so files) from which the corefile was generated.

By: Mark Edwards (edwar64896) 2005-12-10 19:14:20.000-0600

Just tested this, and we get a crash in:

#0  0x08062731 in ast_channel_spy_remove (chan=0x8214c58, spy=0xb76f88b0)
   at channel.c:1044

which is ...

       AST_LIST_REMOVE(&chan->spies->list, spy, list);

see updated BT...

(gdb) p chan->spies->list
$29 = {first = 0xb77398b0, last = 0xb76f88b0}
(gdb) p *spy
$30 = {list = {next = 0xb76f88b0}, lock = {__m_reserved = 0, __m_count = 0,
   __m_owner = 0x0, __m_kind = 1, __m_lock = {__status = 0, __spinlock = 0}},
 trigger = {__c_lock = {__status = 0, __spinlock = 0}, __c_waiting = 0x0,
   __padding = '\0' <repeats 27 times>, __align = 0}, read_queue = {
   head = 0x0, samples = 0, format = 64}, write_queue = {head = 0x0,
   samples = 0, format = 64}, flags = 57, status = CHANSPY_RUNNING,
 type = 0xb78f6fbe "ChanSpy", read_vol_adjustment = 0,
 write_vol_adjustment = 0}
(gdb) p spy->list
$31 = {next = 0xb76f88b0}

By: Tilghman Lesher (tilghman) 2005-12-10 20:12:21.000-0600

This patch applies on top of the previous patch.

By: Mark Edwards (edwar64896) 2005-12-11 05:43:12.000-0600

I have updated svn on my repository and am now seeing linkedlists.h at 7429.

I have applied the above patch (20051210) to this rev. and I am still seeing segfault under described conditions.

Assuming your prior patch on this bug was deleted due to the closure of 0005965.

Fails on AST_LIST_INSERT_TAIL with...

(gdb) p chan->spies->list
$4 = {first = 0xb77128b0, last = 0x0}
(gdb) p spy->list
$5 = {next = 0x0}
(gdb)

Agree that the bug is probably more to do with linkedlists.h than app_chanspy.c

Do you still recommend applying the race condition fix or is this a moot point considering difficulties with linklists.h?

By: Tilghman Lesher (tilghman) 2005-12-11 07:43:47.000-0600

Please upload the backtrace, as well.  Until we resolve the problem with the linked lists, we won't know whether there's an associated race condition, as well.



By: Mark Edwards (edwar64896) 2005-12-11 14:30:06.000-0600

chanspy-bug-backtrace-curelm2+racefix.txt has:

1. linkedlists.h at rev 7429 + patch 20051210__bug5941.diff.txt
2. race condition fix for app_chanspy (deleted from this bug)

By: Tilghman Lesher (tilghman) 2005-12-12 11:51:30.000-0600

Find me on IRC, as Corydon76-home or Corydon-w, so we can find a quicker resolution to this problem.  I'm going to need remote access to your test box.

By: Mark Edwards (edwar64896) 2005-12-12 15:07:55.000-0600

Happy to arrange offline.
I am in Melbourne, Australia, GMT+10

Lets work out a mutually convenient time and I will be happy to give you access to the box.

By: Mark Edwards (edwar64896) 2005-12-13 01:20:22.000-0600

Updated svn 1.2 trunk, applied patch.

Asterisk now rock solid under stated chanspy test conditions, however only the first of the spy channels channels is actually receiving any audio.

By: Kevin P. Fleming (kpfleming) 2005-12-13 10:13:58.000-0600

OK, I'm closing this bug report then as this problem is corrected. If you have a different issue, please open a new bug report for that one. Thanks!

By: Tilghman Lesher (tilghman) 2005-12-13 15:32:30.000-0600

This problem isn't corrected.  The last patch is simply an insertion of the macro code, so I can see exactly where it is crashing inside the macro.

By: Tilghman Lesher (tilghman) 2005-12-13 15:37:15.000-0600

edwar64896:  Please update to current SVN branch 1.2 and see if it still crashes.  If not, then this can indeed be closed, but it was premature to close this until we're sure that that is resolved (without any patch uploaded to this bug).

By: Kevin P. Fleming (kpfleming) 2005-12-13 16:57:20.000-0600

Sorry, I saw the patch from Corydon and thought it had already been committed. My mistake, shouldn't have closed the bug.

By: Mark Edwards (edwar64896) 2005-12-13 19:36:13.000-0600

OK Thanks for reopening this.

I have removed include/asterisk/linkedlists.h and channel.c and done svn update to trunk/1.2

We are back to a crash situation under same conditions noted in the earlier part of this report.

We are now crashing in ast_channel_spy_remove during the AST_LIST_REMOVE macro at the point where....

1051                    typeof(spy) curspy = (&chan->spies->list)->first;
1052                    while (curspy->list.next != (spy))

Unfortunately, (&chan->spies->list)->first is 0x0 in line 1051 and we segfault on line 1052.

See attached BT, channel.c patch I am using for test and additional gdb variable info.



By: Mark Edwards (edwar64896) 2005-12-15 15:39:50.000-0600

Guys, can we set resolution status on this one to something other than fixed, so we don't give people unfamiliar with this bug the wrong impression..

ta.

By: Tilghman Lesher (tilghman) 2005-12-16 18:13:48.000-0600

I've made one more change to SVN that I think will solve the issue.  Please update to the latest 1.2 branch in SVN, and 'make clean dont-optimize' and try again.

By: Mark Edwards (edwar64896) 2005-12-16 18:43:12.000-0600

I am now unable to reproduce failure under stated test conditions.

Looks like that pesky race condition coupled with that linkedlist issue has been put to bed.

Many thanks for your help with this.