[Home]

Summary:ASTERISK-16651: [patch] atxfer broken on 1.6.2.11
Reporter:Ronald Verschaeren (ronald_verschaeren)Labels:
Date Opened:2010-09-06 03:14:43Date Closed:2010-11-08 12:33:21.000-0600
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Features
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) atxferlog
( 1) atxferlog_patch4
( 2) atxferlog_patch5
( 3) atxferlog_patch6
( 4) atxferlog_patch7
( 5) atxferlog_with_patch
( 6) builtin_atxfer_debug1.diff
( 7) builtin_atxfer_debug3.diff
( 8) builtin_atxfer_debug4.diff
( 9) builtin_atxfer_debug5.diff
(10) builtin_atxfer_debug6.diff
(11) builtin_atxfer_debug7.diff
(12) core_show_locks.txt
(13) extensions.conf
(14) features.conf
(15) sip.conf
Description:I'm using SIP phones.

- A calls B
- B transfers A to C by pressing ?2, then C's extension (atxfer => #2 in features.conf)
- C picks up, talks to B.
- B hangs up.
- both A and C hear silence

afaict, A is not bridged to C. This issue was not present in 1.6.1
Comments:By: Paul Belanger (pabelanger) 2010-09-06 04:05:23

We require a complete debug log to help triage the issue.

This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue:

http://svn.digium.com/svn/asterisk/trunk/doc/HOWTO_collect_debug_information.txt

By: Ronald Verschaeren (ronald_verschaeren) 2010-09-06 04:50:10

There is a bit of clutter in the logfile, so for clarification:

192.168.0.4 is the asterisk machine
192.168.0.235 (sip:35)
192.168.0.227 (sip:27)
192.168.0.195 (sip:90) are the peers to watch. I used sip:27 for the DTMF transfer.

By: Matthew Nicholson (mnicholson) 2010-09-22 14:24:57

I can't reproduce this here.  And it looks like the channels are actually bridged at some point according to this line in your log.

[Sep  6 11:32:13] DEBUG[20871] channel.c: Bridge stops bridging channels SIP/27-00000020 and Local/90@default-c5f6;1

Please upload your features.conf, your extensions.conf, and your sip.conf configuration files.

By: Ronald Verschaeren (ronald_verschaeren) 2010-09-23 07:55:18

Thanks for looking into this, I've uploaded the files

By: Matthew Nicholson (mnicholson) 2010-09-23 16:51:45

I can reproduce this with your configuration.  I am investigating it further now.



By: Matthew Nicholson (mnicholson) 2010-09-23 17:01:34

Correction: I still cannot reproduce the issue.  What I was seeing was a problem one of my test phones was having with the alaw codec.

By: Matthew Nicholson (mnicholson) 2010-09-24 16:50:59

Please compile with DEBUG_THREADS enabled and upload the output from "core show locks" during the failed transfer state where A and C cannot hear each other.  I would like to verify if this is a locking issue or not.

By: Matthew Nicholson (mnicholson) 2010-09-28 14:53:58

Have you been able to collect the information I requested?

By: Ronald Verschaeren (ronald_verschaeren) 2010-09-29 02:56:24

I've uploaded the output. Sorry for the delay.

Couple of points that may or may not interest you:
- The issue appears to be phone-independent, it happens on all phones I can test here.
- Sometimes, the transfer works! I would estimate in about 2-3% of cases.
- This happens most often when asterisk has just been restarted.
- There is no output from 'core show locks' when the transfer works.
- When the transfer has failed and both parties hear silence, if one of them hangs up, the other channel remains open.



By: Matthew Nicholson (mnicholson) 2010-10-08 13:59:41

Please grab a debug log with the attached builtin_atxfer_debug1.diff patch.  I am trying to determine exactly what part of the code is failing.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-11 10:50:32

I've uploaded a debug log with your patch compiled in.

When it says 'finishing up attended transfer by bridging...', it's all over already.

When the transferrer hangs up, both parties hear silence, your message about finishing up the transfer is NOT displayed. That message only appears after one of the silence-hearing parties hangs up.

By: Kevin Scott Adams (nivek) 2010-10-12 12:48:03

FWIW, We are getting the same type of results when transferring between Polycom 650's using the transfer button (no DTMF transfers).  Does not happen on every transfer, but the results are always the same in that the sound is inaudible or silent.
Most transfers are from an agent in a queue transferring the caller to another extension.
Running Asterisk-1.6.2.11 under EL5 with latest Kernel with chan_sip and chan_local.
Polycom phones are running sip 3.1.3.

By: Matthew Nicholson (mnicholson) 2010-10-13 09:57:11

I have examined the trace you provided and it looks like the problem is in ast_stream_and_wait(). Try the builtin_atxfer_debug3.diff patch which skips the ast_stream_and_wait() call.  Please upload another trace using that patch.

By: Matthew Nicholson (mnicholson) 2010-10-14 13:51:31

Have you gotten a chance to test this again yet?

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-15 04:31:13

I have just compiled in the patch, and now atxfer works every time!
The bug must indeed be in the ast_stream_and_wait() function.

I will upload a complete log asap.

By: Matthew Nicholson (mnicholson) 2010-10-15 16:07:15

I have uploaded the builtin_atxfer_debug4.diff file.  Please upload a log with that patch applied.  There may be a problem in the way we are detecting when a streaming sound file ends.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-18 02:01:27

I'm recompiling it now.

I've already checked if the beep file is identical to the one in the asterisk tarball.
e6c15171a3b1988399c0b5cce9607038  beep.gsm
e6c15171a3b1988399c0b5cce9607038  /var/lib/asterisk/sounds/en/beep.gsm

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-18 10:42:10

I've uploaded the log from asterisk with patch 4 compiled in.

It appears you were right, it keeps streaming the beep file forever.

By: Matthew Nicholson (mnicholson) 2010-10-18 12:17:03

Ok, try v5 of the patch and let see if we can figure out why it doesn't detect the end of the file.

By: Matthew Nicholson (mnicholson) 2010-10-18 12:18:45

If it would be possible for me to log into this machine and do some live debugging, that would be very helpful.  If that is an option, contact me on IRC as mnicholson in #asterisk-bugs.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-18 16:19:39

I'm very sorry, I know it would help a lot, but giving you ssh access would be difficult, since the system is semi-production.

I just hope this won't turn out to be some kind of strange system-specific glitch.

Tomorrow, I'll upload another log with patch 5 applied.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-19 10:11:07

Log from asterisk with v5

By: Matthew Nicholson (mnicholson) 2010-10-19 15:10:33

New debugging patch to test with.  Thanks for sticking with this thus far.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-20 10:42:58

Uploaded.

The first time I tried was one of those times the transfer worked, so I uploaded that logfile too for completeness

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-20 10:47:58

Hmm, I must have done something wrong while handling the 'working' log file, since it appears to be an old file dated sept 6.

Which means I must have deleted the real 'working' log file. Sorry :)

By: Matthew Nicholson (mnicholson) 2010-10-20 15:51:01

This might be a problem with the timing backend you are using.  Test with the debug7 patch and upload a log.  Also which timing backend are you using?  And does using a different timing backend make a difference?  You may be able to see what timing backend you are using by running the 'timing test' cli command.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-20 15:58:05

timing test

Attempting to test a timer with 50 ticks per second.
Using the 'timerfd' timing module for this test.
It has been 1001 milliseconds, and we got 50 timer ticks

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-20 16:04:09

Since you mention timing: Could this be a kernel issue? I'm running 2.6.27, quite old already. Maybe that explains why I'm the only one who appears to have this problem. I can imagine few other people use such a recent version of asterisk with such an old kernel?



By: Matthew Nicholson (mnicholson) 2010-10-20 16:37:51

It could be a kernel problem. You could try using dahdi or pthreads for timing by disabling the res_timing_timerfd module using modules.conf.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-21 02:01:51

I have disabled the timerfd module, and now the transfer works every time! (using dahdi now)

Should we conclude that this issue is specific to a recent asterisk (with timerfd) and old kernel?
Maybe that means the issue isn't worth fixing?

In any case, I'll upload another log for patch 7 asap.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-21 11:22:50

done

By: Matthew Nicholson (mnicholson) 2010-10-22 09:59:12

This looks like some sort of timerfd bug.  It may be related specifically to something odd with your system.  I was not able to reproduce this on an ubuntu system running kernel 2.6.27.

By: Ronald Verschaeren (ronald_verschaeren) 2010-10-23 13:31:34

OK, so what's the verdict? 'wontfix'?

I'll be reinstalling the linux system on the machine exhibiting this bug in the next couple of weeks. Should the problem return on a clean install of Ubuntu 10.10 server, I'll let you know!

By: Matthew Nicholson (mnicholson) 2010-10-25 07:34:13

I'm not quite ready to mark this as won't fix just yet. It could still be a legitimate timerfd bug. I need to investigate that further, but it probably will end up as won't fix.

By: Matthew Nicholson (mnicholson) 2010-11-08 12:33:20.000-0600

I am marking this won't fix.  This appears to be a timerfd bug that I cannot reproduce, and since switching to another timing source fixes this, I don't think it is worth spending any more time on.