ASTERISK-05110: L option in Dial application does not disconnect

[Home]

Summary: ASTERISK-05110: L option in Dial application does not disconnect

Reporter: heng (heng) Labels:

Date Opened: 2005-09-19 05:10:10 Date Closed: 2008-01-15 15:51:12.000-0600

Priority: Minor Regression? No

Status: Closed/Complete Components: Applications/app_dial

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 5252.patch
( 1) channel_checkpoints.txt
( 2) channel.c.patch
( 3) channel.c.patch.v2
( 4) messages
( 5) show_channel.txt

Description: L option is not working for call more than 2400 seconds.

Below command DOES NOT disconnect call (call can continue forever)
Dial(SIP/foo|20|L(2400000))

Below command works fine (call will disconnect at 2100 seconds)
Dial(SIP/foo|20|L(2100000))

****** ADDITIONAL INFORMATION ******

Have tried from Stable V1.0.2 to Stable V1.0.9. All having the same problem.
OS=CentOS 4.1
Server= Various. From Dell server to normal PC.

Comments: By: Michael Jerris (mikej) 2005-09-19 05:23:55

Can you test this on cvshead or 1.2-beta1 please?
By: heng (heng) 2005-09-19 09:58:43

Tested with today (19-09-2005) cvshead:

Both tests were conducted with can-reinvite = yes.
(SJPhone-->Asterisk-->SJPhone)

Test1:
Dial(SIP/foo|20|L(60000))
Call DID NOT disconnect at 60sec but the following display continually on console:
DEBUG[25406]: Channel.c:2981 ast_generic_bridge: Nobody there. Continuing....
DEBUG[25406]: Channel.c:2981 ast_generic_bridge: Nobody there. Continuing....
DEBUG[25406]: Channel.c:2981 ast_generic_bridge: Nobody there. Continuing....

Test2:
Dial(SIP/foo|20|L(2400000))
Call DID NOT disconnect at 2400 sec. Nothing un-usual was display at console.

Hope that help.

By: Clod Patry (junky) 2005-09-19 21:06:33

I switch canreinvite=no and it works fine, so it has something related to canreinvite=yes.

With canreinvite=yes, im getting the same behavior as you:
[Sep 19 22:02:14] DEBUG[31542]: channel.c:2981 ast_generic_bridge: Nobody there, continuing...
but its only for like 1 sec. and then, the call continue.

tested with HEAD 2005-09-18 07:29:48 UTC.

By: heng (heng) 2005-09-20 01:42:42

You are right!
if set caninvite=yes ---> problem
If set canreinvite=no --->works fine

Able to reproduce the same symptom with Stable 1.0.9.
Having the same problem with S option.

This is my 1st time reporting a bug...what should I do next?
By: Clod Patry (junky) 2005-09-20 07:18:46

wait until we find the solution. :)

Just use canreinvite=no for the moment to hangup at your specific time.
By: twisted (twisted) 2005-09-21 00:42:36

The reason this is not working with canreinvite=yes (at least in my opinion) is because with canreinvite=yes, the call is reinvited off of asterisk and direct end to end, so we have no way to tear down the call.

The theory is correct, however, since messaging stays on asterisk regardless of the rtp stream, we should still be able to send BYEs to both ends and have them tear down the rtp path. It appears we rely on the bridge being in tact to destroy the call. Perhaps we need to modify this behaviour and make it look for the signalling side if it can't find media.
By: heng (heng) 2005-09-21 04:45:57

twisted: Couldn't agree more on your theory. That's why H323 is more 'reliable' than SIP in term of session monitoring but SIP is way more flexible and canreivite=yes is part of it.

However, there are definately some things wrong in coding in this case as Asterisk DID NOT issue any BYE to clients if call > 2400 seconds and the tests were conducted in a LAN environment.

I'm with you on the following:
"Perhaps we need to modify this behaviour and make it look for the signalling side if it can't find media."
By: Kevin P. Fleming (kpfleming) 2005-09-25 21:52:48

Can we please get a complete trace of the failing call (with re-invite enabled)? I need to see the sequence of events that occurred (native bridging, generic bridging, etc.) If you don't the proper way to generate a trace, ask a bug marshal on IRC to assist you.
By: Michael Jerris (mikej) 2005-09-25 21:58:46

If you can do that testing against cvs HEAD that would be most helpful, thanks.
By: heng (heng) 2005-09-27 04:27:18

Done. Uploaded file "messages" which contain all the display on Console including 'sip debug' output.

Test Scenario: (CVS HEAD downloaded 27 Sept 2005)
User 333 called User 332 with Dial(SIP/333|20|L(60000))...(with canreinvite=yes)
Call did not drop after 60 seconds.

By: szuk (szuk) 2005-10-05 18:53:35

I think I am having the same problem. I haven't looked at the code yet so I'm not sure how the call is actually ended, but if twisted is correct wouldn't it be possible to just issue a soft hangup on each channel? That should work for sip in the re-invite case as it sends the appropriate BYE messages.
By: BJ Weschke (bweschke) 2005-10-12 02:18:48

Timer functionality is implemented within the channel bridge. When the channels are natively bridged the timer isn't checked because the channels aren't bridged. The earlier notes are spot on in that this problem is specific to any kind of channel where the media can run outside of the server. I don't know if an IAX transfer would pass the timer values to the receiving leg??? Additionally, if you were to use any of the options that come with "L" it would require that Asterisk be in the media path.
By: heng (heng) 2005-10-12 06:52:04

bweschke: Pls take note if we were using L(x:y:z), then Asterisk must be in the media path. However, if just using L(x), then it will transforms to use S(x/1000). Thus, the actual problem is with S option.

Pls take note that in our tests with Stable 1.0.9, everything work well for short calls ( less than 2100 seconds) but not for calls > 2400 seconds (did not test for call between 2100s and 2400s). I won't be able to help in coding but glad to do any further tests...
By: BJ Weschke (bweschke) 2005-10-12 11:44:57

This is becoming more interesting now. :-)
Using S instead of L, the timeout is now getting stored in the channel structure called "whentohangup". While this isn't part of the config structure on the bridging function itself, it looks like it's getting checked in ast_check_hangup which probably isn't getting called when the channel isn't using ast_write and ast_read because the media is somewhere else.
chan_sip looks like it already has support for receiving and setting a time limit value on inbound calls via OSP that sets the same whentohangup value on the new channel when Asterisk is receiving the call, but it doesn't look like it's taking an existing whentohangup value from the channel structure and passing it in a OSP token on a new call.

However, looking at chan_iax2.c, I would have to expect that the behavior is going to be the same here with IAX2 as it is already with SIP because IAX2 doesn't seem to be passing on whentohangup either when a transfer occurs. Maybe this is expected behavior by those using the function because they know that the middle-man on IAX is getting completely out of the way (call control and media) and the channel goes away when a call is transfer and that's why it's not part of the bug as well?

Moreover, when L is used only with a timeout value with no options (from the original bug report) it might make more sense to use the whentohangup structure on the channel to stay consistent with the functionality between L and S when there is no media involved.
By: BJ Weschke (bweschke) 2005-10-12 12:33:15

actually. I just realized if all the options are turned off in app_dial, it does use S approach. so, disregard the last comment from the earlier note. doh!
By: BJ Weschke (bweschke) 2005-10-12 12:46:09

Try this patch against chan_sip.c in CVS-HEAD. It should identify and at least warn you in all scenarios when you use S or L (without any options) that the timeout isn't going to be observed.

If the warning is consistent, I think it should be possible to expand on this section to add an ast_sched entry to come due at the end of the timeout period which could then step in and actually send BYE msgs to terminate the call.

Disclaimer on file.
By: heng (heng) 2005-10-13 10:33:38

bweschke:
Applied the patch.

With canreinvite=yes, the warning msg consistently appears when the call get connected (bridged). Tested with both S(x) and L(x) options.

With canreinvite=no, unable to see the warning msg.

I believe the above results consistent with your expectation.
By: Mark Spencer (markster) 2005-10-13 14:33:13

Fixed in CVS head.
By: heng (heng) 2005-10-14 09:07:00

Markster:
Using the latest CVS Head (which contained the patches in chan_vpb.c &chan_zap.c).
The previous reported problem still remain.

Tested with canreinvite=yes.
SJPhone-->Asterisk-->SJPhone)

Test1:
Dial(SIP/foo|20|L(40000))
Call DID NOT disconnect at 40sec but the following display continually on console for 1 second. The call session remain active after that:

DEBUG[32514]: Channel.c:2977 ast_generic_bridge: Nobody there. Continuing....
DEBUG[32514]: Channel.c:2977 ast_generic_bridge: Nobody there. Continuing....

Appreciate if someone could help.

Thanks
Heng
By: BJ Weschke (bweschke) 2005-10-14 09:11:27

I will try and lab this up a little later today in the code zone here at Astricon so we can take a closer look at it.
By: Mark Spencer (markster) 2005-10-14 13:07:54

Oops, try latest CVS head again please, sorry about that!
By: Mark Spencer (markster) 2005-10-14 14:24:28

Fixed again in CVS head :)
By: heng (heng) 2005-10-15 09:46:48

Markster:
Thanks for the fast response...but I still need to reopen the case..sorry.

1)Tests with Dial(SIP/foo|20|L(40000))
Result: Call disconnected at 40 sec. Working fine.

2) Tests with Dial(SIP/foo|20|L(2400000))
Result: Asterisk ALLOW the call to continue after 2400 sec. In short, it work well for short duration call but does not disconnect call more than 2400 sec. The symptom consistent with my previous tests with Stable version 1.0.2 to 1.0.9.
By: BJ Weschke (bweschke) 2005-10-15 16:11:15

I wonder if there's a numeric container getting overloaded with the larger value. There's nothing out of sorts in a quick scan of logic in app_dial, but I'll try and dig deeper tonight or tomorrow.
By: BJ Weschke (bweschke) 2005-10-18 01:08:43

definitely still have a problem here from the show_channel attachment. :-/ this was a lab setup with a fresh build from CVS-HEAD this evening.
I'm going to try and set some more debugging points within channel.c and the ast_waitfor_nandfds function to see what might be going wrong with the logic.
By: Mark Spencer (markster) 2005-10-18 01:26:52

I'm especially interested in the value of "timeout" in ast_rtp_bridge...
By: James Lyons (james) 2005-10-18 01:45:07

I could not duplicate this bug on CVS head as of 10-17-05 using the same dial string as shown above. I used two asterisk servers with a SIP connection with canreinvite=yes and an iaxy. Asterisk hung up the call after 40 mins.
By: BJ Weschke (bweschke) 2005-10-18 01:53:24

does that channel_checkpoints attachment help at all? that came out right after/when the rtp got bridged up externally. I'm anxiously awaiting what it comes up with when it returns from that poll in 2399000 more milliseconds. :-)
By: BJ Weschke (bweschke) 2005-10-18 01:54:57

james: all three nodes involved here need to be SIP in order for the media to bridge externally. If you've got an iaxy connecting to one side of two asterisk servers and call the asterisk server on the other side, the media is going to stay bridged in the middle to do translation between SIP and IAX.
By: James Lyons (james) 2005-10-18 02:17:39

bweschke: didn't think about that, will be trying again tomorrow, a little late to wait for another 40 mins...
By: BJ Weschke (bweschke) 2005-10-18 02:32:30

the poll in ast_waitfor_nands - from what I can see thus far, never returns back after its timeout value expires. we're almost 50 mins into the call now and it's still up, and there is nothing in the logs that indicates the poll returned from its blocking state. this is truly strange. a SIP registration does occur successfully on that same device while the call is going on, but I can't imagine that's throwing anything off.
By: BJ Weschke (bweschke) 2005-10-18 03:08:50

googling around, there's a relatively recent discussion thread in the linux kernel as late as 2.6.13 and 2.6.14 that sys_poll() with large timeout parameters is broken. here's the description from the original reporter:
-------------------
The @timeout parameter to sys_poll() is in milliseconds but we compare it
to (MAX_SCHEDULE_TIMEOUT / HZ), which is (jiffies/jiffies-per-sec) or
seconds. That seems blatantly broken. This led to improper overflow
checking for @timeout. As Andrew Morton pointed out, the best fix is to to
check for potential overflow first, then either select an indefinite value
or convert @timeout.

To achieve this and clean-up the code, change the prototype of the sys_poll
to make it clear that the parameter is in milliseconds and introduce a
variable, timeout_jiffies to hold the corresonding jiffies value.

Signed-off-by: Nishanth Aravamudan <n...@us.ibm.com>
Signed-off-by: Andrew Morton <a...@osdl.org>
----------
tweaking the code in channel.c to breakdown the timeout value into smaller values, it does appear that I'm dropping the call now after 40 mins, but the undesirable effect of doing this is that we're sending re-invites every time we cycle prior to the "real timeout" because the bridge is getting broken down and setup again for no real good reason other than the fact that poll appears to be broken on 2.6 linux kernels when using a large timeout value.
By: BJ Weschke (bweschke) 2005-10-18 03:54:30

I think the patch for channel.c just attached finally puts this issue to bed. It'll break down the timeout value on the poll if it's larger than 10 minutes, but loop iteratively within the same function so as to prevent the re-invites that aren't needed when it's only a "cycle timeout" that caused the poll to return in the first place.
By: BJ Weschke (bweschke) 2005-10-18 07:06:24

Hmmm... My patch failed my own 40 min unit test. I've got to revisit the logic now that I'm a little less bleary eyed.
By: heng (heng) 2005-10-18 08:07:49

I really appreciate everybody hard work on this bug especially bweschke who have been working intensively...Just petty I can't do coding :C but I will contribute on the testing later.
By: BJ Weschke (bweschke) 2005-10-18 09:07:16

"New and Improved" patch uploaded. It slices, it dices, it even passes its own unit tests! :-)

The real kudos on this bug goes out to the team in the code zone last week at Astricon who setup a framework for the channel timeout that works across all channel technologies. Nice job guys.

I think this one is finally ready to go away for good. Test away.
By: heng (heng) 2005-10-18 10:23:18

GOOD JOB. Yipee!!!
Initial tests with Dial(SIP/foo|20|S(40)), Dial(SIP/foo|20|L(30000)) and Dial(SIP/foo|20|S(2400)) successful with the calls being disconnected.

Will conducts more tests tomorrow to re-confirm.
Thanks guys...:)
By: heng (heng) 2005-10-19 02:29:19

Tested with Dial(SIP/foo|20|L(7200000)) and Dial(SIP/foo|20|L(2400000|2000000|60000)) successfully.

Notice the following debug msg after the calls terminated:
DEBUG[4958] Channel.c: Avoiding initial deadlock for SIP/fooxxx

If the above msg is not alarming, then I would said the bug have been rectified.

Thanks guys :-)
By: BJ Weschke (bweschke) 2005-10-19 06:29:53

Latest patch has passed testing in two different locations that previously were not working. Waiting to get merged now or have additional people test it.
By: BJ Weschke (bweschke) 2005-10-31 08:01:50.000-0600

Housekeeping
By: BJ Weschke (bweschke) 2005-11-01 10:57:00.000-0600

Was changed to pending stable in error. My mistake. Sorry!
By: BJ Weschke (bweschke) 2005-11-01 11:04:11.000-0600

fixed with markster's commit of channel.c 1.251 on 10/20
By: BJ Weschke (bweschke) 2005-11-01 11:04:37.000-0600

closing.
By: Digium Subversion (svnbot) 2008-01-15 15:50:43.000-0600

Repository: asterisk
Revision: 6759

U trunk/channel.c
U trunk/channels/chan_iax2.c
U trunk/channels/chan_vpb.c
U trunk/channels/chan_zap.c
U trunk/include/asterisk/channel.h
U trunk/include/asterisk/rtp.h
U trunk/rtp.c

------------------------------------------------------------------------
r6759 | markster | 2008-01-15 15:50:43 -0600 (Tue, 15 Jan 2008) | 2 lines

Fix call timeouts with rtp bridge etc (bug ASTERISK-5110)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=6759
By: Digium Subversion (svnbot) 2008-01-15 15:51:12.000-0600

Repository: asterisk
Revision: 6791

U trunk/channel.c
U trunk/channels/chan_iax2.c
U trunk/channels/chan_vpb.c
U trunk/channels/chan_zap.c
U trunk/rtp.c

------------------------------------------------------------------------
r6791 | markster | 2008-01-15 15:51:11 -0600 (Tue, 15 Jan 2008) | 2 lines

Make sure when the timeout occurs that we actually break the bridge (bug ASTERISK-5110)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=6791