Summary:ASTERISK-07417: [patch] Zaptel misreports channel in battery drop state as available, then refuses to open it
Reporter:Rolf Braun (rbraun)Labels:
Date Opened:2006-07-28 18:28:52Date Closed:2007-07-05 12:58:06
Versions:Frequency of
Environment:Attachments:( 0) ztpatch1-1.2.7
( 1) ztpatch1-trunk
Description:In asterisk chan_zap, the function available() is used to check that a channel is free before trying to dial on it. It is also used to find such a channel when a Zap group dial is used, such as Dial(Zap/G1/5551212). Chan_zap relies on the ZT_GET_PARAMS zaptel ioctl to find enough channel information to ensure that the channel is available even if it does not know of a call on that channel.

If the channel is in the battery drop state (ZT_TXSTATE_KEWL) or in the guard time after it (ZT_TXSTATE_AFTERKEWL), the parameters returned by ZT_GET_PARAMS do not indicate that the channel is not available to be used. available() and zt_request() will thus select that channel, try to open it, and fail (EBUSY is returned by zaptel due to the channel state). The result is that a call being made to a group dial will fail even though there are channels free in the zap group; as this results in a call failing to go through, I am filing this as a major bug. We have worked around this in the past by doing a manual rollover (e.g. priority 1 dials Zap/1, priority 2 dials Zap/2), but that should not be necessary.


This problem became more apparent in recent testing where I set ZT_KEWLTIME to 1500 and ZT_AFTERKEWLTIME to 2500. A legacy PBX fed from analog lines connected to an RBS channel bank (cac adit 600) requires a longer battery drop than is documented to reliably detect a hangup. In addition, we are in the process of testing some code that uses the afterkewl state to trigger another battery drop if the line is picked up in that interval, as the legacy PBX has another bug which causes a race condition on the disconnect supervision. This code will be filed separately as a feature request, but it is not necessary to duplicate this bug. In theory this bug could happen without any alteration to the KEWLTIME and AFTERKEWLTIME, but it is much easier to duplicate if that time is made longer. I isolated this bug in zaptel 1.2.6 and 1.2.7, but there have been no changes to the relevant parts of chan_zap and zaptel since then in the trunk.

I am filing this against zaptel because, while the chan_zap rollover behavior is less than ideal, it is fundamentally a bug in zaptel that it reports the channel as available and then refuses to open it. The enclosed patch makes the reported state off-hook in the kewlstart states; I am not sure if this is the right approach, but it does fix the problem.
Comments:By: Rolf Braun (rbraun) 2006-08-18 17:29:02

No activity on this for a few weeks... any more information I need to provide on this?

By: jmls (jmls) 2006-11-01 05:33:33.000-0600

can anyone comment on this issue ? Thanks.

By: Serge Vecher (serge-v) 2006-11-06 10:34:15.000-0600

ideally, we are looking for functional testing results from users with a similar equipment setup as rbraun...

By: Rolf Braun (rbraun) 2006-11-06 15:38:19.000-0600

This particular bug (not 7754/7755) does not require that particular kind of equipment to reproduce. It is a race condition in zaptel and chan_zap and it can be deduced from the program logic. The problem is that zaptel will claim a channel is free when it is not, chan_zap will select that channel, then zaptel will refuse to dial it after the channel has been chosen, so it will not fail over to the next channel in the group. If you need steps to reproduce:

1. Set KEWLTIME to 1500 and AFTERKEWLTIME to 2500 as noted above. While these are not typical values, it is just an adjustment of an existing timeout and it should not cause functionality to break. Recompile zaptel with these values.
2. Create a dialplan with a group dial statement such as Dial(Zap/g1/5551212) where the group consists of lines on a channel bank attached via a T1 with robbed bit signaling.
3. Pick up the first line in the group and hang it up. Just after hanging it up (within the 4 seconds allowed by the values mentioned), trigger a group dial to the group that you created in the dialplan. You can do this from a line in another group, from a SIP or IAX connection, etc.
4. Expected behavior: the call should roll over to the second line. Actual behavior: the call fails to complete even though there are lines available in the group.

I will comment on 7754/7755 separately; I am not nearly as certain what the correct way to fix those is, and I filed them to start a discussion. Bug 7755 depends on this bug to be fixed, but this bug 7612 is a problem in its own right.

By: jmls (jmls) 2007-02-11 03:41:17.000-0600

ping. housekeeping.

By: Rolf Braun (rbraun) 2007-02-13 14:29:41.000-0600

Pong. I'd like to see this bug fixed. I provided in the bug report a description of the code flow, which as far as I can tell has not changed since in SVN either in branches or trunk. I have reproduced the problem in production, which is how I found it in the first place, and this patch does fix the problem. The available() function in chan_zap MUST return the correct availability of the channel for group dial statements to roll over correctly, and I provided a case where it clearly does not. I provided a minimal, 3 line patch so that zaptel can correctly report the channel as being in use to chan_zap. I really don't understand what else you need here.

The only issue might be that this is the wrong place to fix the bug. It would also be possible to fix this bug by error checking the result of zt_new when it is called within zt_request in chan_zap, and resuming the hunt group behavior if no channel is returned rather than passing it back to the application right away. However, the patch I attached is probably the simpler fix and less likely to cause other problems.

By: Jason Parker (jparker) 2007-07-05 12:58:01

Fixed in svn branches 1.2, 1.4, and trunk, in revisions 2696, 2697, and 2698.