ASTERISK-13496: [patch] Unhold fails if first SDP on OK, particularly Cisco CCM 6

[Home]

Summary: ASTERISK-13496: [patch] Unhold fails if first SDP on OK, particularly Cisco CCM 6

Reporter: David Woolley (davidw) Labels:

Date Opened: 2009-02-02 06:39:21.000-0600 Date Closed: 2011-07-26 14:20:43

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/Interoperability

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) CiscoCCMFirstHoldTrace.txt
( 1) CiscoHoldsSIPTrace.txt
( 2) NoReInviteHistory.txt
( 3) NoReInviteSIPDebug.txt
( 4) Provisional_forward_port_of_r174644_to_1.6.1.0_tag.diff

Description: Cisco CCM 6 appears never to send SDP on the INVITE, so Asterisk effectively makes the offer, even when the Cisco is re-inviting because a CCM hosted phone has gone on hold. Asterisk sets up its SDP from flags[1] in the channel private data structure, which means that, if the channel was on hold, it will offer a=inactive (observed, although a=recvonly also seems possible). Even if the CCM intended to unhold, the only possible response to an a=inactive offer, is a=inactive. Asterisk should only actually use saved state if it is making the response. For an offer, as it looks like it can never locally hold, it should offer a=sendrecv.

****** ADDITIONAL INFORMATION ******

In testing, on 1.6.0 (which I believe differs only in configuration documnetation) it looked as though it was also offering a=inactive during the hold, but I can't explain that from the source code, and for various reasons I'm submitting this from away from the office.

It is arguable as to whether it is legal to re-invite without providing SDP on the INVITE, but Cisco are a major player.

For this symptom to show, the Cisco system must be configured not to use MTPs, and, probably, must have compatible RTP DTMF modes. It can happen with SCCP phones. MTPs are similar in concept to Asterisk internal bridges. With an MTP, CCM will generate music on hold locally, rather than signalling the hold downstream, but this results in a loss of redundancy in the speech path and all speech paths going through a single machine.

Comments: By: David Woolley (davidw) 2009-02-11 09:36:36.000-0600

This may be related to ASTERISK-1429448, although, at the moment I can't be sure that it would fix it, or even that it would not make things worse. I think adding a related link would be useful, if only to make "frawd" aware of this one.
By: Joshua C. Colp (jcolp) 2009-02-11 09:42:40.000-0600

I was also pondering whether it would actually take care of the issue... but do not have the capability to test of course.
By: David Woolley (davidw) 2009-02-13 11:11:46.000-0600

I have a problem with ASTERISK-1429448 in that it looks like the trunk svnbot update didn't actually take properly, or doesn't match its description (i.e. it is the wrong patch for the issue). That might not be surprising, as the patch seems to have been applied on top of code (or at least leading comments) that don't exist in trunk or 1.6 versions. It is even possible that what I really need is this prior change!
By: Joshua C. Colp (jcolp) 2009-02-13 11:17:15.000-0600

The code is no longer present in trunk and above, so it was blocked. Are you saying that the lack of this to begin with may be the issue?
By: David Woolley (davidw) 2009-02-13 11:27:47.000-0600

I was trying to understand the patch in the context of the 1.6.0.1 which we are actually using.

I think I need to understand how it got removed, and how it works in the 1.4 context, before I can say that the removal is the problem (I've only just come back to this today, after various higher priority jobs).

At the moment I'm just going on what the comment says, although Cisco always sends its SDP on the ACK, so I don't think it is actually signalling unhold by late SDP.
By: David Woolley (davidw) 2009-02-16 05:35:05.000-0600

As far as I can tell, the patch on which ASTERISK-1429448 depends never made it into trunk, rather than being subsequently taken out. That patch is edit 65041, and looks like it didn't go through the issues system:

Modified Fri May 18 12:58:39 2007 UTC (21 months ago) by oej
File length: 641517 byte(s)

- Adding support for putting calls OFF hold with a re-invite with blank SDP. This was a bug found while doing tests at SIPit in Antwerp.
- In order to not duplicate code, I restructured some of the code for putting calls on/off hold.

Thanks DEA for reminding me. This fix has been asleep in the videocaps branch until now.
By: David Woolley (davidw) 2009-02-16 12:45:09.000-0600

Looking at the code, I am reasonably sure that the combination of revision 65041 and 174644 would cure our symptom, but at the expense of redundant unhold events and unhold SIP history lines.

If there were a re-invite when already held, and which resulted in a held state, it would unhold between the INVITE and the ACK, which is probably not a good thing. (This could be said to be due to the fact that the code doesn't properly distinguish between an SDP request and response.)

As far I can see, these need to be applied on the 1.6.0 and trunk branches in order to avoid a regression failure with respect to the case for which they were originally designed.

Now I am back in the office, I probably need to double check to see if I really got held assumed in the SDP on the OK, even when the channel wasn't held, as that may mean there is another problem somewhere.
By: David Woolley (davidw) 2009-02-17 13:33:19.000-0600

I think what was confusing me about Asterisk apparently sending inactive on the OK to the hold, without SDP on the invite was that CCM has first sent re-invite to address 0.0.0.0 on the invite, then, immediately sent invite with no SDP.

Subsequently both hold and unhold seem to generate invite with no SDP, even though the phone state changes.

I wonder if it doesn't like Asterik's response to first hold and is then somewhat confused, with the lack of code for a resume on no SDP getting Asterisk locked into hold?

I'll try and sanitise the traces and attach them tomorrow. But, basically:

> Invite SDP
< Trying
< OK SDP
> ACK

> Invite SDP
< Trying
< OK SDP
> Ack

Press Hold

< Invite SDP a=inactive c= IN 0.0.0.0
> Trying
> OK SDP a=inactive
< Ack

Unhold

< Invite
> Trying
> OK SDP a=inactive
< Ack a=inactive c= non null

Hold

< Invite
> Trying
> OK SDP a=inactive
< Ack a=inactive c= non null

(This differs slightly from the original scenario in that I have eliminated our application from the equation, and that has a side effect that it is possible to re-invite the RTP after the initial setup.)
By: David Woolley (davidw) 2009-02-18 08:36:32.000-0600

SIP traces uploaded. Note that these were obtained on 1.6.0 tarball code, rather than the 1.6.0.1 used by marketing and on which the problem first presented, but I believe there are no material differences.
By: David Woolley (davidw) 2009-02-19 08:46:59.000-0600

I've also uploaded a Cisco CCM sdl type trace corresponding to the first hold sequence. Note that the clock on the Cisco is 27m 54s slow.

I don't really know how to read this trace, but it does look as though it had second thoughts and decided to generate MOH locally, but it is not obvious why it did that.
By: David Woolley (davidw) 2009-05-07 10:30:44

I've been trying to port these patches to 1.6.1.0. Although, unfortunately I don't have permission to post resulting patches, it seems to me that, to avoid noise, revision 65401 really ought to qualify the reporting (event, history and notifyhold) by whether one is trying to go from held to held. unheld to unheld seems OK, as the calling code does the check, but, if I understand things correctly, held to held cannot be filtered at that stage because there may be a change between different types of held. Maybe it is worth doing belt and braces and testing for no change, rather than held to held.
By: David Woolley (davidw) 2009-05-08 11:12:00

On further consideration, there is too much fiddling to apply the changes in sequence, so I've eliminated 65401 and instead applied just 174644, inlining the relevant bit of change_hold_state.

Specifically, I've used the current version of the debug output code, and copied the history, event, notify and flag clearing code from the code that deals with SDP with a NON-null IP address.

Now to see if it works.

By: David Woolley (davidw) 2009-05-08 12:13:59

There still seems to be some sort of compatibility problem with the Cisco. The CCM sends IP=0.0.0.0, a=inactive, to which Asterisk responds a=inactive. Then, very soon afterwards, it tries again with no SDP, as though it didn't like the original response. Asterisk, with the patch applied, responds a=sendrecv, to which Cisco ACKS a=sendonly.

Asterisk then sends a=recvonly to party B. When I first tried this, I used an X-Lite as party B and it dropped the call in response to that. Using CCM for party B didn't result in a dropped call, although the other party reported only intermittent music on hold, which may well be a configuration/network issue.

I'm now going to have a look at some of the traces I took when alternating CCM between held and unheld.....
By: David Woolley (davidw) 2009-05-08 12:51:00

This is difficult to interpret, but I think the rapid fire re-invite in and out may be resulting in some missing re-invites to party B, or them at least getting out of phase. Unfortunately I didn't trace CCM to CCM from the beginning, only for the subsequent cycles, but I don't see any recvonly's.

Next week is funny, but I'll have to see if I can work out what is happening.
By: Joshua C. Colp (jcolp) 2009-05-21 09:35:31

Have you tried narrowing this down a bit by disabling reinvites? Maybe the initial reinvite for direct RTP is causing the Cisco some grief to begin with. At the very least simplifying the scenario even further will help reduce the signaling interaction.
By: David Woolley (davidw) 2009-05-21 10:57:55

I haven't tried that. It would only be for diagnostic purposes as reinvites are considered critical for our application.

The bad news is that some non-Asterisk work has had its priority raised, so I may not be able to do much for some time. The good news is that my employers are now happy to sign the Digium licence, so I should be able to give you the actual patch that I tried.
By: David Woolley (davidw) 2009-05-21 11:56:19

Attempted forward port of r174644 added.
By: Joshua C. Colp (jcolp) 2009-05-22 08:23:37

davidw: Yes, while critical for your application if we can reduce the amount of signaling interaction it may allow us to better pinpoint exactly what is causing the issue. If it works with reinvites disabled then we know that is the culprit in the first place.
By: David Woolley (davidw) 2009-05-27 07:55:48

I'll see if I can slip in a test in the near future, but it won't be till, at least, next week.
By: David Woolley (davidw) 2009-07-02 10:46:51

I've run a call to a Cisco and alternated hold and unhold on the Cisco phone, using an unpatched 1.6.1.0 tarball version and with canreinvite=no for the Cisco.

I've uploaded a SIP history and sip debug trace for the Cisco side. It doesn't look to me as though the reinvites from Asterisk were upsetting the Cisco.

This does not include the patch that I attached.
By: Olle Johansson (oej) 2009-09-09 03:27:07

This has been working before. I have tested exactly this with Cisco and we got it working, there might even be a comment in the code indicating that.
By: David Woolley (davidw) 2009-09-09 07:14:50

I've just deleted two comments which were bogus, because I got confused between this one and one I just re-opened.
By: David Woolley (davidw) 2009-09-09 07:18:30

Were the previous tests run with MTPs enabled? The problem happens when they are disabled, i.e. the RTP path is optimised in CCM.
By: David Woolley (davidw) 2009-12-03 11:44:31.000-0600

Although this patch seems to work in the context of the existing Asterisk, I am more and more becoming convinced that the whole send/recv only negotiation logic is wrong, and that this problem wouldn't arise, and the original patch wouldn't have needed to be done to 1.4, if it were better.

I think the negotiation should be done in the same way as codec negotiation, with local, peer and joint values, and coding the permissions in each direction as separate bits. Currently one may not need the local values, but consideration of ASTERISK-15145 changes this. It is also possible that ASTERISK-15258 could be done more cleanly with such a structure. (For the benefit of anyone who might try this in parallel, I have chosen to name the working flags relative to their sender, in my prototype code, so a local recvonly combines with a remote sendonly.)

One other thing I have noticed is that the coding of PAGE2_CALL_ONHOLD does not have orthogonal values for the two bits, i.e. if you combine active with sendonly, you get inactive. I haven't found a path through the code that actually breaks, but it ends up using ast_set_flag as though it were a masked assignment, when it is actually a simple bitwise or, and can never clear bits. I'm prototyping on the basis of one bit for sendonly and one bit for recvonly, with only the recvonly one (corresponding to upstream sendonly) actually resulting in hold state processing.

This removes a don't know case, which is treated the same as sendrecv, but something would have to be done with that in order to fix ASTERISK-15204.

---
Corrected transposition reference to ASTERISK-15145.

By: David Woolley (davidw) 2009-12-03 11:58:49.000-0600

Actually the net result is a don't send or don't receive indication, so I think I'm going to have another renaming!
By: David Woolley (davidw) 2009-12-04 11:34:16.000-0600

My gut feeling is that one should be downgrading sendrecv to recvonly (and sendonly to inactive) if one gets a 0.0.0.0 address for the RTP.

However, it looks to me that an inbound a=sendonly or inactive can currently set a state which forces update_call_counter to be called with DEC_CALL_LIMIT during cleanup, but a 0.0.0.0 address doesn't. I'm pretty sure this test is redundant, as SIP_INC_COUNT is an alternative trigger for this behaviour and the count only actually seems to get changed if it was set. However, I'm reluctant to make 0.0.0.0 set don't-send, in case I've missed something.

(Incidentally, although the coding of the PAGE2 flags is a bit perverse (the two conditions aren't really orthogonal, and can't cope with recvonly, it shouldn't break the current code any further because the no hold state is coded as both bits clear.)
By: David Woolley (davidw) 2009-12-04 12:31:46.000-0600

There is, actually, a signficant difference from the codecs list, in that the codecs response is allowed to contain capabilities that weren't in the offer.

On the other hand, Asterisk actually seems to err in the opposite direction, by always offering the mutually compatible codecs from the last round, even if that means they drop off one by one (code read on 1.6.1.0), and it is actually that behaviour which means one cannot directly copy the codec negotiation logic, as it would never head away from a=inactive! The trouble is that this seems to require yet another set of hold state variables in the pvt structure, or add_sdp needs to know whether it is offering or answering. (I think I will go with the latter.)
By: David Woolley (davidw) 2009-12-04 13:08:41.000-0600

Ouch. We should actually be carrying around state for each stream, or computing the lowest common denominator (i.e. if one stream is held, all are held!)

NB This is not a consequence of anything I'm trying to do. It has always been true.

By: Leif Madsen (lmadsen) 2011-07-26 14:20:38.107-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions

If this is still an issue, please open a new issue so it can be re-triaged appropriately. Thanks!