[Home]

Summary:ASTERISK-09749: NOTIFY race condition when state changes happen very fast
Reporter:raarts (raarts)Labels:
Date Opened:2007-06-25 08:47:04Date Closed:2008-06-02 08:40:59
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Subscriptions
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bug11231.txt
( 1) cli_bug11231.txt
( 2) cli_tbelder.txt
( 3) cli_tbelder2.txt
( 4) full.notifybug
( 5) notify.1_2_24.patch
( 6) notify.1_4_12.patch
Description:This particular problem was encountered with SNOM360 phones, but it is not specific to it.

What happens:
- SNOM has subscribed a LED to another phone
- SNOM user dials this other phone.
- dialplan adds an answer-after=0 SIP header.
- asterisk send INVITE to the other phone
- other phone send back RINGING, this is sent back to the SNOM as well
- asterisk send NOTIFY 'early' to the SNOM
- meanwhile, the other phone answers very quickly
- asterisk sends OK to INVITE to SNOM to indicate the other phone
 has answered the line, and immediately after that, a NOTIFY 'confirmed'.
 At this point in time the OK to the previous NOTIFY 'early' has not
 been sent yet by the SNOM.

Now the system is confused, because the SNOM finally sends the OK
to the NOTIFY 'early', but asterisk is only expecting an OK to the
last sent NOTIFY, and starts retrying. See attached trace.

I think asterisk should not sent a new NOTIFY. I think ideally NOTIFY's
should be queued on a per-channel basis. Or should asterisk remember
earlier


Comments:By: raarts (raarts) 2007-06-29 09:27:38

After more examination I think that the real problem is
that asterisk for some reason cannot link back an OK response
to the original SUBSCRIBE, and thus does not
kill the retransmission.
Sofar this seems to me to be a bug in __sip_ack().
I'll keep searching.



By: Jason Parker (jparker) 2007-08-29 17:49:54

Are you able to reproduce this on 1.4?

By: raarts (raarts) 2007-08-29 18:29:09

Must confess I haven't tried 1.4 yet, I decided to give it a go
only after Digium starts selling commercial support for it. When they
trust it, I'll try it.

But I bet that if oej reads the description of this bug he'll
either say: "Already know this" or "fixed in 1.4" ;-)

I stopped searching by the way. Couldn't spare the time to go
to the bottom of it, changed the auto-answer=0 to autoanswer=1
customer was ok with this. But it definitely is a problem.
I also encountered it where a SNOM has BOTH a subscription
to itself AND to the phone it is calling. As soon as the call is ringing,
the SNOM receives two NOTIFY's in rapid succession and asterisk
gets confused when it receives the OK to the first notify.

By: Joshua C. Colp (jcolp) 2007-08-30 15:28:17

Okay, feel free to reopen if you find this is applicable to 1.4 as well.

By: jmls (jmls) 2007-10-25 12:32:26

reopened at request of chat on irc. Applies to 1.4

By: jmls (jmls) 2007-10-25 12:32:55

applies to 1.4.12 as well

By: Steve Davies (one47) 2007-10-25 12:36:10

I managed to reproduce this on a 1.4.x test box that used BLF and Page() heavily. The box no-longer exists, so I created the fix on a 1.2.24 box, but the code is the same in 1.2.x and 1.4.x, so the same fix should apply to both.

Basically, if a "200 OK" takes long enough to arrive that a new notify has already been sent, the "ignore" flag is set on the packet with the old CSeq. Sadly, it was ignoring it a little bit too much, and did not process the packet at-all!

Patches will be uploaded for 1.2.24 and 1.4.12.

By: Steve Davies (one47) 2007-10-25 12:40:22

The attached patch is tested here on 1.2.24, but even the comment in the code

      /* ignore means "don't do anything with it" but still have to
          respond appropriately  */

Suggests that the fix is correct.

By: Olle Johansson (oej) 2007-11-28 05:36:45.000-0600

What's the real issue here? The "system is confused"... Does it affect any calls at all?

By: Olle Johansson (oej) 2007-11-28 05:40:04.000-0600

What's happening here is a problem with the lack of a proper transaction engine in Asterisk chan_sip. We can't handle multiple transactions properly in the same dialog.

The easiest patch is propably to cancel the previous notify transaction when we issue a new one in the same dialog. Just give up and let it fly away. If we end up getting a response to it, it doesn't matter any more. It's not like the notify is extremely important as long when we have a new state change on the way to the subscriber.

By: raarts (raarts) 2007-11-28 08:47:17.000-0600

I have been testing this patch in various live situations for a month now, and I am happy to report: it totally fixes the problem for me. Thanks one47!

By: Olle Johansson (oej) 2007-11-28 08:52:11.000-0600

Can someone please tell me what the issue here is? Does the lamp not blink, does the call die?

The patch may cause other issues, so I need to sort out what we really need to fix.

By: raarts (raarts) 2007-11-28 09:10:56.000-0600

Oej, the issue is the following: whenever the race condition (as described in the bug  description) happens, lights on SNOM phones do not go out when they should go out.
We have a customer that has a SNOM sidecar with all LEDs configured as BLF, and whenever they used Page(), all the LEDS on the sidecar were lit, but most stayed on when the Page() ended. According to the customer it looked like a christmas tree ;)
Other customers complained about LEDs staying on forever after using page().
All these problems vanished with this patch. No side effects that I have found.

By: Steve Davies (one47) 2007-11-28 09:21:01.000-0600

oej: In fact, the existing chan_sip code is clever enough to know that a new transaction (notify) may have started before the old one is complete, and keeps a nice list of all of the outstanding ones so they can be retransmitted, with a note of which one is "current" ie. which one is the most recent.

What is happening without the patch is that the OK for an "old" NOTIFY is arriving, and being COMPLETELY ignored. This means that the NOTIFY is never removed from the outstanding-work list, and starts to be retransmitted.

Each retransmit receives an OK, which is also ignored because the transaction is not at the "top of the stack".

The code change simply allows the OK to be processed, the old NOTIFY timer is then stopped and removed from the retries list, and everyone is happy. The code even looks as if it was written to work in the manner specified by the patch, but that there was a typo during the initial execution :)

Like some other people, I now have this in production at several sites, and have yet to find a problem (of course I realise that this means nothing :) )

By: Steve Davies (one47) 2007-11-28 09:28:15.000-0600

PS. This occurs on several models of phone that support BLF (Aastra, Polycom, SNOM), and the symptoms are similar - BLF lamps showing "random" states - across all manufacturers.

By: Olle Johansson (oej) 2007-11-28 09:28:30.000-0600

Ok, I suspected there was a problem with the lights, but it wasn't clear in the report.

So in fact, cancelling the existing transaction, so that the previous state is not transmitted after the most recent one would solve the issue too and be a better solution.

I'm happy that the patch works, but like many patches, it's solving one issue and may cause others in other situations. I need to analyze that in order to be sure on what to do, but the most efficient way as I see it is to make sure that the list of outstanding NOTIFY transactions is simply cancelled whenever we have a new update. That wouldn't confuse any phone. Asterisk might reveive a 200 OK on a Notify that is not in the list any more, but that wouldn't hurt anything either, just cause a log message. That's way better than having a series of christmas lights on the sidecar.

The phones should also be a bit more intelligent and observe the Cseq and not bother with turning any lamps on when an obvious retransmit arrives, but that's a different issue... :-)

By: Steve Davies (one47) 2007-11-28 10:17:57.000-0600

You are probably right about unforseen consequences, thought I had a thorough look at the code, and I had more of a problem with the broken "ignore a request"  behaviour than with calling the handle-response code, which certainly seems to understand the concept of an old transaction. Particularly when the patch is so trivial!

Delete old notify transactions??? I suppose... Seems a bit nasty to me though when there appears to be an opportuninity to behave "correctly" :-P There may be other cases that you need to handle specially for other response types too.

Just my 2p.

And yes, CSeq and notify version number!!! These phones really could do better sometimes :)

By: Olle Johansson (oej) 2007-11-28 10:21:26.000-0600

Yes, the whole "ignore" stuff is really, really, really bad. And stinks.

Ok, if you went through the code that much, I'll just make a quick review. I have a few years of experience of being in battle with this source, but there's still things I don't understand fully...

By: Steve Davies (one47) 2007-11-28 10:35:22.000-0600

I will magnanimously leave the final choice to you :) I did my best to trace the implications, but as you say, chan_sip.c is hard!

By: Steve Davies (one47) 2008-01-10 04:27:37.000-0600

As an update, I have now got this patch applied to 24 systems, operating with a variety of snom, Aastra, Linksys, Sipura and Polycom phones/ATAs. (Our installs are typically between 20 and 200 phones per site)

I have not seen a problem yet, and it does fix the reported issue.

By: raarts (raarts) 2008-01-10 05:20:55.000-0600

Same here, many systems rolled out with this patch, in various configurations and with various brands of phones. No issues. I think this patch solves a pretty serious problem for any installation that uses Page() and BLF together.

By: Frederic LE FOLL (flefoll) 2008-01-10 05:50:25.000-0600

This issue seems very similar to issue 11231 http://bugs.digium.com/view.php?id=11231 or former issues 9567 http://bugs.digium.com/view.php?id=9567, 10915 http://bugs.digium.com/view.php?id=10915, 10946 http://bugs.digium.com/view.php?id=10946, all due (or partly due) to chan_sip performing inappropriate retransmits (consequences may vary, depending on devices ...) because it explicitely ignores SIP responses where CSeq is not the last highest CSeq that it sent. Thus, the problem arises each time chan_sip sends a "too fast" series of requests (within a SIP dialog).

The patch that one47 proposes keeps the ast_set_flag SIP_PKT_IGNORE and SIP_PKT_IGNORE_RESP but continues handling the response. I proposed a very similar patch for issue 10946 that merely dropped the "ignore" code section : it is the same as one47 patch, except that SIP_PKT_IGNORE and SIP_PKT_IGNORE_RESP were not set.

In reply to Olle questioning about this "ignore" section, imho, it may originate in a copy/paste of code that is pertinent when handling a REQUEST, certainly not when handling a RESPONSE : you can find the equivalent "if (p->icseq && (p->icseq > seqno))" (versus "if (p->ocseq && (p->ocseq != seqno))") just below the section that handle responses, when entering requests handling.

By: Joerg Staedele (brainy) 2008-01-21 05:50:38.000-0600

This issue is still NOT fixed!! When using the above patch everything works fine. Please add the fix to the sourcetree. Thanks

Since it's just replacing an "} else if" to an "} if" it shouldn't cause much problems.



By: Olle Johansson (oej) 2008-01-27 04:51:23.000-0600

please try the patch in ASTERISK-11205 on latets 1.4 from svn. Thanks.

By: Olle Johansson (oej) 2008-02-03 10:49:21.000-0600

Not solved in svn 1.4, but a patch for testing uploaded. Please test.

By: Olle Johansson (oej) 2008-02-03 10:50:30.000-0600

The patch applies to 1.4 btw.

By: Steve Davies (one47) 2008-02-04 04:10:57.000-0600

Oh I like that latest patch... I will still apply the original patch too, but I like the potential performance boost of not spamming a phone with Notify packets during a busy update flurry.

There is a (vanishingly small) risk that a lost OK packet would cause some kind of backlog. I assume that the usual expiry/retry timers would handle that in the usual way?

By: Olle Johansson (oej) 2008-02-04 04:19:48.000-0600

Lost OK on NOtify will not affect the call.

By: raarts (raarts) 2008-02-04 06:00:31.000-0600

Just a note, I am the original reporter, but we cannot test the
patch in 0011736 because we are running 1.2, so you'll have to
rely on one47 for further feedback.

By: Theo Belder (tbelder) 2008-02-06 01:36:28.000-0600

I have a simular problem here with Asterisk 1.4.17:

A Snom device is subscribed to extension 408 which handles the state of SIP/408 and SIP/102 (exten = 408,hint,SIP/408&SIP/102)

When calling extension 408 both phones are ringing. And the following state are sent:
Asterisk: NOTIFY with ringing cseq 106
Snom: OK
-- SIP/408 picked up the phone and SIP/102 stops ringing
Asterisk: NOTIFY with inuse+ringing, cseq: 107
Asterisk: NOTIFY with inuse, cseq: 108
Snom: OK, cseq 107
Snom: OK, cseq 108
Asterisk: retransmits cseq 107 for more than 5 times
Snom: OK cseq 107 (as many asterisk sends the NOTIFY message)

Now the snom device indicates that SIP/408 or SIP/102 still is ringing.

-- SIP/408 hangs up the phone

Asterisk: NOTIFY with idle, cseq: 109
Snom: OK, cseq 109


You can see this behavior in the attached file.



By: Theo Belder (tbelder) 2008-02-06 01:42:02.000-0600

When applying the patch in 0011736, asterisk doesn't retransmit the NOTIFY messages, but also never sends the InUse message, so the Snom phone indicates one of the phones (SIP/408 or SIP/102) is ringing, while SIP/408 is on the phone ...



By: Olle Johansson (oej) 2008-02-06 02:15:52.000-0600

tbelder: That's awkward, because the ringing state is normally very shortlived and should in theory always replaced by a more final message - answer or hangup from caller. Oh, I see. You're using the combined state of two phones. That may cause unexpected issues that I did not consider. Can you upload a debug file with that call trace? Thanks.

By: Theo Belder (tbelder) 2008-02-06 03:08:07.000-0600

file uploaded with:
core set verbose 4
core set debug 4
sip set debug peer 406 (is subscribed to 408)

By: Olle Johansson (oej) 2008-02-12 02:24:17.000-0600

tbelder: Are you sure this latest debug is from the latest 1.4 asterisk version (with the patch) ?

I don't think so since we're sending a new NOTIFY without waiting for reply on the first one in your debug file.

By: Theo Belder (tbelder) 2008-02-12 07:17:51.000-0600

olle: this trace is not with the patch...
Which patch do i have to apply? The notify.1_4_12.patch or the patch from 0011736?

By: Olle Johansson (oej) 2008-02-12 08:48:01.000-0600

The patch called "bug11231.txt" above. Applies to 1.4.

By: Theo Belder (tbelder) 2008-02-13 01:48:00.000-0600

olle: i've applied your patch. Some result here when one of the phones answers the call. The snom device indicates that SIP/408 or SIP/102 still is ringing.

See the uploaded file:
core set verbose 4,
core set debug 4,
sip set debug peer 406

By: Joerg Staedele (brainy) 2008-02-23 05:04:20.000-0600

We still have problems with the lastest chan_sip and the LED's on the Snom phones.

We have the following dialrule:

exten => 1,1,Dial(SIP/foo,5)
exten => 1,2,Dial(SIP/foo&SIP/bar)

When a call comes in, the LED for "SIP/foo" is blinking, which is correct. After 5 seconds only the "SIP/bar" is blinking and the asterisk tells something about that the devicestate for "SIP/foo" was changed to "Invalid" and then to "Ringing" ... My guess is, thats because the status for "SIP/foo" changes too fast from "Ringing" to "Idle" to "Ringing"

We're still using the "simple 3rd party patch" (removing the "else if" to an "if") and it's working almost well for us.

But i would appreciate it if it's officially being fixed.

Thanks

By: Theo Belder (tbelder) 2008-03-17 09:54:23

oej, haven't heard from you since a while. Are you perhaps waiting on more information from me?

By: Steve Davies (one47) 2008-03-17 12:01:35

Given that I (and seemingly others) am widely using the "else if" to "if" fix from this bug so widely now, I suspect that it is a valid fix.

The code changes that are in ASTERISK-10752/ASTERISK-11205 are more of a feature enhancement, which will improve the code and reduce or remove the need for this change in a future release.

By: Olle Johansson (oej) 2008-03-18 02:22:04

Sorry for disappearing for a while, I've been working on customer projects and have been travelling for a few weeks.

Will test a few issues. The problem seems to come up when the hint involves more than one device mostly. I don't see it when the hint only has one device.

By: Theo Belder (tbelder) 2008-04-18 01:53:53

oej, do you have forgotten this case or are you really busy ;) ?

By: Martin Vit (festr) 2008-04-18 03:44:41

try the latest 1.4 svn. i've found some bug which oej repaired and commit.

By: Theo Belder (tbelder) 2008-05-28 09:13:04

The issue I described in note ASTERISK-7955740 with a subscription on more than 1 device is fixed in r114148 of chan_sip.c

Revision log of chan_sip.c r114148:
Handle subscribe queues in all situations... Thanks to festr_ on irc for telling me about this bug.



By: Joshua C. Colp (jcolp) 2008-06-02 08:39:06

After researching, looking at code, and testing I have come to the conclusion that this issue has been solved by the fix from 11736.