[Home]

Summary:ASTERISK-13323: [patch] Realtime peers are never qualified after 'sip reload'
Reporter:Peter Fern (pdf)Labels:
Date Opened:2009-01-08 01:04:52.000-0600Date Closed:2009-02-16 20:07:19.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/DatabaseSupport
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20090109__bug14196__trunk.diff.txt
( 1) 20090116__bug14196.diff.txt
( 2) 20090120__bug14196_1.4.diff.txt
( 3) realtime-poke_1.4.patch
Description:Realtime SIP peers all report status as 'UNKNOWN' when re-initialised after 'sip reload'.  Peers are never sent OPTIONS qualify packet after reload, unless they re-register.



   -- Registered SIP 'nnn' at xxx.xxx.xxx.xxx port 5060
pabx*CLI> sip show peers
Name/username              Host            Dyn Nat ACL Port     Status     Realtime  
nnn/nnn                    xxx.xxx.xxx.xxx       D          5060     OK (35 ms) Cached RT
1 sip peers [Monitored: 1 online, 0 offline Unmonitored: 0 online, 0 offline]


pabx*CLI> sip reload
Reloading SIP
 == Parsing '/etc/asterisk/sip.conf': Found
 == Parsing '/etc/asterisk/users.conf': Found
 == Parsing '/etc/asterisk/sip_notify.conf': Found
pabx*CLI> sip show peers
Name/username              Host            Dyn Nat ACL Port     Status     Realtime  
nnn/nnn                    xxx.xxx.xxx.xxx       D          5060     UNKNOWN    Cached RT
1 sip peers [Monitored: 0 online, 1 offline Unmonitored: 0 online, 0 offline]




****** ADDITIONAL INFORMATION ******

sip.conf:
[general]
context=default
allowoverlap=no
bindport=5060
bindaddr=0.0.0.0
srvlookup=yes
allowsubscribe=yes
notifyringing=yes
notifyhold=yes
rtcachefriends=yes
rtupdate=yes
rtautoclear=no
Comments:By: Peter Fern (pdf) 2009-01-08 01:05:55.000-0600

Related: http://bugs.digium.com/view.php?id=13316

Lack of OPTIONS was determined using TCP dump on dedicated test system.



By: Leif Madsen (lmadsen) 2009-01-09 11:07:21.000-0600

Ummmm... actually how the heck do you get the peer to even show up after the reload? For me it only shows up when I've gotten the REGISTER from the peer. After a reload the memory is cleared (as I would expect), and only upon a reload would the peer be loaded into memory.

I'm using your configuration here too.

By: Leif Madsen (lmadsen) 2009-01-09 11:08:23.000-0600

*CLI> sip show peers
Name/username              Host            Dyn Nat ACL Port     Status     Realtime  
105/105                    192.168.128.129  D          21868    OK (3 ms)  Cached RT
1 sip peers [Monitored: 1 online, 0 offline Unmonitored: 0 online, 0 offline]

*CLI> sip reload
*CLI>  Reloading SIP
 == Parsing '/etc/asterisk/sip.conf': Found
 == Parsing '/etc/asterisk/users.conf': Found

*CLI>   == Parsing '/etc/asterisk/sip_notify.conf': Found
sip show peers
Name/username              Host            Dyn Nat ACL Port     Status     Realtime  
0 sip peers [Monitored: 0 online, 0 offline Unmonitored: 0 online, 0 offline]
*CLI>



By: Leif Madsen (lmadsen) 2009-01-09 11:11:06.000-0600

After talking to a developer on IRC, he says the code explicitly removes the cache from memory upon reload, so perhaps you need to provide more configuration files and information on how to reproduce this.

By: Peter Fern (pdf) 2009-01-09 11:31:07.000-0600

Peers appear to be reinitialized on schedule - in my test env they're reloaded after a few seconds post `sip reload`, and are also reinitialized on demand when a call is placed to/from the peer.

I don't believe there are any further pertinent configs beyond what I've already posted.  I can post SQL schema if required, however I have verified that all fields are present as documented.

http://bugs.digium.com/view.php?id=12921#91464 suggests that code was proposed to allow hosts to persist across reloads?

I'll dig into chan_sip and try to identify the related code if I can, however I'm not overly familiar with the realtime code - is Corydon76 available for comment?

By: Leif Madsen (lmadsen) 2009-01-09 12:35:14.000-0600

I have finally reproduced the issue. Your note gave me the clue.

It appears the peer is getting loaded into memory because the peer is being called. It then has no status until the next time it registers. I have to check with Corydon76 to determine if that is expected behaviour or not.

By: Leif Madsen (lmadsen) 2009-01-09 12:51:49.000-0600

Assigning this to Corydon76 after a conversation with him on IRC. It seems this could be fixed by saving the last known state of the peer in the RT database. However, this type of fix can not go into the 1.4 branch, so it would be available in trunk, and further to 1.6.2 in the future once branched from trunk.

I will allow Corydon76 to comment further should he feel it necessary. Thanks!

By: Tilghman Lesher (tilghman) 2009-01-09 16:01:30.000-0600

Here is a patch for 1.4, if you like.  It would require you to add the field "lastms" of type char(11) to your realtime table to make it work, however.

By: Peter Fern (pdf) 2009-01-13 23:48:40.000-0600

Thanks for prompt response, and apologies for the delay.  I've tried this out, and lastms never appears to be updated.

The following patch works for me without schema updates, perhaps you could comment on correctness?

By: Peter Fern (pdf) 2009-01-14 01:10:59.000-0600

OK then, scratch that patch - very bad news... I see why the need to qualify the poke with > 0, but it never seems to get called then... the peer is not qualified at register time



By: Peter Fern (pdf) 2009-01-14 03:05:07.000-0600

Nope, it's beyond me without spending much more time in the chan_sip code - I can get the qualify to work when the peer is alive, but once the peer goes away (after `sip reload`), the poke_peer goes into a spiral and floods the device off the network with OPTIONS.  Obviously the sched is not destroyed when the peer is removed from memory - I tried achieving this with a different approach subsequent to my previous (incorrect) patch, but was unsuccessful.

This is with 'rtcachefriends=yes', without which, I actually get _0_ peers loaded into memory from realtime, ever.  Not only do the peers not show up with rtcachefriends=no, I get the OPTIONS spiral of death mentioned above, on vanilla 1.4.22.1, for every peer that registers - which hammers the phone, throttles the network, and flogs Asterisk.

So, the two major problems:

1. Realtime peers are not loaded into memory when loading or reloading chan_sip (and never at all without rtcachefriends).
2. sip_poke_peer is only activated for realtime peers when registering using rtcachefriends=yes, not at restart or reload.  With rtcachefriends=no, it is the same, except that the peer is _never_ loaded into memory, so sip_poke_peer is called recursively until it floods the phone off network.

These are serious regressions, and unfortunately not only does the first need addressing, but Corydon76's patch doesn't resolve the second here either.  So, any ideas on where we go from here?  This is hurting a number of our clients, and whilst I'd prefer not to have to revert back to 1.4.21.x, I'm going to have to do so immediately.  I'm happy to keep testing with you guys to resolve this though, once I'm done reverting things.



By: Peter Fern (pdf) 2009-01-14 09:49:35.000-0600

OK, looks like this bug goes back a while.  In the mean time, I see the problem with Corydon76's patch - missing a key on the ast_update_realtime.  With the update keyed by "name", results do get into the DB.  However, this code is only hit at register time, so devices will now work properly until they re-register.  If I put an update_peer in the handle_response_peerpoke, like so:

--- channels/chan_sip.c.dist    2009-01-15 00:46:12.197536110 +1100
+++ channels/chan_sip.c 2009-01-15 02:21:06.906785966 +1100
@@ -12739,6 +12739,8 @@
               ASTOBJ_UNREF(peer_ptr, sip_destroy_peer);
       }

+       update_peer(peer, ast_sched_when(sched, peer->expire));
+
       ast_set_flag(&p->flags[0], SIP_NEEDDESTROY);

       /* Try again eventually */

Things start to work nicely, in fact it seems to fix a number of situations, ie - peers get populated properly when they place an inbound call, as well as when an outbound call is placed, and when the peer is explicitly loaded after a 'sip reload'.

Obviously it's not the most efficient option as I'm updating the whole peer every qualify, so I'm open to suggestions... attaching the updated patch with the fix for Corydon76's code, and the diff in this comment.


PS - I can't seem to reproduce the OPTIONS flood now, but there may be some combination of sip.conf settings that will produce it.

By: Olle Johansson (oej) 2009-01-15 07:43:13.000-0600

Why use realtime when you need the peers in memory all the time for qualify?

The realtime system wasn't really designed for this. You can still load all peers from database though.

By: Peter Fern (pdf) 2009-01-15 10:56:57.000-0600

Can you elaborate?

By: Olle Johansson (oej) 2009-01-15 11:05:21.000-0600

Realtime dynamic peers was designed not to be in memory, but be released as soon as possible after registration or a call. Functions like qualify and notifications require that peers are active in memory, thus should be static, like the ones you configure in sip.conf.

Adding qualify or mailbox to realtime dynamic peers create a strange hybrid that the SIP realtime architecture was never designed for and we now have a big discrepancy after what people expect - the same functionality as static peers and what it was designed for - dynamic peers.

I've asked for input on a redesign several times on the asterisk-dev mailing list with no answers. I believe this is a growing issue, and with all the patches to realtime dynamic peers (caching and stuff)  the situation gets worse and worse and we are reaching a collapse... To be a bit dramatic.

I think we have to make handling of static realtime peers easier. Load from manager and cli, implement real caching so that we remove peers that haven't been used for some time and don't have any requirements for staying in memory and such stuff.

Long answer to a short question. Sorry.



By: Olle Johansson (oej) 2009-01-15 11:06:58.000-0600

Btw, that's my personal opinion. Other developers might or not agree with me.

By: Tilghman Lesher (tilghman) 2009-01-15 13:27:36.000-0600

oej: going forward, qualify is automatically turned off for uncached peers.  We did this to make the system a little more consistent.  However, realtime peers for which rtautoclear is enabled behave rather sanely now.  They answer qualifies, and when they expire, they disappear out of memory.  I think this provides for an excellent balance between the desire to configure users from a database and keeping structures in memory when they need to be kept there.

pdf: that was excellent feedback.  I think the answer lies in only storing the qualify value when it changes state.  For example, if the state changes between the states UNKNOWN, OK, UNREACHABLE, and LAGGED, then we store the value; otherwise, we skip the realtime update step.

By: Peter Fern (pdf) 2009-01-16 01:54:07.000-0600

@oej: I hear where you're coming from - there is definitely a gap between user expectations and the implementation, and agree with your closing comments.

@Corydon76: What you suggest seems sane enough, however I think this may be incorrect:

@@ -12817,6 +12823,7 @@
       ast_log(LOG_NOTICE, "Peer '%s' is now %s. (%dms / %dms)\n",
           peer->name, s, pingtime, peer->maxms);
       ast_device_state_changed("SIP/%s", peer->name);
+       ast_update_realtime("sippeers", "name", peer->name, "lastms", "-1", NULL);
       manager_event(EVENT_FLAG_SYSTEM, "PeerStatus",
           "Peer: SIP/%s\r\nPeerStatus: %s\r\nTime: %d\r\n",
           peer->name, s, pingtime);


I think this needs to be:
@@ -12725,11 +12725,13 @@
       peer->call = NULL;
       if (statechanged) {
               const char *s = is_reachable ? "Reachable" : "Lagged";
+               char str_lastms[20];
+               snprintf(str_lastms, sizeof(str_lastms), "%d", pingtime);

               ast_log(LOG_NOTICE, "Peer '%s' is now %s. (%dms / %dms)\n",
                       peer->name, s, pingtime, peer->maxms);
               ast_device_state_changed("SIP/%s", peer->name);
-               ast_update_realtime("sippeers", "name", peer->name, "lastms", "-1", NULL);
+               ast_update_realtime("sippeers", "name", peer->name, "lastms", str_lastms, NULL);
               manager_event(EVENT_FLAG_SYSTEM, "PeerStatus",
                       "Peer: SIP/%s\r\nPeerStatus: %s\r\nTime: %d\r\n",
                       peer->name, s, pingtime);


for this approach to work?

By: Olle Johansson (oej) 2009-01-16 02:02:48.000-0600

If we're going down this road, saving it in the database, it has to be a configurable option, as it adds considerable load to a system. In a clustered environment, we only want the registration host to run qualify and save data. Other hosts, that will accept incoming calls, should not send any OPTIONs or store any data. Please check that this is the case.

By: Tilghman Lesher (tilghman) 2009-01-16 09:09:33.000-0600

oej: I don't think it adds much load at all to a system, since we're only storing it at the delta of states, not at every options response.  In terms of your cluster environment, with the current code, every host would send a qualify anyway, if caching is turned on.  There's far too much that is not explained about your cluster environment for me to know what the expected behavior would be.

By: Peter Fern (pdf) 2009-01-19 23:44:25.000-0600

I've been running this code (including change from comment 0098001, merged and uploaded as 20090120__bug14196_1.4.diff.txt) on a lightly loaded test system for the past few days and it's been working fairly well. I'll be deploying it to a more heavily utilized system shortly.

By: Tilghman Lesher (tilghman) 2009-02-09 14:34:37.000-0600

pdf:  have you had any trouble running this on your more heavily utilized system?

By: Peter Fern (pdf) 2009-02-16 17:47:18.000-0600

Sorry I haven't replied Corydon, we noticed a problem with peers being destroyed with long registration expires, and I wanted to make sure they weren't in any way related.  I don't believe they are, but I haven't had enough time to adequately diagnose to be entirely certain, and have not been able to reproduce in the lab, though I'm leaning towards it possibly being a Polycom issue.

As to the functionality this is meant to provide, it has been working well, and I've since deployed it to a couple more systems without any reported issues.



By: Digium Subversion (svnbot) 2009-02-16 18:49:25.000-0600

Repository: asterisk
Revision: 176426

U   branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r176426 | tilghman | 2009-02-16 18:49:24 -0600 (Mon, 16 Feb 2009) | 10 lines

After a 'sip reload', qualifies for realtime peers weren't immediately
restarted, instead waiting until the next registration.  We're now
caching the qualify across a reload/restart and starting the qualify
immediately upon loading the peer.
(closes issue ASTERISK-13323)
Reported by: pdf
Patches:
      20090120__bug14196_1.4.diff.txt uploaded by pdf (license 663)
Tested by: pdf

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176426

By: Digium Subversion (svnbot) 2009-02-16 19:58:59.000-0600

Repository: asterisk
Revision: 176459

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r176459 | tilghman | 2009-02-16 19:58:58 -0600 (Mon, 16 Feb 2009) | 17 lines

Merged revisions 176426 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r176426 | tilghman | 2009-02-16 18:49:22 -0600 (Mon, 16 Feb 2009) | 10 lines
 
 After a 'sip reload', qualifies for realtime peers weren't immediately
 restarted, instead waiting until the next registration.  We're now
 caching the qualify across a reload/restart and starting the qualify
 immediately upon loading the peer.
 (closes issue ASTERISK-13323)
  Reported by: pdf
  Patches:
        20090120__bug14196_1.4.diff.txt uploaded by pdf (license 663)
  Tested by: pdf
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176459

By: Digium Subversion (svnbot) 2009-02-16 20:04:59.000-0600

Repository: asterisk
Revision: 176460

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r176460 | tilghman | 2009-02-16 20:04:58 -0600 (Mon, 16 Feb 2009) | 24 lines

Merged revisions 176459 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r176459 | tilghman | 2009-02-16 19:58:39 -0600 (Mon, 16 Feb 2009) | 17 lines
 
 Merged revisions 176426 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r176426 | tilghman | 2009-02-16 18:49:22 -0600 (Mon, 16 Feb 2009) | 10 lines
   
   After a 'sip reload', qualifies for realtime peers weren't immediately
   restarted, instead waiting until the next registration.  We're now
   caching the qualify across a reload/restart and starting the qualify
   immediately upon loading the peer.
   (closes issue ASTERISK-13323)
    Reported by: pdf
    Patches:
          20090120__bug14196_1.4.diff.txt uploaded by pdf (license 663)
    Tested by: pdf
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176460

By: Digium Subversion (svnbot) 2009-02-16 20:07:19.000-0600

Repository: asterisk
Revision: 176461

_U  branches/1.6.1/
U   branches/1.6.1/channels/chan_sip.c

------------------------------------------------------------------------
r176461 | tilghman | 2009-02-16 20:07:18 -0600 (Mon, 16 Feb 2009) | 24 lines

Merged revisions 176459 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r176459 | tilghman | 2009-02-16 19:58:39 -0600 (Mon, 16 Feb 2009) | 17 lines
 
 Merged revisions 176426 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r176426 | tilghman | 2009-02-16 18:49:22 -0600 (Mon, 16 Feb 2009) | 10 lines
   
   After a 'sip reload', qualifies for realtime peers weren't immediately
   restarted, instead waiting until the next registration.  We're now
   caching the qualify across a reload/restart and starting the qualify
   immediately upon loading the peer.
   (closes issue ASTERISK-13323)
    Reported by: pdf
    Patches:
          20090120__bug14196_1.4.diff.txt uploaded by pdf (license 663)
    Tested by: pdf
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=176461