|Summary:||ASTERISK-13979: Multiple "unreachable" and "reachable" messages for each SIP registrations|
|Reporter:||Michael Gaudette (bluefox)||Labels:|
|Date Opened:||2009-04-17 13:18:46||Date Closed:||2011-06-07 14:08:21|
|Description:||Shown in the CLI:|
[Apr 17 14:16:02] NOTICE: chan_sip.c:13019 handle_response_peerpoke: Peer '1234-2' is now Reachable. (32ms / 2000ms)
[Apr 17 14:16:02] NOTICE: chan_sip.c:13019 handle_response_peerpoke: Peer '1234-2' is now Reachable. (32ms / 2000ms)
[Apr 17 14:16:02] NOTICE: chan_sip.c:13019 handle_response_peerpoke: Peer '01234-2' is now Reachable. (29ms / 2000ms)
[Apr 17 14:16:02] NOTICE: chan_sip.c:13019 handle_response_peerpoke: Peer '1234-2' is now Reachable. (31ms / 2000ms)
[Apr 17 14:16:02] NOTICE: chan_sip.c:13019 handle_response_peerpoke: Peer '1234-2' is now Reachable. (41ms / 2000ms)
|Comments:||By: Joshua C. Colp (jcolp) 2009-05-21 08:27:54|
Can you please provide more information with this issue? Are you using realtime with caching and qualify enabled? Can you please provide a complete console log with debug set to go to console in logger.conf and "core set debug 4" executed in the CLI?
By: Michael Gaudette (bluefox) 2009-05-21 08:31:01
I can't, I reverted a while ago because of another bug (deadlocks). I can say I was (and still am) using realtime and qualify, and caching (rtcachefriends=yes).
By: Tilghman Lesher (tilghman) 2009-06-29 15:20:33
Assigning to lmadsen for reproduction
By: sohosys (sohosys) 2009-08-10 15:24:03
we are also experiencing this. version 126.96.36.199. sometimes as many as 15 unreachable log entries per second per event and 70 unreachable events per second per event, all for the same peer. heavy system load.
it looks like either the qualify timer values are not what is set in the code (60sec OK, 10sec UNREACHABLE) or the SIP qualify packets are being retransmitted and every retransmission and/or late reply is creating a new event.
sample console output
[Aug 10 13:40:53] NOTICE: chan_sip.c:16612 sip_poke_noanswer: Peer '5555267136' is now UNREACHABLE! Last qualify: 125
[Aug 10 13:40:56] NOTICE: chan_sip.c:16612 sip_poke_noanswer: Peer '5555267136' is now UNREACHABLE! Last qualify: 1474
[Aug 10 13:40:57] NOTICE: chan_sip.c:16612 sip_poke_noanswer: Peer '5555267136' is now UNREACHABLE! Last qualify: 119
[Aug 10 13:40:58] NOTICE: chan_sip.c:16612 sip_poke_noanswer: Peer '5555267136' is now UNREACHABLE! Last qualify: 126
[Aug 10 13:41:03] NOTICE: chan_sip.c:12985 handle_response_peerpoke: Peer '5555267136' is now Reachable. (120ms / 2000ms)
[Aug 10 13:41:06] NOTICE: chan_sip.c:12985 handle_response_peerpoke: Peer '5555267136' is now Reachable. (93ms / 2000ms)
[Aug 10 13:41:08] NOTICE: chan_sip.c:12985 handle_response_peerpoke: Peer '5555267136' is now Reachable. (101ms / 2000ms)
[Aug 10 13:41:08] NOTICE: chan_sip.c:12985 handle_response_peerpoke: Peer '5555267136' is now Reachable. (127ms / 2000ms)
By: sohosys (sohosys) 2009-08-10 15:30:08
i should also add that there is liekley a real network path issue causing the peer to go unreachable, and that we had a system stop proccessing calls after several minutes of this log activity. presumably the result of many 'lastms' mysql database updates..., several hundred events per second.
By: Michael Gaudette (bluefox) 2009-08-10 16:21:25
I would bump up the severity to major actually. This is also happening with 1.4.26, but more to the point it even killed a customer's internet connection (I do hosted PBX) by seemingly flooding his Internet links with something (not sure what, likely qualify messages). I had to restart Asterisk before link usage became normal.
Usually it's just a double "UNREACHABLE" message, but that time there were about 15 similar messages every few seconds, going from UNREACHABLE to reachable (x ms) in quick succession.
I think this seemingly benign problem is hiding something that could hurt somebody using "qualify" occassionally.
By: sohosys (sohosys) 2009-08-10 16:59:50
I would agree that "trivial" is not correct based on the fact that we too experienced a race or deadlock situation as a result that required a restart to correct. We too had a spike in network traffic that appears to be related.
bluefox, are you realtime settings the same as what i posted above? how many sip peers on your box? we have about 500 on this box. With a near identical configuration (less some minor 1.4 specific tweaks) we run about 2000 on our 1.2 boxes without this issue.
By: sohosys (sohosys) 2009-08-10 17:14:34
what is the datatype of your lastms field in your mysql table? does it allow nulls? i think i am seeing in the cdoe that there are times when the lastms needs to be set to null, but at least in our db null is not allowed for lastms.
By: Michael Gaudette (bluefox) 2009-08-10 17:25:31
I do not allow NULL for lastms, as per the Wiki (serves me right for trusting it)
The rest of my setting are the same as yours, and my peer count is probably close between 800. (alot of them permanently UNREACHABLE as they are just unused peers). Probably around 400-500 active peers.
Can we can confirmation that lastms should allow NULL?
By: sohosys (sohosys) 2009-08-10 17:38:52
for the reason stated in this bug https://issues.asterisk.org/view.php?id=15615 i avoid allowing null in a column unless it is required. I also got my field definition for lastms from the wiki, and know of no other (reliable) source of this information. we will need to wait for comment from a qualifed programmer, which i am not. I can see how this might casue a loop where we qualify a known unreachable peer, but cant explain why the multpile reachables are logged, except that maybe the peer tries to respond to all options queries that were sent during the loop.
By: sohosys (sohosys) 2009-08-11 00:21:33
allowing null in the lastms field did not resolve this. a reload is required to reproduce the problem. shortly after the reload, despite many peers that have registered again, all peers go unreachable and the database lastms value is updated to '-1' for all peers
By: sohosys (sohosys) 2009-08-11 09:38:23
bluefox, trying to find similarities between our system to isolate this. one thing I have noticed is that 1.4 responds to SIP NOTIFY sent by our ATAs with "489 bad event", and then tears down the dialog. 1.2 gracefully responded to NOTIFY from the ATA with "603 declined". we have all of our ATAs configured to sent NAT keepalives to keep firewalls and NAT devices open, and the message they are configured to periodically send is NOTIFY. Do you also have ATAs set to use NOTIFY as a NAT keepalive?
By: Michael Gaudette (bluefox) 2009-08-12 08:37:26
Not sure. I am using almost exclusively Polycom phones, and some softphones (eyeBeam/Xlite). The Polycom phones are configured with keepalive enabled (whatever Polycom does in that respect), the softphone are set with mostly default values.
I can't say about it happening after a reload (you mean SIP reload I assume), but I will keep an eye out.
By: sohosys (sohosys) 2009-08-12 09:01:42
both the polycom and the xlite are capabable of sending notify messages for NAT keepalive. would you be able to run sip debug for one of each device and see if you are getting a SIP 489, bad event for each notify packet, and if it is followed by some message about ending the sip dialog with the phone? we are going to try turning off notify on the endpoints for a while.
By: Michael Gaudette (bluefox) 2009-08-18 19:10:11
Sorry for the delay, was on vacation.
I don't see a SIP 489, but I do see plenty of
[Aug 18 20:02:36] NOTICE: chan_sip.c:8866 check_auth: Correct auth, but
based on stale nonce received from '<sip:firstname.lastname@example.org>'
(188.8.131.52 being my Asterisk's IP address and xxxxxxxxxxx-x being the SIP peer name)
When i turn on debug. Many many of those, for alot of peers not onyl the one I am debugging.
I have run two sip reloads today, and my bandwidth (normally around 800kbit/s at this time of day) is now at around 8 MBits/s. It's all in bandwidth, outbound things look normal. According to Wireshark's protocol hierarchy tool, this unexplained extra 8Mbit/s is all SIP.
When I can restart the whole Asterisk (when the server is idle), will let you know if all this stops as I suspect it will.
By: Michael Gaudette (bluefox) 2009-08-18 20:06:43
Ok, here is what happened now:
I did this:
"sip prune realtime PEERNAME" then "sip show peer PEERNAME load" (as described in Wiki) to reload one specific peer (one that was giving my multiple UNREACHABLE messages". This also happens regularly, through a web app, because I need to change values in SIP realtime.
Specifically, after the first command (prune) my specified peer started giving me MANY MANY UNREACHABLE messages. After the second, everything went to heck.
I am pretty sure this is the cause of the bug. Suddently my server starts throwing up 11Mbit/s, link was totally congested, I had to restart the server (restart now in CLI). Now everything works fine.
Please bump this up to "Major"
By: Leif Madsen (lmadsen) 2009-09-03 08:20:12
I'm assigning this back to Tilghman as it looks like a lot of information that may be useful has been posted. It also looks as if it will be very difficult for me to reproduce since it appears to be based on having several hundred peers using qualify.
By: Tilghman Lesher (tilghman) 2009-09-03 10:48:07
Could one of you possibly provide a packet capture while this problem is occurring? Given the difficulty of reproducing this in a lab environment, I'm going to need some information to track this down.
By: Michael Gaudette (bluefox) 2009-09-03 10:52:16
I have a few packet captures on my HD, but I feel uncomfortable putting it here where it's publicly available. Any way to send it and not have it public?
By: Tilghman Lesher (tilghman) 2009-09-03 12:08:48
I can make the issue private. That will cut off sohosys for the duration, but it will also make the issue invisible for all but managers and administrators.
By: Michael Gaudette (bluefox) 2009-09-03 12:22:41
I have a 12.7MB file to upload for dump. (and I am not a Wireshark expert). What should I do? Upload only the first 2MB (cutting it off somehow with wireshark) or...?
Since the issue only happens with alot of peers and it floods the network, I can't but have a big dump file. This is about 10 seconds of packet dump.
By: Tilghman Lesher (tilghman) 2009-09-03 12:46:50
No, probably the best thing to do is upload it on your own server and post a link to it in this issue. Given that the issue is private, most users will be unable to see the link, and we'll delete it prior to making the issue public again.
By: Michael Gaudette (bluefox) 2009-09-03 12:55:18
By: Leif Madsen (lmadsen) 2009-09-22 09:05:35
Putting back to Acknowledged as the information has been provided.
By: Leif Madsen (lmadsen) 2010-04-30 12:02:14
Is this still happening on the latest versions? I'm not seeing any more comments on this issue...
By: Michael Gaudette (bluefox) 2010-04-30 13:01:14
I will check when I have a chance to upgrade. I do disagree with the severity being minor: this actually had the side-effect of flooding my cable link (not the server's, but the one where a single phone was located) with qualify request to the point where all bandwidth was used.
The problem isnt only the message, but the fact that flooding message hog the bandwidth, seemlingly making it worst in a vicious cycle.
By: Michael Gaudette (bluefox) 2010-04-30 13:02:35
Sorry, just to clarify: it does happen on 1.4.29, where I am now on that particular system. I haven't seen outright flooding since qualify=no for most devices because of this, but the few that arent do give me multiple message at once occasionally
By: Leif Madsen (lmadsen) 2010-04-30 14:09:25
Minor only means the number of people affected, not how impacting the problem is to your environment.
By: Paul Belanger (pabelanger) 2010-06-01 13:14:27
ping! What is the status of this issue? Does this still exist on the latest version of 1.4?
By: Michael Gaudette (bluefox) 2010-06-01 13:15:54
Still seen on 1.4.31.
By: Leif Madsen (lmadsen) 2010-06-08 10:05:17
We're still looking for some debug information to be posted here in order to move this issue forward. We seem to see a lot of information describing the situation, but no information on what is being seen on the Asterisk side of things when this happens both prior and during this issue.
If someone can provide debugging information which includes a SIP trace and Asterisk console output then we can move this issue forward. Otherwise, we're not going to get anywhere.
By: Paul Belanger (pabelanger) 2010-06-16 09:25:15
Suspending this issue until we get the required debug logs (see below). On you have collected them contact a bug marshal or developer on #asterisk-bugs to re-open the issue.
We require a complete debug log to help triage the issue.
This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: