|Summary:||ASTERISK-14701: SIP qualify goes out of control and kills links|
|Reporter:||Michael Gaudette (bluefox)||Labels:|
|Date Opened:||2009-08-24 09:50:59||Date Closed:||2011-06-07 14:00:57|
- about 1200 SIP peers in realtime (onloy 5 in sip.conf)
- Most of those 1200 are configured with qualify=yes
- Servers, which is normal at 2.5Mpbs usage, suddently spikes to 6-10Mpbs outgoing.
****** STEPS TO REPRODUCE ******
- Have many SIP peers, using qualify, in realtime DB.
- Type "sip reload" in CLI a few times (not necessarily quickly, can be 10 times in 2 minutes)
Watch your network card work harder than it should.
****** ADDITIONAL INFORMATION ******
- some peers give out multiple UNREACHABLE and REACHABLE messages, probably linked to the fact that the SIP peer link (remote) is swamped by SIP qualify messages
- Even if I put qualify=no in the db and do a sip reload, I still get some reachable-unreachable messages coming in!!!
|Comments:||By: Michael Gaudette (bluefox) 2009-08-24 10:19:54|
...And eventually Asterisk dies and restarts.
By: dant (dant) 2009-08-24 20:51:52
I can confirm I've seen this before with a 1.4.26 production instance. It only appeared to be x-lite clients that were involved with this issue, the majority of phones being Polycom devices... All clients are realtime with realtime caching enabled... The problem has not reoccurred since setting qualify=no globally...
By: Michael Gaudette (bluefox) 2009-08-24 23:43:50
My experience says that xlite is more than often victim of the problem, but sometimes Polycoms are too.
But the main point is tha whatever xlite (or any peer) does, Asterisk shouldn't start flooding the upstream link with qualify requests (or whatever is flooded).
By: David Brillert (aragon) 2009-08-25 07:56:27
I see this with the Polycom's too. They often become unreachable/reachable lately (some revision of 1.4.26). But I had not discovered any correlation to too many SIP qualify messages on the network.
I thought at first these were network issues.
Increasing the SIP qualify timer per extension seems to help.
I presume the bug marshalls will want to see some CLI output with debug enabled in logger.conf and the *sip set debug* command issued from the CLI while this is happening.
By: Michael Gaudette (bluefox) 2009-08-25 08:28:09
No, I can tell you from MRTG that it is network related, but what seems to happen is that Asterisk floods the end-user network (let's say he's got a 9000Mbits link) to the point where his phones are indeed unreachable, because of Asterisk.
I could increase the timer, but the point of using qualify is to know when to stop trying to send calls to a peer. 200ms is what I use.
By: David Brillert (aragon) 2009-08-25 08:43:13
I don't use realtime.
My dead Polycom's are reachable when they are plugged in and then they become unreachable. For those extensions I increase qualify to 20000ms (my default is 2000ms).
If it's a network problem I don't know why since has happened on three different sites.
Voice VLAN is shared with PC's
Voice VLAN is dedicated by separate switches with no PC's
Voice VLAN is dedicated by separate switches with no PC's
It is very hard to reproduce and very rare for me.
If you can reproduce this I still think the marshalls will want to see some sip set debug output and debug output while this is happening.
By: David Brillert (aragon) 2009-08-25 10:53:18
Tell me if you see anything like this message in /var/log/asterisk/messages just before your phones become unreachable?
WARNING: channel.c:952 __ast_queue_frame: Exceptionally long voice queue length queuing to Local
I see that warning always before my Polycom phones become unreachable by qualify.
If you do I think your problem is related to a bug I posted earlier...
There is plenty of sip debugs and cli output on that report.
Of course this could be completely unrelated.
By: Michael Gaudette (bluefox) 2009-08-25 11:13:18
Never seen that message. The problem si not the the peer becomes unreachable (it's normal given my Hosted PBX environnement), but that Asterisk goes crazy in the circumstances described above..
By: David Vossel (dvossel) 2009-08-25 12:11:57
Bluefox, You mentioned that Asterisk " dies" and has to restart. Is this an actually crash? If so please provide a back trace if you are able to reproduce it again.
In 1.4 it doesn't appear that there are as many qualify options as there are in 1.6. I would suggest possibly adjusting the qualifyfreq option but that is not available in 1.4... 1200 peers is quite a lot. If something wrong goes on with the network, and a qualify fails, then Asterisk will re-attempt that qualify a few times in succession (five times I think)... If 1200 peers fail to qualify that could definitely spike traffic up.
By: Michael Gaudette (bluefox) 2009-08-25 12:24:12
I can't say I know the Asterisk code, but I think that is where the problem lies : I think that Asterisk does much more than 5 qualify in succession when on peer goes unreachable, which in turns floods the device's link, which in turns make it go reachable-unreachable...and round and round we go until traffic upstream goes up and up.
The crash is not easy to reproduce though, it usually happens but sometimes it takes hours from the moment when the problem starts, sometimes minutes, sometimes not at all. Fixing the original issue would be a good start, crash or no crash.
By: David Vossel (dvossel) 2009-08-25 14:04:36
I did a quick test using the default 'qualify=yes' for a peer. I let the peer qualify, then I disconnected the device. I monitored qualify packets Asterisk sent and only 5 qualifies were sent in succession. I believe Asterisk reattempts this every 30 seconds. You may want to verify my findings with your system by doing a similar test and capturing the results with wireshark.
By: Leif Madsen (lmadsen) 2009-09-17 14:46:46
This isn't really a bug, but is rather a side effect of you having 1200 peers that have the qualify function enabled.
In 1.6, there is the qualifyfreq and other options which lets you spread out the number of peers you're qualifying at any one time. On 1.4, if you have qualify enabled for all of those peers, all qualifies go out at the same time, and thus you end up with this network flood that can't be handled.