[Home]

Summary:ASTERISK-08163: chan_skinny doesn't send keepalives
Reporter:pj (pj)Labels:
Date Opened:2006-11-21 05:56:43.000-0600Date Closed:2007-06-04 18:29:28
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_skinny
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) skinny-ka-60-120-tcpdump.txt
( 1) skinny-keep-alive.txt
Description:when phone is normaly turned off (I'm using wifi 7920), phone sends unregister request to chan_skinny and phone is successfully unregistered, but when phone is hard disconnected, chan_skinny doesn't detect this death phone and still considers it as OK registered. Is even possible to normal call (ring) to this phone (this is main reason for marking this issue as major).
some info about this I wrote in bugreport 8190, but because this solve another issue I make this separate bugreport now.


****** ADDITIONAL INFORMATION ******

keepalive=30 in skinny.conf

[Nov 21 12:29:37] Clearing Display
[Nov 21 12:30:08] Clearing Display
[Nov 21 12:30:39] Clearing Display
after several minutes, that phone is disconnected, here is still OK:
ipbx*CLI> skinny show devices
Name                 DeviceId         IP              Type            R NL
-------------------- ---------------- --------------- --------------- - --
PJ                   SEP000D288E664B  193.179.xx.xx  7920            Y  1


I can even ring this death phone:
[Nov 21 12:32:44] Found device: PJ
[Nov 21 12:32:44]     -- skinny_request(324@PJ)
[Nov 21 12:32:44] skinny_new: tmp->nativeformats=256 fmt=256
[Nov 21 12:32:44]     -- skinny_call(Skinny/324@PJ-3)
[Nov 21 12:32:44] Setting ringer mode to '2'.
[Nov 21 12:32:44] Displaying Prompt Status 'Ring-In'
[Nov 21 12:32:44]     -- Called 324@PJ
[Nov 21 12:32:44]     -- Skinny/324@PJ-3 is ringing
[Nov 21 12:32:49] skinny_hangup(Skinny/324@PJ-3) on 324@PJ
[Nov 21 12:32:49] Setting ringer mode to '1'.
[Nov 21 12:32:49] Clearing Display

still as registered:
ipbx*CLI> skinny show devices
Name                 DeviceId         IP              Type            R NL
-------------------- ---------------- --------------- --------------- - --
PJ                   SEP000D288E664B  193.179.xx.xx  7920            Y  1


when phone is powered back again:
firstly is _unregistered_ and after that immediatelly registered:

[Nov 21 12:33:48] WARNING[18562]: chan_skinny.c:4179 get_input: read() returned error: Connection reset by peer
[Nov 21 12:33:48] Skinny Client was lost, unregistering
[Nov 21 12:33:48] NOTICE[18562]: chan_skinny.c:4276 skinny_session: Skinny Session returned: Connection reset by peer
[Nov 21 12:33:54]     -- Starting Skinny session from 193.179.xx.xx
[Nov 21 12:33:54] Received Alarm Message: 25: Name=SEP000D288E664B Load=4.0(03.00) Last=Initialized
[Nov 21 12:33:54] Device SEP000D288E664B is attempting to register
[Nov 21 12:33:54]     -- Device 'SEP000D288E664B' successfully registered
[Nov 21 12:33:54] Requesting capabilities
[Nov 21 12:33:54] Version Request
Comments:By: jmls (jmls) 2007-02-11 03:28:52.000-0600

any news on this ?

By: pj (pj) 2007-02-11 03:47:40.000-0600

the issue still persist,
asterisk doesn't detect death skinny phone and generates normal ringback, when calling to this death phone
tested on Asterisk SVN-trunk-r53885

By: Anthony LaMantia-2 (anthonyl) 2007-03-09 02:28:35.000-0600

hi, pj a few quick questions.. right when you unplug the phone, is there any activity on the console?

also just to be clear 324@PJ-3 is the extension of the device that was disconnected?

By: pj (pj) 2007-03-09 02:35:17.000-0600

hi, after disconnection, absolutely nothing happened on debug console...
324 is linenumber on skinny device 'PJ'
I have following in skinny config...

[PJ]
device=SEP000D288E664B
nat=1
callwaiting=1
transfer=1
threewaycalling=1
line => 324

extensions:
Dial(Skinny/324@PJ)

By: Anthony LaMantia-2 (anthonyl) 2007-03-12 17:04:23

can you test this branch http://svn.digium.com/svn/asterisk/team/anthonyl/skinny-redux , and see if the problem is still happening.. i am working on some heartbeat code at the moment, but it would be nice to know if the branch above contains a patch for the issue.

By: pj (pj) 2007-03-13 17:41:22

checked out right now, no change, no keepalives, no detection of death phone...
Asterisk SVN-anthonyl-skinny-redux-r51181

By: Anthony LaMantia-2 (anthonyl) 2007-03-25 14:55:58

hi pj, sorry for the delay i've been pretty busy..
i added the general framework for scheduling..etc to chan_skinny for sending keep alive messages but i've ran into a slight problem.. i don't have a sample keep alive message available to be to make sure that what im sending out is compliant with what needs to be sent out, do you have a sample of one of the messages?

By: Damien Wedhorn (wedhorn) 2007-03-27 04:29:11

Anthony, the keepalivemessage and keepaliveackmessage are both empty. The wireshark source is a good reference for packet structures. Not sure if you've already done this, but I think the device sends the keepalivemessage based on the timing set in the RegisterAck message.

By: Anthony LaMantia-2 (anthonyl) 2007-03-27 04:36:42

hi, well i did check the wireshark code in packet-sccp.c the messages start from message 0x01 and go to 0x14 (if i can recall from a few days ago)  and according to the information i have (whats in chan_skinny) a keep alive message is 0x00 ... my current (it's most prob. not compliant) just has a message with a blank integer that i just set to 0 when i send the packet out.....


i was thinking of just keep the patch to that until i come by more information,
if it's non-compliant.. at asterisk asterisk would detect(at least it should) that the tcp connection died and destory the session..

im just unsure if the phone on the other end would reply to the message.. which would cause a endless stream of schedulded keep alive messages to be sent out, or the session being destoryed due to a lack of reply..

By: Damien Wedhorn (wedhorn) 2007-03-27 04:45:09

try packet-skinny.c. I've got a trace, and the data-length of the two keepalive messages are both 4 bytes, that is the size of the message description 0x0000 and 0x0100.



By: Anthony LaMantia-2 (anthonyl) 2007-03-27 04:51:32

oh wow, thanks wedhorn.. i see the desc. in packet-skinny.c, i totally overlooked it and focused on packet-sccp.c

By: pj (pj) 2007-03-27 11:26:37

for keepalives and other things improving chan_skinny you can be inspired by looking into source from chan_sccp project (currently death/unmaintained).

By: Jason Parker (jparker) 2007-03-27 11:32:25

We will not accept any patches that are "inspired" by code from chan_sccp in any way.

By: pj (pj) 2007-03-27 11:42:15

don't understand, why you can't accept ideas/reuse work from other free/opensource project and why to spend lot of time again by reverse engineering callmanager messages.

By: Anthony LaMantia-2 (anthonyl) 2007-03-30 23:22:08

well, after looking at some dumps that were provided to me, it looks like the keep alive message(x00000) is sent from the phone to the endpoint.
and the endpoint responds with a 0x0100 keepaliveack message

both of which have no payload, other then the 16byte len-reserved-messageid     header

By: Anthony LaMantia-2 (anthonyl) 2007-04-03 06:13:46

ok so, the new method for this i've been working at has more or less been when there is a new remote connection established with a client chan_skinny keeps track of how long it has been since the connection has been established and a keepalive message recived and if it reaches a certin point of non-communaction skinny_destory()..etc   are called and the client is dropped is this acceptable to you guys?, also.. the timeout value is configurable  , if this method seems sane i hope i can send it to PJ for some testing.

By: Damien Wedhorn (wedhorn) 2007-04-03 06:22:34

Sounds sane to me. Do you set the primary and secondary keepalive timings? Have you worked out what the secondary timing does? Some speculation, the secondary timing may be the timeout for the phone to reset itself so I suppose this timing and the timeout in asterisk should be the same.

Can you post a patch so we can have a look and maybe make some suggestions?

By: pj (pj) 2007-05-21 03:52:32

anthonyl, if you have worked on skinny keepalives, can you upload your patch for public testing?

By: dea (dea) 2007-05-31 18:27:11

The current method of dectection is to see if a message sent to the phone
completes and the right length.

It appears that write() always succeeds, even if the destination is not
there, so missing phones are not detected.

A better method would be to add a variable to the device structure and
record the time of the last RECEIVED keep_alive request.  If the difference
between now() and that value exceeds 90 seconds, consider the device gone.

I can test this shortly.

By: dea (dea) 2007-05-31 20:12:27

Not so bad.  It looks like some cleanup or restructuring occured
in get_input() and the code to detect dropped sessions was shuffled
into a code path that could not be executed.

More clean-up may be needed, but this one has phones being de-registered
after the keep_alive timeout (+10% to allow for network congestion).

Let me know if it works for you.

By: pj (pj) 2007-06-01 16:06:17

my phone periodically deregistering and immediatelly registering, after aplied your keepalive patch to SVN-trunk-r66882M...
network is realiable, rtt 12-30ms, but maybe it can be side efect of phone powermanagement feature, when phone is idle, it doesn't respond fast, when phone has call, response times are fast.
cisco 7920, fw 3.0.2

Reply from 10.0.0.50: bytes=32 time=1815ms TTL=64
Reply from 10.0.0.50: bytes=32 time=6ms TTL=64
Reply from 10.0.0.50: bytes=32 time=981ms TTL=64
Reply from 10.0.0.50: bytes=32 time=5ms TTL=64
Reply from 10.0.0.50: bytes=32 time=974ms TTL=64
Reply from 10.0.0.50: bytes=32 time=7ms TTL=64



Skinny Client was lost, unregistering
[Jun  1 22:53:29] NOTICE[31759]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success
Skinny Client was lost, unregistering
[Jun  1 22:54:52] NOTICE[31759]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success
Skinny Client was lost, unregistering
[Jun  1 22:55:28] NOTICE[31759]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success
Skinny Client was lost, unregistering
[Jun  1 22:56:04] NOTICE[31759]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success



By: dea (dea) 2007-06-01 16:57:48

What do you have a a configured keep_alive in skinny.conf?

I have a concern about the 'configurable' nature of keep-alives.
All references I have seen indicate that the phones use a 90 second
keep-alive and it is not tunable.

If that is the case, then setting the keep-alive in skinny.conf
to more than 90 seconds should handle this.

By: Damien Wedhorn (wedhorn) 2007-06-01 17:04:21

The keepalives are tunable. I've set mine to 30 seconds and asterisk receives a keepalive every 30 seconds. I'm having issues trying to determine when the device decides to reboot. I thought it might be related to secondary_keepalive, but at the moment I can't see to adjust it, or predict it. It seems to occur between about 20 and 40 seconds after a scheduled keepalive is not responded too.

By: dea (dea) 2007-06-01 17:31:38

The code currently sets the secondary keep-alive to the same duration
as the primary.  If the phone only sends a secondary after the primary
it would be fine, otherwise we should use the some multiple of the
keep_alive setting for the secondary.

Also my patch assumes 10% is enough lag on a busy network.  On a 30
second keep_alive, thats only 3 seconds.  I can set the server side to
2X the configured keep_alive to make sure the phone has a chance to
send a packet.

By: pj (pj) 2007-06-01 17:45:08

I have set kepalive to 90 or 120s, but not helped too much;(
[Jun  2 00:39:15] NOTICE[9624]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success
[Jun  2 00:41:31] NOTICE[9624]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success
[Jun  2 00:43:47] NOTICE[9624]: chan_skinny.c:4663 skinny_session: Skinny Session returned: Success

I have set keepalive to short time (20s) and I got much better results, seems, that phone doesn't restarts anymore, keepalives are send periodically, exactly in 20s periods (tcpdump). I can try to continue testing tomorrow.



By: Damien Wedhorn (wedhorn) 2007-06-01 18:46:50

After some googling, it looks like we had the wrong idea about the secondarykeepalives. Apparently, this is the timing for the keepalives to the secondary ccm server, so isn't of much use to us.

With my playing around, it seems that the phone resets about 30 seconds after it should have received the last keepaliveack. Also, from some tcpdumps I've got, it looks like the default CCM keepalive is 30 seconds. For consistency between device resetting and asterisk killing a device, I would suggest a time of about keepalive+30. Note: reset timing has been done with VIP30's, 7900's may be different.

By: dea (dea) 2007-06-01 20:28:08

I'd agree to the proposal.  My CCM cluster tends to be a bit slower than
the phones when identifying a dropped device.

So padding the keep_alive by some reasonable number seems sane.  To test
that theory we can replace (keep_alive * 1000) with
((keep_alive * 1000) + 30000)  

If that works out, we can make it configurable, or at least a #define
so as to not have magic numbers sprinkled in the code.

By: pj (pj) 2007-06-02 02:25:29

as I wrote, I have quite different findings,
keepalive in asterisk must be set to some smaller than 30s,
eg. 25s is working fine, no phone restarts, but keepalive 35s causes periodically restarting phone.
7920, fw 3.0(2)

By: Damien Wedhorn (wedhorn) 2007-06-02 03:06:39

pj, I notice in some of your comments in other bugs that you've set keepalives between 25 and 30 seconds. The do_housekeeping is called by a keepalive being sent by the phone, part of which is to clear the display.

Can you try running without this patch and try some various keepalive settings (something like 30 sec, 60 sec and 120 sec) and see if you get the "Clearing Display" message at those time intervals.

By: pj (pj) 2007-06-02 03:54:38

without this patch, status "Ring-In" is displayed, but after remote side hangs up, on phone status bar is again "Your current options", I never see missed calls info.
"display clearing" never displayed on asterisk console with skinny debug turned on, regardles of this patch aplied or no (display clearing messages periodically appears in previous asterisk revisions).
seems, that also has some efect fw version for my 7920, with 3.0(0) is displayed From <number> in status bar, with fw 3.0(2) is displayed Ring-In.



By: Damien Wedhorn (wedhorn) 2007-06-02 04:18:37

pj, sorry, I wasn't meaning whether the screen cleaning worked or not, but rather, whether your devices were sending keepalives. When a keepalive is sent, chan_skinny clears the display. I'm not interested (for this purpose) whether the display clearing works or not, but whether your devices are send keepalives.

If you could go back to trunk, try keepalive times of 30, 60 and 120 and on the CLI (after doing a skinny set debug), seeing if the "Clearning display" messages occur every 30, 60 and 120 seconds respectively.

There seems to be something funny with your devices. A longer keepalive with this patch should reduce the chances of the phone being reset. I'm taking a guess that if you set the keepalive to anything longer than 30 seconds, your devices decide to not send any keepalives at all.

By: pj (pj) 2007-06-02 06:11:33

wedhorn:
nothing surprising, without this patch and set keepalives to 30, 60, 120, tcpdump shows, that keepalives are send in 30, 60, 120 intervals, but nothing is displayed on asterisk console with skinny debug (ie. no "clearing display" messages).

if you say: "A longer keepalive with this patch should reduce the chances of the phone being reset."
I'm saying that only keepalives _shorter_ than 30s can help to not periodically resetting phones when this patch is applied...

By: Damien Wedhorn (wedhorn) 2007-06-02 07:25:47

pj, there is something surprising, if you are not seeing "Clearing display" with skinny debug, either chan_skinny is not recognising the keepalive, or the ast_calloc call in req_alloc is failling.

If you've got tcpdumps, do you see keepAliveAck messages following the keepAlives?

Edit: and I appreciate that you need to reduce the keepalives so your phone doesn't reset, I am trying to figure out why you need to, and this seem specific to your type of device. The current patch provides a buffer of 10%, so if you have a 120 second keepalive, chan_skinny will wait an additional 12 seconds. If you have a 20 second keepalive, chan_skinny will on wait an additional 2 seconds. A packet delayed by 12 seconds is very rare. A delay of 2 seconds is a lot more common.

For your phone to reset continuously with this patch and a keepalive of 120 seconds implies that either asterisk doesn't receive/is ignoring your keepalive message, or your device is not sending it.



By: pj (pj) 2007-06-02 09:27:38

here is my tcpdump (phone is at 192.168.164.154), two sets of keepalives, exactly after two minutes as defined in skinny.conf (120)

tcpdump -n host 192.168.164.154
16:20:51.176463 IP 192.168.164.154.11397 > 192.168.38.20.2000: P 2080709662:2080709674(12) ack 2146797451 win 16384
16:20:51.176761 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 1:13(12) ack 12 win 6656
16:20:51.176897 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 13:61(48) ack 12 win 6656
16:20:51.177008 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 61:85(24) ack 12 win 6656
16:20:51.177130 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 85:109(24) ack 12 win 6656
16:20:51.225709 IP 192.168.164.154.11397 > 192.168.38.20.2000: . ack 109 win 16284


16:22:51.159062 IP 192.168.164.154.11397 > 192.168.38.20.2000: P 12:24(12) ack 109 win 16384
16:22:51.159346 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 109:121(12) ack 24 win 6656
16:22:51.159485 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 121:169(48) ack 24 win 6656
16:22:51.159595 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 169:193(24) ack 24 win 6656
16:22:51.159715 IP 192.168.38.20.2000 > 192.168.164.154.11397: P 193:217(24) ack 24 win 6656
16:22:51.208924 IP 192.168.164.154.11397 > 192.168.38.20.2000: . ack 217 win 16284

By: Damien Wedhorn (wedhorn) 2007-06-02 18:31:09

Those messages look like keepAlive, followed by keepAliveAck, setDisplay, and two setLamp messages. This is what is expected and if this occurs with the patch installed, your phones should not reset. Do you have a dump when you include this patch and keepAlives set to 120 (or 60)?

By: dea (dea) 2007-06-03 10:07:28

Instead of making the lag factor be based on the keep_alive,
perhaps it should be a fixed value.  I'd suggest 5 or 10 seconds.

By: pj (pj) 2007-06-03 13:04:18

uploaded tcpdump, tested with patch applied and KA set to 60 and 120s,
maybe suspicious is, that first captured packet going from asterisk to phone,
192.168.38.20.2000 > 192.168.164.154.10056
instead from phone to asterisk, like was in previous test...

By: pj (pj) 2007-06-03 13:20:23

after a while, phone stops resetting and keepalives are send correctly (first packet from phone to asterisk) weird:-\

tcpdump -n host 192.168.164.154

20:15:46.947125 IP 192.168.164.154.10077 > 192.168.38.20.2000: P 2293829822:2293829834(12) ack 628292820 win 16384
20:15:46.947456 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 1:13(12) ack 12 win 6656
20:15:46.947612 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 13:61(48) ack 12 win 6656
20:15:46.947723 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 61:85(24) ack 12 win 6656
20:15:46.947820 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 85:109(24) ack 12 win 6656
20:15:46.970705 IP 192.168.164.154.10077 > 192.168.38.20.2000: . ack 109 win 16284


20:17:46.930723 IP 192.168.164.154.10077 > 192.168.38.20.2000: P 12:24(12) ack 109 win 16384
20:17:46.931004 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 109:121(12) ack 24 win 6656
20:17:46.931145 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 121:169(48) ack 24 win 6656
20:17:46.931254 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 169:193(24) ack 24 win 6656
20:17:46.931349 IP 192.168.38.20.2000 > 192.168.164.154.10077: P 193:217(24) ack 24 win 6656
20:17:46.953256 IP 192.168.164.154.10077 > 192.168.38.20.2000: . ack 217 win 16284

By: pj (pj) 2007-06-03 14:03:59

tested again (settings on asteriska same, phone manualy off/on),
phone is now again still periodically restarting, so corrent phone behaviour in my previous post was exception.

By: dea (dea) 2007-06-03 14:17:55

PJ, can you try the suggestion I made in post number 64352?

I am not seeing my 7920 dropping/re-registering here, and it
is even less responsive to ping tests than yours.

By: pj (pj) 2007-06-03 14:41:43

DEA, tested right now:
res = poll(fds, 1, (keep_alive * 1000) + 30000);
again, phone still resettings, if keepalive=120 :-\

By: dea (dea) 2007-06-03 15:09:00

OK, there is something quite funky with your setup.  Can you turn on skinny
debug with verbose set to 4 and debug set to 0?

By: pj (pj) 2007-06-03 15:27:37

this periodically appears on asterisk console, after every phone restart...

Skinny Client was lost, unregistering
[Jun  3 22:25:42] NOTICE[1510]: chan_skinny.c:4745 skinny_session: Skinny Session returned: Success
Found device: PJ
   -- Starting Skinny session from 192.168.164.154
Received Alarm Message: 25: Name=SEP000D288E664B Load=4.0(03.02) Last=All-CMs-Bad
Device SEP000D288E664B is attempting to register
   -- Device 'SEP000D288E664B' successfully registered
Requesting capabilities
Found device: PJ
Version Request
Received CapabilitiesRes
Adding codec capability '4 (4)'
Adding codec capability '8 (2)'
Adding codec capability '0 (11)'
Adding codec capability '256 (12)'
Adding codec capability '0 (15)'
Adding codec capability '0 (16)'
Device capability set to '8'
Buttontemplate requested
Adding button: 9, 1
Sending 30002 template to PJ
Received SoftKey Template Request
Received SoftKeySetReq
Received Time/Date Request

By: dea (dea) 2007-06-03 16:10:50

Are there any other logging messages above 'Skinny Client was lost,  unregistering'?

If not, that points to the channel not receiving the keep alive in the
time expected.

We might also be chasing a firmware bug.  I've left my 7920 at 3.0(0)
and it has been rock solid with KeepAlive set to 30 in skinny.conf

By: pj (pj) 2007-06-03 16:36:33

maybe you have right! I'm reflashing 3.0(0) and got better results, no excessive restarts event if keepalives are set to 120s, perhaps really fw issue, or keepalives frequency settings above 30s is something non standard from ci$co perspective and is not working correctly:-\
You can try also 3.0(2), to confirm my issue. tomorow, I will try another phone types.

By: dea (dea) 2007-06-03 17:09:48

Cisco continues to try and improve battery life on the 7920, and one
way they use is if the phone is idle, they put the phone to sleep
every few seconds.  I forget how often or for how lond, and it likely
changes between firmware releases as they try to get more battery life.

I suspect that these sleep periods are interfering with the keep_alive.
It might also impact Call Manager.  I cannot test the newer firmware since
it might bugger up my production environment, or at least my 7920 that I
loaned myself for chan_skinny testing.  I have to give it back to myself
for production use soon.....

If 3.0(0) works out for you, let us know so we can scratch another skinny
bug off the list.

By: pj (pj) 2007-06-04 14:30:32

tested with 7912, fw8.0(3), it working fine with KA set to 120, no excessive phone reboots, chan_skinny does correctly detect disconnected phone.
so I think, this patch is ready for commit:-) thanks

By: Jason Parker (jparker) 2007-06-04 18:29:26

Fixed in svn branches 1.4 and trunk in revisions 67156 and 67157.