[Home]

Summary:ASTERISK-16717: Incoming SIP UDP packets are ignored.
Reporter:dtryba (dtryba)Labels:
Date Opened:2010-09-23 07:53:37Date Closed:
Priority:MinorRegression?No
Status:Open/NewComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 3rdcrash.txt
( 1) bt.txt
( 2) locks
( 3) objects
Description:UDP SIP datagrams are not processed by Asterisk. tcpdump shows them arriving on the network interface but they don't show up in Asterisk. This lasts for about 10m (RTP and SIP/TCP or IAX functions just fine in this period), Asterisk then starts to "Really destroying SIP dialog" (100+ in 1s) and suddenly all is well again. Outbound UDP packets are being send and the other side replies (but obviously gets ignored).

Last UDP datagram received:
[2010-09-22 18:49:20] VERBOSE[17369] chan_sip.c:
<--- SIP read from UDP:213.247.xxx.xxx:1024 --->

next datagram arrives 9m and 11s later:
[2010-09-22 18:58:31] VERBOSE[17369] chan_sip.c:
<--- SIP read from UDP:84.246.xxx.xxx:1042 --->

Versions affected are 1.6.2.9 (backported from Debian/unstable to Debian/stable) and 1.6.2.13 (Debian 1.6.2.9-2 with patches applied to get to version 1.6.2.13). This suddenly started happening 1 weeks ago, before this machine had been running in the same configuration since 1.6.2.9 was released by the Debian team (uptime was about 115 days before rebooting this machine to fix a kernel issue) without any problems.

The only thing changed that week was the installation of fail2ban to block bruteforce attempts. But removing this (and rebooting the machine) didn't solve the problem.

The problem happends at random times, during the day with active calls or during the night whitout any usage.


****** ADDITIONAL INFORMATION ******

An observed workaround was editing sip.conf bindaddr from 0.0.0.0 to specific ip adr and "sip reload". At that point Asterisk started accepting UDP packets again, my guess is that the change triggers garbage collection. But at a later time Asterisk started ignoring UDP packets again.

I can't see anything weird going on. Latest problem was between 18:49:20 and 18:58:30
/proc/net/sockstat shows no excessive connections:
18:48:51 UDP: inuse 34 mem 31
18:49:21 UDP: inuse 34 mem 31
18:49:51 UDP: inuse 34 mem 62
18:54:27 UDP: inuse 36 mem 62
18:55:29 UDP: inuse 40 mem 62
18:55:59 UDP: inuse 42 mem 62
18:56:30 UDP: inuse 44 mem 62
18:57:32 UDP: inuse 46 mem 62
18:58:34 UDP: inuse 46 mem 31
18:59:05 UDP: inuse 30 mem 0

Active SIP dialogs in Asterisk grows from 31 at 18:48:51 to 147 at 18:58:03.
The growth is only from TCP based SIP dialogs and is full of old OPTIONS. At 18:58:30 Asterisk start some cleanup and AutoDestroys the stale dialogs.

---------- SIP HISTORY for '26963aae3fa753aa4c55bfbe42cc0856@109.235.33.46'
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   * SIP Call
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   001. Rx              OPTIONS / 102 OPTIONS / sip:voip.pocos.nl
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   002. TxResp          SIP/2.0 / 102 OPTIONS - 404 Not Found
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   003. SchedDestroy    32000 ms
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   004. AutoDestroy     26963aae3fa753aa4c55bfbe42cc0856@109.235.33.46
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:
---------- END SIP HISTORY for '26963aae3fa753aa4c55bfbe42cc0856@109.235.33.46'

This OPTIONS was received at 18:49:48 and the history at 18:49:51 was:
* SIP Call
* 1. Rx              OPTIONS / 102 OPTIONS / sip:voip.pocos.nl
* 2. TxResp          SIP/2.0 / 102 OPTIONS - 404 Not Found
* 3. SchedDestroy    32000 ms

Active UDP session suffer the same fate:

[2010-09-22 18:49:46] VERBOSE[596] chan_sip.c: Reliably Transmitting (NAT) to 89.146.xxx.xxx:62840:
BYE sip:xxxxxxxxx@89.146.xxx.xxx:62840 SIP/2.0
Via: SIP/2.0/UDP 109.235.32.36:5060;branch=z9hG4bK28f6bf83;rport
Max-Forwards: 70
From: <sip:0033xxxxxxxxx@voip.pocos.nl>;tag=as21291fe9
To: <sip:xxxxxxxxx@voip.pocos.nl>;tag=2288469170
Call-ID: 1640465179@192_168_6_100
CSeq: 102 BYE
User-Agent: Asterisk PBX 1.6.2.13
X-Asterisk-HangupCause: Normal Clearing
X-Asterisk-HangupCauseCode: 16
Content-Length: 0

History at 18:49:51 (doesn't change until after 18:58:03)
 * SIP Call
 * 1. Rx              INVITE / 2 INVITE / sip:0033xxxxxxxxx@voip.pocos.nl
 * 2. AuthChal        Auth challenge sent for  - nc 0
 * 3. TxRespRel       SIP/2.0 / 2 INVITE - 401 Unauthorized
 * 4. SchedDestroy    6400 ms
 * 5. Rx              ACK / 2 ACK / sip:0033xxxxxxxxx@voip.pocos.nl
 * 6. Rx              INVITE / 3 INVITE / sip:0033xxxxxxxxx@voip.pocos.nl
 * 7. CancelDestroy
 * 8. Invite          New call: 1640465179@192_168_6_100
 * 9. AuthOK          Auth challenge successful for xxxxxxxxx
 * 10. NewChan         Channel SIP/xxxxxxxxx-000029b1 - from 1640465179@192_168_6_100
 * 11. TxResp          SIP/2.0 / 3 INVITE - 100 Trying
 * 12. TxResp          SIP/2.0 / 3 INVITE - 100 Trying
 * 13. TxResp          SIP/2.0 / 3 INVITE - 180 Ringing
 * 14. TxRespRel       SIP/2.0 / 3 INVITE - 200 OK
 * 15. Rx              ACK / 3 ACK / sip:0033xxxxxxxxx@109.235.32.36
 * 16. Hangup          Cause Normal Clearing
 * 17. SchedDestroy    6400 ms
 * 18. TxReqRel        BYE / 102 BYE - BYE
 * 19. RTCPaudio       Quality:

Asterisk retrying to send a BYE:

[2010-09-22 18:58:31] VERBOSE[17369] chan_sip.c: Retransmitting #1 (NAT) to 89.146.xxx.xxx:62840:
BYE sip:xxxxxxxxx@89.146.xxx.xxx:62840 SIP/2.0
Via: SIP/2.0/UDP 109.235.32.36:5060;branch=z9hG4bK28f6bf83;rport
Max-Forwards: 70
From: <sip:0033xxxxxxxxx@voip.pocos.nl>;tag=as21291fe9
To: <sip:xxxxxxxxx@voip.pocos.nl>;tag=2288469170
Call-ID: 1640465179@192_168_6_100
CSeq: 102 BYE
User-Agent: Asterisk PBX 1.6.2.13
X-Asterisk-HangupCause: Normal Clearing
X-Asterisk-HangupCauseCode: 16
Content-Length: 0

And the final cleanup happening at 18:58:32]

---------- SIP HISTORY for '1640465179@192_168_6_100'
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   * SIP Call
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   001. Rx              INVITE / 2 INVITE / sip:0033xxxxxxxxx@voip.pocos.nl
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   002. AuthChal        Auth challenge sent for  - nc 0
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   003. TxRespRel       SIP/2.0 / 2 INVITE - 401 Unauthorized
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   004. SchedDestroy    6400 ms
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   005. Rx              ACK / 2 ACK / sip:0033xxxxxxxxx@voip.pocos.nl
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   006. Rx              INVITE / 3 INVITE / sip:0033xxxxxxxxx@voip.pocos.nl
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   007. CancelDestroy
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   008. Invite          New call: 1640465179@192_168_6_100
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   009. AuthOK          Auth challenge successful for xxxxxxxxx
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   010. NewChan         Channel SIP/xxxxxxxxx-000029b1 - from 1640465179@192_168
_6_100
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   011. TxResp          SIP/2.0 / 3 INVITE - 100 Trying
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   012. TxResp          SIP/2.0 / 3 INVITE - 100 Trying
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   013. TxResp          SIP/2.0 / 3 INVITE - 180 Ringing
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   014. TxRespRel       SIP/2.0 / 3 INVITE - 200 OK
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   015. Rx              ACK / 3 ACK / sip:0033xxxxxxxxx@109.235.32.36
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   016. Hangup          Cause Normal Clearing
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   017. SchedDestroy    6400 ms
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   018. TxReqRel        BYE / 102 BYE - BYE
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   019. RTCPaudio       Quality:
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   020. ReTx            200 BYE sip:xxxxxxxxx@89.146.xxx.xxx:62840 SIP/2.0
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   021. ReliableXmit    timeout
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   022. NeedDestroy     Setting needdestroy because autodestruct
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   023. ReTx            400 BYE sip:xxxxxxxxx@89.146.xxx.xxx:62840 SIP/2.0
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:   024. AutoDestroy     1640465179@192_168_6_100
[2010-09-22 18:58:32] DEBUG[17369] chan_sip.c:
---------- END SIP HISTORY for '1640465179@192_168_6_100'
Comments:By: Stefan Schmidt (schmidts) 2010-09-23 07:58:37

how many sip users and also subscribes do you have on this box?
maybe you can rebuild with the lock debug option enabled and post the output of core show locks.

By: Leif Madsen (lmadsen) 2010-09-23 10:50:58

Ya I'd like to see some 'core show locks' and a backtrace when this is happening. My gut instinct tells me this is a network or network device issue and not strictly an Asterisk issue, but we should certainly investigate what is going on inside of Asterisk when this happens to rule anything out.

By: Stefan Schmidt (schmidts) 2010-09-23 13:59:35

i am not sure if this is really a network thing. i have seen such behavior caused by a deadlock in the ast_sched_runq function which cause asterisk to not read any further incoming packets.

have you see in the log that many users get unreachable when this happens and get back reachable fast after its over?

By: dtryba (dtryba) 2010-09-23 15:23:51

This is a PSTN gateway with about 65 SIP peers and 20 IAX peers, mostly other Asterisk machines. Apart from the installing fail2ban there was a 10% increase in traffic last week. But actual usage doesn't seem to be a trigger, it happens 03:00 when nobody is using the gateway.

Peers aren't unreachable during the 10m outage, a couple of peers (ATAs or SIP phones) change state to unreachable right after the "recovery cleanup" and become reachable within a couple of minutes when they try to reregister. Asterisk machines or Cisco Call Managers aren't considered unreachable. The customer machines are complaining my machine is unreachable. My guess is that the difference between Asterisk and phones is due to the different retry/keepalive strategies keeping the UDP session alive on both ends.

I'll build a debug version with lock enabled tomorrow.



By: Stefan Schmidt (schmidts) 2010-09-23 17:07:38

that amount of sip peers shouldnt be a problem. how much traffic do you see on your server? shouldnt be more than 100kbit in idle time like 03:00 ;)

you wouldnt see much peers going unreachable cause there are just too less to see this.

By: dtryba (dtryba) 2010-09-24 05:38:22

Bandwith is about 30 kbit/s during the middle of the night. Max during the day (5m average) peaks at 2Mbit/s.

My interpretation of the machine not flagging its peers unreachable is that it is just not processing anything UDP related. In the 10m span there is no entry in the logs that Asterisk is even trying to send any UDP OPTIONS. The only UDP event Asterisk is trying to send is the BYE at 18:49:46. This packet didn't make it to the tcpdump so it never left Asterisk. This kind of counters lmadsen's gut feeling this might be a network related issue.

By: dtryba (dtryba) 2010-09-28 04:24:45

I ran into https://issues.asterisk.org/view.php?id=15951
Is a core shwo locks when compiled with ./configure --without-execinfo any use?

By: Leif Madsen (lmadsen) 2010-10-04 12:01:31

Can you comment on this issue?

By: Stefan Schmidt (schmidts) 2010-10-04 14:49:24

dtryba maybe it helps to see where this lock happens.
you could check the output of "sip show objects" too, to see how many sip dialogs are in use.

By: lev neploh (doomeasy) 2010-10-26 07:57:36

hello. same thing on my asterisk, same version, same workaround (thank you, dtryba!).

By: dtryba (dtryba) 2010-11-11 05:19:37.000-0600

I took some time, but finally I have a "core show locks" from the moment asterisk stopped responding to UDP requests. It's from an 1.6.2.13 machine.

The last time this happened the work around of changing bindaddr and reloading didn't solve the problem (needed a restart).

I made asterisk dump core, it's a 64bit machine but the core file is 3551223808 bytes, might be a coincidence but this looks pretty close to the max memory usage of 32bit binaries. Still have to examine the stack trace, but am a bit weary about running "gdb /usr/sbin/asterisk core" on the machine that is running an active asterisk. Will this cause problems for the running asterisk?

By: Stefan Schmidt (schmidts) 2010-11-11 06:17:38.000-0600

this deadlock occurs in the scheduler part of the do_monitor section and this cause asterisk to not receive any further packets.

what i can futher see is that there is a problem with the refcounter for sip objects like this:

name: simonthuis2
refcount: 1345

name: simonthuis1
refcount: 1286

this should never be on such a high value cause this objects wont be destroyed clearly.
IMO there is a refcounter bug with TCP/TLS calls in 1.6.2 which cause this deadlock.

By: dtryba (dtryba) 2010-11-11 06:42:54.000-0600

TLS is disabled, sip show settings:
Global Settings:
----------------
 UDP SIP Port:           5060
 UDP Bindaddress:        0.0.0.0
 TCP SIP Port:           5060
 TCP Bindaddress:        0.0.0.0
 TLS SIP Port:           Disabled

simonthuis1 and simonthuis2 are accounts on the same Snom-M9/9.1.75.
I disabled these accounts. And will monitor refcounts.

By: Stefan Schmidt (schmidts) 2010-11-11 06:52:14.000-0600

snoms are evil ;)

By: dtryba (dtryba) 2010-11-11 06:55:22.000-0600

The bt from the core (full threads will be uploaded):

(gdb) bt full
#0  0x00007f6883756bd6 in poll () from /lib/libc.so.6
No symbol table info available.
#1  0x0000000000442663 in monitor_sig_flags (unused=0x0) at asterisk.c:3067
       p = {fd = 164, events = 1, revents = 0}
       a = 167
#2  0x0000000000444c84 in ?? ()
No locals.
#3  0x00007f68836ae1a6 in __libc_start_main () from /lib/libc.so.6
No symbol table info available.
#4  0x000000000041e419 in ?? ()
No locals.
ASTERISK-1  0x00007fff80d99db8 in ?? ()
No symbol table info available.
ASTERISK-2  0x000000000000001c in ?? ()
No symbol table info available.
ASTERISK-3  0x0000000000000008 in ?? ()
No symbol table info available.
ASTERISK-4  0x00007fff80d9ae8c in ?? ()
No symbol table info available.
ASTERISK-5  0x00007fff80d9ae9f in ?? ()
No symbol table info available.
ASTERISK-6 0x00007fff80d9aea2 in ?? ()
No symbol table info available.
ASTERISK-7 0x00007fff80d9aea5 in ?? ()
No symbol table info available.
ASTERISK-8 0x00007fff80d9aeb9 in ?? ()
No symbol table info available.
ASTERISK-9 0x00007fff80d9aebc in ?? ()
No symbol table info available.
ASTERISK-10 0x00007fff80d9aebf in ?? ()
No symbol table info available.
ASTERISK-11 0x00007fff80d9aec2 in ?? ()
No symbol table info available.
ASTERISK-12 0x0000000000000000 in ?? ()
No symbol table info available.

By: dtryba (dtryba) 2010-11-11 08:14:07.000-0600

Sadly enough the problem isn't caused by the Snom. It is disabled but I just had my 3 crash today. This time the highest refcount is 16 (3rdcrash.txt).

I switched to our hot spare, it has the same problem (1.6.2.9) but in most cases switching will result in a couple of weeks without troubles.

By: dtryba (dtryba) 2010-11-11 10:59:26.000-0600

Arghhhh, fallback machine experienced the problem within  a couple of hours. This killing us. Let's see what happens when I switch all hosts back to UDP and hopefully can turn off TCP completely.

By: wufan (wufan) 2010-12-21 08:58:08.000-0600

hi, did you get a solution on this or some further informations? i think i have
the same problem, but no explanation... last week it did occur and 2 hours ago.. even a reboot didnt solved the problem. Thank you!

By: dtryba (dtryba) 2010-12-21 09:11:18.000-0600

Since last report I disabled TCP. Zero problems since then on these POTS gateways. So if you can do without TCP try turning it off completly.

Just disabling tcp in sip.conf and reloading is not enough, it will keep tcp sockets lingering eventually resulting in a crash (in my case). I had to reboot the machine to get rid of all tcp sip connections!

By: wufan (wufan) 2010-12-21 10:33:22.000-0600

thanks,
mhh its already off..
maybe there are any related bugs?
sip show settings:
Global Settings:
----------------
 UDP SIP Port:           5060
 UDP Bindaddress:        0.0.0.0
 TCP SIP Port:           Disabled
 TLS SIP Port:           Disabled

By: lev neploh (doomeasy) 2010-12-21 12:58:29.000-0600

may it is different situation, but in my case (same symptoms) - i have added -
ulimit -n 10000 to /etc/init.d/asterisk, then /etc/init.d/asterisk restart.
--
because of:
root@platforma:~# lsof | grep asterisk | wc -l
357

root@platforma:~# ps ax | grep asterisk
4716 pts/17   S+     0:00 rasterisk r
15343 pts/18   S+     0:00 grep asterisk
15968 ?        Ssl   60:43 asterisk


root@platforma:~# cat /proc/15968/limits |  grep "open files"
Max open files            1024                 1024                 files