[Home]

Summary:ASTERISK-13339: Asterisk Crashes with signal 11 (segmentation fault) at random intervals (but at least 2 times a day)
Reporter:Andrea Borghi (aborghi)Labels:
Date Opened:2009-01-11 16:34:34.000-0600Date Closed:2009-01-16 16:46:32.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Registration
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 14211.diff
( 1) asterisk.txt
( 2) gdb.txt
( 3) sip.txt
Description:Asterisk Crashes with signal 11 (segmentation fault) at random intervals (but at least 2 times a day)

Asterisk began crashing after a configuration change to the sip channel to handle two additional Corda numbers (before we had only one number and the server keep running months).
Beside the obvious error in sip configuration i was going to solve, asterisk began crashing more or less regularly after adding the registrations for the new corda numbers, totalling 3 SIP registrations to the cordia servers.

while debugging, I noted that che cordia servers does not seem to reply in a timely manner to the sip requests, delaying registrations for long time before accepting it.

from various backtraces i done while preparing the files for the report, form various different crashes, i see that the signal 11 occours every time in handle_response_register

A thing i noted is that while i have registerattempts=10 when the registrations to cordia are timing out in the messages in the console i see ALWAYS the attempt count a 0. It seems that asterisk is not counting attempts to register to the cordia servers so it is not reaching the limit i imposed of 10 attempts.



****** ADDITIONAL INFORMATION ******

VoIP installation is as follows:

2 PATTON SIPtoISDN gateways  (1 for PRI, 1 for 4xBRI) converts local calls
1 PORTECH SIPtoGSM gateway convert local mobile calls to SIP
30 AASTRA 57i phones give service to the local office
3 Cordia SIP subscriptions for non-itlay numbers (USA, France, England)
1 Asterisk server gives the intelligence to the installation and some services such as voicemail

Local network is a L2 VLAN in the switched infrastructure, no bottlenecks in switch trunks.
Internet connection is 10Mbit/s fiber to Colt Networks, no traffic congestion, about 22 ms RTT to London MIX.

net 192.168.10.0/24 hosts the phones and gateways
net 10.0.0.0/24 hosts the network connected to internet via a firewall

Software is Debian 4.0 Etch last upgraded today.
Linux pbx 2.6.24-etchnhalf.1-686 #1 SMP Tue Dec 2 07:56:33 UTC 2008 i686 GNU/Linux
Package list and versions available upon request.

Asterisk 1.4.22 with asterisk-addons-1.4.7 (using only MySQL CDR backend) freshly compiled from distribution tarballs

attached files:

gdb.txt = gdb output as per docs/backtrace.txt
sip.txt = sip channel configuration with passwords removed
asterisk.txt = asterisk startup messages with debugging

Comments:By: Mark Michelson (mmichelson) 2009-01-12 16:17:53.000-0600

The reason that you see the attempt count at 0 is that the function where the crash is occurring sets the count back to 0 since we are receiving a successful reply to our REGISTER.

As for the actual crash, this is an interesting one. Based on the backtraces you provided, I can see that the crash is occurring because we are attempting to dereference a NULL pointer. The interesting part is that there is a check about 10 lines up in the source to make sure we don't do that. I'll have to investigate a bit deeper to see if I can find a suitable cause for this crash to happen.

By: Andrea Borghi (aborghi) 2009-01-12 16:28:27.000-0600

to follow up, i had the server with one registration only (commented the register=> keywords for the two accounts English and France, keeped US) and the asterisk is *NOT* crashing.

Perhaps an interesting side effect of just one or two registration (mis)handled or just plainly timing out, because we are effectively talking to the very same server and the code is confused in this case?

By: Mark Michelson (mmichelson) 2009-01-12 16:32:01.000-0600

I don't know what sort of load your server has on it, but do you think it might be possible to run Asterisk under valgrind? For instructions about this, see doc/valgrind.txt in the Asterisk source directory. This will cause a noticeable performance penalty, but if you can afford it, it's the quickest way to solve these sorts of issues.

By: Andrea Borghi (aborghi) 2009-01-12 16:46:25.000-0600

will try tomorrow evening (now 23:44 local time) because i can not disrupt normal office hours service. I have already installed valgrind via debian packages on the server.

By: Mark Michelson (mmichelson) 2009-01-12 16:59:43.000-0600

Actually, the valgrind output may not be necessary. I was looking at an incorrect line of the code when I analyzed the problem before and may be able to determine the problem on my own.

By: Mark Michelson (mmichelson) 2009-01-12 17:08:20.000-0600

I believe I have found the problem here. I was mistakenly looking at the code in handle_response_register for a 200 OK response, but the line number indicated in your backtrace actually pointed to the section responsible for handling a 408 response. A 408 is a request timeout, which most likely meant that our registration attempt had timed out locally as well. As a result, the object we were attempting to dereference was NULLed out when the registration timed out.

I have uploaded a patch, 14211.diff, which adds a check for the pointer as well as a warning message to appear when a 408 is received and the sip_registry object we wish to modify has been destroyed already. Please test and let me know if this works for you. Thank you very much!

By: Andrea Borghi (aborghi) 2009-01-12 17:18:54.000-0600

patch applied and running. i will keeping an eye on it tomorrow and let you know what happens. for the record, at the restart the 3 registrations to cordia did not timeout at the restart just now.

By: Leif Madsen (lmadsen) 2009-01-13 12:04:00.000-0600

Just pinging this issue to determine if the reporter believes this patch resolved his problem?

By: Andrea Borghi (aborghi) 2009-01-13 16:14:41.000-0600

as noted previously the patch was applied yesterday. the system is running without crashes (it has been up about 22 hours) but the network interconnection or the servers are better so we are *NOT* seeing timeouts on registrations which i believe generated the issue so i don't have a good/bad reply for you. my belief is that the patch resolved the issue, but less a day of testing is a bit small to be sure. i want to run asterisk another 2 days before closing this issue.

By: Mark Michelson (mmichelson) 2009-01-13 16:15:40.000-0600

That's perfectly reasonable. I'll wait until Friday. If I don't hear about any other problems occurring then I will commit the patch and close the issue.

By: Leif Madsen (lmadsen) 2009-01-14 14:10:06.000-0600

aborghi, yes that is reasonable. Sorry, I had read the date wrong and thought much more than a few hours had passed since your note was posted :)

By: Andrea Borghi (aborghi) 2009-01-15 15:03:15.000-0600

ok. i can confirm that this patch resolves the crash problems for me.
Asterisk has not crashed in 2 days, 20 hours of service.

during this period i had some events like these:
Jan 13 20:15:04] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 20:15:04] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 20:15:04] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 20:24:24] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 20:24:24] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 20:24:24] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 21:12:05] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 22:35:44] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 13 22:35:44] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 01:32:16] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 01:32:16] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 01:32:16] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 02:09:26] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 02:09:26] WARNING[9169] chan_sip.c: Got 200 OK on REGISTER that isn't a register
[Jan 14 02:46:40] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 4e6d2fb319ff613b7c840f355cd89b04@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:46:40] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 568b62eb24666c3575291c4a40757375@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:46:40] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 0b74e9133af5cd9d032a3f516a40ddc1@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:56:00] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 0b74e9133af5cd9d032a3f516a40ddc1@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:56:00] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 4e6d2fb319ff613b7c840f355cd89b04@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:56:00] WARNING[9169] chan_sip.c: Got a 408 response to our REGISTER on call 568b62eb24666c3575291c4a40757375@127.0.1.1 after we had destroyed the registry object
[Jan 14 02:56:41] NOTICE[9169] chan_sip.c:    -- Registration for '16462487925@soft1.ny.cordiaip.net' timed out, trying again (Attempt #0)
[Jan 14 02:56:41] NOTICE[9169] chan_sip.c:    -- Registration for '442033553817@soft1.ny.cordiaip.net' timed out, trying again (Attempt #0)
[Jan 14 02:56:41] NOTICE[9169] chan_sip.c:    -- Registration for '33170615949@soft1.ny.cordiaip.net' timed out, trying again (Attempt #0)

but asterisk is up and running.

Why we arrive in a part of code where we try to free an already freed block is a curiosity of mine but i don't have the time to learn the innards of asterisk code to learn that.

Let me clarify: my programmer side does not like this situation at the code level, it's a thing that suggests some other and more frightening effects of locking, thread interactions, bad state machines and such. BUT the simple fact the patch (as this is really a "patch") is confirmed to be working for me, and that is all i can do this time due to my work levels.

I wish to thank you all for the quick resolution.

I hope this will be integrated in releasing 1.4.23

By: Leif Madsen (lmadsen) 2009-01-16 14:12:49.000-0600

Sorry, this will not be making it into 1.4.23, but rather will be in 1.4.24-rc1 and beyond.

Thanks!

By: Digium Subversion (svnbot) 2009-01-16 16:42:11.000-0600

Repository: asterisk
Revision: 168975

U   branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r168975 | mmichelson | 2009-01-16 16:42:11 -0600 (Fri, 16 Jan 2009) | 18 lines

Account for possible NULL pointer when we receive a 408 in response to a REGISTER

It may be that by the time we receive a reply to a REGISTER request, the attempt has
timed out and thus the registry structure pointed to by the corresponding sip_pvt has
gone away. This situation was handled properly for a 200 OK response, but the 408
case assumed that the sip_registry struct was non-NULL, thus potentially causing a crash

This commit fixes this assumption and prints out a message to the console if we should
receive a late 408 response to a REGISTER


(closes issue ASTERISK-13339)
Reported by: aborghi
Patches:
     14211.diff uploaded by putnopvut (license 60)
Tested by: aborghi


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=168975

By: Digium Subversion (svnbot) 2009-01-16 16:43:07.000-0600

Repository: asterisk
Revision: 168976

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r168976 | mmichelson | 2009-01-16 16:43:07 -0600 (Fri, 16 Jan 2009) | 26 lines

Merged revisions 168975 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r168975 | mmichelson | 2009-01-16 16:42:13 -0600 (Fri, 16 Jan 2009) | 18 lines

Account for possible NULL pointer when we receive a 408 in response to a REGISTER

It may be that by the time we receive a reply to a REGISTER request, the attempt has
timed out and thus the registry structure pointed to by the corresponding sip_pvt has
gone away. This situation was handled properly for a 200 OK response, but the 408
case assumed that the sip_registry struct was non-NULL, thus potentially causing a crash

This commit fixes this assumption and prints out a message to the console if we should
receive a late 408 response to a REGISTER


(closes issue ASTERISK-13339)
Reported by: aborghi
Patches:
     14211.diff uploaded by putnopvut (license 60)
Tested by: aborghi


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=168976

By: Digium Subversion (svnbot) 2009-01-16 16:46:10.000-0600

Repository: asterisk
Revision: 168978

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r168978 | mmichelson | 2009-01-16 16:46:10 -0600 (Fri, 16 Jan 2009) | 34 lines

Merged revisions 168976 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r168976 | mmichelson | 2009-01-16 16:43:09 -0600 (Fri, 16 Jan 2009) | 26 lines

Merged revisions 168975 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r168975 | mmichelson | 2009-01-16 16:42:13 -0600 (Fri, 16 Jan 2009) | 18 lines

Account for possible NULL pointer when we receive a 408 in response to a REGISTER

It may be that by the time we receive a reply to a REGISTER request, the attempt has
timed out and thus the registry structure pointed to by the corresponding sip_pvt has
gone away. This situation was handled properly for a 200 OK response, but the 408
case assumed that the sip_registry struct was non-NULL, thus potentially causing a crash

This commit fixes this assumption and prints out a message to the console if we should
receive a late 408 response to a REGISTER


(closes issue ASTERISK-13339)
Reported by: aborghi
Patches:
     14211.diff uploaded by putnopvut (license 60)
Tested by: aborghi


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=168978

By: Digium Subversion (svnbot) 2009-01-16 16:46:31.000-0600

Repository: asterisk
Revision: 168979

_U  branches/1.6.1/
U   branches/1.6.1/channels/chan_sip.c

------------------------------------------------------------------------
r168979 | mmichelson | 2009-01-16 16:46:31 -0600 (Fri, 16 Jan 2009) | 34 lines

Merged revisions 168976 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r168976 | mmichelson | 2009-01-16 16:43:09 -0600 (Fri, 16 Jan 2009) | 26 lines

Merged revisions 168975 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r168975 | mmichelson | 2009-01-16 16:42:13 -0600 (Fri, 16 Jan 2009) | 18 lines

Account for possible NULL pointer when we receive a 408 in response to a REGISTER

It may be that by the time we receive a reply to a REGISTER request, the attempt has
timed out and thus the registry structure pointed to by the corresponding sip_pvt has
gone away. This situation was handled properly for a 200 OK response, but the 408
case assumed that the sip_registry struct was non-NULL, thus potentially causing a crash

This commit fixes this assumption and prints out a message to the console if we should
receive a late 408 response to a REGISTER


(closes issue ASTERISK-13339)
Reported by: aborghi
Patches:
     14211.diff uploaded by putnopvut (license 60)
Tested by: aborghi


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=168979