[Home]

Summary:ASTERISK-09243: realtime prune (and others) may segfault asterisk (timing issue)
Reporter:Vadim Berezniker (kryptolus)Labels:
Date Opened:2007-04-11 07:31:56Date Closed:2008-04-22 11:55:32
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Registration
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) expire_register2.patch
Description:The function expire_register doesn't do any locking before mucking with the peer.
If it gets pre-empted, there's a chance that the peer might be destroyed before the control returns to expire_register.

If you execute a "prune realtime peer" at the right time, asterisk will segfault. It is not limited to just this however, as I have experienced several segfaults with this signature without any intervention. However, looking at the code I can only see a problem with the pruning code, I don't see any possible issues with any other place in sip.

My patch queues up peers to be destroyed and they are ultimately destroyed from the monitor thread which should guarantee that expire_register cannot be running  at the same time. The other alternative is to add a check to expire_register to check if peer is still inside the peer list. However, that has potential to impact performance because that check would block a lot of things on every expire.



Comments:By: Vadim Berezniker (kryptolus) 2007-04-11 08:36:12

Just noticed an issue in the first patch. I uploaded a newer version.

By: Curt Moore (jcmoore) 2007-05-29 22:31:20

kryptolus, thank you for your submission. I apologize that it's taken a while for someone to get back to you.

Does anyone have any thoughts on this issue?  I'm not an expert on this part of the code so any input is appreciated.

By: Olle Johansson (oej) 2007-05-30 01:20:51

I've been worried at this but haven't seen any bug reports. Will have to consider if this is something we want to change in 1.2 this late in the release cycle, but certainly in 1.4 and trunk.

By: Vadim Berezniker (kryptolus) 2007-06-22 08:28:29

We're seeing a segfault in 1.4 regularly (once every couple of days) without any interaction (no sip reload / sip prune, etc)

The backtrace looks exactly like it did when using "sip prune realtime peer."
I'm going to try using my patch in 1.4 to see if it eliminates the crash.

#0  0x0085bd34 in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#1  0x008f907c in expire_register (data=0xb7c64418) at /root/asterisk-1.4.5-nbs-5/include/asterisk/lock.h:532
#2  0x080ed39a in ast_sched_runq (con=0x82ef020) at sched.c:359
#3  0x0091c12b in do_monitor (data=0x0) at chan_sip.c:15344
#4  0x080f91d5 in dummy_start (data=0x92e9b4) at utils.c:545
ASTERISK-1  0x0085a371 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-2  0x00745ffe in clone () from /lib/tls/libc.so.6

By: rayjay (rayjay) 2007-08-17 01:50:31

I can confirm that we have had the same segfaults in 1.4 and hence I have had to disable the sip prune command from our scripts for now until this is resolved.  kryptolus, I am running the very latest 1.4.10.1 code.  Do you have a version of the patch for the latest 1.4 release I could try myself?

By: Russell Bryant (russell) 2007-08-17 08:59:54

I just made some commits to 1.4 and trunk that don't address the problem reported here, but should fix the crash that kryptolus is having.

By: Yuri (ys) 2007-11-09 08:03:23.000-0600

This is still exist in 1.4.13.

Sometimes (after sip reload) if i have outgoing sip registrations.
I see coredump:

#0  0x285df62e in ast_mutex_lock (pmutex=0x5c) at lock.h:716
No locals.
#1  0x285f7159 in expire_register (data=0x81ff000) at chan_sip.c:7741
       newcount = 0
       peer = (struct sip_peer *) 0x0
       __PRETTY_FUNCTION__ = "expire_register"
#2  0x080e15dc in ast_sched_runq (con=0x81bab40) at sched.c:359
       current = (struct sched *) 0x81dbe20
       tv = {tv_sec = 1194613236, tv_usec = 648355}
       numevents = 0
       res = 677590056
#3  0x28617b35 in do_monitor (data=0x0) at chan_sip.c:15389
       res = 0
       sip = (struct sip_pvt *) 0x0
       peer = (struct sip_peer *) 0x0
       t = 1194613236
       fastrestart = 0
       lastpeernum = -1
       curpeernum = 14
       reloading = 0
       __PRETTY_FUNCTION__ = "do_monitor"
#4  0x080eeefd in dummy_start (data=0x81bf160) at utils.c:805
       ret = (void *) 0x0
      a = {start_routine = 0x2861758c <do_monitor>, data = 0x0,
 name = 0x81bee80 "do_monitor", ' ' <repeats 11 times>, "started at [15443] chan_sip.c restart_monitor()"}
ASTERISK-1  0x282c33a5 in pthread_create () from /lib/libpthread.so.2
No symbol table info available.
ASTERISK-2  0x28380137 in _ctx_start () from /lib/libc.so.6
No symbol table info available.
(gdb)

By: jmls (jmls) 2008-02-06 04:07:19.000-0600

russell: Any luck on getting this sorted ?

By: Russell Bryant (russell) 2008-02-06 09:38:26.000-0600

The correct fix for the core issue here is to use reference counting on these objects.  Steve Murphy has been working on this, but only for Asterisk 1.6.  Backporting this to 1.4 would be a pretty huge job, but it will have to be done to fix this issue.

By: rayjay (rayjay) 2008-02-06 14:17:44.000-0600

We rely heavily on realtime for provisioning our peers and users, and rely on the prune method to refresh the * cache when a user resets their password.  This has started to become quite a problem for us now and we need to get the prune method working reliably fairly soon.  Will the attached patch here work for 1.4?  I realise we may need to change some variable/object names for 1.4 code, but will the logic in the patch actually fix the problem temporarily until Steve gets the proper fix backported to 1.4?  Perhaps Kyryptolus can comment?  Did we get any feedback on the initial patch?

By: Vadim Berezniker (kryptolus) 2008-02-06 14:45:39.000-0600

I wrote a patch for my employer to address the issue and we had not had a crash in this area since then. I'm in school at the moment, so I don't have time to separate out this patch and update it to the latest version...

By: Steve Murphy (murf) 2008-03-19 13:45:54

I propose that I update and apply this patch for 1.4; for trunk & 1.6, we go the more general route of doing the astobj2 implementation. I honetly think that using refcounts on peers will cover this problem; I know that even in the trunk version so far, I've punted on fully tracking peer refs in the callbacks, but this could be expanded to cover this sort of issue.

By: Steve Murphy (murf) 2008-03-19 13:47:29

PS. I suggest applying this patch to 1.4 instead of the full astobj2 conversion because it'll be far, far, far cheaper and unintrusive than the "full body makeover" we are applying to trunk.

By: Russell Bryant (russell) 2008-03-19 14:04:19

This patch doesn't make any sense, though.  The problem is that peers are already reference counted.  If they were ref counted properly, then there wouldn't be any race conditions or problems to deal with.  This patch is just hiding the fact that the ref counting is not correct, and is extremely unlikely to actually completely fix a problem.

By: Kevin P. Fleming (kpfleming) 2008-03-19 15:10:51

There really isn't any point in not 'fully tracking refs'... doing so is just an invitation to even harder to isolate problems because it *looks* like the code is doing the right thing.

In spite of the work being done for 1.6, the existing reference counting in 1.4 should be made as stable and reliable as it can be. If expire_register() is operating on a peer object without holding a reference to that object, then we are guaranteed to crash at some point.

By: Steve Murphy (murf) 2008-03-19 18:13:57

I see your point, but the PAIN, oh, the pain!

My main concerns are these:
(1) An overhaul of the code in 1.4 will be pretty intrusive. The chances of introducing more bugs than we solve is extremely high.
(2) Next, the reason for death in this case: expire_register. In the current code, asterisk has a concept of when to destroy objects (at least, to a degree); why expire_register callbacks are left scheduled is probably the true bug. Let me put it another way: the program is crashing because we destroyed a peer, and some callback is left around that refs that peer? Huh? I don't quite grok how or why we would 'partially' destroy an object, and think it'd be OK for the struct to live a while longer because something else is referencing it... especially a callback.... I have to analyze this issue, and see how/why this is happening... The real answer might not be the patch, but rather to re-arrange the code in expire_register() so it isn't in the path to peer destruction...

By: Digium Subversion (svnbot) 2008-04-21 10:49:36

Repository: asterisk
Revision: 114329

A   team/russell/issue_9520/

------------------------------------------------------------------------
r114329 | russell | 2008-04-21 10:49:36 -0500 (Mon, 21 Apr 2008) | 2 lines

Create a branch for fixing a bunch of race conditions for issue ASTERISK-9243

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114329

By: Digium Subversion (svnbot) 2008-04-21 11:00:40

Repository: asterisk
Revision: 114331

U   team/russell/issue_9520/channels/chan_iax2.c
U   team/russell/issue_9520/channels/chan_sip.c
U   team/russell/issue_9520/include/asterisk/sched.h

------------------------------------------------------------------------
r114331 | russell | 2008-04-21 11:00:36 -0500 (Mon, 21 Apr 2008) | 3 lines

Add proper reference tracking for peers with regards to the expire_register callback
(should fix the crash reported in issue ASTERISK-9243)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114331

By: Digium Subversion (svnbot) 2008-04-21 11:33:02

Repository: asterisk
Revision: 114333

U   team/russell/issue_9520/channels/chan_sip.c

------------------------------------------------------------------------
r114333 | russell | 2008-04-21 11:33:01 -0500 (Mon, 21 Apr 2008) | 3 lines

Add reference tracking with respect to scheduler callbacks when qualifying peers.
(inspired by issue ASTERISK-9243)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114333

By: Russell Bryant (russell) 2008-04-21 11:33:22

The code in this branch is ready for testing.  I'm going to do some of my own testing and commit it unless I hear any negative feedback.

By: Russell Bryant (russell) 2008-04-21 12:16:10

It's actually not working yet, don't try it.  :)

By: Russell Bryant (russell) 2008-04-21 17:25:09

Ok, stuff has been fixed up, feel free to test now ..

By: Digium Subversion (svnbot) 2008-04-22 10:15:30

Repository: asterisk
Revision: 114522

U   branches/1.4/channels/chan_sip.c
U   branches/1.4/include/asterisk/sched.h

------------------------------------------------------------------------
r114522 | russell | 2008-04-22 10:15:29 -0500 (Tue, 22 Apr 2008) | 13 lines

Merge changes from team/russell/issue_9520

These changes make sure that the reference count for sip_peer objects properly
reflects the fact that the peer is sitting in the scheduler for a scheduled
callback for qualifying peers or for expiring registrations.  Without this, it
was possible for these callbacks to happen at the same time that the peer was
being destroyed.  This was especially likely to happen with realtime peers, and
for people making use of the realtime prune CLI command.

(closes issue ASTERISK-9243)
Reported by: kryptolus
Committed patch by me

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114522

By: Digium Subversion (svnbot) 2008-04-22 10:15:51

Repository: asterisk
Revision: 114523

_U  trunk/

------------------------------------------------------------------------
r114523 | russell | 2008-04-22 10:15:50 -0500 (Tue, 22 Apr 2008) | 20 lines

Blocked revisions 114522 via svnmerge

........
r114522 | russell | 2008-04-22 10:20:37 -0500 (Tue, 22 Apr 2008) | 13 lines

Merge changes from team/russell/issue_9520

These changes make sure that the reference count for sip_peer objects properly
reflects the fact that the peer is sitting in the scheduler for a scheduled
callback for qualifying peers or for expiring registrations.  Without this, it
was possible for these callbacks to happen at the same time that the peer was
being destroyed.  This was especially likely to happen with realtime peers, and
for people making use of the realtime prune CLI command.

(closes issue ASTERISK-9243)
Reported by: kryptolus
Committed patch by me

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114523

By: Digium Subversion (svnbot) 2008-04-22 10:16:12

Repository: asterisk
Revision: 114524

_U  branches/1.6.0/

------------------------------------------------------------------------
r114524 | russell | 2008-04-22 10:16:11 -0500 (Tue, 22 Apr 2008) | 27 lines

Blocked revisions 114523 via svnmerge

................
r114523 | russell | 2008-04-22 10:20:53 -0500 (Tue, 22 Apr 2008) | 20 lines

Blocked revisions 114522 via svnmerge

........
r114522 | russell | 2008-04-22 10:20:37 -0500 (Tue, 22 Apr 2008) | 13 lines

Merge changes from team/russell/issue_9520

These changes make sure that the reference count for sip_peer objects properly
reflects the fact that the peer is sitting in the scheduler for a scheduled
callback for qualifying peers or for expiring registrations.  Without this, it
was possible for these callbacks to happen at the same time that the peer was
being destroyed.  This was especially likely to happen with realtime peers, and
for people making use of the realtime prune CLI command.

(closes issue ASTERISK-9243)
Reported by: kryptolus
Committed patch by me

........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114524

By: Digium Subversion (svnbot) 2008-04-22 11:55:32

Repository: asterisk
Revision: 114535

_U  team/seanbright/resolve-shadow-warnings/
U   team/seanbright/resolve-shadow-warnings/CHANGES
U   team/seanbright/resolve-shadow-warnings/apps/app_jack.c
U   team/seanbright/resolve-shadow-warnings/channels/chan_sip.c
U   team/seanbright/resolve-shadow-warnings/configs/sip_notify.conf.sample
U   team/seanbright/resolve-shadow-warnings/main/manager.c
U   team/seanbright/resolve-shadow-warnings/main/utils.c

------------------------------------------------------------------------
r114535 | seanbright | 2008-04-22 11:55:30 -0500 (Tue, 22 Apr 2008) | 73 lines

Merged revisions 114520,114523,114527,114529,114533 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r114520 | murf | 2008-04-22 10:38:46 -0400 (Tue, 22 Apr 2008) | 15 lines


Hopefully, this will resolve the issues that russellb had with this log_show_lock().
I gathered the code that filled the string, and put it in a different func which
I cryptically call "append_lock_information()".
Now, both log_show_lock(), and handle_show_locks() both call this code to do
the work. Tested, seems to work fine.
Also, log_show_lock was modified to use the ast_str stuff, along with checking
for successful ast_str creation, and freeing the ast_str obj when finished.
A break was inserted to terminate the search for the lock; we should never
see it twice.

An example usage in chan_sip.c was created as a comment, for instructional
purposes.


................
r114523 | russell | 2008-04-22 11:20:53 -0400 (Tue, 22 Apr 2008) | 20 lines

Blocked revisions 114522 via svnmerge

........
r114522 | russell | 2008-04-22 10:20:37 -0500 (Tue, 22 Apr 2008) | 13 lines

Merge changes from team/russell/issue_9520

These changes make sure that the reference count for sip_peer objects properly
reflects the fact that the peer is sitting in the scheduler for a scheduled
callback for qualifying peers or for expiring registrations.  Without this, it
was possible for these callbacks to happen at the same time that the peer was
being destroyed.  This was especially likely to happen with realtime peers, and
for people making use of the realtime prune CLI command.

(closes issue ASTERISK-9243)
Reported by: kryptolus
Committed patch by me

........

................
r114527 | russell | 2008-04-22 11:46:01 -0400 (Tue, 22 Apr 2008) | 8 lines

Correct action_ping() and action_events() with regards to Manager 1.1
documentation.  Also, fix a bug in xml_translate().

(closes issue ASTERISK-11124)
Reported by: ys
Patches:
     trunk_manager.c.diff uploaded by ys (license 281)

................
r114529 | file | 2008-04-22 11:54:06 -0400 (Tue, 22 Apr 2008) | 6 lines

Add support for authenticating on a NOTIFY request. This is useful for phones that require it when sending them a special packet to get them to do something (such as reload their configuration).
(closes issue ASTERISK-9602)
Reported by: IgorG
Patches:
     sipnotify-113980-v14.patch uploaded by IgorG (license 20)

................
r114533 | russell | 2008-04-22 12:47:00 -0400 (Tue, 22 Apr 2008) | 4 lines

Add a c() option for the Jack() application and JACK_HOOK() funciton for supplying
a custom client name.  Using the channel name is still the default.  This was done
at the request of Jared Smith.

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=114535