Summary: | ASTERISK-15495: [patch] segfault on chanspy due to race in main/channel.c | ||
Reporter: | Tim Ringenbach at Asteria Solutions Group (tim_ringenbach) | Labels: | |
Date Opened: | 2010-01-22 17:52:38.000-0600 | Date Closed: | 2010-03-11 09:33:53.000-0600 |
Priority: | Minor | Regression? | No |
Status: | Closed/Complete | Components: | Applications/app_chanspy |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) chanspy_backtrace_0016678.txt ( 1) chanspy_crash.diff ( 2) chanspy_crashv2.diff ( 3) datastore_destroy_race.diff ( 4) example_bt_datestore_destroy_race.txt | |
Description: | When channel.c destroys the datastore on the channel, it doesn't hold the channel lock while calling the destroy callback. It really ought to, because otherwise it's accessing the datastore list without locking. I've gotten a segfault trying to lock the mutex in the ds destroy function in app_chanspy because of this race. Holding the channel lock during the destroy should be safe because it is also held during the fixup callback, and app_chanspy has already been patched to avoid the possible deadlock from that locking order issue. | ||
Comments: | By: David Vossel (dvossel) 2010-02-12 10:29:27.000-0600 I've uploaded a patch that I believe will resolve the issue. It doesn't look like the problem you are having has to do with locking. The chanspy datastore is created on the stack, and the mutex is destroyed before exiting the main chanspy function. Regardless if we lock the datastore's channel during removal or not, if a channel still has the chanspy datastore after the chanspy app exits it will crash. I believe the real fix is to put a couple of checks in chanspy to guarantee this never happens. Please test my patch and let me know if it resolves the issue. I'm posting this patch on reviewboard as well. It will be at this link, https://reviewboard.asterisk.org/r/500/ By: Tim Ringenbach at Asteria Solutions Group (tim_ringenbach) 2010-02-12 13:19:02.000-0600 It looks like I already had added a chanspy_ds_free(&chanspy_ds); in the same place you have it, and it was still crashing with that. I'll try testing your patch and post something to the review board. I suspect that I will still be able to make it crash though. If you look at the code in channel.c: while ((datastore = AST_LIST_REMOVE_HEAD(&chan->datastores, entry))) /* Free the data store */ ast_channel_datastore_free(datastore); After it removes the datastore from the linked list, but before it calls ast_channel_datastore_free(), chanspy might decide it's done, and call chanspy_ds_free(). chanspy_ds_free() won't find anything to remove, but it is ok with that. I assumed since it didn't find anything to remove, it's safe to exit and let the stack memory go away. channel.c can't know to tell it otherwise because it's not on the list anymore. But the other thread is going about to call the destroy method, which is going to access invalid stack memory and crash. By: David Vossel (dvossel) 2010-02-12 14:18:49.000-0600 I uploaded a new patch with an additional improvement. We should really be checking the result of ast_channel_datastore_remove(). By: Tim Ringenbach at Asteria Solutions Group (tim_ringenbach) 2010-02-12 16:06:42.000-0600 I applied your new patch to a stock 1.4.29, plus I added a "sleep(30)" where I think the race is happening to try to bring it out. Now I am running the same stress test I was using when I originally got the crash, and I'll post the backtrace if I can get it to crash. I don't see what good checking the return value of ast_channel_datastore_remove() does. You know it didn't remove the channel datastore and really should have, but what are you going to do about it? You have to not free the memory (by returning) until the other thread finishes calling your callback, and you have no way of knowing when that will be. That's why I think you need to protect it from happening by making channel.c whole the channel lock while doing that. By: Tim Ringenbach at Asteria Solutions Group (tim_ringenbach) 2010-02-12 16:12:35.000-0600 I just attached the backtrace I produced while running your patch plus changing main/channel.c to look like this: /* Get rid of each of the data stores on the channel */ while ((datastore = AST_LIST_REMOVE_HEAD(&chan->datastores, entry))) { sleep(30); //Try to make the race happen /* Free the data store */ ast_channel_datastore_free(datastore); } Is there a good reason not to add the locking my patch adds? By: David Vossel (dvossel) 2010-02-12 16:21:04.000-0600 You're right, locking is the problem. I discarded my review request. I've investigated any possible issues that could arise from locking the channel during the datastore removal in 1.4, and it does not appear to be an issue. I'll have to do a little more research before I feel comfortable with this in 1.6.x and trunk though. Thanks for the feedback! By: Tim Ringenbach at Asteria Solutions Group (tim_ringenbach) 2010-02-12 16:22:16.000-0600 You're welcome. Thanks for reviewing my patch! By: Digium Subversion (svnbot) 2010-02-12 17:30:18.000-0600 Repository: asterisk Revision: 246545 U branches/1.4/main/channel.c ------------------------------------------------------------------------ r246545 | dvossel | 2010-02-12 17:30:18 -0600 (Fri, 12 Feb 2010) | 16 lines lock channel during datastore removal On channel destruction the channel's datastores are removed and destroyed. Since there are public API calls to find and remove datastores on a channel, a lock should be held whenever datastores are removed and destroyed. This resolves a crash caused by a race condition in app_chanspy.c. (closes issue ASTERISK-15495) Reported by: tim_ringenbach Patches: datastore_destroy_race.diff uploaded by tim ringenbach (license 540) Tested by: dvossel ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=246545 By: Digium Subversion (svnbot) 2010-02-12 17:32:34.000-0600 Repository: asterisk Revision: 246546 _U trunk/ U trunk/main/channel.c ------------------------------------------------------------------------ r246546 | dvossel | 2010-02-12 17:32:34 -0600 (Fri, 12 Feb 2010) | 21 lines Merged revisions 246545 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r246545 | dvossel | 2010-02-12 17:30:17 -0600 (Fri, 12 Feb 2010) | 16 lines lock channel during datastore removal On channel destruction the channel's datastores are removed and destroyed. Since there are public API calls to find and remove datastores on a channel, a lock should be held whenever datastores are removed and destroyed. This resolves a crash caused by a race condition in app_chanspy.c. (closes issue ASTERISK-15495) Reported by: tim_ringenbach Patches: datastore_destroy_race.diff uploaded by tim ringenbach (license 540) Tested by: dvossel ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=246546 By: Digium Subversion (svnbot) 2010-02-12 17:33:38.000-0600 Repository: asterisk Revision: 246547 _U branches/1.6.2/ U branches/1.6.2/main/channel.c ------------------------------------------------------------------------ r246547 | dvossel | 2010-02-12 17:33:38 -0600 (Fri, 12 Feb 2010) | 28 lines Merged revisions 246546 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r246546 | dvossel | 2010-02-12 17:32:33 -0600 (Fri, 12 Feb 2010) | 21 lines Merged revisions 246545 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r246545 | dvossel | 2010-02-12 17:30:17 -0600 (Fri, 12 Feb 2010) | 16 lines lock channel during datastore removal On channel destruction the channel's datastores are removed and destroyed. Since there are public API calls to find and remove datastores on a channel, a lock should be held whenever datastores are removed and destroyed. This resolves a crash caused by a race condition in app_chanspy.c. (closes issue ASTERISK-15495) Reported by: tim_ringenbach Patches: datastore_destroy_race.diff uploaded by tim ringenbach (license 540) Tested by: dvossel ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=246547 By: Digium Subversion (svnbot) 2010-02-12 17:34:10.000-0600 Repository: asterisk Revision: 246548 _U branches/1.6.1/ U branches/1.6.1/main/channel.c ------------------------------------------------------------------------ r246548 | dvossel | 2010-02-12 17:34:10 -0600 (Fri, 12 Feb 2010) | 28 lines Merged revisions 246546 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r246546 | dvossel | 2010-02-12 17:32:33 -0600 (Fri, 12 Feb 2010) | 21 lines Merged revisions 246545 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r246545 | dvossel | 2010-02-12 17:30:17 -0600 (Fri, 12 Feb 2010) | 16 lines lock channel during datastore removal On channel destruction the channel's datastores are removed and destroyed. Since there are public API calls to find and remove datastores on a channel, a lock should be held whenever datastores are removed and destroyed. This resolves a crash caused by a race condition in app_chanspy.c. (closes issue ASTERISK-15495) Reported by: tim_ringenbach Patches: datastore_destroy_race.diff uploaded by tim ringenbach (license 540) Tested by: dvossel ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=246548 By: Digium Subversion (svnbot) 2010-02-12 17:35:07.000-0600 Repository: asterisk Revision: 246549 _U branches/1.6.0/ U branches/1.6.0/main/channel.c ------------------------------------------------------------------------ r246549 | dvossel | 2010-02-12 17:35:06 -0600 (Fri, 12 Feb 2010) | 28 lines Merged revisions 246546 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r246546 | dvossel | 2010-02-12 17:32:33 -0600 (Fri, 12 Feb 2010) | 21 lines Merged revisions 246545 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r246545 | dvossel | 2010-02-12 17:30:17 -0600 (Fri, 12 Feb 2010) | 16 lines lock channel during datastore removal On channel destruction the channel's datastores are removed and destroyed. Since there are public API calls to find and remove datastores on a channel, a lock should be held whenever datastores are removed and destroyed. This resolves a crash caused by a race condition in app_chanspy.c. (closes issue ASTERISK-15495) Reported by: tim_ringenbach Patches: datastore_destroy_race.diff uploaded by tim ringenbach (license 540) Tested by: dvossel ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=246549 |