Summary: | ASTERISK-01079: safe_asterisk not always working as it should | ||
Reporter: | zoa (zoa) | Labels: | |
Date Opened: | 2004-02-23 11:45:53.000-0600 | Date Closed: | 2004-09-25 02:54:40 |
Priority: | Blocker | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ||
Description: | sometimes safe_asterisk gives an error when restarting asterisk after a crash. But it seems to be too fast on restarting sometimes and asterisk refuses to restart. If i manually restart just one second later, it works fine, maybe some retrys or a little delay before restarting would fix that. ****** ADDITIONAL INFORMATION ****** cleopatra*CLI> /usr/sbin/safe_asterisk: line 6: 19949 Segmentation fault asterisk ${ASTARGS} 1>&/dev/${TTY} </dev/${TTY} <zoa> Asterisk ended with exit status 139 <zoa> Asterisk exited on signal 11. <zoa> Automatically restarting Asterisk. <zoa> Asterisk ended with exit status 1 <zoa> Asterisk died with code 1. Aborting. <zoa> Disconnected from Asterisk server | ||
Comments: | By: Brian West (bkw918) 2004-02-27 00:16:22.000-0600 what distro? By: zoa (zoa) 2004-02-27 04:15:39.000-0600 debian (i've seen it on several servers already, all running debian.) Maybe its due to zaptel ? or maybe its due to the coredump taking up a lot of time to write ? (i think it also has to do with the load when it crashes). By: James Golovich (jamesgolovich) 2004-03-02 16:03:59.000-0600 I would bet that the exit(1) that is being called because ast_tryconnect is probably true. It might be good to make asterisk exit with a value > 128 at that point so it will be restarted. Either that or modify safe_asterisk so if the return value is 1 it will restart By: James Golovich (jamesgolovich) 2004-03-07 17:01:07.000-0600 Any way to reproduce this behavior? If so add a ast_log line before each of the exit(1) calls from asterisk to find out which one of them is doing it. Then I guess we need to decide if we want to change safe_asterisk to restart if the return code is > 0 or if we want to change any error return values in asterisk.c that are possible temporary conditions to be > 127. By: zoa (zoa) 2004-03-08 06:24:15.000-0600 i have no way of reproducing it... :( just happens every now and then. I now do a restart on exit code 1 By: James Golovich (jamesgolovich) 2004-03-08 17:51:25.000-0600 I think the best way to fix this is to change the exit codes of any potential temporary failure condition to something > 127. Of course anything that isn't a temporary failure should stay exit(1) so safe_asterisk doesnt keep restarting when it isnt going to be able to do anything By: jjanzer (jjanzer) 2004-03-09 16:18:36.000-0600 I would recomend using exit codes from "/usr/include/sysexits.h" if at all possible (dunno if the current exit codes fall into those predefined ones). As far as reproducing the bug, it's simple... just cause asterisk to abort() and watch safe_asterisk puke every now and then. You could even do something as simple and dumb as calling an invalid pointer. For me, it pukes more times than not. I agree with zoa in that I think safe_asterisk is *too* aggresive about restarting. By: James Golovich (jamesgolovich) 2004-03-09 17:06:22.000-0600 So the question comes down to do we want safe_asterisk to sleep for a second or so after dying, or should we change the exitcode of asterisk when it already thinks asterisk is running? By: James Golovich (jamesgolovich) 2004-03-09 17:35:22.000-0600 Looked over the sysexits.h and the unfortunately all of the error codes in there are < 129 so none of them would cause safe_asterisk to restart. I tried adding a sleep 1 to the code that happens when safe_asterisk sees a return value gt 128 and it seems to take care of it. By: jjanzer (jjanzer) 2004-03-09 18:06:12.000-0600 Since all of the calls are lost anyway, I don't see why a second delay is going to really kill anyone, sounds good to me. But, do you think this will really fix the problem (is there a possibility that on a really *slow* machine that 1 second isn't enough?). Maybe there should be a more programatic solution, not that I can think of one (; edited on: 03-09-04 16:55 By: zoa (zoa) 2004-03-12 17:34:34.000-0600 fixed in cvs. |