[Home]

Summary:ASTERISK-01079: safe_asterisk not always working as it should
Reporter:zoa (zoa)Labels:
Date Opened:2004-02-23 11:45:53.000-0600Date Closed:2004-09-25 02:54:40
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:sometimes safe_asterisk gives an error when restarting asterisk after a crash.

But it seems to be too fast on restarting sometimes and asterisk refuses to restart.

If i manually restart just one second later, it works fine,

maybe some retrys or a little delay before restarting would fix that.



****** ADDITIONAL INFORMATION ******

cleopatra*CLI> /usr/sbin/safe_asterisk: line 6: 19949 Segmentation fault      asterisk ${ASTARGS} 1>&/dev/${TTY} </dev/${TTY}
<zoa> Asterisk ended with exit status 139
<zoa> Asterisk exited on signal 11.
<zoa> Automatically restarting Asterisk.
<zoa> Asterisk ended with exit status 1
<zoa> Asterisk died with code 1.  Aborting.
<zoa> Disconnected from Asterisk server
Comments:By: Brian West (bkw918) 2004-02-27 00:16:22.000-0600

what distro?

By: zoa (zoa) 2004-02-27 04:15:39.000-0600

debian (i've seen it on several servers already, all running debian.)

Maybe its due to zaptel ? or maybe its due to the coredump taking up a lot of time to write ?

(i think it also has to do with the load when it crashes).

By: James Golovich (jamesgolovich) 2004-03-02 16:03:59.000-0600

I would bet that the exit(1) that is being called because ast_tryconnect is probably true.  It might be good to make asterisk exit with a value > 128 at that point so it will be restarted.  Either that or modify safe_asterisk so if the return value is 1 it will restart

By: James Golovich (jamesgolovich) 2004-03-07 17:01:07.000-0600

Any way to reproduce this behavior? If so add a ast_log line before each of the exit(1) calls from asterisk to find out which one of them is doing it.  Then I guess we need to decide if we want to change safe_asterisk to restart if the return code is > 0 or if we want to change any error return values in asterisk.c that are possible temporary conditions to be > 127.

By: zoa (zoa) 2004-03-08 06:24:15.000-0600

i have no way of reproducing it... :( just happens every now and then.

I now do a restart on exit code 1

By: James Golovich (jamesgolovich) 2004-03-08 17:51:25.000-0600

I think the best way to fix this is to change the exit codes of any potential temporary failure condition to something > 127.  Of course anything that isn't a temporary failure should stay exit(1) so safe_asterisk doesnt keep restarting when it isnt going to be able to do anything

By: jjanzer (jjanzer) 2004-03-09 16:18:36.000-0600

I would recomend using exit codes from "/usr/include/sysexits.h" if at all possible (dunno if the current exit codes fall into those predefined ones).

As far as reproducing the bug, it's simple... just cause asterisk to abort() and watch safe_asterisk puke every now and then. You could even do something as simple and dumb as calling an invalid pointer.

For me, it pukes more times than not.

I agree with zoa in that I think safe_asterisk is *too* aggresive about restarting.

By: James Golovich (jamesgolovich) 2004-03-09 17:06:22.000-0600

So the question comes down to do we want safe_asterisk to sleep for a second or so after dying, or should we change the exitcode of asterisk when it already thinks asterisk is running?

By: James Golovich (jamesgolovich) 2004-03-09 17:35:22.000-0600

Looked over the sysexits.h and the unfortunately all of the error codes in there are < 129 so none of them would cause safe_asterisk to restart.  I tried adding a sleep 1 to the code that happens when safe_asterisk sees a return value gt 128 and it seems to take care of it.

By: jjanzer (jjanzer) 2004-03-09 18:06:12.000-0600

Since all of the calls are lost anyway, I don't see why a second delay is going to really kill anyone, sounds good to me.

But, do you think this will really fix the problem (is there a possibility that on a really *slow* machine that 1 second isn't enough?). Maybe there should be a more programatic solution, not that I can think of one (;

edited on: 03-09-04 16:55

By: zoa (zoa) 2004-03-12 17:34:34.000-0600

fixed in cvs.