ASTERISK-01079: safe_asterisk not always working as it should

[Home]

Summary: ASTERISK-01079: safe_asterisk not always working as it should

Reporter: zoa (zoa) Labels:

Date Opened: 2004-02-23 11:45:53.000-0600 Date Closed: 2004-09-25 02:54:40

Priority: Blocker Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments:

Description: sometimes safe_asterisk gives an error when restarting asterisk after a crash.

But it seems to be too fast on restarting sometimes and asterisk refuses to restart.

If i manually restart just one second later, it works fine,

maybe some retrys or a little delay before restarting would fix that.

****** ADDITIONAL INFORMATION ******

cleopatra*CLI> /usr/sbin/safe_asterisk: line 6: 19949 Segmentation fault asterisk ${ASTARGS} 1>&/dev/${TTY} </dev/${TTY}
<zoa> Asterisk ended with exit status 139
<zoa> Asterisk exited on signal 11.
<zoa> Automatically restarting Asterisk.
<zoa> Asterisk ended with exit status 1
<zoa> Asterisk died with code 1. Aborting.
<zoa> Disconnected from Asterisk server

Comments: By: Brian West (bkw918) 2004-02-27 00:16:22.000-0600

what distro?
By: zoa (zoa) 2004-02-27 04:15:39.000-0600

debian (i've seen it on several servers already, all running debian.)

Maybe its due to zaptel ? or maybe its due to the coredump taking up a lot of time to write ?

(i think it also has to do with the load when it crashes).
By: James Golovich (jamesgolovich) 2004-03-02 16:03:59.000-0600

I would bet that the exit(1) that is being called because ast_tryconnect is probably true. It might be good to make asterisk exit with a value > 128 at that point so it will be restarted. Either that or modify safe_asterisk so if the return value is 1 it will restart
By: James Golovich (jamesgolovich) 2004-03-07 17:01:07.000-0600

Any way to reproduce this behavior? If so add a ast_log line before each of the exit(1) calls from asterisk to find out which one of them is doing it. Then I guess we need to decide if we want to change safe_asterisk to restart if the return code is > 0 or if we want to change any error return values in asterisk.c that are possible temporary conditions to be > 127.
By: zoa (zoa) 2004-03-08 06:24:15.000-0600

i have no way of reproducing it... :( just happens every now and then.

I now do a restart on exit code 1
By: James Golovich (jamesgolovich) 2004-03-08 17:51:25.000-0600

I think the best way to fix this is to change the exit codes of any potential temporary failure condition to something > 127. Of course anything that isn't a temporary failure should stay exit(1) so safe_asterisk doesnt keep restarting when it isnt going to be able to do anything
By: jjanzer (jjanzer) 2004-03-09 16:18:36.000-0600

I would recomend using exit codes from "/usr/include/sysexits.h" if at all possible (dunno if the current exit codes fall into those predefined ones).

As far as reproducing the bug, it's simple... just cause asterisk to abort() and watch safe_asterisk puke every now and then. You could even do something as simple and dumb as calling an invalid pointer.

For me, it pukes more times than not.

I agree with zoa in that I think safe_asterisk is *too* aggresive about restarting.
By: James Golovich (jamesgolovich) 2004-03-09 17:06:22.000-0600

So the question comes down to do we want safe_asterisk to sleep for a second or so after dying, or should we change the exitcode of asterisk when it already thinks asterisk is running?
By: James Golovich (jamesgolovich) 2004-03-09 17:35:22.000-0600

Looked over the sysexits.h and the unfortunately all of the error codes in there are < 129 so none of them would cause safe_asterisk to restart. I tried adding a sleep 1 to the code that happens when safe_asterisk sees a return value gt 128 and it seems to take care of it.
By: jjanzer (jjanzer) 2004-03-09 18:06:12.000-0600

Since all of the calls are lost anyway, I don't see why a second delay is going to really kill anyone, sounds good to me.

But, do you think this will really fix the problem (is there a possibility that on a really *slow* machine that 1 second isn't enough?). Maybe there should be a more programatic solution, not that I can think of one (;

edited on: 03-09-04 16:55
By: zoa (zoa) 2004-03-12 17:34:34.000-0600

fixed in cvs.