|Summary:||ASTERISK-08131: [patch] Log file rotation on SIGXFSZ doesn't check log file sizes.|
|Reporter:||Tony Mountifield (softins)||Labels:|
|Date Opened:||2006-11-14 05:19:28.000-0600||Date Closed:||2011-06-07 14:08:09|
|Environment:||Attachments:||( 0) logdif.txt|
|Description:||A customer's system was recording a VERY long conference in MeetMe. After 37 hours or so, the recording reached the ext3 file size limit of 2GB. This generated a SIGXFSZ which was caught by logger.c, triggering a log file rotation.|
As MeetMe continued attempting to write to the recording file, the signal was repeatedly generated, causing nearly 4000 log file rotations before Asterisk finally crashed.
In logger.c, when it sees filesize_reload_needed has been set, it should check the sizes of the open logfiles to confirm that one of them is the reason for the signal, and only perform the rotation if so.
****** ADDITIONAL INFORMATION ******
I've classified this as "minor" because it happens infrequently, but when it does happen it is quite dramatic.
I haven't produced a patch to fix this due to time pressure, but I wanted to flag up the issue. It looks like it applies to all versions of Asterisk.
I'll address the question of what to do when a recording file exceeds the filesystem size limit in a separate bug report when I've figured out a strategy for dealing with it.
|Comments:||By: Anthony LaMantia (alamantia) 2006-12-01 16:51:09.000-0600|
are you asking us just to prevent reload_logger() from being called within ast_log() when the signal is dispatched.
without trying to resolve the problem with the meetme recording file?
i'm not positive this is the best idea, but im not to positive how to resolve the meetme problem and this one at the same time.
i'm going to attempt to get a patch produced for this one, but if anyone has a few ideas on it please post something.
By: Tony Mountifield (softins) 2006-12-01 18:01:40.000-0600
I'm considering the meetme issue separately.
The point of this report was to just suggest that when asterisk receives a SIGXFSZ, it should check whether the logs *really* need rotating before actually doing so, since it's always possible that the signal could have been generated by another file, e.g. *any* kind of recording, not just in meetme.
I'm also not convinced that SIGXFSZ is worth handling anyway, since it could have several different causes - wouldn't it be better just to check for EFBIG being returned from write() or fwrite()?
By: Anthony LaMantia (alamantia) 2006-12-01 18:10:41.000-0600
EFBIG being returned from write() or fwrite()?
it seems like it that may be the smarter thing to do.
By: Anthony LaMantia (alamantia) 2006-12-07 15:31:13.000-0600
I have written a patch for this issue and uploaded it to this bug-ticket.
if you are willing to review it please provide some feedback.
By: Tilghman Lesher (tilghman) 2006-12-07 16:15:17.000-0600
This isn't actually an issue with ext3; it's an issue with the maximum file offset, which is a 32-bit integer in 1.2. In 1.4, we have, where available, set the file offsets to be 64-bit, which fixes this issue.
I'll additionally note that given you stated 37 hours, that means you're using uncompressed wav, RIFF format, which has multiple 32-bit offsets in the file format and therefore does not support larger files anyway. You should transition to use one of the headerless formats such as ulaw or gsm, to avoid this issue.
By: Tony Mountifield (softins) 2006-12-07 16:28:52.000-0600
No, the point of this specific issue isn't the maximum size of the file, or of recordings.
This bug report is purely that a size overflow on a non-log file makes Asterisk incorrectly think it has to rotate the logs.
Alamantia - thanks for the patch, I'll look at it asap.
By: Tilghman Lesher (tilghman) 2006-12-07 16:40:18.000-0600
alamantia: if you fail to handle the SIGXFSZ, the signal will terminate the process. Removing the handler is the wrong approach.
softins: I understand why it's happening, but increasing the maximum size of the files will prevent the signal from firing, which gives us the correct result (which is, the recording may grow larger unimpeded). Otherwise, you'll still get an endless stream of SIGXFSZ's and your Asterisk process will still either halt or run in a degraded mode.
By: Anthony LaMantia (alamantia) 2006-12-07 17:24:27.000-0600
maybe making the handler more or less a empty stub.. to prevent the process from crashing.. and rotating the logs only in the case of fwrite failing? would be a good idea
something needs to be done about the cause of the SIGXFSZ signals that are flooding in. and how the logger handles them (while not being the cause)
one quick idea on solving the SIGXFSZ problem are maybe adjusting the filesize limit on your system using ulimit -f from the cli. or from the bash shell..
By: Tony Mountifield (softins) 2006-12-07 17:31:43.000-0600
Corydon76: I still think it's wrong that although SIGXFSZ could be generated by ANY file, Asterisk assumes it is a log file, don't you?
Presumably, SIGXFSZ will only be fired once for each attempt to exceed the file size. So if that signal is set to SIG_IGN, and the check for EFBIG is made on every error return, then there should be no problem.
So I think the patch shouldn't let the errors stack up, but instead should do something like (including the check for EFBIG - don't assume it):
res = fprintf(qlog, ...);
if (res >= 0) vfprintf(qlog, ...);
if (res >= 0) fprintf(qlog, ...);
if (res >= 0) fflush(qlog);
if (res < 0 && errno == EFBIG)
filesize_reload_needed = 1;
Please let's not get distracted by what meetme or any other recording app should or should not do. The point of *this* particular patch is to enable asterisk to decide to rotate oversized logs without having to use SIGXFSZ to do so.
By: Tilghman Lesher (tilghman) 2006-12-07 17:37:32.000-0600
That still doesn't address the root cause, which is being flooded with signals. If you do not address that cause, your process is going to forever sit processing continuous signals, instead of processing calls. You have to address the problem of the signal flood, or else you aren't addressing the problem, only a symptom of it.
By: Tony Mountifield (softins) 2006-12-07 17:39:38.000-0600
I give up. Tighlman, I get so fed up arguing with you.
By: Tilghman Lesher (tilghman) 2006-12-07 18:03:40.000-0600
Sorry for being argumentative, but we solved this problem a year ago by changing the size of file offsets, which alleviates the entire issue.
What I'm aiming to do is to help you avoid trying a solution that has been tried before and is prone to additional problems. The best possible solution down this path is to ignore the signal altogether and insert additional checks around each write to an open file handle. By ignoring the signal, you avoid spending any time in a signal handler. While not perfect, it may be the only thing to do in 1.2.
By: Anthony LaMantia (alamantia) 2006-12-07 18:13:23.000-0600
i've uploaded a patch which ignores the signal all-together and checks for EFBIG on all writes inside of logger.c, i hope this bugs us closer to a solution for this.
By: Tony Mountifield (softins) 2006-12-08 02:12:40.000-0600
Actually, ignoring the signal and instead checking each file write was what I thought I *was* arguing for in my last couple of bugnotes, so I'm glad we agree in the end! It was late in the UK when I posted my previous note, and I was frustrated at apparently not being understood.
I think that is the correct thing to do also in 1.4 and trunk (notwithstanding the 64-bit offsets), since SIGXFSZ doesn't provide any information about what file triggered it, and is therefore not useful as a recovery mechanism, only an abort one.
I'll check out the latest patch - thanks Anthony!
By: Tony Mountifield (softins) 2006-12-08 02:27:44.000-0600
I've just scanned the patch by eye, and seen a couple of typos:
if (res <= 0 && res == EFBIG)
if (res < 0 && errno == EFBIG)
in two places, and
if (res == EFBIG)
if (errno == EFBIG)
in one place. I'll make those changes when I try out the patch.
By: Anthony LaMantia (alamantia) 2006-12-11 03:07:36.000-0600
i've just uploaded a patch with changes to resolve the problems you've pointed out.
By: Tony Mountifield (softins) 2006-12-11 09:01:53.000-0600
Hmmm. I've applied the patch with a few corrections too, and done some testing by making the "full" log nearly 2GB in size.
Unfortunately, it appears that when the file size reaches 2147483647 (0x7FFFFFFF), the file no longer grows, but NO error is returned from the stdio functions attempting to write.
I can increase the size of the file by appending to it from the shell, so it appears that the 2GB limit is to do with the libraries used by Asterisk, as mentioned by Corydon76.
Since no error is returned, this patch won't work. I think the only other approach that would work would be to follow each fflush() with a ftell() to discover how far the file has been written, and to set the rotate flag when the offset returned from ftell() is something like 0x7FFF0000 (to allow a bit of headroom).
By: Anthony LaMantia (alamantia) 2006-12-21 18:18:37.000-0600
i've uploaded another patch for testing. this one may resolve the issue.
after reading the header files in /usr/include i noticed the the "limit" that is being talked about here is due to the fact _IO_FILE uses (char *) pointers to keep track of the various variables it needs to... checking the length of the file so CHAR_MAX would make a ok boundry to test aginst. i also placed the reload_logger() command to be called after this check would should take away our need to keep track of filesize_reload_needed.
so softins can you test out this patch? if it works then we can go about cleaning up the code a bit and use this method.
By: Anthony LaMantia (alamantia) 2006-12-21 23:07:48.000-0600
I have a branch here that I am working on as well regarding this issue.
at the moment my patch works somewhat. it will reload the logs and not cause a segfault, due to the singal, but it will reload all of the logs.. as well.. as due to a threading issue.. sometimes create 1 or 3 log files with only 1 message each.
i plan up updating this branch with the fix for that problem in the near future.
By: Anthony LaMantia (alamantia) 2006-12-21 23:08:49.000-0600
the diff, is avaliable here.
By: Anthony LaMantia (alamantia) 2007-01-10 17:52:06.000-0600
if you get a chance can you provide me with some of your thoughts on this bug?
By: Anthony LaMantia-2 (anthonyl) 2007-01-21 12:54:01.000-0600
well for what it's worth this patch works for me ;)
By: jmls (jmls) 2007-05-28 02:31:35
so, where do we go from here ?
By: Jason Parker (jparker) 2007-06-26 11:09:51
Reporter has not responded since December and has apparently lost interest in testing the fix. I'm going to have to close this. Please contact a bug marshal if you would like this issue to be reopened.