|ASTERISK-06350: [patch] Failure to lock voicemail.conf leads to corruption
|( 0) 20060217__lockf_vm.diff.txt
|We have a problem with Asterisk not locking voicemail.conf for update.
It appears to not protect the file even against itself. It certainly
doesn't use flock() to protect it against others.
This is a problem for several reasons. First, of course, people can be
hand-editing the file to add or remove users. Secondly, automated
programs may be appending data to it for the same purpose.
We've noticed corruption in our file and are certain this is the culprit.
app_voicemail should certainly use flock or fcntl to lock this file so that others may safely modify it to add or remove mailboxes. It is not even clear from the current code that the file is even protected against multiple Asterisk threads modifying it simultaneously.
When corruption occurs, it takes the place of duplicating the extension number on the following line at the end of the file, without adding \n to that line. For instance, if our file ended like this:
123 => 123123,Test User
Then after corruption, it would look like:
123 => 123123,Test User
where that last "123" wouldn't have a \n after it.
There seems to be no safe way to peacefully coexist with Asterisk to add or remove mailboxes from this file.
****** ADDITIONAL INFORMATION ******
Reported to Debian at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=353227
Asterisk-users thread at http://lists.digium.com/pipermail/asterisk-users/2006-February/147249.html
It was suggested to discuss this here.
|By: Tilghman Lesher (tilghman) 2006-02-17 09:08:35.000-0600
Please try this patch out on your system.
By: jgoerzen2 (jgoerzen2) 2006-02-17 09:33:00.000-0600
Thanks. I have downloaded the patch and am rebuilding asterisk now. I'll keep an eye on it over the next business day or two and will let you know how it turned out by Monday or Tuesday if that's soon enough for you.
By: jgoerzen2 (jgoerzen2) 2006-02-17 13:31:53.000-0600
Thanks for this.
Unfortunately, it didn't fix the problem. Still seeing the same symptoms.
A couple of comments about your approach:
1) You really need to flock something other than voicemail.conf.new.
By not doing that, you could still get problems even within asterisk, because after you unlock voicemail.conf.new, another asterisk thread could get in there and write to it before you rename it back.
I was going to suggest that you flock voicemail.conf, but I don't think that is the right thing to do either. It would be fine for the first process, but after the first process is done and has renamed voicemail.conf.new over to voicemail.conf, the second process would have an open fd on a now-nameless file (the old voicemail.conf) and would be copying old data to voicemail.conf.new.
Come to think of it, that may also be a problem with the current implementation, not sure.
Anyway, my suggestion would be to use a separate lock file -- voicemail.conf.lock perhaps. It can be a 0-byte file that you just flock. If you never rename or unlink it, the locking semantics should be straightforward. Lock it, then open voicemail.conf and voicemail.conf.new and do whatever you like with them.
2) The timeout of 100us seems very low, especially if system load ever gets high. I'd say .5s or even 1s would be more sensible.
By: Tilghman Lesher (tilghman) 2006-02-17 13:50:15.000-0600
1) No, it couldn't. The lock is not released until after the old voicemail.conf is unlinked and the new file is renamed in its place. And since I moved the lock to before we open voicemail.conf, it eliminates another possible race condition.
2) 100us lets us wait plenty of time for the file to become available. If that's not enough, then the password simply won't get changed.
In other words, that's not a possible source of corruption (the password simply won't get changed and you'll see an error that the file couldn't be opened).
Last, if you're still getting voicemail.conf corruption with this patch in place, either your kernel is ignoring the flock (e.g. the filesystem is NFS), or Asterisk is not the source of your corruption.
By: jgoerzen2 (jgoerzen2) 2006-02-17 14:04:03.000-0600
Ah, you're right about moving it to before you open voicemail.conf. I missed that.
I understand that the 100us isn't a possible source of corruption, but it could be unfriendly to the user to not have the password changed. That's why I suggested a larger timeout. But not a big deal.
I will try to keep a very close eye on this over the next few days. I did not believe that we had any other process looking at this file, save for Asterisk, but now that I see you're locking .new before .conf, I can't see where there's any room for this behavior either. This is on regular old ext3, so I can't imagine the kernel would ignore flock.
Thanks for your help.
By: jgoerzen2 (jgoerzen2) 2006-02-17 14:40:05.000-0600
Actually, there could still be a problem here.
Let's say we have process 1 comes along. It opens up voicemail.conf.new, creating it, and acquires an exclusive lock on it.
Process 2 comes along. Opens up the existing voicemail.conf.new, and waits to get the lock.
Process 1 now does its thing, copying voicemail.conf to .new and changing it along the way. It then renames voicemail.conf.new to .conf, and releases the lock.
Now process 2 gets the lock. It proceeds to open up voicemail.conf.
But here's the rub -- the file descriptor that process 2 has open for writing now *is* voicemail.conf, since process 1 had renamed it. Process 2 will therefore be reading and writing from the same file, not a .new, and at the end its rename will fail since voicemail.conf.new doesn't exist.
This does not appears to be what we're seeing though, because voicemail.conf itself should be deleted in this scenario.
Incidentally, I don't think you need to do that unlink before the rename.
Does that make sense?
By: Tilghman Lesher (tilghman) 2006-02-17 15:27:27.000-0600
By: Tilghman Lesher (tilghman) 2006-03-09 12:20:58.000-0600
By: jgoerzen2 (jgoerzen2) 2006-03-09 12:51:02.000-0600
Sorry for not replying before -- we are switching over to our new system tomorrow and it's been rather hectic.
I haven't been able to try out your second patch and probably won't be able to. After the problems persisted after the first one, we switched over to storing mailbox information in a SQL database. That's been working very well for us, and it's no longer practical to test things related to mailboxes in the text file.