Summary:ASTERISK-03793: [patch] detach CDRs for posting in a separate thread
Reporter:cmaj (cmaj)Labels:
Date Opened:2005-03-28 07:55:21.000-0600Date Closed:2008-01-15 15:36:52.000-0600
Versions:Frequency of
Environment:Attachments:( 0) cdr_detaching.diff11.txt
( 1) cdr.conf.sample2
Description:Currently, when a channel is hungup, it doesn't actually get freed until the CDR record is posted.  This can take a while if, for example, your database connection is slow.  Since the CDR is kind of a separate entity, I thought it would make sense to decouple posting from hangup by spawning a new thread to deal with the CDR post and free operations.

Tests so far work fine, although I have not tried this patch on a production server.  I also thought of a different option, having some sort of listener thread for CDRs that the channels can notify when they have a new CDR, although I'm not sure if creating an extra thread for every single hangup is that big of a deal so that's why I did the patch that way first.  Either way, I think it's a bug that a channel has to wait on database I/O before it can free itself.


Disclaimer on file.
Comments:By: Kevin P. Fleming (kpfleming) 2005-03-28 11:47:14.000-0600

The additional thread overhead will be excessive for this.

A better solution involves using a queue processed by a dedicated thread, something which will be far simpler to do once I post the new event system (since CDR posting can just be an 'event' at that time). That will happen this week, so keep your eyes open :-)

By: cmaj (cmaj) 2005-03-28 13:21:55.000-0600

I really wasn't sure about the overhead, so thanks for that info, although I thought I remember reading somewhere that there was no longer a thread limit in the kernel...

I guess I'll wait for your event system before modifying this patch with the queue idea (or will it no longer be necessary ?)  Come to think of it, could you post to the dev list some more details on the new system ?

By: Kevin P. Fleming (kpfleming) 2005-03-28 13:34:57.000-0600

The document that I currently have on the thread system no longer matches the code (the code has advanced a great deal), and I haven't updated it yet. I'm trying to get ready for a trip out of town where I will be able to keep working on it, though.

In a nutshell, the loaded CDR backends will be able to 'subscribe' to CDR posting events, and if they wish they can have them delivered to a queue to be processed by their own thread (with file-descriptor based wakeups, so there is no need for spin loops).

You are right, there is not really a hard limit on the number of threads that the Linux kernel can handle, and at least for Linux thread creation/destruction is fairly lightweight. However, on other systems that is not the case, and creating/destroying a thread for small operations is very wasteful.

By: Mark Spencer (markster) 2005-03-28 13:47:51.000-0600

What about just having a single CDR thread listening and posting CDR's as they become available?

By: Kevin P. Fleming (kpfleming) 2005-03-28 13:55:05.000-0600

That would be a reasonable option for now, yes, although that thread could still get blocked by a slow backend (but at least that wouldn't keep channels around in memory, only CDR structures).

By: cmaj (cmaj) 2005-03-28 16:14:22.000-0600

Wouldn't it need two threads: one to listen for CDR post requests and queue them up, and another to do the actual work ?  Because if you did it in one thread, you would still have the same problem, which is waiting on the database insert.

I get nervous around those big ast_mutex_lock things...

By: Kevin P. Fleming (kpfleming) 2005-03-28 16:20:40.000-0600

No, the 'posting' operation would just add the CDR to the end of a lockable linked-list, so that could happen in the channel's normal thread.

The CDR processing thread would then just check that list for work to do.

By: essobi (essobi) 2005-04-06 08:32:36

Wouldn't this open the way for "batching" in CDRs in a set time interval or number?  A lot of big softswitches do it this way for performance reasons.  They'll batch in CDRs every 5 minutes, or every 10,000 records. which ever comes first.

Just my 2 cents.

By: Kevin P. Fleming (kpfleming) 2005-04-06 11:32:28

Sure, that would be possible with this technique.

By: Kevin P. Fleming (kpfleming) 2005-04-29 13:26:18

Were you ever able to produce a version using a single thread processing a linked-list of CDRs to post? If not, we'll have to table this issue for now, until a less thread-heavy solution is available.

By: cmaj (cmaj) 2005-05-05 18:20:05

diff2 only spawns a new thread when there's a full 'batch' of CDRs.  I'll make this batch mode operation and batch size configurable from a new cdr.conf file, add flushing of some sort when shutting down, and more fuzzy stuff in the next patch.  I just wanted to see if I was on the right track here.

By: cmaj (cmaj) 2005-05-07 09:10:19

Was diff2 horrible or something ?

By: Mark Spencer (markster) 2005-05-08 12:51:06

This is going to be challenging for cases where we do things to the CDR mid-call...

By: Mark Spencer (markster) 2005-05-08 12:51:22

Maybe we can discuss at the next dev call?

By: cmaj (cmaj) 2005-05-09 11:28:10

I think it should be okay with mid-call CDR stuff because it's really only changing behaviour when the CDR is posted and freed.  There's a hiccup with the failed CDR stuff because that is in it's own little world, but I can fix this in the next patch with changing the calls from cdr_post and cdr_free to cdr_detach.

I will try to clean this up a little more and make it user configurable before the dev call.  Thanks.

By: cmaj (cmaj) 2005-05-11 10:08:52

I skipped diff3 because it sucked, but diff4 does so much more.  It adds a cdr.conf file to enable/disable both the CDR processing and the batch mode, along with parameters to either post after X number of records or Y seconds.  Plus it adds two cli commands "cdr status" and "cdr bake", the latter being used to actually post the records in the batch.

This is my first patch to make use of the scheduler, and most of it was a cut and paste from dnsmgr.c, but everything seems to hold up, even under reloads.

I am still not sure what should happen at shut down.  Right now, you'll just lose the records, but if you have to post 1000 records while shutting down, that's kind of a pain, too.

But anyhow, here it is, and I will hop on the dev call tomorrow to discuss.

By: cmaj (cmaj) 2005-05-17 22:50:59

I missed out on half the dev call last week, and I will miss out on the whole thing this week, sorry.  But I've updated the patch in diff5 to work against current CVS.  diff5 also includes a first attempt at posting CDRs during shutdown, a little better handling of the scheduler to avoid deleting empty entries, and it changes one of the CLI commands from "cdr bake" to "cdr submit" (which seemed to make a little more sense.)

By: Kevin P. Fleming (kpfleming) 2005-05-19 01:21:56

Code review notes:

1) Don't have asterisk.c call ast_cdr_submit_batch() directly; instead add an ast_cdr_engine_term() function, parallel to ast_cdr_engine_init(), so it will be more properly encapsulated (and extendible in the future if needed).

2) I'm not conmfortable with each batch always having a 'struct ast_cdr_batch_item' with a NULL cdr pointer; I think it would be better for the batch to have NULL head/tail pointers until a CDR has been attached to it. With the current scheme, you will create a posting thread every time the timeout expires, even if the only entry in the batch has no CDR.

3) This code is in error:

+ struct ast_cdr_batch_item *newbatchitems;
+ /* Create a fresh batch for all CDRs collected after this point */
+ newbatchitems = malloc(sizeof(struct ast_cdr_batch_item));
+ memset(newbatchitems, 0, sizeof(newbatchitems));

It's better to use:
       newbatchitems = malloc(sizeof(*newbatchitems));
       memset(newbatchitems, 0, sizeof(*newbatchitems));

(This occurs at least two times in your patch).

4) This comment seems to be the opposite of what is being done:

+ /* swap just the ast_cdr_batch struct, and not the actual items */
+ ast_mutex_lock(&cdr_batch_lock);
+ oldbatchitems = batch->head;
+ reset_batch();
+ ast_mutex_unlock(&cdr_batch_lock);

5) The error case message here seems to be wrong:

+ if (ast_pthread_create(&batch_post_thread, &attr, do_batch_backend_process, oldbatchitems)) {
+ ast_log(LOG_WARNING, "CDR could not batch process, data was lost and memory was leaked\n");
+ return;
+ } else {
+ if (option_debug)
+ ast_log(LOG_DEBUG, "CDR batch processing begins now\n");
+ }

6) There is no need for 'extern' declarations, in header files or in the module itself (the compiler will do the right thing even without it):

+extern void ast_cdr_submit_batch(void)

7) There is no call to ast_config_destroy().

Otherwise this looks really nice, and will be a big improvement.

By: cmaj (cmaj) 2005-05-19 18:12:24

diff6 should correct all the issues from the code review of diff5, thanks Kevin.  Concerning a couple of the items mentioned:

2) The 'size' of the batch should have been used to return early in the case of no CDR data, but I implemented your suggestion regardless.  It didn't make sense to have an empty batch item floating around at initialization.
6) The 'extern' declaration in the header file was more in keeping with the precedent already there.  I wonder if it would make sense with different compilers ?

By: Kevin P. Fleming (kpfleming) 2005-05-19 23:33:37

Looks much better, just a few minor comments now:

1) In ast_cdr_submit_batch(), you check both !batch->head and (batch->size != 0). If there is any chance at all that ->head can be non-NULL while size is zero, that is a serious error and needs to be addressed. Realistically it should not happen, and making the extra check will only serve to raise questions for future readers of the code (like me!).

2) I don't think this is right:

+ oldtail = batch->tail;
+ oldtail->cdr = cdr;
+ oldtail->next = newtail;
+ batch->tail = newtail;

Maybe it should be (since batch->tail should already have a CDR):

   batch->tail->next = newtail;
   newtail->cdr = cdr;
   batch->tail = newtail;

Otherwise this looks pretty good to me.

As far as 'extern' goes, it's technically only needed for items that have storage associated with them (variables and such). For function prototypes, 'extern' makes no difference, and that's standard C, it's not compiler specific, as far as I am aware.

By: cmaj (cmaj) 2005-05-20 10:55:34

diff7 addresses the last couple issues, although this comment: "Maybe it should be (since batch->tail should already have a CDR)" was particularly helpful.  In fact, batch->tail didn't always have a CDR, so I corrected this, too.  Thanks.

Besides that, diff7 starts to tackle the problem of ast_cdr_reset (mid-call CDR operations, as Mark alluded to possibly being a problem earlier in this bug report.)  I don't think the way I'm doing it is 100% correct -- there must be some data I'm missing out on with just a memcopy of the ast_cdr struct -- but I think it's generally the right way to do this.

I decided to remove ast_cdr_post, since calls to it are not correct when in batch mode (they're not correct any more at all, actually), and I replaced calls to it elsewhere with ast_cdr_detach.  Along that line, it would be good to maybe rip out ast_cdr_fork from app_cdr and put that in cdr.c, too.

Also, I put the extern's back into the header file because maybe somebody's script is relying on those for some reason elsewhere, and it keeps with the convention already there.

In the next patch, I will try to work on adding some sort of delay when calling ast_cdr_engine_term, to give the submission thread at least a few seconds to get as many records as possible into the backends.  Right now, if you have a bunch of records, and no channels that need hanging up, you'll just get the first few records from the batch processed if you shut down *.

By: drmac (drmac) 2005-05-21 22:30:23

perhaps I didn't read the patch close enough, but what happens when I've buffered up 99 CDRs (where 100 will trigger an insert) and asterisk crashes? or the power dies? or the server blows up? Have I completly lost those 99? Is there some way to prevent this?

By: cmaj (cmaj) 2005-05-23 09:38:22

Right now, with this patch, if asterisk blows up, your data is lost.

I'm adding a "safeshutdown" option to the new cdr.conf file in the next diff to block on asterisk shutdown until the CDR data is submitted.  But this won't help in the case of someone pulling your power cord.

The best workaround I can figure, if this is a concern, would be to set the batch "size=1" and another new option, "scheduleronly=yes".  That will cause each record to be scheduled separately for submission and to not launch a new thread after each record.  Coupled with "safeshutdown=yes" and this will vastly limit your data loss.

Another option would be to maybe use the internal asterisk DB as a temporary storage facility to keep track of in-progress CDR data, but that's probably beyond the scope of this patch.

By: cmaj (cmaj) 2005-05-23 09:49:06

diff8 had my mod for building cdr_pgsql, sorry.  Please use diff9.

By: cmaj (cmaj) 2005-05-26 15:12:23

diff9 was not making use of the scheduler properly when called in "unscheduled" batch mode.  diff10 should correct that by making the do_cdr thread wake up every second and run the scheduling queue, even if there's nothing in the queue.  It's a tradeoff between getting the scheduler to run like clockwork every X seconds and still allowing unscheduled batch submissions when the number of records exceeds some maximum value.

If this is a problem with regards to increased overhead, then maybe setting up a pipe to notify the do_cdr thread when to wake up would be more effective.  This way, the call to poll could actually use a non-empty file descriptor, in addition to the timeout already in place.  This seems like it would be more confusing, tho.

By: cmaj (cmaj) 2005-05-31 09:46:24

diff11 starts using pthread_cond_* stuff to wakeup the CDR thread when there's work to do.  I haven't seen this part of the pthreads library in use elswehere in asterisk (other than chan_vpb), but I think it's portable functionality.  It sure beats using a pipe for inter-thread communication.

By: Kevin P. Fleming (kpfleming) 2005-06-02 21:39:48

Committed to CVS HEAD, thanks for the excellent work!

By: Kevin P. Fleming (kpfleming) 2005-06-02 22:10:15

Forgot to note that this was committed with some minor mods, mostly to suppress warning messages when CDR posting is completely disabled.

By: Digium Subversion (svnbot) 2008-01-15 15:36:52.000-0600

Repository: asterisk
Revision: 5823

U   trunk/asterisk.c
U   trunk/cdr/cdr_manager.c
U   trunk/cdr.c
U   trunk/channel.c
A   trunk/configs/cdr.conf.sample
U   trunk/include/asterisk/cdr.h
U   trunk/loader.c
U   trunk/pbx.c

r5823 | kpfleming | 2008-01-15 15:36:52 -0600 (Tue, 15 Jan 2008) | 2 lines

support configurable batch posting of CDRs (off by default) (bug ASTERISK-3793)