ASTERISK-09331: I believe this fixes a BAD BAD BAD Scenario

[Home]

Summary: ASTERISK-09331: I believe this fixes a BAD BAD BAD Scenario

Reporter: Steve Davies (one47) Labels:

Date Opened: 2007-04-27 04:46:59 Date Closed: 2007-07-02 08:38:34

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) badbadbad_1.2.18.patch
( 1) badbadbad_1.4.trunk.patch
( 2) badbadbad.txt

Description: I discussed a BAD BAD BAD case with Olle in June last year. It is low priority because it happens during a hangup, so all that really happens is that a final ACK is ignored. I think I worked out what is happening, so I am including my original verbose/debug log, and a possible patch.

****** ADDITIONAL INFORMATION ******

I believe the cause of the problem is that the thread handling "ast_hangup" calls the CDR close-and-output function with 2 locks held. If this involces a slow database write, the locks can be held for an unreasonably long time.

The patch (attached) simply defers the call to the CDR write code until the locks are released. They compile. I will be testing them later using 1.2.18.

Comments: By: Tilghman Lesher (tilghman) 2007-04-27 07:52:37

If you have a slow database, I'd suggest that you rewrite your CDR driver to queue messages to be written via a separate thread, instead of posting them directly. Changing the order of deallocation within the channel driver can also be BAD BAD BAD, as it changes assumptions about when events occur.
By: Steve Davies (one47) 2007-04-27 08:21:56

The attached trace seems to hold a lock for a significant part of 1 second (WAY too long in VoIP terms), and is using purely Asterisk-basic CDR code - 1 copy of cdr record sent to text file, and one sent to cdr_pgsql.

Also, the call to "ast_cdr_end" is unchanged, so the contents of the CDR record will be unchanged by the patch.

It is only the act of writing it to disk that is moved. The move is similarly very small, as it just moves the "CDR push and free" to the other side of the call to "chan->tech->hangup(chan)" so that the channel-proper is finished with and the locks are cleared before some potentially blocking disk activity and SQL preparation is required.

I appreciate what you say abbout changing the order of deallocation, but the CDR record pointer is deallocated (set to NULL) in exactly the same place after this patch. We just take a copy of the data so we can write it very slightly later.

The trace shows 2 attempts to get a channel lock (one from the SIP protocol thread, and one from the devstate thread) while the hangup process is holding that lock and updating the CDR.
By: Olle Johansson (oej) 2007-04-27 09:07:42

one47: Please change cdr.conf so we can test with the cdr caching - that propably should not lock for a long time and will cause the BAD BAD BAD messages to disappear if your theory is right. We need this feedback.
By: Steve Davies (one47) 2007-04-27 10:03:35

Hi,

I could not find any reference to CDR caching, even in SVN trunk, so instead I tested as follows using a 1.2.18 system:

1) CDR to .txt and PGSQL, database idle, and revently VACUUM'ed for performance.
Problem occurred occasionally.
2) CDR to .txt only
Could not reproduce problem
3) CDR to .txt and PGSQL, database made artificially busy.
Problem occurred on nearly every BUSY call.

Applied patch as supplied, then retest:

4) CDR to .txt and PGSQL, database idle, and revently VACUUM'ed for performance.
Could not reproduce problem
5) CDR to .txt only
Could not reproduce problem
6) CDR to .txt and PGSQL, database made artificially busy.
Could not reproduce problem

I am happy to try CDR caching or any other tests if you can explain how. I could perhaps add log entries that time the ast_hangup process to verify where the time goes?
By: Olle Johansson (oej) 2007-04-30 02:06:52

It's called "batch" mode, sorry.

cdr.conf.sample:
----
; Define the CDR batch mode, where instead of posting the CDR at the end of
; every call, the data will be stored in a buffer to help alleviate load on the
; asterisk server. Default is "no".
;
; WARNING WARNING WARNING
; Use of batch mode may result in data loss after unsafe asterisk termination
; ie. software crash, power failure, kill -9, etc.
; WARNING WARNING WARNING
;
;batch=no
By: Steve Davies (one47) 2007-04-30 03:54:26

I gave this a fairly quick test with good results. I tried creating big batches and causing quite high call-volume without seeing any obvious problems.

There is of course the downside in fixing it this way, in that your CDR records are written after a delay.
By: cmaj (cmaj) 2007-04-30 14:29:07

If you set the batch size = 1, then that should get those CDRs submitted fairly quickly, at the cost of more threads. But they're cheap threads.
By: Steve Davies (one47) 2007-04-30 16:27:36

Yes, batch=1 does help, and is approximately equivalent to the attached patch. The difference being that this patch does it without an extra thread being created for almost every CDR write, which is beneficial, even if it is only a small benefit, and it also means that the _default_ case is holding channel locks for less time...
By: Olle Johansson (oej) 2007-05-02 04:48:13

Corydon76: Since this can be handled by the cdr batching, do you agree with me that this bug is resolved or do you see some value in the proposed patch? I don't really have much insights into the CDR subsystem.
By: Tilghman Lesher (tilghman) 2007-05-02 07:46:28

No change required, as this can be fixed with batching.
By: Steve Davies (one47) 2007-07-02 04:18:05

Could a maintainer please change this issue to note that the patch was included in 1.2.20, and then re-close it.

Many thanks,
Steve
By: Joshua C. Colp (jcolp) 2007-07-02 08:38:33

Added to latest 1.2/1.4/trunk as of this writing... I knew there was a bug that talked about this but couldn't find it.