|Summary:||ASTERISK-21162: Deadlock in cdr.c: cdr_batch_lock vs cdr_pending_lock|
|Reporter:||Chase Venters (cventers)||Labels:|
|Date Opened:||2013-02-25 12:19:40.000-0600||Date Closed:||2013-03-26 09:06:19|
|Environment:||CentOS release 6.3 (Final) Linux ivr01.XXXXXXXX 2.6.32-279.22.1.el6.i686 #1 SMP Wed Feb 6 00:31:03 UTC 2013 i686 i686 i386 GNU/Linux Asterisk 126.96.36.199 from EPEL||Attachments:||( 0) ASTERISK-21162-1.8.diff|
( 1) deadlock_analysis.txt
( 2) gdb-asterisk-deadlock.zip
|Description:||I've run into a deadlock in Asterisk cdr.c. I have hundreds of threads blocked like this:
[Edit by Rusty Newton - removed in-line trace analysis and attached as deadlock_analysis.txt]
I think from looking at the 188.8.131.52-rc1 code that this issue still exists there.
|Comments:||By: Chase Venters (cventers) 2013-02-25 12:30:21.940-0600|
A full GDB "thread all apply bt" for my deadlocked Asterisk instance, for completeness.
Please note that to respect our users' privacy I have masked all phone numbers as **********.
By: Rusty Newton (rnewton) 2013-02-27 16:22:01.880-0600
I see a lot of values optimized out in your trace. Can you provide the trace again with DONT_OPTIMIZE and BETTER_BACKTRACES enabled ?
By: Chase Venters (cventers) 2013-02-27 16:48:14.647-0600
As a developer I understand how it's often important to have a full backtrace.
That being said, in this situation, it would be a difficult time consuming affair for me to reproduce this issue with a custom compiled Asterisk as it came from our production environment and we don't have the best instrumentation to push the load we'd need to trip it again.
I was hoping that since I already found the problem in the code, and pointed out the relevant deadlocked threads, it would be a simple matter to confirm that the problem exists and produce a solution.
Any time one thread attempts to lock A->B while another attempts to lock B->A that is a bug that could result in deadlock, as we have here with cdr_batch_lock and cdr_pending_lock.
Do you *really* need more details in this backtrace? In this specific case, it seems irrelevant to me.
By: Rusty Newton (rnewton) 2013-02-27 18:05:54.143-0600
I'm not a developer, but as a bug marshal part of what I do is make sure we have as much information as possible to save time for the developers who end up looking at the issue.
We do see issues, including deadlock scenarios that are near impossible to solve without the additional information provided by the compilation options mentioned. If the reporter is able to get it (most of the time they are) then it's best.
I understand the difficulty if the system is in production, and we really appreciate you reporting the issue in as detailed a fashion as you have. I'll go ahead and acknowledge this to get it on the core teams radar. If a developer responds back that we need any additional information then we'll comment here. Thanks!
By: Matt Jordan (mjordan) 2013-03-15 14:27:33.652-0500
Can you try the attached patch (ASTERISK-21162-1.8.diff) and see if it resolves the problem?
By: Matt Jordan (mjordan) 2013-03-26 09:06:11.908-0500
I went ahead and committed this in r383839 in Asterisk 1.8. I was able to reproduce the problem and, with the patch, the problem no longer occurs. Testing included using batches of 30 CDRs, triggered both by time delay as well as by accumulation of CDRs.
If you can test this as well to make sure this is resolved that'd be appreciated. If it isn't, just leave a comment here or contact a bug marshal in #asterisk-bugs (or e-mail me!) and I'll reopen this issue.