Summary:ASTERISK-22093: Deadlock due to locking inversion between PBX context lock and channel lock while reloading dialplan through pbx_lua
Reporter:David Brillert (aragon)Labels:
Date Opened:2013-07-17 15:28:32Date Closed:2016-08-19 14:16:32
Status:Closed/CompleteComponents:Core/PBX PBX/pbx_lua
Versions:11.5.0 Frequency of
Environment:Attachments:( 0) 7_26_2013_gdb_thread_apply_all_bt_full.txt
( 1) gdb_thread_apply_all_bt.txt
( 2) gdb_thread_apply_all_bt_full.txt
( 3) valgrind.rar
Description:Third deadlock in two weeks.
Asterisk not built with NOT_OPTIMIZED or thread debug options.
The best I could do was attach GDB to Asterisk PID and get a thread apply all BT (attached)
Comments:By: David Brillert (aragon) 2013-07-18 08:26:10.037-0500

DEBUG_THREADS is not enabled in the build and the gdb shows a lot of <value optimized out>.
The system is a very busy production system and would not survive DEBUG_THREADS.
Is there another build option I could use to get the required locking data?
Or is there enough info in the GDB trace to help with a patch?
This is a subversion build of Asterisk 11 r391794

By: Matt Jordan (mjordan) 2013-07-18 08:45:23.062-0500

I don't think this has anything to do with Local channels.

Rather, it looks like a locking inversion between reloading the dialplan from {{pbx_lua}} (and one of the dialplan context locks) and a channel lock. Unfortunately, without a {{core show locks}} it's difficult to find the culprit.

A {{bt full}} might give a bit more information such that the culprit can be found.

By: David Brillert (aragon) 2013-07-18 08:58:51.946-0500

I'll collect that bt full as soon as I see another deadlock :D
I have updated to SVN 393929

By: David Brillert (aragon) 2013-07-22 14:37:25.067-0500

Here is the gdb thread apply all bt full from today's deadlock

By: David Brillert (aragon) 2013-07-26 09:56:55.475-0500

Another deadlock today.
Reproducible on a reload.

So I modified modules.conf to remove lua.
noload          => pbx_lua.so

Now the system doesn't completely deadlock (processes calls) but the CPU load goes through the roof on any reload.

Attached gdb during the reload

By: Rusty Newton (rnewton) 2013-07-26 10:21:35.838-0500

Mark Michelson triaged the new trace and it looks like it should be helpful. Though "core show locks" output could reduce hours of work to minutes of work.

pbx_lua is under extended support so I've put the issue into an open state for a community dev to take a look when they are available.

By: David Brillert (aragon) 2013-07-26 10:32:13.631-0500

Thank you kindly, that is good news.

By: David Brillert (aragon) 2013-07-26 10:36:08.693-0500

We aren't using pbx_lua so I changed modules.conf so it won't load.
I hoped that would be enough to fix the problem but the gdb trace still doesn't look 'happy, happy, happy'.
The load gets crazy and I still see some <Address 0x6 out of bounds> that look like memory corruption in the gdb trace.

By: Rusty Newton (rnewton) 2013-07-26 14:00:14.032-0500

Sorry I had missed your 26/Jul/13 9:56 AM before I commented.

Can you get [valgrind|https://wiki.asterisk.org/wiki/display/AST/Valgrind] output?

To confirm, after noloading pbx_lua you don't get a full deadlock, but CPU usage spikes.

How long does the CPU usage hit the roof for?

By: David Brillert (aragon) 2013-07-26 14:07:03.708-0500

I have downloaded the customer's config to a lab server and starting valgrind.
It's taking a long time to start under Valgrind ;)

To confirm, after noloading pbx_lua you don't get a full deadlock, but CPU usage spikes.
- So far so good on the full deadlock

How long does the CPU usage hit the roof for?
- a long time, until I restart Asterisk

The trace = gdb thread apply all bt full.txt
   22/Jul/13 3:37 PM
Is a gdb attachment during a full deadlock with lua loaded.

The trace =  7_26_2013 gdb thread apply all bt full.txt
   26/Jul/13 10:56 AM
Is a gdb attachment during the high CPU load with noload => pbx_lua.so

By: David Brillert (aragon) 2013-07-26 14:47:02.287-0500

I hope this valgrind attachment will help.
It's not from the production server and had zero voice traffic on it, and no registered extensions.

By: Rusty Newton (rnewton) 2013-07-29 16:18:21.168-0500

Since you are not getting the deadlock it sounds like we have two possibly separate issues. To avoid confusion please create a new issue for the high CPU consumption. Attach your new trace and valgrind output to that one, then we'll look at it there.

Since the locking was related to pbx_lua and your not even using that, we won't expect this new debug to help for the previous deadlock.


By: David Brillert (aragon) 2013-07-29 16:32:46.148-0500

New ticket is opened ASTERISK-22208

By: David Brillert (aragon) 2013-09-16 14:22:58.856-0500

Why am I the assignee on this bug report?