Summary:ASTERISK-18829: ConfBridge deadlocks waiting endlessly for a condition to be signalled inside bridge_channel_join_multithreaded
Reporter:zvision (zvision)Labels:
Date Opened:2011-11-07 10:15:22.000-0600Date Closed:2011-11-10 15:56:53.000-0600
Versions:10.0.0-beta2 Frequency of
must be completed before resolvingASTERISK-18847 Asterisk 10.0.0 Release Blockers
Environment:Debian 64-bit x86Attachments:( 0) backtrace.txt
( 1) core_show_channels.txt
( 2) core-show-locks.txt
( 3) issue_18829_v2.diff
( 4) issue_18829.diff
( 5) messages.txt
Description:I have a setup that spawns multiple ConfBridge conferences involving 2 participants. Conferences are usually short and are created/destroyed frequently. From time to time a channel involved in the conference gets stuck. It ignores hangups both from underlaying signalling channel and from the CLI console (hangup request). It deadlocks in the ConfBridge application until Asterisk restart. After doing some debugging (connecting with GDB to the running process and analyzing Asterisk debug logs) I found the following:
1. The situation occurs when another party has left and the leave sound has been played.
2. GDB shows that the deadlocked thread waits inside the bridge_channel_join_multithreaded function from main/bridging.c in the following code:
     ast_debug(10, "Going into a multithreaded signal wait for bridge channel %p of bridge %p\n", bridge_channel, bridge_channel->bridge);
     ast_cond_wait(&bridge_channel->cond, ao2_object_get_lockaddr(bridge_channel));
  it waits for the condition to be signalled and for some reason it never gets signalled.
3. Debug logs look like:
     [Nov  4 15:13:45] DEBUG[20017] bridging.c: Going into a multithreaded waitfor for bridge channel 0x146b7e8 of bridge 0x7fcf5c003c38
     [Nov  4 15:13:45] DEBUG[19916] app_confbridge.c: Departing underlying channel 'Bridge/0x1449af0-output' from bridge '0x7fcf5c003c38'
     [Nov  4 15:13:45] DEBUG[20017] bridging.c: Giving bridge technology softmix notification that 0x146b7e8 is leaving bridge 0x7fcf5c003c38
     [Nov  4 15:13:45] DEBUG[20017] bridging.c: Removed channel 0x1477c08 from bridge array on 0x7fcf5c003c38, new count is 1
     [Nov  4 15:13:45] DEBUG[19916] bridging.c: Removed channel 0x7fcf6c28e5f8 from bridge array on 0x7fcf5c003c38, new count is 0
     [Nov  4 15:13:45] VERBOSE[19916] res_musiconhold.c:     -- Started music on hold, class 'default', on DAHDI/i4/224804962-4
     [Nov  4 15:13:45] DEBUG[19916] channel.c: Scheduling timer at (50 requested / 50 actual) timer ticks per second
     [Nov  4 15:13:45] DEBUG[19988] bridging.c: Going into a multithreaded signal wait for bridge channel 0x1469fa8 of bridge 0x7fcf5c003c38
  The problem looks like when everything is fine, the bridge_channel_join_multithreaded function waits for file descriptors to be ready (like in the first line),
  and when the problem occurs, the if branch inside the function waits for the condition to be signalled and the last message appears.
  After this message appears the channel gets deadlocked inside the conference (the condition is never signalled).
4. From the above, it seems to me that the problem occurs, when the bridge enters a suspended state.

I don't have an idea how to fix it, but would it broke something, if I change the condition wait to use some timeout - maybe then before a next call
to bridge_channel_join_multithreaded the bridge will have its state updated or something like that?

The problem occurs randomly, but quite frequently and does not depend on pbx load. Sometimes tt happens with only 2-3 conferences active and sometimes
when there are 40-50 conferences active. But usually after 100-200 calls has been processed, it shows up.
Comments:By: Leif Madsen (lmadsen) 2011-11-07 10:26:16.479-0600

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install).  This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!

By: Leif Madsen (lmadsen) 2011-11-07 10:26:23.920-0600

Thank you for your bug report. In order to move your issue forward, we require a backtrace[1] from the core file produced after the crash. Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

After enabling, reproduce the crash, and then execute the backtrace[1] instructions. When complete, attach that file to this issue report.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: zvision (zvision) 2011-11-07 17:20:58.033-0600

I attached the files as requested. The test case was simple - I generated about 20-30 channels that were entering/leaving confbridge conferences (each max 2 channels). Some of the channels (SIP) were disconnected due to lack of RTP activity (as can be seen in the attached messages.txt log). All these channels got stuck in the same place (waiting on the mentioned earlier condition object). No locks are held, as cond_wait function releases locks before waiting on the condition object. In the debug log I can see exactly 13 messages indicating that the thread entered multithreaded signal wait for bridge channel. And these channels never get woke up. I also attached output from core show channels and sip show channels commands. All SIP dialogs are in a BYE state, but none are disconnected. CLI channel request hangup command is unable to hangup the channel. The problem occurs also for DAHDI calls.

By: zvision (zvision) 2011-11-08 03:52:54.768-0600

I just made an experiment and changed ast_cond_wait to ast_cond_timedwait with 10 seconds timeout inside bridge_channel_join_multithreaded function and the problem is "gone", which means channels hang up after the condition times out. I also noticed, that the SIP RTP timeout messages I attached in the logs are caused because channels deadlock waiting for the condition to be signalled and this causes RTP inactivity. I know this does not fix the problem, but it's some kind of a tip or workaround.

By: David Vossel (dvossel) 2011-11-08 13:12:29.480-0600

Please test the patch I uploaded.

By: zvision (zvision) 2011-11-08 15:05:45.031-0600

I tested the patch and it fixes the issue partially. Now, my test setup leaves only 1-2 calls deadlocked instead of 10-15, so there may be some other place
to signal the condition too. But that was a good shot, as my conferences use DTMF menus and I saw in the code there are bridge suspend/unsusped actions during
DTMF bridge state. Unfortunatelly, the patch does not fix the problem to the end.

By: David Vossel (dvossel) 2011-11-09 11:21:42.476-0600

Yep, well DTMF menus need to be solid.  I'll write another patch.

By: zvision (zvision) 2011-11-09 13:34:48.824-0600

I'll be happy to test it:) as it seems you are close to the solution. I tried to analyze the code and find some other places
to signal the condition object, but have not found anything.

By: David Vossel (dvossel) 2011-11-09 17:32:22.749-0600

Okay, updated a version 2 of the previous patch.  Let me know how it goes.

By: zvision (zvision) 2011-11-10 03:37:08.777-0600

This one works perfect! I made various tests with different call loads
and lots of conference create/dtmf menu/destroy events and no single call deadlocked:)
Thanks for such a fast response!

By: David Vossel (dvossel) 2011-11-10 15:55:55.862-0600

That's just the way we role!!! sometimes.