Summary: | ASTERISK-30081: app_confbridge: Channel can join wrong bridge due to race condition | ||
Reporter: | N A (InterLinked) | Labels: | |
Date Opened: | 2022-05-25 09:33:42 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | In Progress/In Progress | Components: | Applications/app_confbridge |
Versions: | 18.9.0 | Frequency of Occurrence | Frequent |
Related Issues: | |||
Environment: | Attachments: | ( 0) app_confbridge.c ( 1) wyeed8jbjr.txt ( 2) zq10e1w9qo.txt | |
Description: | This doesn't happen all the time, but it happens often enough. The actual timing seems to be random - possible race condition.
This reproduces very frequently - not every time, but about half the time or more - with this scenario: - A and B are in a confbridge - All in a split second, but in this order: - A leaves the bridge - B gets channel redirected out of the bridge - (confbridge is now empty, ceases to exist momentarily) - A re-enters the bridge. - C enters the bridge. C ends up in the bridge correctly, but A ends up in the wrong bridge. Two channels are supposed to get into the same ConfBridge, but they each end up by themselves, obviously not bridged together. It works correctly when the timing is different, and there ISN'T a brief moment where the bridge is "EMPTY" in the middle. Therefore, you ONLY see this in a trace for a bad, broken bridge: {noformat} [2022-05-25 15:32:49] DEBUG[23368][C-0000002a]: app_confbridge.c:1570 alloc_playback_chan: Created announcer channel 'CBAnn/oe7-0000007a;1' to conference bridge 'oe7' [2022-05-25 15:32:49] -- Channel CBAnn/oe7-0000007a;2 joined 'softmix' base-bridge <66c254c2-ef39-4252-9cc4-405fb58d1bbc> {noformat} I have several call debugs of "good" and "bad" calls that broke, and only in the bad/broken calls can I find "EMPTY" in the app_confbridge debug messages. The race condition can be detected by checking if the bridge exists later - note in this annotated example just how exactly right the threads need to interleave to trigger this. If the timing is just right, this will reproduce frequently. Otherwise, it is unlikely to occur. {noformat} CLI output with core set debug 1 app_confbridge: Thread exiting bridge: [2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY Thread joining bridge: [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1751 join_conference_bridge: Trying to find conference bridge 'oe7' [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1902 join_conference_bridge: Conference 'oe7' already exists in container. Thread exiting bridge: [2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: app_confbridge.c:1528 conf_ended: CONF END (this is the conf_ended function) Thread joining bridge: [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from EMPTY to SINGLE [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1352 conf_update_user_mute: User PJSIP/ATAxLB2-00000003 is unmuted: user:0 system:0. [2022-05-26 00:53:42] WARNING[28852][C-0000000b]: app_confbridge.c:1943 join_conference_bridge: Due to a race condition, conf oe7 disappeared before we could enter it [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY ^ Here, we successfully detect the race condition and abort. {noformat} More details: core show channels shows them both executing the ConfBridge application on the same ConfBridge, but only one channel is in the bridge according to confbridge list and confbridge show. A backtrace confirms the other channel has somehow entered a *different* bridge by itself. On occasion, channels joining the same ConfBridge at nearly the same time appear to get into different bridges as opposed to the same bridge, even though they have been requested to go into the same ConfBridge. A backtrace confirms that for channels with the same confbridge name, confbridge->bridge is different for them here (they should be the same): https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L2848 I am marking this "major" since such a race condition can completely break the usage of ConfBridge, and suggests that perhaps the locking here is insufficient: https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L1736 CLI trace showing bizarre confbridge/bridge/channel disonnance: https://paste.interlinked.us/zq10e1w9qo.txt Output of bridge show: {noformat} *CLI> bridge show 17378691-4657-41df-a2ba-ff871b5cda79 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 3022d1db-2659-4dad-988f-e8e6994641bb all de65981b-7f2d-420c-8ce4-e2c6120953dd [2022-05-25 14:24:28] -- Remote UNIX connection [2022-05-25 14:24:28] -- Remote UNIX connection disconnected *CLI> bridge show 17378691-4657-41df-a2ba-ff871b5cda79 Id: 17378691-4657-41df-a2ba-ff871b5cda79 Type: base Technology: softmix Subclass: base Creator: ConfBridge Name: oe6 Video-Mode: none Video-Source-Id: Num-Channels: 3 Num-Active: 3 Duration: 00:31:02 Channel: PJSIP/ATAxLB1-00000016 Channel: CBAnn/oe6-00000048;2 Channel: Local/2129@originate-local-00000049;1 *CLI> bridge show 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 Id: 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 Type: base Technology: softmix Subclass: base Creator: ConfBridge Name: oe7 Video-Mode: none Video-Source-Id: Num-Channels: 2 Num-Active: 2 Duration: 00:30:49 Channel: Local/4002702129@lines-pre-0000004a;2 Channel: CBAnn/oe7-0000004d;2 *CLI> bridge show 3022d1db-2659-4dad-988f-e8e6994641bb Id: 3022d1db-2659-4dad-988f-e8e6994641bb Type: basic Technology: simple_bridge Subclass: basic Creator: Name: Video-Mode: none Video-Source-Id: Num-Channels: 2 Num-Active: 2 Duration: 00:30:51 Channel: Local/2129@originate-local-00000049;2 Channel: Local/4002702129@lines-pre-0000004a;1 *CLI> bridge show de65981b-7f2d-420c-8ce4-e2c6120953dd Id: de65981b-7f2d-420c-8ce4-e2c6120953dd Type: base Technology: softmix Subclass: base Creator: ConfBridge Name: oe7 Video-Mode: none Video-Source-Id: Num-Channels: 2 Num-Active: 2 Duration: 00:30:59 Channel: PJSIP/ATAxLB2-00000015 Channel: CBAnn/oe7-0000004c;2 {noformat} | ||
Comments: | By: Asterisk Team (asteriskteam) 2022-05-25 09:33:43.151-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/]. By: N A (InterLinked) 2022-05-25 09:37:59.559-0500 Backtrace of running process coredump attached. Key thing is search for: "oe7,silentbridge,incogtalker") at app_confbridge.c:2842 If you go up to stack frame 4 in the 2 matches, you'll notice the bridge pointer is different for them. In contrast, for oe6 as opposed to oe7, they are the same (correct). By: N A (InterLinked) 2022-05-25 09:39:39.449-0500 Here is some CLI output as well: By: N A (InterLinked) 2022-05-25 09:46:46.433-0500 Attached it is the app_confbridge.c in use, since it has some newer patches from master cherry picked into it. It's otherwise unmodified. |