Summary:ASTERISK-28888: res_corosync: causes asterisk crash in huge distributed environment.
Reporter:Università di Bologna - CESIA VoIP (cesia.voip@unibo.it)Labels:patch
Date Opened:2020-05-12 09:18:33Date Closed:2020-06-22 12:59:13
Versions:13.22.0 Frequency of
Environment:FreePBX 14Attachments:( 0) res_corosync.diff
Description:The VOIP infrastructure of the University of Bologna is a distributed system based on FreePBX: currently it's composed by 8 FreePBX server running a custom module developed internally that implement the high availability of the system and keep the pbxes synchronized. We have about 5300 sip identities and they will grow to 8000 in the coming months.
We are using res_corosync asterisk module to synchronize the device states and MWI states across the pbxes, but due to the large number of sip identities in our system, we encountered some problems.
We developed a patch to the res_corosync module with some changes needed to make it work in a huge distributed environment.

1) Fix memory-leaks
Added code to release ast_events extracted from corosync and stasis messages

2) Clean stasis cache when a member of the corosync cluster leaves the group
Added code to remove from the stasis cache of the members remained on the group all the messages with the EID of the left member.
If the device states of the left member remain in the stasis cache of other members, they will not be updated anymore and high priority cached values, like BUSY, will take precedence over current device states.

3) Stop corosync event propagation when node is not joined to the group
Updated dispatch_thread_handler code to detect when asterisk is not joined to the corosync group and added some condition in publish_event_to_corosync code to send corosync messages only when joined.
When a node is not joined its corosync daemon can't send messages: the cpg_mcast_joined function append new messages to the FIFO buffer until it's full and then it blocks indefinitely.
In this scenario if the stasis_message_cb callback, registered by res_corosync to handle stasis messages, try to send a corosync messages, the thread of the stasis thread-pool will be blocked until the node join the corosync cluster.

This is still a work in progress as we haven't solved all the issues: in a huge distributed environment, like our, some problems occasionally occur yet:

1) When the delivering of a device state to the corosync group failed, without cluster membership changes, that device state is no more propagated to the other pbxes until it changes one other time or a node join the cluster.

2) The method cpg_mcast_joined of the corosync library blocks the calling thread until the message is delivered to the local corosync daemon. Under some circumstances, the local corosync daemon is unable to receive the message from the corosync library and the calling thread blocks indefinitely. The cpg_mcast_joined call is inside a critical section guarded by a lock and the same lock protects the code that reinitialize the connection of the corosync library to the corosync daemon inside the res_corosync module: when a thread is blocked inside the cpg_mcast_joined call, res_corosync is unable to detect corosync daemon failures and to reinitialize the connection. It's also not possible to unload the res_corosync module as the blocked thread is locking the module shared library.
Comments:By: Asterisk Team (asteriskteam) 2020-05-12 09:18:34.859-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua C. Colp (jcolp) 2020-05-12 09:27:53.654-0500

Do you plan on putting this patch up on Gerrit for inclusion?

By: Università di Bologna - CESIA VoIP (cesia.voip@unibo.it) 2020-05-12 09:34:49.104-0500

Sorry i forgot the patch...

By: Joshua C. Colp (jcolp) 2020-05-12 10:07:33.374-0500

Per my last comment, do you plan on putting this up on Gerrit or just attaching the patch here?

By: Joshua C. Colp (jcolp) 2020-05-12 12:00:27.904-0500

I have also had to remove the patch as it was not marked as a contribution. In order to accept contributions the License Agreement has to be signed, after which a patch can be uploaded. This can be done by clicking the "Sign a License Agreement" link at the top of the page and filling out the agreement. Once approved by legal then a patch can be uploaded.

By: Università di Bologna - CESIA VoIP (cesia.voip@unibo.it) 2020-05-13 03:39:18.131-0500

I signed the license agreement and resubmitted the patch

By: Joshua C. Colp (jcolp) 2020-05-13 04:19:39.695-0500

License agreements are manually reviewed by legal. You must wait until that is done before attaching a patch.

By: Università di Bologna - CESIA VoIP (cesia.voip@unibo.it) 2020-05-26 07:17:15.818-0500

Legal reviewed the license agreements and I resubmitted the patch as requested

By: Joshua C. Colp (jcolp) 2020-05-27 03:42:23.759-0500

If you would like to submit the patch through code review then please follow the Patch Contribution Process[1]. Otherwise I have acknowledged this issue and it will be up to someone else to take the patch through that process.

[1] https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process

By: Università di Bologna - CESIA VoIP (cesia.voip@unibo.it) 2020-06-03 07:45:37.919-0500

I submitted the patch for code review to Gerrit

By: Università di Bologna - CESIA VoIP (cesia.voip@unibo.it) 2020-06-05 03:26:33.598-0500

I resubmitted the path to gerrit with only the functional changes

By: Friendly Automation (friendly-automation) 2020-06-22 12:59:14.549-0500

Change 14603 merged by Friendly Automation:
res_corosync: Fix crash in huge distributed environment.


By: Friendly Automation (friendly-automation) 2020-06-22 13:06:42.882-0500

Change 14455 merged by Kevin Harwell:
res_corosync: Fix crash in huge distributed environment.


By: Friendly Automation (friendly-automation) 2020-06-22 13:07:23.215-0500

Change 14601 merged by Kevin Harwell:
res_corosync: Fix crash in huge distributed environment.


By: Friendly Automation (friendly-automation) 2020-06-22 13:11:00.747-0500

Change 14602 merged by Kevin Harwell:
res_corosync: Fix crash in huge distributed environment.