[Home]

Summary:ASTERISK-30445: stasis: Off-nominal channel leave causes bridge to be destroyed
Reporter:Quality Unit (adminQU)Labels:
Date Opened:2023-02-28 04:30:27.000-0600Date Closed:
Priority:MinorRegression?
Status:Open/NewComponents:Resources/res_stasis
Versions:18.16.0 20.1.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-30501 Frack assertion error in while loop in stasis.c
Environment:CentOS 7, kernel 3.10.0-1160.83.1.el7.x86_64.Attachments:( 0) asterisk_dump_logs_2023-04-17_core.18037.zip
( 1) asterisk_dump_logs_2023-04-17_core.26817.zip
( 2) asterisk_dump_logs_2023-04-17-10_core.1913.zip
( 3) asterisk_incident_1000_lines_before_and_after.txt
( 4) core_asterisk.zip
( 5) module_show_output&logs_before_incident.txt.zip
Description:This started to happen after upgrade to `18.16.0` from `16.29.0`. Asterisk starting to spam logs with similar messages like:

{code:none}
[2023-02-24 09:30:57.001] ERROR[11361] stasis.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f7ec4031d10 (0)
[2023-02-24 09:30:57.006] ERROR[11361] : Got 14 backtrace records
# 0: /usr/sbin/asterisk() [0x459ae6]
# 1: /usr/sbin/asterisk(__ao2_ref+0x79) [0x459fc9]
# 2: /usr/sbin/asterisk(stasis_forward_all+0x83) [0x555283]
# 3: /usr/lib64/asterisk/modules/res_stasis.so(+0x107e1) [0x7f7f62a7e7e1]
# 4: /usr/lib64/asterisk/modules/res_stasis.so(+0x14e2a) [0x7f7f62a82e2a]
# 5: /usr/lib64/asterisk/modules/res_stasis.so(+0x12966) [0x7f7f62a80966]
# 6: /usr/lib64/asterisk/modules/res_stasis.so(+0x1554e) [0x7f7f62a8354e]
# 7: /usr/lib64/asterisk/modules/res_stasis.so(stasis_app_exec+0x569) [0x7f7f62a79789]
# 8: /usr/lib64/asterisk/modules/app_stasis.so(+0xbd7) [0x7f7f4d6f4bd7]
# 9: /usr/sbin/asterisk(pbx_exec+0xb9) [0x50db89]
#10: /usr/lib64/asterisk/modules/res_ari_channels.so(+0xa528) [0x7f7f48594528]
#11: /usr/sbin/asterisk() [0x57e80d]
#12: /lib64/libpthread.so.0(+0x7ea5) [0x7f7fb932fea5]
#13: /lib64/libc.so.6(clone+0x6d) [0x7f7fb88ff98d]
{code}

After around ~3 minutes of these (Asterisk during that time is already unresponsive), we have final errors before service crash:

{code:none}
[2023-02-24 09:32:48.153] ERROR[11361] astmm.c: Memory Allocation Failure in function topic_add_subscription at line 1214 of stasis.c
{code}

And, from `/var/log/messages`:

{code:none}
Feb 24 09:32:48 localhost kernel: asterisk[11361]: segfault at 1 ip 0000000000552827 sp 00007f7ed3e0a6b0 error 4 in asterisk[400000+2a2000]
{code}

It happens randomly, once, twice a day, sometime doesn't happen at all.
Though, it's always happening when Asterisk is processing calls, so it's not happening during night (when asterisk is almost idle)
Comments:By: Asterisk Team (asteriskteam) 2023-02-28 04:30:30.636-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Quality Unit (adminQU) 2023-02-28 04:31:46.040-0600

Attaching core files after crash.  

By: Joshua C. Colp (jcolp) 2023-02-28 04:38:57.569-0600

You stated this is in Asterisk 16.29.0, yet this is tagged against 18.16.0. Has this occurred in 18? Asterisk 16 is security fix only.

By: Quality Unit (adminQU) 2023-02-28 04:44:13.799-0600

Yes, this is occurring in 18.16.0, I said that it has started after upgrade, we haven't spotted that on Asterisk 16.

By: Joshua C. Colp (jcolp) 2023-02-28 04:55:23.567-0600

I misread the description.

This crash is most likely a symptom of a memory leak somewhere. It's crashing because of a failure to allocate memory.

We'd need memory leak information[1] to try to determine where it is leaking. Information about how exactly you use Asterisk would also be useful - what parts of Asterisk, what channel drivers, etc.

[1] https://wiki.asterisk.org/wiki/display/AST/Memory+Leak+Debugging

By: Quality Unit (adminQU) 2023-02-28 09:21:46.161-0600

That's interesting, I reported possible memory leak issue in https://community.asterisk.org/t/possible-memory-leak-in-asterisk-v16-29-0/94828/2, but it was low priority since adding more memory was enough to mitigate it.

We use a self-written Python application, that uses ARI to control call flows, manage channels, play audio files, and much more.
OpenSIPS host used as a load balancer and two Asterisk hosts are behind it, registrations processed on OpenSIPS.

I'm attaching another archive with `module show` output and VERBOSE logs seconds before the incident. Please, let me know if you want me to provide more details, I'll get memory leak info in the meantime.

By: Asterisk Team (asteriskteam) 2023-03-16 12:00:19.658-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Quality Unit (adminQU) 2023-04-18 04:54:37.648-0500

It was reproduced on version `20.1.0`, "Failed assertion..." happened 3 times in a row:
{code:none}
$ grep 'Failed assertion' full
[2023-04-17 07:26:46.332] ERROR[1600] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7fd6def2ca60 (0)
[2023-04-17 07:26:46.334] ERROR[1600] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7fd6def2ca60 (0)
[2023-04-17 08:27:29.982] ERROR[17890] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7f36b809f520 (0)
[2023-04-17 08:27:29.983] ERROR[17890] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7f36b809f520 (0)
[2023-04-17 08:59:43.084] ERROR[13188] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7f354c04a3d0 (0)
[2023-04-17 08:59:43.085] ERROR[13188] stasis.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7f354c04a3d0 (0)
{code}
Yet, there is nothing logged in `mmlog`:

{code:none}
$ sudo cat /var/log/asterisk/mmlog
1680796806 - New session
1680800798 - New session
1680801419 - New session
1680802085 - New session
1680802457 - New session
1680803107 - New session
1680805268 - New session
1680886819 - New session
1680887275 - New session
1681117108 - New session
1681124183 - New session
1681127745 - New session
1681131634 - New session
1681458248 - New session
1681716666 - New session
1681720295 - New session
1681722233 - New session
{code}

Asterisk was compiled with `DON'T OPTIMIZE` flag, so core dumps probably will have more info, attaching them here.

By: Asterisk Team (asteriskteam) 2023-04-18 04:54:38.479-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Joshua C. Colp (jcolp) 2023-04-18 05:03:02.925-0500

The backtraces just show the trying to publish a message failed, due to the subscriber being NULL when it shouldn't be. It doesn't show how things got to that state and with the available information would most likely be impossible to determine. Providing all of the logging around the assertions (100-1000 lines, depending on usage) MAY provide further insight. Essentially have to determine the history of things.

By: Joshua C. Colp (jcolp) 2023-04-18 05:09:23.876-0500

Does the memory allocation failure message still occur?

By: Joshua C. Colp (jcolp) 2023-04-18 05:10:34.224-0500

And what is the output of "memory show summary"?

By: Quality Unit (adminQU) 2023-04-18 05:11:04.848-0500

Was obtained with:
{code:none}
$ grep -A 1000 -B 1000 'Failed assertion' full > asterisk_incident_1000_lines_before_and_after.txt
{code}

Full log size is 1.6GB, let me know if you need more, I can try to compress/truncate it and share full.

Asterisk is running with `-vvvvvdddddg` verbosity levels, so hopefully logs are as verbose as possible.

By: Quality Unit (adminQU) 2023-04-18 05:16:03.540-0500

{code:none}
$ sudo asterisk -x 'memory show summary'
 48662696 bytes in       4049 allocations in file ../src/pj/pool_policy_malloc.c
   480135 bytes in        635 allocations in file /opt/asterisk/include/asterisk/strings.h
   158260 bytes in        520 allocations in file /opt/asterisk/include/asterisk/threadstorage.h
   157508 bytes in        521 allocations in file abstract_jb.c
      712 bytes in          1 allocations in file app_bridgewait.c
       80 bytes in          1 allocations in file ari/ari_websockets.c
     1057 bytes in          6 allocations in file ari/config.c
     4174 bytes in        117 allocations in file ari/resource_channels.c
      813 bytes in          3 allocations in file ari/resource_events.c
     7730 bytes in        245 allocations in file asterisk.c
   849408 bytes in      35392 allocations in file astobj2.c
 12876768 bytes in     178844 allocations in file astobj2_hash.c
    10800 bytes in        135 allocations in file astobj2_rbtree.c
   198392 bytes in        605 allocations in file bridge.c
     5760 bytes in        108 allocations in file bridge_after.c
    20400 bytes in         51 allocations in file bridge_channel.c
     4080 bytes in        120 allocations in file bridge_roles.c
  1622200 bytes in        106 allocations in file bridge_softmix.c
     1283 bytes in          4 allocations in file bucket.c
      816 bytes in          4 allocations in file ccss.c
   131819 bytes in        151 allocations in file cdr.c
     6816 bytes in          6 allocations in file cel.c
       80 bytes in          1 allocations in file chan_audiosocket.c
      160 bytes in          2 allocations in file chan_bridge_media.c
   175898 bytes in        321 allocations in file chan_pjsip.c
      160 bytes in          2 allocations in file chan_rtp.c
    35208 bytes in        552 allocations in file channel.c
   105990 bytes in        269 allocations in file channel_internal_api.c
    11298 bytes in        530 allocations in file cli.c
     6144 bytes in         47 allocations in file codec.c
       88 bytes in          1 allocations in file codec_opus_config.c
    26496 bytes in          1 allocations in file codec_resample.c
   322730 bytes in       2811 allocations in file config.c
   107256 bytes in        701 allocations in file config_options.c
      216 bytes in          2 allocations in file core_local.c
    99840 bytes in         45 allocations in file core_unreal.c
  1778956 bytes in      23497 allocations in file devicestate.c
      418 bytes in          4 allocations in file dial.c
    92160 bytes in         64 allocations in file dsp.c
 14287856 bytes in      46992 allocations in file endpoints.c
    41175 bytes in        277 allocations in file features_config.c
   516210 bytes in         85 allocations in file file.c
    28352 bytes in        443 allocations in file fixedjitterbuf.c
     5143 bytes in         70 allocations in file format.c
      992 bytes in          1 allocations in file format_cache.c
  1573056 bytes in      12296 allocations in file format_cap.c
     8136 bytes in        123 allocations in file framehook.c
      712 bytes in          1 allocations in file func_dialgroup.c
    87592 bytes in          1 allocations in file hashtab.c
     2769 bytes in         16 allocations in file http.c
      968 bytes in          1 allocations in file indications.c
     4256 bytes in          2 allocations in file iostream.c
      312 bytes in          5 allocations in file json.c
     2296 bytes in          1 allocations in file libasteriskssl.c
    74902 bytes in        463 allocations in file loader.c
    16746 bytes in          7 allocations in file logger.c
      294 bytes in         10 allocations in file logger_category.c
   169503 bytes in        209 allocations in file manager.c
      968 bytes in         36 allocations in file max_forwards.c
     1096 bytes in          1 allocations in file media_cache.c
       96 bytes in          2 allocations in file message.c
     3776 bytes in         21 allocations in file named_locks.c
 14009618 bytes in     110441 allocations in file pbx.c
   494316 bytes in        392 allocations in file pbx_app.c
   604576 bytes in      24179 allocations in file pbx_config.c
   482240 bytes in        268 allocations in file pbx_functions.c
     9288 bytes in          1 allocations in file pbx_realtime.c
      600 bytes in         12 allocations in file pbx_sw.c
     6998 bytes in        131 allocations in file pbx_variables.c
      240 bytes in          2 allocations in file pjsip/cli_commands.c
       56 bytes in          1 allocations in file presencestate.c
       88 bytes in          1 allocations in file res_aeap.c
      504 bytes in          2 allocations in file res_ari.c
       55 bytes in          3 allocations in file res_ari_events.c
      968 bytes in          1 allocations in file res_clialiases.c
       24 bytes in          6 allocations in file res_format_attr_ilbc.c
      336 bytes in          7 allocations in file res_format_attr_opus.c
     1944 bytes in          5 allocations in file res_geolocation/geoloc_config.c
      640 bytes in          4 allocations in file res_hep.c
    66915 bytes in         11 allocations in file res_http_websocket.c
    57317 bytes in         22 allocations in file res_musiconhold.c
      754 bytes in          5 allocations in file res_odbc.c
     2090 bytes in         35 allocations in file res_pjproject.c
      192 bytes in          6 allocations in file res_pjsip.c
      584 bytes in          2 allocations in file res_pjsip/config_auth.c
      465 bytes in          2 allocations in file res_pjsip/config_global.c
       88 bytes in          1 allocations in file res_pjsip/config_system.c
     8540 bytes in         17 allocations in file res_pjsip/config_transport.c
 11514248 bytes in      11873 allocations in file res_pjsip/location.c
      352 bytes in          1 allocations in file res_pjsip/pjsip_cli.c
   515499 bytes in      12344 allocations in file res_pjsip/pjsip_configuration.c
    10888 bytes in         69 allocations in file res_pjsip/pjsip_distributor.c
     1392 bytes in          3 allocations in file res_pjsip/pjsip_global_headers.c
 25089849 bytes in     105602 allocations in file res_pjsip/pjsip_options.c
      992 bytes in          1 allocations in file res_pjsip/pjsip_scheduler.c
   165888 bytes in       1728 allocations in file res_pjsip/pjsip_session.c
     2152 bytes in          1 allocations in file res_pjsip/pjsip_transport_events.c
     2152 bytes in          1 allocations in file res_pjsip/pjsip_transport_management.c
      101 bytes in          1 allocations in file res_pjsip_authenticator_digest.c
      591 bytes in         48 allocations in file res_pjsip_caller_id.c
      296 bytes in          6 allocations in file res_pjsip_config_wizard.c
      120 bytes in          1 allocations in file res_pjsip_endpoint_identifier_ip.c
      616 bytes in          1 allocations in file res_pjsip_exten_state.c
     2048 bytes in          1 allocations in file res_pjsip_history.c
     4232 bytes in          1 allocations in file res_pjsip_logger.c
     1936 bytes in          2 allocations in file res_pjsip_mwi.c
     1088 bytes in          2 allocations in file res_pjsip_outbound_registration.c
     1424 bytes in          2 allocations in file res_pjsip_pubsub.c
     4560 bytes in         57 allocations in file res_pjsip_sdp_rtp.c
   156530 bytes in        942 allocations in file res_pjsip_session.c
     1520 bytes in         19 allocations in file res_pjsip_session/pjsip_session_caps.c
     1968 bytes in         41 allocations in file res_pjsip_t38.c
   585929 bytes in        581 allocations in file res_rtp_asterisk.c
       25 bytes in          1 allocations in file res_sorcery_astdb.c
     6070 bytes in         45 allocations in file res_sorcery_config.c
     6776 bytes in          7 allocations in file res_sorcery_memory.c
      968 bytes in          1 allocations in file res_sorcery_memory_cache.c
      124 bytes in          7 allocations in file res_sorcery_realtime.c
   329604 bytes in        131 allocations in file res_srtp.c
    11816 bytes in         87 allocations in file res_stasis.c
      712 bytes in          1 allocations in file res_stasis_device_state.c
     6227 bytes in         21 allocations in file res_stasis_playback.c
    13882 bytes in         61 allocations in file res_stasis_recording.c
      328 bytes in          2 allocations in file res_statsd.c
       88 bytes in          1 allocations in file res_stir_shaken.c
     2000 bytes in          1 allocations in file res_stir_shaken/general.c
     9128 bytes in          1 allocations in file res_timing_pthread.c
    15232 bytes in        136 allocations in file res_timing_timerfd.c
   153927 bytes in        539 allocations in file rtp_engine.c
   223424 bytes in        425 allocations in file sched.c
      408 bytes in         17 allocations in file sdp_srtp.c
      376 bytes in          4 allocations in file serializer.c
    22260 bytes in         42 allocations in file slinfactory.c
  4552043 bytes in      24483 allocations in file sorcery.c
 22068831 bytes in     236249 allocations in file stasis.c
     6680 bytes in         55 allocations in file stasis/app.c
     4000 bytes in         20 allocations in file stasis/command.c
    32696 bytes in        134 allocations in file stasis/control.c
     2208 bytes in          2 allocations in file stasis/messaging.c
    51192 bytes in        162 allocations in file stasis_bridges.c
  3343182 bytes in      58751 allocations in file stasis_cache.c
  1127904 bytes in      11749 allocations in file stasis_cache_pattern.c
   367436 bytes in       1029 allocations in file stasis_channels.c
  2544751 bytes in      35421 allocations in file stasis_message.c
  2256720 bytes in      23503 allocations in file stasis_message_router.c
     1184 bytes in          3 allocations in file stasis_state.c
    37104 bytes in          2 allocations in file stdtime/localtime.c
   812740 bytes in      15092 allocations in file stream.c
  1672856 bytes in      49269 allocations in file stringfields.c
 24835429 bytes in      12267 allocations in file strings.c
  7139359 bytes in      47269 allocations in file taskprocessor.c
     1820 bytes in         11 allocations in file tcptls.c
   960256 bytes in      23621 allocations in file threadpool.c
     2360 bytes in        140 allocations in file timing.c
  1131456 bytes in        138 allocations in file translate.c
      160 bytes in          2 allocations in file udptl.c
    15153 bytes in        185 allocations in file utils.c
  1253746 bytes in       6322 allocations in file xmldoc.c
213632459 bytes allocated in 1127831 selected allocations

213632459 bytes in all allocations
   583543 bytes in deferred free large allocations
    35188 bytes in deferred free small allocations
   618731 bytes in deferred free allocations
214251190 bytes in all allocations and deferred free allocations
{code}

> Does the memory allocation failure message still occur?

It does, it looks like it's tied to number of channels/calls processed. It usually happens once during the day, when we process the most calls (up to 30-40 per one Asterisk machine, we have two machines behind OpenSIPS).
Sometimes it might get more often, like 2-3 times per day, sometime might not happen at all. As I noticed, it's tied to high calls volume, at least it looks like.

By: Joshua C. Colp (jcolp) 2023-04-18 05:24:50.887-0500

There is an off-nominal scenario where channel departure fails, and the bridge is destroyed while it is subsequently still used:

{code}
[2023-04-17 08:59:43.075] DEBUG[2971] stasis/control.c: 1681721248.708: Channel is no longer in departable state
[2023-04-17 08:59:43.075] DEBUG[2971] bridge.c: Bridge 1681721248.708: actually destroying stasis bridge, nobody wants it anymore
[2023-04-17 08:59:43.075] DEBUG[2971] stasis_bridges.c: Update: 0x7f35400b72f8  Old: 1681721248.708  New: <none>
[2023-04-17 08:59:43.075] DEBUG[2971] stasis_bridges.c: Update: 0x7f35400b72f8  Old: 1681721248.708  New: <none>
[2023-04-17 08:59:43.075] DEBUG[2971] stasis.c: Destroying topic. name: bridge:all/bridge:1681721248.708, detail:
[2023-04-17 08:59:43.076] DEBUG[2971] stasis.c: Topic 'bridge:all/bridge:1681721248.708': 0x7f354c04a3d0 destroyed
{code}

That's what is causing this. I've accepted the issue.

By: Quality Unit (adminQU) 2023-04-18 05:50:23.181-0500

Great news!
Do I understand correctly, we can expect a fix in the nearest minor release?

By: Joshua C. Colp (jcolp) 2023-04-18 05:53:37.533-0500

There is no time frame on when a fix would occur.

By: Quality Unit (adminQU) 2023-04-18 05:55:29.002-0500

I expected that answer, but had to make sure.
OK then, thanks, we'll be waiting.

By: Abhay Gupta (agupta) 2023-04-24 07:47:30.112-0500

We are also facing this issue and server is highly loaded . Were you able to find the sequence of event that could lead to this crash so we can try to stop the bug from hitting till a solution is found .


By: Abhay Gupta (agupta) 2023-04-26 01:22:32.239-0500

We are able to generate the issue at our will and can do the testing if some patch can be suggested
1. Create bridge , put channel in stasis bridge
2. Delete the channel
3. Delete the bridge

If we do a few calls in succession it gives this FRACK error in while loop and asterisk crashes after sometime .

By: Abhay Gupta (agupta) 2023-04-26 04:07:16.062-0500

Also confirmed that the issue is not present in asterisk 16.30