Summary:ASTERISK-29882: Occasional segfaults in production
Reporter:Duncan (duncanvxt)Labels:patch
Date Opened:2022-01-26 16:40:23.000-0600Date Closed:
Versions:18.8.0 Frequency of
Environment:Debian Stretch in DockerAttachments:( 0) asterisk-chan-save.patch
( 1) coreFile-thread1.txt
( 2) Screenshot_from_2021-12-20_09-09-29.png
( 3) slow-snoop.zip
Description:Hi team,

We are seeing Asterisk instances occasionally segfault in production. This happens in our voicemail handling stasis app, which includes playing media to the caller, and then recording the caller, managed through ARI. We don't know what the steps are to reproduce the issue as we have never reproduced it ourselves, but suspect it is probably some kind of race condition.

I have attached coreFile-thread1.txt from the core dump. We have all the other core dump *.txt files but would prefer not to post these publicly in case they could contain any sensitive data. I would be happy to provide these directly to a maintainer via email. We’re happy to assist debugging this issue if at all possible, and have some C experience but no experience working on Asterisk. Please get in touch if we can do anything to help troubleshoot further.

Many thanks
Comments:By: Asterisk Team (asteriskteam) 2022-01-26 16:40:24.626-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Kevin Harwell (kharwell) 2022-01-26 17:55:59.877-0600

{panel:title=Private Submission of Information Disclaimer}You have indicated that you wish to submit unredacted information privately. It is not recommended to do this as it will substantially restrict the number of individuals who can help with your issue, as submitted information is only available to Sangoma. Note that submission of such information does not change the priorization of this issue. If you still wish to proceed you may do so by sending it to asteriskteam@digium.com with the issue number in the subject. For large files please send a link where they can then be downloaded. By sending this information you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/. Any exchange of private information between you and an Asterisk community member outside of the Asterisk JIRA is not subject to the Website Terms of Use and should be privately discussed between yourself and the Asterisk community member.{panel}

By: Kevin Harwell (kharwell) 2022-01-26 18:00:41.848-0600

If you're fine with the above disclaimer then please send the backtrace files to the address mentioned.

Also, are you comfortable using {{gdb}}?
$ gdb asterisk <core file>
If so what's the output for the following (barring it contains no sensitive info):
(gdb) frame 7
(gdb) p snoop->spyee_chan
(gdb) p *snoop->spyee_chan

By: Kevin Harwell (kharwell) 2022-01-26 18:03:15.775-0600

Oh and also if possible please attach an [Asterisk debug|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information] log leading up to the crash.

By: Duncan (duncanvxt) 2022-01-26 19:00:50.336-0600

Hi Kevin,

I've sent the other coreFile-*.txt files to that email address, thanks.

This is the output I get for those gdb commands:

(gdb) frame 7
#7  0x00007f4039ffbeb5 in ?? ()
(gdb) p snoop->sypee_chan
No symbol "snoop" in current context.
(gdb) p *snoop->sypee_chan
No symbol "snoop" in current context.

All we have on hand in the way of logs leading up to the crash is the attached screenshot, hope it's of some use.

Many thanks,

By: Duncan (duncanvxt) 2022-01-26 19:01:09.198-0600

Asterisk logs leading up to the segmentation fault

By: Kevin Harwell (kharwell) 2022-01-27 11:43:44.169-0600

In the {{gdb}} output above it appears the frame is correct however not sure why it shows "??" here.

As far as the "No symbol" goes though it looks like you misspelled {{snoop->spyee_chan}}. You've got it as _sypee_chan_.

By: Kevin Harwell (kharwell) 2022-01-27 11:59:13.365-0600

Also as a note (I'm not sure if this would be the actual fix to this problem or the entire fix) it appears a NULL check on the {{snoop->spyee_chan}} is warranted before trying to access it in {{publish_chanspy_message}}. Given the backtrace on this issue it does appear that there are path(s) in which {{snoop->spyee_chan}} could be NULL in that function.

For instance, here in this backtrace that function is called from {{stasis_app_control_snoop}} before the {{snoop->spyee_chan}} is set.

That said the call to {{ao2_alloc_options}} should have NULL'ed that field out, and given current information it appears though that {{snoop->spyee_chan}}  might be pointing to "junk".

By: Duncan (duncanvxt) 2022-01-27 15:24:11.597-0600

You're right, sorry about that. However I seem to get the same output with it spelled correctly, unfortunately:

(gdb) frame 7
#7  0x00007f4039ffbeb5 in ?? ()
(gdb) p snoop->spyee_chan
No symbol "snoop" in current context.
(gdb) p *snoop->spyee_chan
No symbol "snoop" in current context.

By: Duncan (duncanvxt) 2022-01-27 15:39:42.010-0600

I've obtained a more useful output by running gdb on the docker container that was running Asterisk, I think it may have had something to do with a version mismatch or missing access to shared library files from outside the container but I'm not exactly sure, sorry I'm pretty unfamiliar with gdb etc. Here is the output:

(gdb) frame 7
#7  0x00007f4039ffbeb5 in publish_chanspy_message (snoop=0x7f408035b688, start=<optimized out>) at res_stasis_snoop.c:138
138 res_stasis_snoop.c: No such file or directory.
(gdb) p snoop->spyee_chan
$1 = (struct ast_channel *) 0x0
(gdb) p *snoop->spyee_chan
Cannot access memory at address 0x0

By: Kevin Harwell (kharwell) 2022-01-27 15:40:11.664-0600

Hrm would really like to know if that value is indeed NULL. It appears to be at the correct address, but maybe missing symbols now. Did you perhaps copy the core file to another system, or reinstall Asterisk since obtaining and loading the core in {{gdb}}? If so that might account for it.

Otherwise, maybe next time it crashes load the new core file into {{gdb}}, look for the frame number in the stack trace that matches to the {{publish_chanspy_message}} function, and then see what the output is for that variable on that frame.

How often does it crash? Also, next time it does go ahead and post thread1 again if you can. I'm curious if it crashes in the same place every time.

By: Duncan (duncanvxt) 2022-01-27 15:44:11.112-0600

See the message I posted just before you, looks like it is NULL :)

At a guess I would say it crashes roughly every two weeks, but it's pretty random (was once twice in the same day).

By: Duncan (duncanvxt) 2022-01-27 15:48:09.283-0600

We don't currently have any core files from other crashes, but will be sure to capture the next one.

By: Kevin Harwell (kharwell) 2022-01-27 16:05:19.121-0600

Ah yeah ha I posted without refreshing so missed your update.

It is indeed NULL! I do wonder though why it didn't crash on the earlier call to {{ast_channel_uniqueid}}.

I'll open the issue. Not sure when it'll be looked into though. The fastest way to a fix is by submitting a patch :-) (if you're interested)

Since it is NULL a NULL check in that function before accessing the variable should keep it from crashing. So if you wanted to avoid crashes you could patch the code to do such. However, without someone looking more into I don't know the full reprecussions of doing such. For instance, if that information does not exist should the message be published? Or should it be published without said data? Or perhaps should it be always guranteed that data exists before that function is called? etc...

By: William ML Leslie (wleslie) 2022-10-23 20:40:07.183-0500

Greetings, we have been hitting this issue too; I've found what's going on.  Here's the relevant portion of a traceback.

#7  0x00007f096c4439a5 in publish_chanspy_message (snoop=0x7f09540a8e58, start=0) at res_stasis_snoop.c:138
#8  0x00007f096c443f76 in snoop_hangup (chan=0x7f0954009900) at res_stasis_snoop.c:228
#9  0x0000564b94998096 in ast_hangup (chan=0x7f0954009900) at channel.c:2612
#10 0x00007f096c4447bf in stasis_app_control_snoop (chan=0x7f0968294b50, spy=STASIS_SNOOP_DIRECTION_IN, whisper=STASIS_SNOOP_DIRECTION_NONE, app=0x7f0954066340 "our-application", app_args=0x0,
   snoop_id=0x7f095404f3d8 "1666316293.2651-in-snoop") at res_stasis_snoop.c:380
#11 0x00007f096c3709d8 in ari_channels_handle_snoop_channel (args_channel_id=0x7f0954026c8a "1666316293.2651", args_spy=0x7f095407ebf0 "in", args_whisper=0x7f09540accb0 "none",
   args_app=0x7f0954066340 "our-application", args_app_args=0x0, args_snoop_id=0x7f095404f3d8 "1666316293.2651-in-snoop", response=0x7f092b7dab20) at ari/resource_channels.c:1638

Key here is that if stasis_app_control_snoop fails in some way, it hangs up on the fresh channel it has created.  The snoop channel has been set as this channel's tech on line 351:
ast_channel_tech_set(snoop->chan, &snoop_tech);

As part of the channel hangup process, the tech's hangup callback is invoked, so hanging up on the channel ends up calling snoop_hangup on a snoop that has not been fully initialised, notably that doesn't have spyee_chan set.

There are a couple of ways we could fix this.  I'll get a reproducer to see if I can rule either one out.

One possibility is simply checking for null snoop->spyee_chan in publish_chanspy_message.  The function is already set up to handle messaging with no spyee_snapshot, so it's valid in a schema sense.  The real question is whether the resulting messages will be confusing - talking about a channel that is being removed, with no association to an existing channel.

We could alternatively set snoop->spyee_chan earlier in stasis_app_control_snoop and bump its reference count.  This would lead to more sensible messages, and the additional reference would be destroyed when the function returns and deallocates the snoop.  I don't know if I'm missing any other invariants that prevent this from working.  This is probably the better approach.

I'm assuming that we do intend to send these messages if we fail to start a snoop.

By: William ML Leslie (wleslie) 2022-10-25 23:00:14.419-0500

Config + ari-py code for reproducing this crash with pjsua.  Should reproduce in a few minutes on asterisk 18.0+.

Note there is some config for the network - we test asterisk with a manually configured network in order to reproduce some of our iax2 trunking setups and I've preserved that here.  Remember to change the IP addresses and ports in these configs / in the program to match asterisk as you are running it.

By: William ML Leslie (wleslie) 2022-10-25 23:47:47.047-0500

A patch, implementing my second idea by moving a single line of code, made available under the same license as asterisk.  Instead of crashing the system, the attempt to add a snoop channel will receive a 500 error with body {"message":"Snoop channel could not be created"}.  We could try to detect this case and return a 404 to match the other case where the parent channel has gone away if desired.

I have run the repro code for many hours without a crash and am now running my load tests (we test asterisk with thousands of concurrent calls), and will let you know if we discover any issues.

FWIW ASTERISK-29604 appears to also be this bug.  It contains a suggested fix, but I think this patch is a better ARI experience overall as it describes both channels involved.

By: Duncan (duncanvxt) 2022-11-01 16:32:40.267-0500

Thank you for this William!