Summary:ASTERISK-28900: res_fax: Double frame free when gateway in use with off-nominal format usage
Reporter:Gregory Massel (gmza)Labels:fax patch
Date Opened:2020-05-16 07:56:52Date Closed:2020-06-05 13:17:10
Versions:16.10.0 Frequency of
Environment:Ubuntu 18.04.4 LTS, Asterisk 16.10.0, kernels 5.0.0-29-generic and 5.0.0-37-generic, Intel E5-1680 and E3-1271. No DAHDI.Attachments:( 0) 2020-05-12_Call_that_crashed_Asterisk_-_client_leg.pcap
( 1) 2020-05-12_Call_that_crashed_Asterisk_-_SBC_leg.pcap
( 2) 2020-05-13_Call_that_crashed_Asterisk_-_client_leg.pcapng
( 3) 2020-05-13_Call_that_crashed_Asterisk_-_SBC_leg.pcapng
( 4) 2020-05-18_Yeastar_call_new_firmware_does_NOT_crash_Asterisk_-_client_leg.pcap
( 5) 2020-05-18_Yeastar_call_new_firmware_does_NOT_crash_Asterisk_-_SBC_leg.pcap
( 6) 2020-05-20_08h57_0153044300_to_0865545610_-_client_leg.pcap
( 7) 2020-05-20_08h57_0153044300_to_0865545610_-_SBC_leg.pcap
( 8) 2020-05-20_09h11_0153044300_to_0865545610_-_client_leg.pcap
( 9) 2020-05-20_09h11_0153044300_to_0865545610_-_SBC_leg.pcap
(10) ASTERISK-28900.diff
(11) ASTERISK-28900-2.diff
(12) core-brief.txt
(13) core-brief.txt
(14) core-full.txt
(15) core-full.txt
(16) core-info.txt
(17) core-info.txt
(18) core-locks.txt
(19) core-locks.txt
(20) core-thread1.txt
(21) core-thread1.txt
(22) detailed-log.txt
(23) ExtractedBacktraceDetails.txt
Description:When the T.38 gateway is enabled on two PJSIP endpoints [ set_var=FAXOPT(gateway)=yes,15 ] and the called endpoint initiates a T.38 re-INVITE and the calling endpoint does NOT support T.38, Asterisk tries to gateway between the audio (RTP) and UDPTL (T.38). In most instances, this works fine. However, when the calling party involves a Yeastar PBX device, the call will - every time, repeatedly - cause the Asterisk box performing the T.38 gatewaying to immediately deadlock and, shortly thereafter, Asterisk core dumps.

At present this is only happening if the calling device is a Yeastar PBX which seems to indicate that a corrupt, malformed or missing RTP frame generated by that device is what is crashing Asterisk.

In many instances the core dump is zero bytes or the backtrace is largely unusable, however, I have managed to extract a few usable backtraces as well as packet captures of the calls that tigger the crash.
Comments:By: Asterisk Team (asteriskteam) 2020-05-16 07:56:53.943-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Gregory Massel (gmza) 2020-05-16 08:00:14.546-0500

2020-05-13 07h07 incident

By: Gregory Massel (gmza) 2020-05-16 08:02:28.746-0500

2020-05-06 09h11 incident

By: Gregory Massel (gmza) 2020-05-16 08:09:53.386-0500

packet captures of two calls that triggered these crashes

By: Gregory Massel (gmza) 2020-05-16 08:18:26.963-0500

Some further notes:
1. Although the calls in the PCAP were G.729a codec from the Yeastar device, the issue wasn't with transcoding G.729 because this issue does NOT occur if I make a test call from a regular voice handset to the same destination number, irrespective of whether the handset is set to use G.729 or G.711 A-law.
2. I have identified six originating sources so far that have caused this issue and each and every one involved a Yeastar switchboard. I've tried to replicate it with other hardware and no other hardware seems to trigger the issue.
3. Unfortunately I don't have a Yeastar device of my own to test with, however, I have customers using these. I am arranging for one to upgrade the Yeastar firmware to the very latest and will re-test to see if there has been any fix on their side.
4. The backtrace seems to indicate a memory corruption when dealing with the RTP frame.
5. It is important to fix this from the Asterisk side because, if the source of the issue is weaponised, it could create a security risk.

By: Joshua C. Colp (jcolp) 2020-05-18 04:36:24.049-0500

Please attach a packet capture of a working case, as well as an Asterisk console log with full debug enabled.

By: Joshua C. Colp (jcolp) 2020-05-18 05:22:12.674-0500

As well while you mention the issue isn't with transcoding, what codec module is in use?

By: Gregory Massel (gmza) 2020-05-18 06:55:34.659-0500

The attached calls are from the same endpoint after upgrading the Yeastar PBX device to the latest firmware. Firmware Yeastar S20- does NOT cause Asterisk to crash so the attached PCAPs can be regarded as the working case to compare against.

By: Gregory Massel (gmza) 2020-05-18 06:56:41.488-0500

This particular caller uses G.729, however, I must stress that we've had incidents with callers using both G.729 and G.711 A-law.

By: Joshua C. Colp (jcolp) 2020-05-18 07:01:53.137-0500

The caller is using alaw in this instance, not g729 as you mention. I was hoping to see the same codec in use.

By: Gregory Massel (gmza) 2020-05-18 07:04:35.026-0500

Yeastar firmware ChangeLog can be found at: https://help.yeastar.com/en/s-series/topic/v30.13.0.34.html
The changes I can see that may be relevant are:
( Fixed the Fax issue: The Fax transmission would be failed when T.38 was enabled for the trunk.
( Fixed the SIP Trunk issue: There would be no sound through the SIP trunk on which T.38 was enabled when the PBX received a re-INVITE packet that contained no image media.

With regard to the console, we process approximately 10 calls per second on these systems, so the console logs a LOT of information.

For the incident of 2020-05-13, the last information logged was:
[May 13 13:26:38] VERBOSE[17038][C-00004329] app_dial.c: PJSIP/telkom-jdf-00008c36 answered PJSIP/switch_tissip1-00008c35
[May 13 13:26:38] VERBOSE[17057][C-00004329] bridge_channel.c: Channel PJSIP/telkom-jdf-00008c36 joined 'simple_bridge' basic-bridge <b3e435a0-d898-4ba6-b970-e3
[May 13 13:26:38] VERBOSE[17038][C-00004329] bridge_channel.c: Channel PJSIP/switch_tissip1-00008c35 joined 'simple_bridge' basic-bridge <b3e435a0-d898-4ba6-b97

By: Joshua C. Colp (jcolp) 2020-05-18 07:08:20.084-0500

That's not really enough information which is why I was hoping for the debug, as it states exactly what is going on/negotiated/happening. We can see what we can do with the information available, but when T.38 is involved at all that drastically limits what we can do for reproduction so we rely more on the logging.

By: Joshua C. Colp (jcolp) 2020-05-18 07:12:09.350-0500

Can you provide an actual core dump with binaries to examine? This can be done using the ast_coredumper utility[1] by passing "--tarball-coredumps" to it.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationAfterACrash

By: Gregory Massel (gmza) 2020-05-18 09:23:19.854-0500

I cannot attach files > 50MB. The tarballs are 100MB and 147MB respectively.

By: Joshua C. Colp (jcolp) 2020-05-18 09:30:58.123-0500

Can you place them on a Google Drive or something similar to download?

By: Gregory Massel (gmza) 2020-05-18 09:47:02.597-0500


By: Joshua C. Colp (jcolp) 2020-05-18 09:55:16.914-0500

What Linux distribution do you use to ensure things are the best match for examining the core dump?

By: Gregory Massel (gmza) 2020-05-18 09:59:11.174-0500

Ubuntu 18.04.4 LTS
Linux tissbc2 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
(We've got another machine running 5.0.0-37-generic with the same issue)
Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (on the machine that these coredumps were taken, but the other machine is an E5-1680)

By: Joshua C. Colp (jcolp) 2020-05-19 08:21:36.941-0500

Please try this attached patch.

By: Joshua C. Colp (jcolp) 2020-05-19 08:58:58.291-0500

Attaching all of the useful backtrace information I extracted.

By: Gregory Massel (gmza) 2020-05-19 09:41:54.011-0500

Thank you. I am trying to get hold of a customer with the applicable Yeastar PBX and older firmware version to perform before-and-after testing. I am trying to arrange this ASAP, however, owing to the time-zone difference and it being almost close of office hours, I will probably only have a conclusive test result tomorrow.

By: Gregory Massel (gmza) 2020-05-20 02:41:15.662-0500

I have tested and the patch appears to fix the issue.
I managed to replicate the issue, crashing Asterisk, then applied the patch, and was unable to replicate the crash thereafter.
In case this was just luck and co-incidence, I have, however, still recorded logs (with debug 9 and verbose 9) and PCAPs from before-and-after and am including these for you

By: Gregory Massel (gmza) 2020-05-20 02:42:05.819-0500

Log file extract with verbose 9 and debug 9 showing log entries prior to crash

By: Gregory Massel (gmza) 2020-05-20 02:43:18.611-0500

PCAPs from before-and-after. 08h57 was before the patch was applied (resulting in a crash) and 09h11 was after the patch was applied (no crash).

By: Gregory Massel (gmza) 2020-05-20 02:46:04.626-0500

I will also keep running live traffic throughout the day with the patch applied to see if any production traffic manages to cause a crash, however, from the testing it does appear that this patch is successful. Thank you for the prompt assistance and resolution!

By: Joshua C. Colp (jcolp) 2020-05-20 04:00:32.141-0500

While undecided yet, I don't think this will end up being a security issue either. You need control over both ends, you need someone to have the ability to send calls through your system in specific ways, specific codec conditions (that I haven't been able to determine or reproduce even with the provided information), you need fax gateway in use, you need T.38 negotiation to have succeeded on one side, and failed on the other.

By: Joshua C. Colp (jcolp) 2020-05-20 05:30:13.500-0500

Once you're through further testing just comment.

By: Gregory Massel (gmza) 2020-05-25 12:58:34.779-0500

I've had three productions machines running for 4 days now with the patch and all have been stable.

By: Joshua C. Colp (jcolp) 2020-05-27 04:20:59.786-0500

Can you double check with this final patch? It's only in one direction that the change needs to be done.

By: Joshua C. Colp (jcolp) 2020-06-01 11:09:20.364-0500

[~gmza] Have you had a chance to test this out? I also asked others and we've agreed not to treat this as a security issue. To that end on Wednesday most likely I will get this up for review and such.

By: Friendly Automation (friendly-automation) 2020-06-05 13:17:11.538-0500

Change 14454 merged by Friendly Automation:
res_fax: Don't consume frames given to fax gateway on write.


By: Friendly Automation (friendly-automation) 2020-06-05 13:21:21.736-0500

Change 14483 merged by Friendly Automation:
res_fax: Don't consume frames given to fax gateway on write.


By: Friendly Automation (friendly-automation) 2020-06-05 13:25:04.239-0500

Change 14484 merged by Friendly Automation:
res_fax: Don't consume frames given to fax gateway on write.


By: Friendly Automation (friendly-automation) 2020-06-05 13:36:10.387-0500

Change 14482 merged by Kevin Harwell:
res_fax: Don't consume frames given to fax gateway on write.