Summary: | ASTERISK-26978: rtp: Crash in ast_rtp_codecs_payload_code() | ||||||
Reporter: | Ross Beer (rossbeer) | Labels: | |||||
Date Opened: | 2017-05-03 08:07:51 | Date Closed: | 2017-07-10 10:53:40 | ||||
Priority: | Major | Regression? | |||||
Status: | Closed/Complete | Components: | Core/RTP | ||||
Versions: | GIT | Frequency of Occurrence | Frequent | ||||
Related Issues: |
| ||||||
Environment: | Fedora 23 | Attachments: | ( 0) backtrace_20170502_clean.txt ( 1) backtrace_20170511_clean.txt ( 2) backtrace_20170515_clean.txt ( 3) backtrace_20170518_02_Clean.txt ( 4) backtrace_20170524_clean.txt ( 5) backtrace_20170605_clean.txt ( 6) backtrace_20170607_8_CLEAN.txt | ||||
Description: | A crash has been introduced in a recent GIT commit:
[5561] rtp_engine.c: Fix deadlock potential copying RTP payload maps. Please see attached backtrace, if required I can provide an unredacted trace. | ||||||
Comments: | By: Asterisk Team (asteriskteam) 2017-05-03 08:07:52.168-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Richard Mudgett (rmudgett) 2017-05-03 14:00:00.272-0500 The crash is not even in the function changed by gerrit review 5561. The review changed ast_rtp_codecs_payloads_copy() while the crash is in ast_rtp_codecs_payload_code(). The crash appears to be because of a stale bridged RTP instance pointer pointing to freed memory. Using MALLOC_DEBUG could confirm this. How it could happen is not readily apparent in the attached backtrace nor in the private fuller backtraces. By: Ross Beer (rossbeer) 2017-05-03 15:38:04.056-0500 Thank you for looking at the backtraces, I am now recompiling with MALLOC_DEBUG to see if I can catch the issue. By: Ross Beer (rossbeer) 2017-05-05 05:23:38.727-0500 The MALLOC_DEBUG has shown the following memory corruption: WARNING: Memory corrupted after free of 0x7fb7e0186330 allocated at rtp_engine.c ast_rtp_instance_new() line 446 WARNING: Memory corrupted after free of 0x7f611400d5f0 allocated at rtp_engine.c ast_rtp_instance_new() line 446 This also causes a deadlock. I will send the unredacted backtrace to you directly. By: Richard Mudgett (rmudgett) 2017-05-05 11:37:20.048-0500 The new backtrace isn't showing anything different than the one two days ago. MALLOC_DEBUG does give more of a corroborating hint that the rtp instance bridged with pointer has already been freed. It says we have corrupted an already freed rtp instance. I was more hoping for a 0xdeaddead crash as those tend to be very helpful. The new backtrace and the one two days ago show that the called channel has left the native rtp bridge without clearing the caller's rtp instance bridged with pointer. The caller channel then tries to pass a rtp frame to the dead callee channel's rtp instance and deadlocks on the already freed memory. What I suspect is happening is when a channel leaves the native rtp bridge it cannot find the rtp instance of one of the channels and thus cannot clear the bridged with pointer for that channel. I think what the native rtp bridge needs to do is save a pointer and ao2 ref of each rtp instance when the second channel gets in the bridge. Then when one of the channels leaves the bridge we can ensure that both rtp instances will get their bridged with pointer cleared. By: Ross Beer (rossbeer) 2017-05-09 05:21:58.501-0500 Does the patch https://gerrit.asterisk.org/#/c/5601/ attempt to resolve this issue? By: Richard Mudgett (rmudgett) 2017-05-09 09:47:35.607-0500 No it does not. If it were an attempt to resolve the issue I would have tagged it as such. By: Ross Beer (rossbeer) 2017-05-11 10:35:47.267-0500 Further backtrace attached showing an issue here: {noformat} #0 0x000000000059b3b8 in ast_rtp_codecs_payload_code (codecs=0x7f2d8e8586e8, asterisk_format=1, format=0x226bf08, code=0) at rtp_engine.c:1019 1019 type = AST_VECTOR_GET(&codecs->payloads, i); {noformat} By: Ross Beer (rossbeer) 2017-05-15 10:01:46.581-0500 Please see attached a further backtrace. This is all related to the 'ast_rtp_codecs_payload_code' By: Ross Beer (rossbeer) 2017-05-18 09:37:35.300-0500 Another crash in RTP, however, this time in res_rtp_asterisk.c:4776 #0 0x00007fbcc62c2619 in bridge_p2p_rtp_write (hdrlen=12, len=172, rtpheader=0x7fbd1c3440b0, instance1=0x7fbd342d46d0, instance=0x7fbd1c342b50) at res_rtp_asterisk.c:4776 By: Ross Beer (rossbeer) 2017-05-24 04:02:00.467-0500 Further, crashes, this is happening multiple times a day. Would adding the following resolve the issue? Its an additional NULL check on the codec. #0 0x000000000051f883 in ast_format_cmp (format1=0x2bf73b8, format2=0x151) at format.c:247 {noformat} if (format1->codec == NULL || format2->codec == NULL) { return AST_FORMAT_CMP_NOT_EQUAL; } {noformat} By: Richard Mudgett (rmudgett) 2017-05-24 11:41:43.682-0500 [~rossbeer] I've outlined what I think is needed to fix the issue on May 5 above. When the channels join the native rtp bridge the bridge technology needs to save a pointer with a ref to the rtp instance structure for both channels in the bridge. Then when a channel leaves the bridge the rtp instance's bridged with pointer can be guaranteed to be cleared. As it is now when a channel leaves the bridge there is no guarantee that the rtp instance's bridged with pointer gets cleared. Somehow both channel's rtp instance pointers are not being found so one of the rtp instance's bridged with pointer is not being cleared. As a result, the rtp code tries to natively bridge a frame to a destroyed rtp instance and deadlocks on a destroyed lock. By: Ross Beer (rossbeer) 2017-06-05 05:46:38.072-0500 This issue is still happening on the latest GIT version. I have put out a bounty with no takers. This issue causes Asterisk to crash multiple times a day. By: Ross Beer (rossbeer) 2017-06-07 10:06:41.717-0500 Uploading another backtrace. By: George Joseph (gjoseph) 2017-06-08 08:07:04.614-0500 Do you have the exact asterisk version string for yesterday's backtrace? By: Ross Beer (rossbeer) 2017-06-08 08:39:49.353-0500 The version was Asterisk GIT-13-13.15.0-rc1-196-g3e8eea0M with the TLS patch from ASTERISK-27001 for pjsip |