Summary: | ASTERISK-27412: core: Audiohook freeing interpolated frame when it shouldn't. | ||||||||
Reporter: | Mikhail (j3st3r) | Labels: | patch pjsip | ||||||
Date Opened: | 2017-11-10 11:47:30.000-0600 | Date Closed: | 2017-11-16 07:59:50.000-0600 | ||||||
Priority: | Critical | Regression? | |||||||
Status: | Closed/Complete | Components: | Channels/chan_sip/General Core/General | ||||||
Versions: | 15.1.0 15.1.1 | Frequency of Occurrence | Frequent | ||||||
Related Issues: |
| ||||||||
Environment: | ubuntu server 16.04 x64 | Attachments: | ( 0) autosupport.tar.gz ( 1) ba.tar.gz ( 2) debug_log_123456 ( 3) jira_asterisk_27412_v15.1.patch ( 4) jira_asterisk_27413_v15.1.patch ( 5) valgrind.txt ( 6) valgrind3.txt ( 7) valgrind4.txt | ||||||
Description: | I've installed Asterisk 15.1.0 on Ubuntu server 16.04. I configured realtime sippeers (tried via odbc and res_config_mysql - all the same). And asterisk core crashes after 20-40 minutes of working.
System has 4gb of RAM and 4 gb swap: total used free shared buff/cache available Mem: 3849 438 1079 24 2330 3094 Swap: 3993 466 3527 processor is Intel(R) Core(TM) i3 CPU 560 @ 3.33GHz backtraces: https://drive.google.com/file/d/179L3aN2iey2nN_zdGKyJZ5EGNGGOpEHh/view?usp=sharing https://drive.google.com/file/d/1RuLZ-eqyhEyFYHc1bLANof1COuXfYKUq/view?usp=sharing | ||||||||
Comments: | By: Asterisk Team (asteriskteam) 2017-11-10 11:47:31.300-0600 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Richard Mudgett (rmudgett) 2017-11-10 12:15:20.518-0600 Please follow the Asterisk Issue Guidelines [1] and attach backtraces to the issue. Links to third party locations tend to vaporize and become invalid when someone eventually gets around to looking at them. That being said, the backtraces show memory corruption so you need to either run under valgrind [2] or with the MALLOC_DEBUG flag [3] enabled in the menuselect compiler options section. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines [2] https://wiki.asterisk.org/wiki/display/AST/Valgrind [3] https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag By: Mikhail (j3st3r) 2017-11-10 13:56:15.027-0600 Hello, Richard. I've started asterisk with valgrand, and the issue repeated, I attach valgrind.txt By: Richard Mudgett (rmudgett) 2017-11-10 16:41:46.670-0600 Realtime has nothing to do with this problem as the problem is in the media pathway. From the valgrind output it looks like someone is freeing a media frame when they shouldn't. The freed media frame is then cached in a thread local cache. However, since the frame is still being used it might get passed to another thread and freed in the other thread's frame cache. As a result all the frame caches can get poisoned. Another thing from the valgrind output is that it appears to be affecting a two party call with mixmonitor recording the call. Therefor, a simplified setup appears possible to replicate the problem as it could just be between chan_sip and mixmonitor. We need to know how your system is setup (Attach config files, etc.) [1] as someone will have to replicate this problem. Also debug logs [2] captured when the problem happens could be used to help simplify the setup. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines [2] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information By: Mikhail (j3st3r) 2017-11-11 01:47:57.362-0600 I attach debug log and autosupport script info. Asterisk is used for website callbacks (customer inserts his phone number into widget on a website, this information goes via AMI into Asterisk, and asterisk originates two legs of the call) By: Mikhail (j3st3r) 2017-11-11 04:46:13.720-0600 I've upgraded Asterisk to version 15.1.1 and have the same problem By: Richard Mudgett (rmudgett) 2017-11-11 14:02:09.930-0600 I created ASTERISK-27413 to help with finding the memory corruption bug you are having. That issue describes how to disable the media frame caching after applying the patch. [^jira_asterisk_27413_v15.1.patch] - This patch adds the cache_media_frames option to help with finding use after free of media frames. This patch applies to v15.1.x. Use {{patch -p1 -i jira_asterisk_27413_v15.1.patch}} to apply. Once you have done so, please reinstall asterisk, disable the media frame cache, recreate the issue under valgrind and attach the new output. By: Mikhail (j3st3r) 2017-11-12 12:49:38.522-0600 Surprisingly, but after applying the patch - Asterisk works (although I assumed that the patch is needed to find the problem, but does not solve it). ps aux |grep asterisk |grep -v grep asterisk 10530 12.6 10.4 1335724 410940 pts/3 Sl+ 12:16 72:27 /usr/bin/valgrind.bin --suppressions=/usr/src/asterisk-15.1.1/contrib/valgrind.supp --log-fd=9 asterisk -vvvvcg asterisk 10531 0.0 0.0 4224 660 pts/3 S+ 12:16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet 10530 cat /etc/asterisk/asterisk.conf |grep cache_media cache_media_frames=no asterisk -rx 'core show uptime' System uptime: 9 hours, 31 minutes, 37 seconds Last reload: 9 hours, 31 minutes, 37 seconds asterisk -rx 'core show calls' 0 active calls 560 calls processed I continue to observe. By: Richard Mudgett (rmudgett) 2017-11-12 16:41:35.950-0600 valgrind can prevent crashes by how it operates and still find things. The patch adds a new option to disable the media frames caches. The default of the new option has the caches still enabled so you need to explicitly set the option off in asterisk.conf as indicated by ASTERISK-27413. By: Mikhail (j3st3r) 2017-11-12 23:58:36.870-0600 Asterisk died after ~22 hours of work. At the end of CLI output - I have seen "FRACK!, Failed assertion bad magic number 0x0 for object 0xd206110 (0) at line 5522 in action_originate of manager.c Killed" valgrind3.txt attached. By: Mikhail (j3st3r) 2017-11-13 04:11:50.296-0600 here is one more output: {noformat} FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [0/1921] [2017-11-13 13:04:35] #3: [0x4651e0] asterisk(__ao2_cleanup_debug+0x48) [0x4651e0] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [2017-11-13 13:04:35] #4: [0x53fc99] asterisk() [0x53fc99] [2017-11-13 13:04:35] #5: [0x53fcf9] asterisk(ast_frame_free+0x2e) [0x53fcf9] [2017-11-13 13:04:35] #6: [0x5336a4] asterisk() [0x5336a4] [2017-11-13 13:04:35] #7: [0x533866] asterisk() [0x533866] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [2017-11-13 13:04:35] #8: [0x4c7ff5] asterisk() [0x4c7ff5] [2017-11-13 13:04:35] #9: [0x4ca714] asterisk(ast_read+0x22) [0x4ca714] [2017-11-13 13:04:35] #10: [0x5356c2] asterisk() [0x5356c2] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 146 in format_cap_framed_destroy of format_cap.c [2017-11-13 13:04:35] #11: [0x535c44] asterisk(ast_waitstream+0x40) [0x535c44] [2017-11-13 13:04:35] #12: [0x1aff5f65] /usr/lib/asterisk/modules/app_playback.so(+0x2f65) [0x1aff5f65] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 6071 in _dtor_tmp_fmt of channel.c [2017-11-13 13:04:35] #13: [0x5b0a9d] asterisk(pbx_exec+0x119) [0x5b0a9d] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c [2017-11-13 13:04:35] #14: [0x59b7db] asterisk() [0x59b7db] [2017-11-13 13:04:35] #15: [0x59f6b6] asterisk(ast_spawn_extension+0x50) [0x59f6b6] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 5616 in ast_set_write_format of channel.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [2017-11-13 13:04:35] #16: [0x5a03ec] asterisk() [0x5a03ec] [2017-11-13 13:04:35] #17: [0x5a1e6b] asterisk(ast_pbx_run_args+0x9a) [0x5a1e6b] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c [2017-11-13 13:04:35] #18: [0x5a1ea9] asterisk(ast_pbx_run+0x1d) [0x5a1ea9] [2017-11-13 13:04:35] #19: [0x5aabdf] asterisk() [0x5aabdf] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c [2017-11-13 13:04:35] #20: [0x6447a6] asterisk() [0x6447a6] [2017-11-13 13:04:24] ERROR[27986][C-00000204]: frame.c:347 ast_frdup: FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) [2017-11-13 13:04:35] Got 19 backtrace records [2017-11-13 13:04:35] #0: [0x6478e8] asterisk(__ast_assert_failed+0x8d) [0x6478e8] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [2017-11-13 13:04:35] #1: [0x464121] asterisk() [0x464121] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c [2017-11-13 13:04:35] #2: [0x4649c6] asterisk(__ao2_ref+0xa1) [0x4649c6] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 157 in __frame_free of frame.c FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c [2017-11-13 13:04:35] #3: [0x5403ea] asterisk(ast_frdup+0x291) [0x5403ea] [2017-11-13 13:04:35] #4: [0x4c043c] asterisk() [0x4c043c] FRACK!, Failed assertion bad magic number 0x0 for object 0xd24a180 (0) at line 347 in ast_frdup of frame.c [2017-11-13 13:04:35] #5: [0x4c0a7a] asterisk(ast_queue_frame+0x2a) [0x4c0a7a] {noformat} valgrind4.txt is attached By: Richard Mudgett (rmudgett) 2017-11-13 18:22:56.481-0600 [^valgrind3.txt] and [^valgrind4.txt] are showing three things: # 0x5BDFEC: normalise_history (plc.c:99) is using memcpy() when it should use memmove() because the memory blocks overlap. This is minor and not really causing a problem. # 0x46E17A: audio_audiohook_write_list (audiohook.c:953) is freeing a frame it shouldn't. Apparently middle_frame == start_frame. In that case it shouldn't free it. This is the main memory corruption mechanism seen in the backtraces. The regression change was introduced by ASTERISK-26926. # Chan_sip is destroying its pvt data structure before it is time due to a ref counting error. This is causing the FRACKs and many invalid reads/writes after the FRACKs. Be aware that chan_sip is extended support so someone from the community needs to address this problem. The first two I'm going to consider as what needs to be fixed to complete this issue. The third issue with chan_sip is reported in other issues and is a duplicate. However, the valgrind output is very useful for the chan_sip problem. By: Richard Mudgett (rmudgett) 2017-11-13 18:58:18.204-0600 Oops. The FRACKs you pasted are more likely associated with the bad free of frames and not from chan_sip. However, there is a ref counting problem in chan_sip shown by the valgrind output. By: Mikhail (j3st3r) 2017-11-14 01:29:00.445-0600 Hi, Richard. Thank you for your answers and giving time to my problem. I'll change the channel's driver to pjsip. Can you orient in time when the remaining bugs will be fixed and what do I need to do? By: Mikhail (j3st3r) 2017-11-14 03:04:17.733-0600 Today I've seen this behavior: Asterisk is working, but calls can not be originated, and in the CLI (asterisk is working via valgrind) I see: {noformat} *CLI> [2017-11-14 12:02:08] == Manager 'callback' logged on from 127.0.0.1 FRACK!, Failed assertion bad magic number 0x0 for object 0xd2a9b70 (0) at line 5522 in action_originate of manager.c [2017-11-14 12:02:09] ERROR[9919]: manager.c:5522 action_originate: FRACK!, Failed assertion bad magic number 0x0 for object 0xd2a9b70 (0) FRACK!, Failed assertion bad magic number 0x0 for object 0xd2a9b70 (0) at line 146 in format_cap_framed_destroy of format_cap.c [2017-11-14 12:02:11] Got 15 backtrace records *CLI> [2017-11-14 12:02:11] #0: [0x6478e8] asterisk(__ast_assert_failed+0x8d) [0x6478e8] [2017-11-14 12:02:11] #1: [0x464121] asterisk() [0x464121] [2017-11-14 12:02:11] #2: [0x4649c6] asterisk(__ao2_ref+0xa1) [0x4649c6] [2017-11-14 12:02:11] #3: [0x53d2ef] asterisk(__ast_format_cap_append+0xa7) [0x53d2ef] [2017-11-14 12:02:11] #4: [0x56ffd0] asterisk() [0x56ffd0] [2017-11-14 12:02:11] #5: [0x574371] asterisk() [0x574371] [2017-11-14 12:02:11] #6: [0x574d66] asterisk() [0x574d66] [2017-11-14 12:02:11] #7: [0x575210] asterisk() [0x575210] *CLI> [2017-11-14 12:02:11] #8: [0x62ec71] asterisk() [0x62ec71] [2017-11-14 12:02:11] #9: [0x6447a6] asterisk() [0x6447a6] [2017-11-14 12:02:10] ERROR[9919]: translate.c:1329 ast_translator_best_choice: Cannot determine best translation path since one capability supports no formats [2017-11-14 12:02:10] WARNING[9919]: channel.c:6093 request_channel: No translator path exists for channel type Local (native (codec2|g723|ulaw|alaw|gsm|g726|g726aal2|adpcm|(null)|slin|slin|slin|slin|slin|slin|slin|slin|lpc10|g729|speex|speex|speex|ilbc|g722|siren7|siren14|testlaw|g719|opus|jpeg|png|h261|h263|h263p|h264|mpeg4|vp8|vp9|red|t140|t38|silk|silk|silk|silk)) to ((null)) [2017-11-14 12:02:11] ERROR[9919]: format_cap.c:146 format_cap_framed_destroy: FRACK!, Failed assertion bad magic number 0x0 for object 0xd2a9b70 (0) [2017-11-14 12:02:11] Got 20 backtrace records [2017-11-14 12:02:11] #0: [0x6478e8] asterisk(__ast_assert_failed+0x8d) [0x6478e8] [2017-11-14 12:02:11] #1: [0x464121] asterisk() [0x464121] [2017-11-14 12:02:11] #2: [0x4649c6] asterisk(__ao2_ref+0xa1) [0x4649c6] [2017-11-14 12:02:11] #3: [0x4651e0] asterisk(__ao2_cleanup_debug+0x48) [0x4651e0] [2017-11-14 12:02:11] #4: [0x53ce2d] asterisk() [0x53ce2d] [2017-11-14 12:02:11] #5: [0x464f97] asterisk(__ao2_ref+0x672) [0x464f97] [2017-11-14 12:02:11] #6: [0x53cb6e] asterisk() [0x53cb6e] [2017-11-14 12:02:11] #7: [0x464f97] asterisk(__ao2_ref+0x672) [0x464f97] [2017-11-14 12:02:11] #8: [0x4651e0] asterisk(__ao2_cleanup_debug+0x48) [0x4651e0] [2017-11-14 12:02:11] #9: [0x571f7d] asterisk() [0x571f7d] [2017-11-14 12:02:11] #10: [0x574371] asterisk() [0x574371] [2017-11-14 12:02:11] #11: [0x574d66] asterisk() [0x574d66] [2017-11-14 12:02:11] #12: [0x575210] asterisk() [0x575210] [2017-11-14 12:02:11] #13: [0x62ec71] asterisk() [0x62ec71] [2017-11-14 12:02:11] #14: [0x6447a6] asterisk() [0x6447a6] [2017-11-14 12:02:11] == Manager 'callback' logged off from 127.0.0.1 [2017-11-14 12:02:12] == Manager 'callback' logged on from 127.0.0.1 {noformat} By: Richard Mudgett (rmudgett) 2017-11-14 08:34:42.719-0600 The issue is accepted and we have enough information to fix the first two issues. There should be enough information for the chan_sip issue as well. I cannot give a time frame on when the issue will be fixed. The FRACK backtraces would be more useful if you had BETTER_BACKTRACES enabled in menuselect and debug symbols available. From the valgrind output these FRACKs would appear to be associated with the ref counting issue with chan_sip. By: Mikhail (j3st3r) 2017-11-14 08:38:27.537-0600 No problem, I'll recompile asterisk with these options. How should I start it after recompiling? Via valgrind or as usual? By: Richard Mudgett (rmudgett) 2017-11-14 08:43:52.395-0600 You have already collected the information needed from valgrind so you don't need to run under valgrind. By: Mikhail (j3st3r) 2017-11-14 08:49:32.172-0600 Got it, thank you for clarification By: Richard Mudgett (rmudgett) 2017-11-14 18:12:22.760-0600 [^jira_asterisk_27412_v15.1.patch] - This patch should fix the first two issues found by valgrind. The overlapping blocks in plc.c and the incorrect fame free in the audiohook code. patch -p1 -i jira_asterisk_27412_v15.1.patch Let me know if it works. By: Mikhail (j3st3r) 2017-11-15 04:27:28.437-0600 >The issue is accepted and we have enough information to fix the first two >issues. There should be enough information for the chan_sip issue as well. I >cannot give a time frame on when the issue will be fixed. >The FRACK backtraces would be more useful if you had >BETTER_BACKTRACES enabled in menuselect and debug symbols >available. From the valgrind output these FRACKs would appear to be >associated with the ref counting issue with chan_sip. Backtrace is ataached - ba.tar.gz By: Mikhail (j3st3r) 2017-11-15 04:28:56.859-0600 > Let me know if it works. Richard, thanks for patch. I'll notify you if the issue persists after patching By: Richard Mudgett (rmudgett) 2017-11-15 07:59:28.613-0600 The backtrace you attached today [^ba.tar.gz] has no symbols so it is not usable. However, it looks like you have MALLOC_DEBUG enabled because it is trying to access address 0xdeaddeaddeaddexx. By: Mikhail (j3st3r) 2017-11-15 08:00:48.222-0600 Could you please explain me how to enable those symbols? By: Richard Mudgett (rmudgett) 2017-11-15 08:17:29.488-0600 Strange. All the other backtraces and valgrind output you have posted has had symbols. Why not this one? The symbols should be there if you built and installed it yourself. Did you build and install this version yourself? By: Mikhail (j3st3r) 2017-11-15 08:21:03.928-0600 Yes, I did. The difference between configurations is choosing options DEBUG_THREADS, DEBUG_FD_LEAKS and MALLOC_DEBUG in menuselect By: Mikhail (j3st3r) 2017-11-15 14:28:08.228-0600 surprisingly, but after parching 15.1.1 I've got this error: rtp_engine.c:447 ast_rtp_instance_new: No RTP engine was found. Do you have one loaded? but rtp modules are loaded... Module Description Use Count Status Support Level bridge_native_rtp.so Native RTP bridging module 0 Running core chan_rtp.so RTP Media Channel 0 Running core res_rtp_multicast.so Multicast RTP Engine 0 Running core res_srtp.so Secure RTP (SRTP) 0 Running core 4 modules loaded By: Richard Mudgett (rmudgett) 2017-11-15 14:46:21.879-0600 DEBUG_THREADS, DEBUG_FD_LEAKS and MALLOC_DEBUG alter the public API to add debugging parameters. Any externally compiled modules like codec_opus or res_digium_phone cannot be used as they can result in crashes. By: Richard Mudgett (rmudgett) 2017-11-15 16:04:11.386-0600 We have confirmation that the memory corruption caused by incorrectly freeing the frame in audiohook.c is a duplicate of ASTERISK-27238. By: Friendly Automation (friendly-automation) 2017-11-16 07:59:51.454-0600 Change 7231 merged by Joshua Colp: audiohook.c: Fix freeing a frame and still using it. [https://gerrit.asterisk.org/7231|https://gerrit.asterisk.org/7231] By: Friendly Automation (friendly-automation) 2017-11-16 08:17:18.865-0600 Change 7232 merged by Joshua Colp: audiohook.c: Fix freeing a frame and still using it. [https://gerrit.asterisk.org/7232|https://gerrit.asterisk.org/7232] By: Friendly Automation (friendly-automation) 2017-11-16 08:27:58.024-0600 Change 7233 merged by Joshua Colp: audiohook.c: Fix freeing a frame and still using it. [https://gerrit.asterisk.org/7233|https://gerrit.asterisk.org/7233] |