Summary: | ASTERISK-30432: res_speech_aeap: Tight loop; high CPU usage | ||
Reporter: | Peter Howell (wpeterhowell) | Labels: | |
Date Opened: | 2023-02-20 09:25:09.000-0600 | Date Closed: | 2023-02-22 09:29:11.000-0600 |
Priority: | Minor | Regression? | |
Status: | Closed/Complete | Components: | Resources/NewFeature |
Versions: | 18.16.0 | Frequency of Occurrence | Occasional |
Related Issues: | |||
Environment: | CentOS-7 3.10.0-1160.76.1.el7.x86_64 | Attachments: | ( 0) asterisk-open-files.txt ( 1) core-asterisk-running-2023-02-20T17-18-13Z-brief.txt ( 2) core-asterisk-running-2023-02-20T17-18-13Z-full.txt ( 3) core-asterisk-running-2023-02-20T17-18-13Z-info.txt ( 4) core-asterisk-running-2023-02-20T17-18-13Z-locks.txt ( 5) core-asterisk-running-2023-02-20T17-18-13Z-thread1.txt ( 6) dump.pcap |
Description: | We see that on occasion closed AEAP connections to Asterisk are not closing. We are sending an OP_CLOSE before closing the connection, but it seems under certain conditions, the connection does not close. At this point, Asterisk goes into a tight loop on these zombie connections and the result is high CPU usage. | ||
Comments: | By: Asterisk Team (asteriskteam) 2023-02-20 09:25:14.972-0600 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/]. By: Peter Howell (wpeterhowell) 2023-02-20 09:27:12.090-0600 The aeap process is listening on 9099 -- you'll see two open connections that we're closed on our side in the attached file. By: Peter Howell (wpeterhowell) 2023-02-20 09:42:12.164-0600 This is a network capture of a session that put Asterisk into a tight loop. By: Joshua C. Colp (jcolp) 2023-02-20 10:46:48.248-0600 We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis. Thanks! [1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information By: Joshua C. Colp (jcolp) 2023-02-20 10:47:07.400-0600 Additionally, a deadlock backtrace would also be useful to see what the threads are doing. By: Peter Howell (wpeterhowell) 2023-02-20 11:56:41.706-0600 Could not upload the main file, as it's too large. By: Joshua C. Colp (jcolp) 2023-02-21 08:56:03.702-0600 Which file are you referring to? The debug log? By: Peter Howell (wpeterhowell) 2023-02-21 09:18:43.845-0600 The core-asterisk-running-2023-02-20T17-18-13Z file as part of that set is 668MB in size -- it wouldn't let me upload it. By: Joshua C. Colp (jcolp) 2023-02-21 09:22:57.759-0600 Ah, that is not needed at this time. A debug log however is. I can't pick out anything specifically from the backtrace, except to show that it's polling on the socket - so it's possible there is an off-nominal there where it should be returning but isn't. A debug log may elaborate further on it. By: Peter Howell (wpeterhowell) 2023-02-21 10:40:28.792-0600 I had configured this: debug_log_30432 => notice,warning,error,debug,verbose,dtmf but, the file seems to have no debug in it? By: Joshua C. Colp (jcolp) 2023-02-21 10:52:31.838-0600 Did you do a logger reload? Are your logging levels set? By: Peter Howell (wpeterhowell) 2023-02-21 11:53:13.247-0600 It seems it logged debug initially, but after that, it stopped for some reason. On the lastest test, we did notice this: > core show channels verbose Channel Context Extension Prio State Application Data CallerID Duration Accountcode PeerAccount BridgeID PJSIP/BULKVS-0000000 yr-play-speech s 4 Up Playback custom/WellSaidLabs/Ava/1 15303053339 00:12:45 vsr and a request hangup doesn't kill this, so I'm wondering if the loop could somehow be associated with this channel that won't die? By: Peter Howell (wpeterhowell) 2023-02-21 15:24:19.556-0600 In further testing, it does seem that this channel stays up, so maybe connected to not releasing the AEAP connection? Since it's an abrupt hangup that seems to cause this on the inbound caller, maybe we should try a hangup hook with SpeechDestroy()? By: Joshua C. Colp (jcolp) 2023-02-21 15:46:31.145-0600 What is the specific dialplan that goes with this? On further inspection of the backtrace it makes it look like you are executing Playback() from inside the 'h' extension which is very a much a can't do/shouldn't do. By: Peter Howell (wpeterhowell) 2023-02-21 15:51:19.652-0600 It's played in a GoSub routine; not in a hangup handler. Can we call SpeechDestroy in a hangup handler? Since the call ends abruptly, this isn't called, so I'm wondering if calling it in a hangup handler is possible and might clean things up. By: Joshua C. Colp (jcolp) 2023-02-21 15:56:45.460-0600 This is assuming AEAP is the source of the issue. I'm not 100% certain of that, because of this thread: {code} Thread 67 (Thread 0x7f3656b19700 (LWP 26451)): #0 __ast_frdup (f=0x7f3648001970, file=0x6632f2 "file.c", line=921, func=0x663c77 <__PRETTY_FUNCTION__.18977> "read_frame") at frame.c:329 #1 0x00000000004f4c75 in __ast_frisolate (fr=fr@entry=0x7f3648001970, file=file@entry=0x6632f2 "file.c", line=line@entry=921, func=func@entry=0x663c77 <__PRETTY_FUNCTION__.18977> "read_frame") at frame.c:206 #2 0x00000000004e6f4d in read_frame (whennext=0x7f3656b157b0, s=0x7f3648001910) at file.c:921 #3 ast_readaudio_callback (s=0x7f3648001910) at file.c:960 #4 0x00000000004ec785 in ast_playstream (s=0x7f3648001910) at file.c:1066 #5 ast_streamfile (chan=chan@entry=0x7f36500414e0, filename=filename@entry=0x7f3656b15860 "custom/WellSaidLabs/Ava/1b8391878eeefd208e106401bc4795a1", preflang=0x7f365001517a "en-US") at file.c:1337 #6 0x00007f3694edf80e in playback_exec (chan=0x7f36500414e0, data=0x7f3656b16ab0 "custom/WellSaidLabs/Ava/1b8391878eeefd208e106401bc4795a1") at app_playback.c:506 #7 0x0000000000535136 in pbx_exec (c=c@entry=0x7f36500414e0, app=app@entry=0x3464120, data=data@entry=0x7f3656b16ab0 "custom/WellSaidLabs/Ava/1b8391878eeefd208e106401bc4795a1") at pbx_app.c:492 #8 0x0000000000527f69 in pbx_extension_helper (c=c@entry=0x7f36500414e0, context=0x7f3650041ea0 "yr-play-speech", exten=exten@entry=0x7f3650041ef0 "s", priority=priority@entry=4, label=label@entry=0x0, callerid=callerid@entry=0x7f3648003fa0 "17197890799", action=action@entry=E_SPAWN, found=found@entry=0x7f3656b18b10, combined_find_spawn=combined_find_spawn@entry=1, con=0x0) at pbx.c:2948 #9 0x0000000000529d5b in ast_spawn_extension (combined_find_spawn=1, found=0x7f3656b18b10, callerid=0x7f3648003fa0 "17197890799", priority=4, exten=0x7f3650041ef0 "s", context=<optimized out>, c=0x7f36500414e0) at pbx.c:4203 #10 ast_pbx_h_exten_run (chan=chan@entry=0x7f36500414e0, context=<optimized out>) at pbx.c:4233 #11 0x000000000052b2f0 in __ast_pbx_run (c=c@entry=0x7f36500414e0, args=args@entry=0x0) at pbx.c:4594 #12 0x000000000052b653 in pbx_thread (data=data@entry=0x7f36500414e0) at pbx.c:4701 #13 0x00000000005b26f8 in dummy_start (data=<optimized out>) at utils.c:1574 #14 0x00007f369871cea5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f36977bab0d in clone () from /lib64/libc.so.6 {code} Which shows the 'h' extension executing and the Playback() dialplan application being run. So, is your dialplan constructed in such a way that the 'h' extension could lead to Playback() occurring or as well other things? That would most likely cause issues, and could cause looping. Lots of applications are not written to run in 'h' extension or hangup handler and could cause undefined behavior. They should not be run in the 'h' extension or hangup handler. I also don't think a hangup handler would execute before this. By: Peter Howell (wpeterhowell) 2023-02-21 16:03:57.352-0600 We do not normally use hangup handlers. This example is in "yr-play-speech", which is a subroutine that handles playing the text-to-speech -- there is nowhere we would run a Playback application in a hangup. We have added a SpeechDestroy in a hangup handler now, however, as a test. By: Joshua C. Colp (jcolp) 2023-02-21 16:06:07.666-0600 The 'h' extension and hangup handlers are two separate things. The 'h' extension is literally an extension named 'h'. That's what is happening in that thread. It can also occur if you do: {noformat} exten => _.,1,Noop {noformat} Since that would match everything. By: Joshua C. Colp (jcolp) 2023-02-21 16:07:50.254-0600 You need to examine the dialplan, identify how an 'h' extension is running, elaborate, or provide the dialplan. It is EXTREMELY suspect and like I said would result in undefined behavior. By: Peter Howell (wpeterhowell) 2023-02-21 16:09:11.118-0600 Ah, ok... so, then, the answer is Yes. We have been doing _. because we use GUID strings that we have to match and not just telephone numbers. This is now making perfect sense! By: Peter Howell (wpeterhowell) 2023-02-21 16:10:27.695-0600 We have added an "h" handler that just does Return(), so that _. won't be re-executed on hangup. By: Joshua C. Colp (jcolp) 2023-02-21 16:13:14.566-0600 I don't know if that will be sufficient, but only you know your dialplan now with this knowledge of the 'h' extension. By: Peter Howell (wpeterhowell) 2023-02-21 16:17:59.262-0600 Actually, in this case, it does not appear that we're using _. so I can't see how it could possibly be executed in the hangup. Going to try to figure out how to get debug working in the logs. By: Asterisk Team (asteriskteam) 2023-02-21 16:17:59.520-0600 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Joshua C. Colp (jcolp) 2023-02-21 16:19:06.694-0600 You can also do "dialplan show h@context" to show how the dialplan would execute such things. By: Peter Howell (wpeterhowell) 2023-02-21 16:22:29.704-0600 v004*CLI> dialplan show h@yr There is no existence of h@yr extension Command 'dialplan show h@yr' failed. By: Joshua C. Colp (jcolp) 2023-02-21 16:23:41.448-0600 Well, the backtrace says it's happening. The rest is up to you. By: Peter Howell (wpeterhowell) 2023-02-21 16:25:20.357-0600 There is definitely nothing there -- we rarely use that. In this case, we don't have _. on this context. Could the way that the AEAP connection closure is being handled cause this somehow? By: Joshua C. Colp (jcolp) 2023-02-21 16:26:39.355-0600 No. That would not cause it. You should check all of your contexts. The 'h' extension is executed in the context in which the channel is executing at the time of hang up. By: Peter Howell (wpeterhowell) 2023-02-21 16:26:43.439-0600 Also, we have not seen a recurrence since pushing a hangup handler that is running SpeechDestroy -- so far so good. By: Peter Howell (wpeterhowell) 2023-02-21 17:22:52.321-0600 Well, that did not actually fix anything, so still having issues -- they're just not very reproducible. Will get a debug log as soon as possible. By: Peter Howell (wpeterhowell) 2023-02-22 09:24:56.764-0600 We had one context that handles the speech to text and it did have a _. in the extension, so if the call ended at the right time, while it was in that sub-routine, it should also capture the h and re-execute; so, we are definitely good to close this out. Thanks for all your help! We had thought it was connected to the AEAP module. |