Summary: | ASTERISK-26445: rtp: Deadlock in getting payload code | ||||||||||
Reporter: | Ross Beer (rossbeer) | Labels: | |||||||||
Date Opened: | 2016-10-06 04:00:23 | Date Closed: | 2017-10-08 20:01:14 | ||||||||
Priority: | Major | Regression? | |||||||||
Status: | Closed/Complete | Components: | Core/RTP | ||||||||
Versions: | 13.11.2 | Frequency of Occurrence | Frequent | ||||||||
Related Issues: |
| ||||||||||
Environment: | Fedora 23 | Attachments: | ( 0) backtrace-threads_2016-10-05-CLEAN.txt ( 1) backtrace-threads-CLEAN.txt ( 2) backtrace-threads-PJSIP.txt | ||||||||
Description: | Asterisk deadlocks and no further processing can occur.
This looks to be related to rtp_engine.c {noformat} #2 0x000000000053ce5a in __ast_rwlock_rdlock (filename=0x6f1e4b "rtp_engine.c", line=924, func=0x6f2ef0 <__PRETTY_FUNCTION__.16829> "ast_rtp_codecs_payload_code", t=0x7fe9945032f0, name=0x6f21a4 "&codecs->codecs_lock") at lock.c:911 {noformat} I think this may have been caused by using hangup handlers on the channel. One before a dial and then another in the Dial b option subroutine. The hangup handlers call the same routine that updates the hangup reason for each channel: {noformat} #5 0x000000000058cac5 in ast_func_read (chan=0x7f4cc420fdb8, function=0x7f4d38cae0e0 "CHANNEL(rtcp,all)", workspace=0x7f4d38cad0c0 "", len=4096) at pbx_functions.c:617 #6 0x0000000000590bb0 in pbx_substitute_variables_helper_full (c=0x7f4cc420fdb8, headp=0x7f4cc4210580, cp1=0x7f4d38caf260 "CDR(mos)=${CHANNEL(rtcp,all)};cr=${CHANNEL(audioreadformat)};cw=${CHANNEL(audiowriteformat)};cn=${CHANNEL(audionativeformat)}", cp2=0x7f4d38caf349 "", count=8182, used=0x7f4d38caf238) at pbx_variables.c:693 #7 0x000000000059112b in pbx_substitute_variables_helper (c=0x7f4cc420fdb8, cp1=0x7f4d38caf260 "CDR(mos)=${CHANNEL(rtcp,all)};cr=${CHANNEL(audioreadformat)};cw=${CHANNEL(audiowriteformat)};cn=${CHANNEL(audionativeformat)}", cp2=0x7f4d38caf340 "CDR(mos)=", count=8191) at pbx_variables.c:790 #8 0x0000000000575257 in pbx_extension_helper (c=0x7f4cc420fdb8, con=0x0, context=0x7f4cc4210770 "record-hangupcause", exten=0x7f4cc42107c0 "s", priority=4, label=0x0, callerid=0x7f4b60004a70 "<< PRIVATE INFORMATION REMOVED >>>", action=E_SPAWN, found=0x7f4d38cb1924, combined_find_spawn=1) at pbx.c:2873 #9 0x00000000005788a2 in ast_spawn_extension (c=0x7f4cc420fdb8, context=0x7f4cc4210770 "record-hangupcause", exten=0x7f4cc42107c0 "s", priority=4, callerid=0x7f4b60004a70 "<< PRIVATE INFORMATION REMOVED >>>", found=0x7f4d38cb1924, combined_find_spawn=1) at pbx.c:4110 {noformat} From the backtrace (attached) it looks like calling channel and the audio formats causes a lock. | ||||||||||
Comments: | By: Asterisk Team (asteriskteam) 2016-10-06 04:00:24.893-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Rusty Newton (rnewton) 2016-10-07 17:42:00.866-0500 Can you attach the dialplan and if possible a log of the call flow happening before the deadlock? By: Ross Beer (rossbeer) 2016-10-10 09:21:14.017-0500 The outgoing call uses: {noformat} exten => _0.,1,Set(CHANNEL(hangup_handler_push)=record-hangupcause,s,1) exten => _0.,n,Dial(${SIPDRIVER}/${globalnumber}@${usecarrier},120,L(${timelimit}:300000:60000)b(dialout-trunk-precall-handler^${EXTEN}^1) {noformat} The precall gosub uses: {noformat} [dialout-trunk-precall-handler] exten => _X.,1,Set(CHANNEL(hangup_handler_push)=record-hangupcause,s,1) exten => _X.,n,Return() {noformat} The hangup handler is defined: {noformat} [record-hangupcause] exten => s,1,Set(CDR(hangupcause)=${HANGUPCAUSE}) exten => s,n,GotoIf($["${CHANNEL(channeltype)}" = "Local"]?nomos) exten => s,n,GotoIf($["${CDR(disposition)}" = "FAILED"]?nocdr) exten => s,n,Set(CDR(mos)=${CHANNEL(rtcp,all)}\;cr=${CHANNEL(audioreadformat)}\;cw=${CHANNEL(audiowriteformat)}\;cn=${CHANNEL(audionativeformat)}) exten => s,n(nomos),Return() exten => s,n(nocdr),NoCDR() exten => s,n,Return() {noformat} By: Rusty Newton (rnewton) 2016-10-12 19:56:54.633-0500 Thanks, about how many calls do you see between the deadlocks occurring? I'd like to know what sort of frequency to expect if we try to reproduce. By: Ross Beer (rossbeer) 2016-10-13 09:23:01.905-0500 I've just had another deadlock accessing RTP without using the hangup handlers: {noformat} Thread 10 (Thread 0x7f4168af2700 (LWP 45870)): #0 0x00007f42dcfb08ad in __lll_lock_wait () at /lib64/libpthread.so.0 #1 0x00007f42dcfaca48 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0 #2 0x000000000053ce3a in __ast_rwlock_rdlock (filename=0x6f1eab "rtp_engine.c", line=924, func=0x6f2f50 <__PRETTY_FUNCTION__.17039> "ast_rtp_codecs_payload_code", t=0x7f40701f4a40, name=0x6f2204 "&codecs->codecs_lock") at lock.c:911 #3 0x0000000000598671 in ast_rtp_codecs_payload_code (codecs=0x7f40701f4a20, asterisk_format=1, format=0x2a1f640, code=0) at rtp_engine.c:924 #4 0x00007f417ba8bd09 in ast_rtp_read (hdrlen=12, len=172, rtpheader=0x7f3fb02eb520, instance=0x7f3fb03904f8) at res_rtp_asterisk.c:4301 #5 0x00007f417ba8bd09 in ast_rtp_read (instance=0x7f3fb03904f8, rtcp=<optimized out>) at res_rtp_asterisk.c:4502 #6 0x0000000000596d9a in ast_rtp_instance_read (instance=0x7f3fb03904f8, rtcp=0) at rtp_engine.c:482 #7 0x00007f41785845af in chan_pjsip_read (ast=0x7f3fb00f1638) at chan_pjsip.c:695 #8 0x00000000004b7940 in __ast_read (chan=0x7f3fb00f1638, dropaudio=0) at channel.c:3898 #9 0x00000000004b93eb in ast_read (chan=0x7f3fb00f1638) at channel.c:4305 #10 0x000000000048659b in bridge_handle_trip (bridge_channel=0x7f40580088a8) at bridge_channel.c:2416 #11 0x0000000000486a46 in bridge_channel_wait (bridge_channel=0x7f40580088a8) at bridge_channel.c:2586 #12 0x0000000000487131 in bridge_channel_internal_join (bridge_channel=0x7f40580088a8) at bridge_channel.c:2732 #13 0x000000000046dbcf in ast_bridge_join (bridge=0x7f4058003e08, chan=0x7f3fb00f1638, swap=0x0, features=0x7f4168aedfd0, tech_args=0x0, flags=(AST_BRIDGE_JOIN_PASS_REFERENCE | AST_BRIDGE_JOIN_INHIBIT_JOIN_COLP)) at bridge.c:1712 #14 0x000000000050e3e0 in ast_bridge_call_with_flags (chan=0x7f3fb00f1638, peer=0x7f4058006438, config=0x7f4168aef000, flags=0) at features.c:672 #15 0x000000000050e4b2 in ast_bridge_call (chan=0x7f3fb00f1638, peer=0x7f4058006438, config=0x7f4168aef000) at features.c:711 #16 0x00007f41969cb608 in dial_exec_full (chan=0x7f3fb00f1638, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)", peerflags=0x7f4168aef400, continue_exec=0x0) at app_dial.c:3158 #17 0x00007f41969cb999 in dial_exec (chan=0x7f3fb00f1638, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)") at app_dial.c:3210 #18 0x00000000005879a6 in pbx_exec (c=0x7f3fb00f1638, app=0x36cdd00, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)") at pbx_app.c:485 #19 0x00000000005753db in pbx_extension_helper (c=0x7f3fb00f1638, con=0x0, context=0x7f3fb00f1ff0 "crossservers", exten=0x7f3fb00f2040 "<<< ENDPOINT >>>", priority=18, label=0x0, callerid=0x7f4058002c40 "<<< CLI >>>", action=E_SPAWN, found=0x7f4168af1be4, combined_find_spawn=1) at pbx.c:2884 #20 0x00000000005788a2 in ast_spawn_extension (c=0x7f3fb00f1638, context=0x7f3fb00f1ff0 "<<< CONTEXT >>>", exten=0x7f3fb00f2040 "<<< ENDPOINT >>>", priority=18, callerid=0x7f4058002c40 "<<< CLI >>>", found=0x7f4168af1be4, combined_find_spawn=1) at pbx.c:4110 #21 0x000000000057950b in __ast_pbx_run (c=0x7f3fb00f1638, args=0x0) at pbx.c:4285 #22 0x000000000057ac37 in pbx_thread (data=0x7f3fb00f1638) at pbx.c:4605 #23 0x0000000000600915 in dummy_start (data=0x7f3fb031b2a0) at utils.c:1235 #24 0x00007f42dcfa861a in start_thread () at /lib64/libpthread.so.0 #25 0x00007f42dc2e45fd in clone () at /lib64/libc.so.6 {noformat} This happens between 20,000 and 100,000 calls processed. By: Ross Beer (rossbeer) 2016-10-18 08:31:41.853-0500 I've just uploaded a backtrace showing an additional deadlock called 'backtrace-threads-PJSIP.txt' however I'm not sure if this is a new issue? By: Ross Beer (rossbeer) 2016-12-22 18:21:50.893-0600 Will the following change resolve this issue? https://gerrit.asterisk.org/#/c/4671/ The above is related to ASTERISK-26673 By: Ross Beer (rossbeer) 2017-02-16 09:05:49.060-0600 I believe ASTERISK-26673 resolved part of the issue, however, the newer deadlocks look to be caused by a lock inversion between channel hangup and the bridging code. @George Joseph has unsanitised backtraces of the latest lock. By: Joshua C. Colp (jcolp) 2017-02-16 09:11:56.932-0600 I examined the latest lock and it was the same as this. It wasn't a result of the bridge, it just cascaded and deadlocked in more places. By: Ross Beer (rossbeer) 2017-03-29 04:57:48.223-0500 The PJSIP team have just resolved deadlock issues with the transactions: https://trac.pjsip.org/repos/changeset/5572 https://trac.pjsip.org/repos/changeset/5573 Would having the bundled PJSIP patched with these commits help resolve this and other deadlock issues I have reported? By: Richard Mudgett (rmudgett) 2017-04-29 18:34:59.668-0500 The [^backtrace-threads-PJSIP.txt] backtrace definitely shows a deadlock in the T.38 framehook between a pjsip serializer thread processing a reINVITE with SDP and the channel thread writing a frame in the T.38 frame hook. The deadlock is between thread 3 and 4. I have a patch for this up on gerrit against the ASTERISK-26974 issue I created as it is a different deadlock than this issue. The other "CLEAN" backtraces give hints of a deadlock but I cannot find a circular locking because I cannot determine the deadlocked threads. I cannot find the thread holding the lock that ast_rtp_codecs_payload_code() is trying to lock. The backtraces are optimized and many threads have *too much* backtrace information deleted. (I also cannot determine the deadlocking threads on the other private unredacted backtraces as they are also optimized.) By: Ross Beer (rossbeer) 2017-10-08 20:01:15.142-0500 I can no longer replicate the issue |