ASTERISK-26445: rtp: Deadlock in getting payload code

[Home]

Summary: ASTERISK-26445: rtp: Deadlock in getting payload code

Reporter: Ross Beer (rossbeer) Labels:

Date Opened: 2016-10-06 04:00:23 Date Closed: 2017-10-08 20:01:14

Priority: Major Regression?

Status: Closed/Complete Components: Core/RTP

Versions: 13.11.2 Frequency of
Occurrence Frequent

Related
Issues:
is duplicated by ASTERISK-26499 Locked Here: astobj2_container.c line 333 (internal_ao2_traverse)

is duplicated by ASTERISK-26792 Lock inversion between channel hangup and the bridging code

is related to ASTERISK-26905 pjproject_bundled: Merge 3 upstream deadlock patches into bundled

is related to ASTERISK-26974 res_pjsip: Deadlock in T.38 framehook

Environment: Fedora 23 Attachments: ( 0) backtrace-threads_2016-10-05-CLEAN.txt
( 1) backtrace-threads-CLEAN.txt
( 2) backtrace-threads-PJSIP.txt

Description: Asterisk deadlocks and no further processing can occur.

This looks to be related to rtp_engine.c

{noformat}
#2 0x000000000053ce5a in __ast_rwlock_rdlock (filename=0x6f1e4b "rtp_engine.c", line=924, func=0x6f2ef0 <__PRETTY_FUNCTION__.16829> "ast_rtp_codecs_payload_code", t=0x7fe9945032f0, name=0x6f21a4 "&codecs->codecs_lock") at lock.c:911
{noformat}

I think this may have been caused by using hangup handlers on the channel. One before a dial and then another in the Dial b option subroutine. The hangup handlers call the same routine that updates the hangup reason for each channel:

{noformat}
#5 0x000000000058cac5 in ast_func_read (chan=0x7f4cc420fdb8, function=0x7f4d38cae0e0 "CHANNEL(rtcp,all)", workspace=0x7f4d38cad0c0 "", len=4096) at pbx_functions.c:617
#6 0x0000000000590bb0 in pbx_substitute_variables_helper_full (c=0x7f4cc420fdb8, headp=0x7f4cc4210580, cp1=0x7f4d38caf260 "CDR(mos)=${CHANNEL(rtcp,all)};cr=${CHANNEL(audioreadformat)};cw=${CHANNEL(audiowriteformat)};cn=${CHANNEL(audionativeformat)}", cp2=0x7f4d38caf349 "", count=8182, used=0x7f4d38caf238) at pbx_variables.c:693
#7 0x000000000059112b in pbx_substitute_variables_helper (c=0x7f4cc420fdb8, cp1=0x7f4d38caf260 "CDR(mos)=${CHANNEL(rtcp,all)};cr=${CHANNEL(audioreadformat)};cw=${CHANNEL(audiowriteformat)};cn=${CHANNEL(audionativeformat)}", cp2=0x7f4d38caf340 "CDR(mos)=", count=8191) at pbx_variables.c:790
#8 0x0000000000575257 in pbx_extension_helper (c=0x7f4cc420fdb8, con=0x0, context=0x7f4cc4210770 "record-hangupcause", exten=0x7f4cc42107c0 "s", priority=4, label=0x0, callerid=0x7f4b60004a70 "<< PRIVATE INFORMATION REMOVED >>>", action=E_SPAWN, found=0x7f4d38cb1924, combined_find_spawn=1) at pbx.c:2873
#9 0x00000000005788a2 in ast_spawn_extension (c=0x7f4cc420fdb8, context=0x7f4cc4210770 "record-hangupcause", exten=0x7f4cc42107c0 "s", priority=4, callerid=0x7f4b60004a70 "<< PRIVATE INFORMATION REMOVED >>>", found=0x7f4d38cb1924, combined_find_spawn=1) at pbx.c:4110
{noformat}

From the backtrace (attached) it looks like calling channel and the audio formats causes a lock.

Comments: By: Asterisk Team (asteriskteam) 2016-10-06 04:00:24.893-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Rusty Newton (rnewton) 2016-10-07 17:42:00.866-0500

Can you attach the dialplan and if possible a log of the call flow happening before the deadlock?
By: Ross Beer (rossbeer) 2016-10-10 09:21:14.017-0500

The outgoing call uses:

{noformat}
exten => _0.,1,Set(CHANNEL(hangup_handler_push)=record-hangupcause,s,1)
exten => _0.,n,Dial(${SIPDRIVER}/${globalnumber}@${usecarrier},120,L(${timelimit}:300000:60000)b(dialout-trunk-precall-handler^${EXTEN}^1)
{noformat}

The precall gosub uses:

{noformat}
[dialout-trunk-precall-handler]
exten => _X.,1,Set(CHANNEL(hangup_handler_push)=record-hangupcause,s,1)
exten => _X.,n,Return()
{noformat}

The hangup handler is defined:

{noformat}
[record-hangupcause]
exten => s,1,Set(CDR(hangupcause)=${HANGUPCAUSE})
exten => s,n,GotoIf($["${CHANNEL(channeltype)}" = "Local"]?nomos)
exten => s,n,GotoIf($["${CDR(disposition)}" = "FAILED"]?nocdr)
exten => s,n,Set(CDR(mos)=${CHANNEL(rtcp,all)}\;cr=${CHANNEL(audioreadformat)}\;cw=${CHANNEL(audiowriteformat)}\;cn=${CHANNEL(audionativeformat)})
exten => s,n(nomos),Return()
exten => s,n(nocdr),NoCDR()
exten => s,n,Return()
{noformat}

By: Rusty Newton (rnewton) 2016-10-12 19:56:54.633-0500

Thanks, about how many calls do you see between the deadlocks occurring? I'd like to know what sort of frequency to expect if we try to reproduce.
By: Ross Beer (rossbeer) 2016-10-13 09:23:01.905-0500

I've just had another deadlock accessing RTP without using the hangup handlers:

{noformat}
Thread 10 (Thread 0x7f4168af2700 (LWP 45870)):
#0 0x00007f42dcfb08ad in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007f42dcfaca48 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
#2 0x000000000053ce3a in __ast_rwlock_rdlock (filename=0x6f1eab "rtp_engine.c", line=924, func=0x6f2f50 <__PRETTY_FUNCTION__.17039> "ast_rtp_codecs_payload_code", t=0x7f40701f4a40, name=0x6f2204 "&codecs->codecs_lock") at lock.c:911
#3 0x0000000000598671 in ast_rtp_codecs_payload_code (codecs=0x7f40701f4a20, asterisk_format=1, format=0x2a1f640, code=0) at rtp_engine.c:924
#4 0x00007f417ba8bd09 in ast_rtp_read (hdrlen=12, len=172, rtpheader=0x7f3fb02eb520, instance=0x7f3fb03904f8) at res_rtp_asterisk.c:4301
#5 0x00007f417ba8bd09 in ast_rtp_read (instance=0x7f3fb03904f8, rtcp=<optimized out>) at res_rtp_asterisk.c:4502
#6 0x0000000000596d9a in ast_rtp_instance_read (instance=0x7f3fb03904f8, rtcp=0) at rtp_engine.c:482
#7 0x00007f41785845af in chan_pjsip_read (ast=0x7f3fb00f1638) at chan_pjsip.c:695
#8 0x00000000004b7940 in __ast_read (chan=0x7f3fb00f1638, dropaudio=0) at channel.c:3898
#9 0x00000000004b93eb in ast_read (chan=0x7f3fb00f1638) at channel.c:4305
#10 0x000000000048659b in bridge_handle_trip (bridge_channel=0x7f40580088a8) at bridge_channel.c:2416
#11 0x0000000000486a46 in bridge_channel_wait (bridge_channel=0x7f40580088a8) at bridge_channel.c:2586
#12 0x0000000000487131 in bridge_channel_internal_join (bridge_channel=0x7f40580088a8) at bridge_channel.c:2732
#13 0x000000000046dbcf in ast_bridge_join (bridge=0x7f4058003e08, chan=0x7f3fb00f1638, swap=0x0, features=0x7f4168aedfd0, tech_args=0x0, flags=(AST_BRIDGE_JOIN_PASS_REFERENCE | AST_BRIDGE_JOIN_INHIBIT_JOIN_COLP)) at bridge.c:1712
#14 0x000000000050e3e0 in ast_bridge_call_with_flags (chan=0x7f3fb00f1638, peer=0x7f4058006438, config=0x7f4168aef000, flags=0) at features.c:672
#15 0x000000000050e4b2 in ast_bridge_call (chan=0x7f3fb00f1638, peer=0x7f4058006438, config=0x7f4168aef000) at features.c:711
#16 0x00007f41969cb608 in dial_exec_full (chan=0x7f3fb00f1638, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)", peerflags=0x7f4168aef400, continue_exec=0x0) at app_dial.c:3158
#17 0x00007f41969cb999 in dial_exec (chan=0x7f3fb00f1638, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)") at app_dial.c:3210
#18 0x00000000005879a6 in pbx_exec (c=0x7f3fb00f1638, app=0x36cdd00, data=0x7f4168aef540 "PJSIP/<<< ENDPOINT >>>,,b(vh-add-distinctivering-headers^1^1)") at pbx_app.c:485
#19 0x00000000005753db in pbx_extension_helper (c=0x7f3fb00f1638, con=0x0, context=0x7f3fb00f1ff0 "crossservers", exten=0x7f3fb00f2040 "<<< ENDPOINT >>>", priority=18, label=0x0, callerid=0x7f4058002c40 "<<< CLI >>>", action=E_SPAWN, found=0x7f4168af1be4, combined_find_spawn=1) at pbx.c:2884
#20 0x00000000005788a2 in ast_spawn_extension (c=0x7f3fb00f1638, context=0x7f3fb00f1ff0 "<<< CONTEXT >>>", exten=0x7f3fb00f2040 "<<< ENDPOINT >>>", priority=18, callerid=0x7f4058002c40 "<<< CLI >>>", found=0x7f4168af1be4, combined_find_spawn=1) at pbx.c:4110
#21 0x000000000057950b in __ast_pbx_run (c=0x7f3fb00f1638, args=0x0) at pbx.c:4285
#22 0x000000000057ac37 in pbx_thread (data=0x7f3fb00f1638) at pbx.c:4605
#23 0x0000000000600915 in dummy_start (data=0x7f3fb031b2a0) at utils.c:1235
#24 0x00007f42dcfa861a in start_thread () at /lib64/libpthread.so.0
#25 0x00007f42dc2e45fd in clone () at /lib64/libc.so.6
{noformat}

This happens between 20,000 and 100,000 calls processed.
By: Ross Beer (rossbeer) 2016-10-18 08:31:41.853-0500

I've just uploaded a backtrace showing an additional deadlock called 'backtrace-threads-PJSIP.txt' however I'm not sure if this is a new issue?
By: Ross Beer (rossbeer) 2016-12-22 18:21:50.893-0600

Will the following change resolve this issue?

https://gerrit.asterisk.org/#/c/4671/

The above is related to ASTERISK-26673
By: Ross Beer (rossbeer) 2017-02-16 09:05:49.060-0600

I believe ASTERISK-26673 resolved part of the issue, however, the newer deadlocks look to be caused by a lock inversion between channel hangup and the bridging code.

@George Joseph has unsanitised backtraces of the latest lock.
By: Joshua C. Colp (jcolp) 2017-02-16 09:11:56.932-0600

I examined the latest lock and it was the same as this. It wasn't a result of the bridge, it just cascaded and deadlocked in more places.
By: Ross Beer (rossbeer) 2017-03-29 04:57:48.223-0500

The PJSIP team have just resolved deadlock issues with the transactions:

https://trac.pjsip.org/repos/changeset/5572
https://trac.pjsip.org/repos/changeset/5573

Would having the bundled PJSIP patched with these commits help resolve this and other deadlock issues I have reported?

By: Richard Mudgett (rmudgett) 2017-04-29 18:34:59.668-0500

The [^backtrace-threads-PJSIP.txt] backtrace definitely shows a deadlock in the T.38 framehook between a pjsip serializer thread processing a reINVITE with SDP and the channel thread writing a frame in the T.38 frame hook. The deadlock is between thread 3 and 4. I have a patch for this up on gerrit against the ASTERISK-26974 issue I created as it is a different deadlock than this issue.

The other "CLEAN" backtraces give hints of a deadlock but I cannot find a circular locking because I cannot determine the deadlocked threads. I cannot find the thread holding the lock that ast_rtp_codecs_payload_code() is trying to lock. The backtraces are optimized and many threads have *too much* backtrace information deleted. (I also cannot determine the deadlocking threads on the other private unredacted backtraces as they are also optimized.)

By: Ross Beer (rossbeer) 2017-10-08 20:01:15.142-0500

I can no longer replicate the issue