Summary:ASTERISK-24838: chan_sip: Locking inversion occurs when building a peer causes a peer poke during request handling
Reporter:Richard Mudgett (rmudgett)Labels:
Date Opened:2015-02-27 17:31:22.000-0600Date Closed:2015-03-06 13:19:22.000-0600
Versions:11.15.1 Frequency of
Description:A deadlock can happen when an incoming call for a realtime peer happens at the same time as a periodic CLI "sip show channels" is executing.

From [~mjordan]'s analysis:
Looking at the output provided, we can see that Thread 2 is currently pending on a {{sip_pvt}} lock during an iteration over the SIP dialogs container:
Thread 2 (Thread 0x2b9b2d17f700 (LWP 5337)):
#0  0x00002b9adeb28d5c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00002b9adeb243c4 in _L_lock_997 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00002b9adeb24227 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00002b9ae385988c in show_channels_cb (__cur=0x2b9b487e6688, __arg=0x2b9b2d17e3c0, flags=<optimized out>) at chan_sip.c:20862
#4  0x000000000045008c in internal_ao2_callback (c=0x2b9ae4011c48, flags=flags@entry=OBJ_NODATA, cb_fn=cb_fn@entry=0x2b9ae3859850, arg=arg@entry=0x2b9b2d17e3c0, tag=tag@entry=0x0, file=file@entry=0x0, line=line@entry=0, func=func@entry=0x0, type=DEFAULT, data=0x0) at astobj2.c:1107
#5  0x00000000004521ff in __ao2_callback (c=<optimized out>, flags=flags@entry=OBJ_NODATA, cb_fn=cb_fn@entry=0x2b9ae3859850 <show_channels_cb>, arg=arg@entry=0x2b9b2d17e3c0) at astobj2.c:1212
#6  0x00002b9ae3844109 in sip_show_channels (e=<optimized out>, cmd=<optimized out>, a=<optimized out>) at chan_sip.c:20933
#7  0x00000000004a1d3a in ast_cli_command_full (uid=uid@entry=0, gid=gid@entry=0, fd=fd@entry=137, s=s@entry=0x2b9b2d17e740 "sip show channels") at cli.c:2691
#8  0x00000000004a1ee3 in ast_cli_command_multiple_full (uid=0, gid=0, fd=137, size=size@entry=18, s=s@entry=0x2b9b2d17ea1f "sip show channels") at cli.c:2714
#9  0x000000000044b7b5 in netconsole (vconsole=vconsole@entry=0x8662f0) at asterisk.c:1378
#10 0x0000000000591b8a in dummy_start (data=<optimized out>) at utils.c:1223
#11 0x00002b9adeb21b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00002b9adda027bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

In Thread 77, we can see that {{chan_sip}} is blocked attempting to link a new SIP pvt into the same {{dialogs}} container:
Thread 77 (Thread 0x2b9b00886700 (LWP 26092)):
#0  0x00002b9adeb28d5c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00002b9adeb243c4 in _L_lock_997 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00002b9adeb24227 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x000000000044f56b in internal_ao2_link (c=0x2b9ae4011c48, user_data=user_data@entry=0x2b9b48345d88, flags=flags@entry=0, tag=tag@entry=0x0, file=file@entry=0x5b6286 "astobj2.c", line=line@entry=937, func=func@entry=0x5b6442 "__ao2_link") at astobj2.c:905
#4  0x000000000045201f in __ao2_link (c=<optimized out>, obj_new=obj_new@entry=0x2b9b48345d88, flags=flags@entry=0) at astobj2.c:937
#5  0x00002b9ae3891b43 in sip_alloc (callid=callid@entry=0x0, addr=addr@entry=0x0, useglobal_nat=useglobal_nat@entry=0, intended_method=intended_method@entry=3, req=req@entry=0x0, logger_callid=logger_callid@entry=0x0) at chan_sip.c:8634
#6  0x00002b9ae389347c in sip_poke_peer (peer=peer@entry=0x2319558, force=0, force@entry=xxxxxxxxxxx) at chan_sip.c:29585
#7  0x00002b9ae3897575 in build_peer (name=name@entry=0x2b9b486ce1e5 "xxxxxxxxxxx", v=<optimized out>, v@entry=0x2b9b4846f280, alt=<optimized out>, alt@entry=0x0, devstate_only=devstate_only@entry=0, realtime=1) at chan_sip.c:31253
#8  0x00002b9ae389c05e in realtime_peer (newpeername=0x2b9b486ce1e5 "xxxxxxxxxx", newpeername@entry=0x0, addr=addr@entry=0x2b9b487eb030, callbackexten=callbackexten@entry=0x2b9b008837b4 "xxxxxxxxxx", devstate_only=devstate_only@entry=0, which_objects=which_objects@entry=2) at chan_sip.c:5435
#9  0x00002b9ae389cc95 in sip_find_peer_full (addr=addr@entry=0x2b9b487eb030, callbackexten=0x2b9b008837b4 "xxxxxxxxxx", transport=<optimized out>, devstate_only=0, which_objects=2, realtime=1, peer=0x0) at chan_sip.c:5530
#10 0x00002b9ae38c558e in sip_find_peer_by_ip_and_exten (transport=<optimized out>, callbackexten=<optimized out>, addr=0x2b9b487eb030) at chan_sip.c:5574
#11 check_peer_ok (p=p@entry=0x2b9b487e6688, of=0x2b9b4851dda5 "xxxxxxxxxx", req=req@entry=0x2b9b00885300, sipmethod=sipmethod@entry=5, addr=addr@entry=0x2b9b00885270, authpeer=authpeer@entry=0x2b9b00883b80, reliable=reliable@entry=XMIT_RELIABLE, uri2=uri2@entry=0x2b9b008838f0 "sip:xxxxxxxxxx@sip.xxxxxxxx.com:5060", calleridname=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at chan_sip.c:18087
#12 0x00002b9ae38c6025 in check_user_full (p=p@entry=0x2b9b487e6688, req=req@entry=0x2b9b00885300, sipmethod=sipmethod@entry=5, uri=uri@entry=0x2b9b4882b18f "sip:xxxxxxxxxx@sip.xxxxxxxx.com:5060", reliable=reliable@entry=XMIT_RELIABLE, addr=addr@entry=0x2b9b00885270, authpeer=authpeer@entry=0x2b9b00883b80) at chan_sip.c:18401
#13 0x00002b9ae38c7ec9 in handle_request_invite (p=p@entry=0x2b9b487e6688, req=req@entry=0x2b9b00885300, addr=addr@entry=0x2b9b00885270, seqno=1, recount=recount@entry=0x2b9b00885200, e=e@entry=0x2b9b4882b18f "sip:xxxxxxxxxx@sip.xxxxxxxxxx.com:5060", nounlock=nounlock@entry=0x2b9b00885210) at chan_sip.c:25563
#14 0x00002b9ae38cf2dd in handle_incoming (p=p@entry=0x2b9b487e6688, req=req@entry=0x2b9b00885300, addr=addr@entry=0x2b9b00885270, recount=recount@entry=0x2b9b00885200, nounlock=nounlock@entry=0x2b9b00885210) at chan_sip.c:28339
#15 0x00002b9ae38d1ae2 in handle_request_do (req=req@entry=0x2b9b00885300, addr=addr@entry=0x2b9b00885270) at chan_sip.c:28548
#16 0x00002b9ae38d34b6 in sipsock_read (id=<optimized out>, fd=<optimized out>, events=<optimized out>, ignore=<optimized out>) at chan_sip.c:28479
#17 0x00000000004ffe66 in ast_io_wait (ioc=0x2b9ae4001b40, howlong=<optimized out>) at io.c:292
#18 0x00002b9ae38ade1f in do_monitor (data=data@entry=0x0) at chan_sip.c:29077
#19 0x0000000000591b8a in dummy_start (data=<optimized out>) at utils.c:1223
#20 0x00002b9adeb21b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00002b9adda027bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x0000000000000000 in ?? ()
So, Thread 77 is waiting for Thread 2 to complete. However, what is interesting here is that Thread 77 is currently operating on that same {{sip_pvt}}, and currently has it locked. In essence, Thread 77 is creating its own locking inversion:

# Hold {{sip_pvt}} {{0x2b9b487e6688}} lock as the result of {{handle_request_do}}
# As the result of {{sip_poke_peer}}, we go off to create a new {{sip_pvt}}. This requires grabbing the {{dialogs}} container while we hold a lock to a {{sip_pvt}}, creating the possibility of a locking inversion
# At the same time, the CLI command {{sip show channels}} is run on Thread 2.  This grabs the lock on the {{dialogs}} container and iterates through the {{sip_pvt}} objects. This blocks Thread 77 from grabbing the {{dialogs}} container; however, Thread 2 cannot grab the lock for {{0x2b9b487e6688}}, causing the deadlock.

The solution here is to release the {{sip_pvt}} lock on {{0x2b9b487e6688}} before trying to find the peer. Optionally, we could also not try to poke the peer during the process of the build peer, and instead push that off to a scheduler. That would keep the process of looking up the peer from trying to create a new dialog while a {{sip_pvt}} is currently locked.

I've opted to push the peer poke off onto the scheduler rather than unlock the {{sip_pvt}}.