ASTERISK-25213: [patch]Possibility of deadlock in chan

[Home]

Summary: ASTERISK-25213: [patch]Possibility of deadlock in chan_sip INVITE early Replace code

Reporter: Walter Doekes (wdoekes) Labels:

Date Opened: 2015-06-30 09:56:27 Date Closed: 2015-07-04 19:12:22

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 11.18.0 11.19.0 Frequency of
Occurrence Frequent

Related
Issues:
is related to ASTERISK-25212 [patch]Segfault when using DEBUG_FD_LEAKS

Environment: Attachments: ( 0) callpickup.sh
( 1) callpickup-alice.xml
( 2) callpickup-bob-waits-and-tells-charlie.xml
( 3) callpickup-charlie.xml
( 4) callpickup-extensions.conf
( 5) callpickup-sip.conf
( 6) deadlock-detect.txt
( 7) issueA25213-unlock_refer_around_sip_new.patch
( 8) issueA25213-unlock_refer_around_sip_new-PROOF.patch

Description: It seems there is a lock order problem with the following code in chan_sip:

{code}
static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
...
/* This locks both refer_call pvt and refer_call pvt's owner!!!*/
if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
error = 1;
} else {
refer_locked = 1;
}
...
if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) { /* No matching extension found */
...
} else {
...
/* First invitation - create the channel. Allocation
* failures are handled below. */

c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
{code}

The above code:
- locks a channel
- then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)

This order is wrong, because most of the other code does the inverse:
- locks the channel list
- locks a channel

This is a problem when:
- one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
- another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel

Now the two threads start waiting on each other.

Reproducing, see the attached files, starting with {{callpickup}}:
{noformat}
$ ls /etc/asterisk/ -1
asterisk.conf
callpickup-alice.xml
callpickup-bob-waits-and-tells-charlie.xml
callpickup-charlie.xml
callpickup.sh
extensions.conf
logger.conf
modules.conf
sip.conf
{noformat}

Run like this:
{noformat}
shell1# ./callpickup.sh
shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
-nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
-p 5073 -key tel devil 127.0.0.1
shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
{noformat}

With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
Sometimes it doesn't crash, in that case, it shows something like:

{noformat}
# grep locked /tmp/deadlock-detect.txt
[Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
[Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
[Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
{noformat}

I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.

Cheers,
Walter Doekes
OSSO B.V.

Comments: By: Asterisk Team (asteriskteam) 2015-06-30 09:56:28.682-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Rusty Newton (rnewton) 2015-06-30 15:34:56.687-0500

Thanks for being so thorough. I was able to reproduce using the scripts and configs you provided.
By: Walter Doekes (wdoekes) 2015-07-01 02:06:05.110-0500

Thanks for checking and reproducing, Rusty.

This attached proof of concept patch {{issueA25213-unlock_refer_around_sip_new-PROOF.patch}} ran (under valgrind) the entire night without deadlocks \[1\]. This morning I removed the patch again, and it deadlocked in 30 seconds \[2\].

\[1\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
Call-rate(length) Port Total-time Total-calls Remote-host
10.0(0 ms)/1.000s 5073 53245.13 s 164551 127.0.0.1:5060(UDP)

0 new calls during 1.004 s period 1 ms scheduler resolution
45 calls (limit 45) Peak was 45 calls, after 4 s
0 Running, 109 Paused, 22 Woken up
329130 dead call msg (discarded) 0 out-of-call msg (discarded)
2 open sockets

Messages Retrans Timeout Unexpected-Msg
Pause [ 1000ms] 164551 0
INVITE ----------> 164551 419958 104
100 <---------- 0 0 0 0
401 <---------- 164427 0 0 0
407 <---------- 0 0 0 0
180 <---------- 0 0 0 0
183 <---------- 0 0 0 0
200 <---------- 0 0 0 0
ACK ----------> 164427 0
INVITE ----------> 164427 331580 4
...
{noformat}

\[2\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
Call-rate(length) Port Total-time Total-calls Remote-host
10.0(0 ms)/1.000s 5073 43.17 s 74 127.0.0.1:5060(UDP)

0 new calls during 1.004 s period 1 ms scheduler resolution
45 calls (limit 45) Peak was 45 calls, after 4 s
0 Running, 76 Paused, 13 Woken up
18 dead call msg (discarded) 0 out-of-call msg (discarded)
2 open sockets

Messages Retrans Timeout Unexpected-Msg
Pause [ 1000ms] 74 0
INVITE ----------> 74 353 4
100 <---------- 0 0 0 0
401 <---------- 42 0 0 0
407 <---------- 0 0 0 0
180 <---------- 0 0 0 0
183 <---------- 0 0 0 0
200 <---------- 0 0 0 0
ACK ----------> 42 0
INVITE ----------> 42 96 0
...
{noformat}

On to fix the XXX part. I don't think the resulting patch will win any prizes, but in the good ol' chan_sip it may be sufficient.
By: Walter Doekes (wdoekes) 2015-07-01 05:15:27.578-0500

Attaching {{issueA25213-unlock_refer_around_sip_new.patch}}.

That cleans up some of the mess and adds the appropriate call destruction code (I think).

Ran valgrind on it. Looks good.

I didn't explicitly test the edge case though.