Summary:ASTERISK-21406: [patch] chan_sip deadlock on monlock between unload_module and do_monitor
Reporter:Corey Farrell (coreyfarrell)Labels:
Date Opened:2013-04-10 19:01:45Date Closed:2014-03-07 16:59:03.000-0600
Versions: 11.4.0 Frequency of
Environment:Ubuntu/quantal, eglibc-2.15-0ubuntu20Attachments:( 0) chan_sip-unload-deadlock-backtrace.txt
( 1) chan_sip-unload-deadlock-debug.patch
( 2) chan_sip-unload-testfix.patch
Description:unload_module cancels/joins the monitor thread while holding monlock.  If do_monitor attempts to lock monlock while unload_module already has it, they deadlock.  do_monitor waits for monlock while unload_module waits for do_monitor to exit.

I've experienced this issue a couple of times in production when attempting to shutting down.  I found the cause while running valgrind tests.  I believe valgrind slowed things down so much it caused the deadlock to occur somewhat reliably.  I could not replicate the issue with lock debugging enabled.  I added ast_log messages to unload_module, found that they stopped while monlock was held.  The valgrind testing was done with 'make samples', no changes to /etc/asterisk.  I tried attaching gdb once the lock occured but it could not find symbols (probably because of valgrind).
Comments:By: Corey Farrell (coreyfarrell) 2013-04-10 19:38:03.436-0500

[^chan_sip-unload-testfix.patch] is a possible fix.  At first I did not use sched_yield(), the ast_debug message was printed, but the deadlock was avoided.  After adding sched_yield I was not been able to reproduce the deadlock and or the ast_mutex_trylock failed message.

This patch has not been tested with any SIP peers/activity, it was only tested as a way to fix the specific issue.

By: David Brillert (aragon) 2013-07-18 08:07:13.596-0500

I might be experiencing the same deadlock.
Do you have a gdb trace you can upload so I can compare traces?

By: Corey Farrell (coreyfarrell) 2013-07-31 02:58:57.528-0500

gdb backtrace is from 1.8 branch.

thread 5 is do_monitor() waiting for monlock.
thread 16 is attempting to unload chan_sip.  it has monlock and is waiting for do_monitor() to exit (pthread_join)

Built without thread debugging, run within valgrind.  I've been unable to reproduce this issue with thread debugging enabled.  Thread debugging / deadlock detection adds a bunch of code to ast_mutex_lock, one of the calls must react to pthread_cancel.

By: Corey Farrell (coreyfarrell) 2014-02-25 18:02:37.029-0600

[^chan_sip-unload-deadlock-debug.patch] is not meant to be committed.  If you attempt to unload chan_sip while do_monitor is in delay it will deadlock every time.

By: Corey Farrell (coreyfarrell) 2014-03-03 13:55:26.426-0600

Review reposted to https://reviewboard.asterisk.org/r/3284/ for switch to my new RB username.