|Summary:||ASTERISK-21406: [patch] chan_sip deadlock on monlock between unload_module and do_monitor|
|Reporter:||Corey Farrell (coreyfarrell)||Labels:|
|Date Opened:||2013-04-10 19:01:45||Date Closed:||2014-03-07 16:59:03.000-0600|
|Versions:||22.214.171.124 11.4.0||Frequency of|
|Environment:||Ubuntu/quantal, eglibc-2.15-0ubuntu20||Attachments:||( 0) chan_sip-unload-deadlock-backtrace.txt|
( 1) chan_sip-unload-deadlock-debug.patch
( 2) chan_sip-unload-testfix.patch
|Description:||unload_module cancels/joins the monitor thread while holding monlock. If do_monitor attempts to lock monlock while unload_module already has it, they deadlock. do_monitor waits for monlock while unload_module waits for do_monitor to exit.
I've experienced this issue a couple of times in production when attempting to shutting down. I found the cause while running valgrind tests. I believe valgrind slowed things down so much it caused the deadlock to occur somewhat reliably. I could not replicate the issue with lock debugging enabled. I added ast_log messages to unload_module, found that they stopped while monlock was held. The valgrind testing was done with 'make samples', no changes to /etc/asterisk. I tried attaching gdb once the lock occured but it could not find symbols (probably because of valgrind).
|Comments:||By: Corey Farrell (coreyfarrell) 2013-04-10 19:38:03.436-0500|
[^chan_sip-unload-testfix.patch] is a possible fix. At first I did not use sched_yield(), the ast_debug message was printed, but the deadlock was avoided. After adding sched_yield I was not been able to reproduce the deadlock and or the ast_mutex_trylock failed message.
This patch has not been tested with any SIP peers/activity, it was only tested as a way to fix the specific issue.
By: David Brillert (aragon) 2013-07-18 08:07:13.596-0500
I might be experiencing the same deadlock.
Do you have a gdb trace you can upload so I can compare traces?
By: Corey Farrell (coreyfarrell) 2013-07-31 02:58:57.528-0500
gdb backtrace is from 1.8 branch.
thread 5 is do_monitor() waiting for monlock.
thread 16 is attempting to unload chan_sip. it has monlock and is waiting for do_monitor() to exit (pthread_join)
Built without thread debugging, run within valgrind. I've been unable to reproduce this issue with thread debugging enabled. Thread debugging / deadlock detection adds a bunch of code to ast_mutex_lock, one of the calls must react to pthread_cancel.
By: Corey Farrell (coreyfarrell) 2014-02-25 18:02:37.029-0600
[^chan_sip-unload-deadlock-debug.patch] is not meant to be committed. If you attempt to unload chan_sip while do_monitor is in delay it will deadlock every time.
By: Corey Farrell (coreyfarrell) 2014-03-03 13:55:26.426-0600
Review reposted to https://reviewboard.asterisk.org/r/3284/ for switch to my new RB username.