Summary:ASTERISK-15224: Interlock between directed pickup and device state threads
Reporter:Laurent Steffan (lmsteffan)Labels:
Date Opened:2009-11-26 18:12:33.000-0600Date Closed:2009-12-03 17:25:36.000-0600
Versions:Frequency of
Environment:Attachments:( 0) Pb_locks_asterisk2.txt
Description:Our Asterisk systems suffer from an interlock between two threadsa which eventually leads to a complete halt of the system. I include a trace of the "currently held locks" (core show locks) obtained right after such a halt (I have several other such traces, all similar).


Analysis of the locks shows that one thread, do_devstate_changes, first gets a lock on the "channels" variable (in function ast_parse_device_state). It then proceeds to examine the list of channels and tries to take a lock on one of those (presumably the one involved in the channel masquerade).

Meanwhile, the other thread, involving the directed pickup itself, first seizes the lock on one channel in the masquerade (not sure whether its the target or the original channel, but in any case it's precisely the one which the other thread was trying to get). It then goes on to get the lock on the other channel involved in the masquerade ("clonechan"), which  is not a problem. And finally, when trying to hang up, this thread tries to get the lock on the "channels" variable - and of course that's the one already taken by the other thread. Deadlock.

Right after that deadlock, of course, many other threads try to get the "channels" lock and very quickly grind the system to a halt.

I am still unable to solve this deadlock because I am not sure
   1) of the order in which the locks should be taken - I think it's more logical, and in keeping with the Coding Guidelines, to first take the "channels" lock and then the lock on the channel itself. If thats's correct, it means the lock order should be changed somehow in app_directed_pickup.c.
   2) of the effects of changing this order on other parts of the code : in particular whether it might introduce other deadlocks
   3) of the particular construct to be used for solving this problem (trylock? unlock/lock again? other?)
Comments:By: David Vossel (dvossel) 2009-12-02 18:06:05.000-0600


r222761 may have resolved this.

By: Russell Bryant (russell) 2009-12-03 17:25:36.000-0600

This was fixed in trunk rev 222761