[Home]

Summary:ASTERISK-11200: Multiple deadlock / show channels freeze on Queues 1.4.16/17
Reporter:Doug (doug)Labels:
Date Opened:2008-01-10 08:51:21.000-0600Date Closed:2008-01-16 11:20:11.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_agent
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 10min_after_lock.rtf
( 1) 11730.patch
( 2) 6min_after_lock.rtf
( 3) CLI_output.rtf
Description:Hi all

I have a deadlock bug on queues that will happen 2-3 times daily, where there will be a call the gets stuck in the queue and will not hangup untill restart of Asterisk. When this occurs if we use the "show channels" command the CLI will freeze but we are able to logout and back in to get the CLI running again. If left to long there will be problems with Agent state reporting and some agents will not recive calls unless they log out and in again (where their state is reset) or we reload the queue app. We cannot see the ZAP channel due to the "show channels" freezing CLI and the same will happen using tab with "soft hangup". I am able to see the ZAP channel in the queue and have tried "core destroy channel" with no success, this will kill the ZAP channel but leave the Queued call. The only way that we can clear the lock is a full restart of Asterisk. Below please find the CLI outputs and attached the Thread Debug info on the lock.

****** ADDITIONAL INFORMATION ******

*CLI> show queues

default-grey has 1 calls (max unlimited) in 'rrmemory' strategy (26s holdtime), W:0, C:433, A:91, SL:97.2% within 60s
  Members:
     Local/8020@default-agent/n (dynamic) (Unknown) has taken 1 calls (last was 2781 secs ago)
     Local/8010@default-agent/n (dynamic) (Unknown) has taken 1 calls (last was 6 secs ago)
     Local/8021@default-agent/n (dynamic) (Unknown) has taken no calls yet
     Local/8005@default-agent/n (dynamic) (Unknown) has taken 2 calls (last was 1470 secs ago)
     Local/8009@default-agent/n (dynamic) (Unknown) has taken 5 calls (last was 44 secs ago)
     Local/8022@default-agent/n (dynamic) (Unknown) has taken 5 calls (last was 132 secs ago)
     Local/8002@default-agent/n (dynamic) (Unknown) has taken 2 calls (last was 87 secs ago)
     Local/8016@default-agent/n (dynamic) (In use) has taken 12 calls (last was 27 secs ago)
     Local/8006@default-agent/n (dynamic) (Not in use) has taken 12 calls (last was 297 secs ago)
     Local/8017@default-agent/n (dynamic) (In use) has taken 2 calls (last was 49 secs ago)
     Local/8007@default-agent/n (dynamic) (In use) has taken 7 calls (last was 2089 secs ago)
     Local/8030@default-agent/n (dynamic) (Unknown) has taken no calls yet
  Callers:
     1. Zap/26-1 (wait: 409:33, prio: 0)


*CLI> show channels
Channel              Location             State   Application(Data)
SIP/5000-094f8f20    (None)               Up      Bridged Call(SIP/8016-b7c47c18
SIP/8016-b7c47c18    s@macro-default-dial Up      Dial(SIP/0833868114@5000|60)
SIP/8556-093c28c0    8556@default-default Ringing AppDial((Outgoing Line))
Local/8556@default-d s@macro-default-dial Ring    Dial(SIP/8556|30|tkK|)
Local/8556@default-d 8556@default-default Down    (None)
Zap/9-1              8538@zap-incoming:18 Up      Dial(SIP/8538|60)
SIP/8002-093a1da0    (None)               Up      Bridged Call(Local/8002@defaul
Local/8002@default-a 8002@default-local-d Up      Dial(SIP/8002||t)
Local/8002@default-a 8002@default-agent:1 Up      Bridged Call(Zap/5-1)
SIP/8009-0939f5e0    (None)               Up      Bridged Call(Local/8009@defaul
Local/8009@default-a 8009@default-local-d Up      Dial(SIP/8009||t)
Local/8009@default-a 8009@default-agent:1 Up      Bridged Call(Zap/6-1)
Zap/5-1              s@default-applicatio Up      Queue(default-greyhound|t|||30
Zap/6-1              s@default-applicatio Up      Queue(default-greyhound|t|||30
SIP/8021-09304a80    (None)               Up      Bridged Call(Local/8021@defaul
Local/8021@default-a 8021@default-local-d Up      Dial(SIP/8021||t)
Local/8021@default-a 8021@default-agent:1 Up      Bridged Call(Zap/4-1)
SIP/8030-0964efd0    (None)               Up      Bridged Call(Local/8030@defaul
Local/8030@default-a 8030@default-local-d Up      Dial(SIP/8030||t)
Local/8030@default-a 8030@default-agent:1 Up      Bridged Call(Zap/3-1)
Zap/4-1              s@default-applicatio Up      Queue(default-greyhound|t|||30
Zap/3-1              s@default-applicatio Up      Queue(default-greyhound|t|||30
SIP/5000-09337938    (None)               Up      Bridged Call(SIP/8005-b7c28da8
SIP/8005-b7c28da8    s@macro-default-dial Up      Dial(SIP/0828358380@5000|60)
SIP/8022-093637f8    (None)               Up      Bridged Call(Local/8022@defaul
Local/8022@default-a 8022@default-local-d Up      Dial(SIP/8022||t)
Local/8022@default-a 8022@default-agent:1 Up      Bridged Call(Zap/13-1)
Zap/13-1             s@default-applicatio Up      Queue(default-greyhound|t|||30
*CLI>

Thread Debug Info:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3085925280 (listener             started at [ 1063] asterisk.c ast_makesocket())
=== -------------------------------------------------------------------
===
=== Thread ID: 3085421472 (do_devstate_changes  started at [  378] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x9297908 (1)
=== ---> Tried and failed to get Lock #1 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92906e0 (1)
=== ---> Tried and failed to get Lock #2 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92a9900 (1)
=== ---> Tried and failed to get Lock #3 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92c8dc8 (0)
=== ---> Tried and failed to get Lock #4 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x9298558 (1)
=== ---> Tried and failed to get Lock ASTERISK-1 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92a5f28 (29)
=== ---> Tried and failed to get Lock ASTERISK-2 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a14d60 (0)
=== ---> Tried and failed to get Lock ASTERISK-3 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x9326ad0 (1)
=== ---> Tried and failed to get Lock ASTERISK-4 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x933d348 (0)
=== ---> Tried and failed to get Lock ASTERISK-5 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92f9e48 (3)
=== ---> Tried and failed to get Lock ASTERISK-6 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a1ed50 (0)
=== ---> Tried and failed to get Lock ASTERISK-7 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92f3538 (2)
=== ---> Tried and failed to get Lock ASTERISK-8 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92dbf78 (1)
=== ---> Tried and failed to get Lock ASTERISK-9 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x9299468 (0)
=== ---> Tried and failed to get Lock ASTERISK-10 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7c184f8 (0)
=== ---> Tried and failed to get Lock ASTERISK-11 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a11b08 (0)
=== ---> Tried and failed to get Lock ASTERISK-12 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x928f9d8 (3)
=== ---> Tried and failed to get Lock ASTERISK-13 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92efab8 (2)
=== ---> Tried and failed to get Lock ASTERISK-14 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a51ed0 (0)
=== ---> Tried and failed to get Lock ASTERISK-15 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92f9938 (0)
=== ---> Tried and failed to get Lock ASTERISK-16 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92aa3f8 (1)
=== ---> Tried and failed to get Lock ASTERISK-17 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x93388c8 (0)
=== ---> Tried and failed to get Lock ASTERISK-18 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a1aeb8 (0)
=== ---> Tried and failed to get Lock ASTERISK-19 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a2fca8 (0)
=== ---> Tried and failed to get Lock ASTERISK-20 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7a310c8 (0)
=== ---> Tried and failed to get Lock ASTERISK-21 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7b124b8 (0)
=== ---> Tried and failed to get Lock ASTERISK-22 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x933fe78 (0)
=== ---> Tried and failed to get Lock ASTERISK-23 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7c18158 (0)
=== ---> Tried and failed to get Lock ASTERISK-24 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7b26a18 (0)
=== ---> Tried and failed to get Lock ASTERISK-25 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7c1c100 (0)
=== ---> Tried and failed to get Lock ASTERISK-26 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0xb7b13d98 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 3085667232 (accept_thread        started at [ 3041] manager.c init_manager())
=== -------------------------------------------------------------------
===
=== Thread ID: 3084929952 (do_parking_thread    started at [ 2435] res_features.c load_module())
=== -------------------------------------------------------------------
===
=== Thread ID: 3085175712 (monmp3thread         started at [  913] res_musiconhold.c moh_register())
=== -------------------------------------------------------------------
===
=== Thread ID: 15358880 (do_monitor           started at [15638] chan_sip.c restart_monitor())
=== ---> Tried and failed to get Lock #0 (chan_sip.c): MUTEX 15506 do_monitor &sip->lock 0x92ce748 (11)
=== -------------------------------------------------------------------
===
=== Thread ID: 10337184 (sla_thread           started at [ 4793] app_meetme.c sla_load_config())
=== -------------------------------------------------------------------
===
=== Thread ID: 11885472 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 13884320 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 14130080 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 14375840 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 14621600 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 14867360 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 15113120 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 16210848 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 16456608 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 84069280 (iax2_process_thread  started at [ 8889] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 24525728 (sched_thread         started at [ 8899] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 24771488 (network_thread       started at [ 8900] chan_iax2.c start_network_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 118590368 (scan_thread          started at [  499] pbx_spool.c load_module())
=== -------------------------------------------------------------------
===
=== Thread ID: 25807776 (device_state_thread  started at [ 4816] app_queue.c load_module())
=== -------------------------------------------------------------------
===
=== Thread ID: 41139104 (pri_dchannel         started at [ 9478] chan_zap.c start_pri())
=== -------------------------------------------------------------------
===
=== Thread ID: 25209760 (do_monitor           started at [ 6994] chan_zap.c restart_monitor())
=== -------------------------------------------------------------------
===
=== Thread ID: 114748320 (pri_dchannel         started at [ 9478] chan_zap.c start_pri())
=== -------------------------------------------------------------------
===
=== Thread ID: 119643040 (monitor_sig_flags    started at [ 2996] asterisk.c main())
=== -------------------------------------------------------------------
===
=== Thread ID: 67152800 (session_do           started at [ 2393] manager.c accept_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 88021920 (autoservice_run      started at [  194] autoservice.c ast_autoservice_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 44878752 (session_do           started at [ 2393] manager.c accept_thread())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1083 channel_find_locked &c->lock 0x92c88d0 (139561)
=== ---> Lock #1 (manager.c): RDLOCK 2158 process_message &actionlock 0x815fee0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 27081632 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 100363168 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 95910816 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2219 __ast_read &chan->lock 0x92ae4f0 (2)
=== ---> Waiting for Lock #1 (chan_local.c): MUTEX 287 local_write &p->lock 0x92c3ea8 (1)
=== --- ---> Locked Here: chan_local.c line 492 (local_hangup)
=== -------------------------------------------------------------------
===
=== Thread ID: 28556192 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 72723360 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 1733 ast_hangup &chan->lock 0x92c88d0 (1)
=== ---> Lock #1 (chan_local.c): MUTEX 492 local_hangup &p->lock 0x92c3ea8 (1)
=== ---> Waiting for Lock #2 (pbx.c): MUTEX 5858 pbx_builtin_setvar_helper &chan->lock 0x92ae4f0 (1)
=== --- ---> Locked Here: channel.c line 2219 (__ast_read)
=== --- ---> Locked Here: channel.c line 2815 (ast_write)
=== -------------------------------------------------------------------
===
=== Thread ID: 37383072 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 29346720 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 133745568 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 39357344 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 100617120 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 114129824 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 28310432 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 32021408 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 27818912 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 62958496 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 43244448 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 102763424 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 27327392 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 107244448 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 119049120 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 71981984 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 27573152 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 124558240 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 70548384 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 67808160 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 127876000 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 97115040 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 28064672 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 30084000 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 29838240 (pbx_thread           started at [ 2647] pbx.c ast_pbx_start())
=== -------------------------------------------------------------------
===
=== Thread ID: 81742752 (mixmonitor_thread    started at [  301] app_mixmonitor.c launch_monitor_thread())
=== -------------------------------------------------------------------
===
=== Thread ID: 88570784 (netconsole           started at [ 1005] asterisk.c listener())
=== -------------------------------------------------------------------
===
=======================================================================
Comments:By: callguy (callguy) 2008-01-10 09:35:53.000-0600

putnopvut: if you look at the core show locks output from 11712 at least part of it looks similar to what we are seeing there. i can't confirm that the same use case is triggering it but thought it was worth noting.

By: Doug (doug) 2008-01-10 09:47:25.000-0600

This is the same problem however we only get the problem when a call is stuck in the queue other that that all calls proccess fine and no other issues will occur so far. So my issue is most likely related to a locked queue call.

By: callguy (callguy) 2008-01-10 10:07:37.000-0600

udi-doug: i can't definitively say if we have a hung call in a queue that is triggering it (we have a lot of activity on this particular box), but it is plausible as we do have queues on there. we are also seeing the same behavior - aside from the strange console behavior the system does continue to function without issue until restart (which is the only way to clear it up).

By: Mark Michelson (mmichelson) 2008-01-10 10:20:47.000-0600

I am willing to tentatively state that this and 11712 are actually different deadlocks.

This deadlock appears to be a classic case of lock order being incorrect, If you check thread ID's 95910816 and 72723360 in 6min_after_lock.rtf, you'll see that each is holding the lock that the other is trying to obtain, and a cursory look at the output seems to indicate that the problem has to do with local channels. I'll investigate this further and give more detailed information about it later.

11712 on the other hand, appears to be a problem where a lock is never released in chan_sip for some reason. What's interesting about it is that the problem there is actually an infinite loop (due to locking issues though). I'll add a more detailed note to that issue explaining the problem.

By: Mark Michelson (mmichelson) 2008-01-10 12:54:15.000-0600

The problem had to do with the fact that the code for hanging up a local channel violated the required locking order regarding private structures and channels. I'm guessing that you saw this behavior in your queues because your queue members are local channels.

I am uploading 11730.patch. This introduces deadlock avoidance into the local_hangup function. Please test this and report if it has fixed the problem.

By: Doug (doug) 2008-01-11 00:26:50.000-0600

Thanks for the help and patch i will istall this and report back within the next 2 days due to my only being able to patch this system at odd times as it happens to be a 24/7 call center. Thanks again.

By: Doug (doug) 2008-01-16 01:00:41.000-0600

Thanks for the patch! All is running 100% now have had no locks at all. This ticket can be closed.

By: Mark Michelson (mmichelson) 2008-01-16 11:12:37.000-0600

All right. I will commit the patch and close the issue soon afterwards. I have added a relationship to issue ASTERISK-10707, since the latest core show locks output from there indicates the same deadlock was occurring. I will be closing that issue as well with a similar note attached.

Edit: Made the reference to 11181 a link so that it could be easily accessed.



By: Mark Michelson (mmichelson) 2008-01-16 11:20:10.000-0600

The patch has been committed to 1.4 in revision 98964 and to trunk in revision 98965. Thanks for bringing the bug to our attention and thanks for testing the patch!