[Home]

Summary:ASTERISK-00579: Deadlock in chan_agent
Reporter:tclark (tclark)Labels:
Date Opened:2003-11-26 19:08:37.000-0600Date Closed:2011-06-07 14:04:43
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:Well I thought we were good to go here we ran 70+ hrs here with out a dead lock after tholo s deadlock fix,
but here are the gory dtls
Looks like Thread 17 want a mutex lock on the agent parent channel but some has it locked else where ...


****** ADDITIONAL INFORMATION ******

(gdb) thread apply all bt

Thread 24 (Thread 3589144 (LWP 2928)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x00000000 in ?? ()

Thread 23 (Thread 3576854 (LWP 2910)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403ed85d in agent_hangup (ast=0x82ee760) at chan_agent.c:495
ASTERISK-1 0x08056dd9 in ast_hangup (chan=0x82ee760) at channel.c:664
ASTERISK-2 0x405a5fc2 in try_calling (qe=0xbcffede0, options=0xbcffeef2 "t", announceoverride=0x0, url=0x0) at app_queue.c:339
ASTERISK-3 0x405a6d45 in queue_exec (chan=0x82fb988, data=0xbcfff304) at app_queue.c:1162
ASTERISK-4 0x08060c30 in pbx_exec (c=0x82fb988, app=0x80f75d0, data=0xbcfff304, newstack=1) at pbx.c:396
ASTERISK-5 0x08062c73 in pbx_extension_helper (c=0x82fb988, context=0x82fbae0 "dispatchq", exten=0x82fbbd4 "s", priority=1,
callerid=0x82fb5d8 "\"BONNER J \" <2507530954>", action=1) at pbx.c:1157
ASTERISK-6 0x080639ad in ast_pbx_run (c=0x82fb988) at pbx.c:1641
ASTERISK-7 0x4060173c in ss_thread (data=0x82fb988) at chan_zap.c:4488
ASTERISK-8 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 22 (Thread 3588117 (LWP 2927)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403eeb92 in agents_show (fd=56, argc=2, argv=0xbd1ff674) at chan_agent.c:975
ASTERISK-1 0x0806b318 in ast_cli_command (fd=56, s=0xbd1ff824 "show agents") at cli.c:1007
ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7060) at asterisk.c:193
ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 21 (Thread 3593236 (LWP 2953)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403eeb92 in agents_show (fd=65, argc=2, argv=0xbd3ff634) at chan_agent.c:975
ASTERISK-1 0x0806b318 in ast_cli_command (fd=65, s=0xbd3ff7e4 "show agents") at cli.c:1007
---Type <return> to continue, or q <return> to quit---
ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7090) at asterisk.c:193
ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0
ASTERISK-4 0x40020101 in pthread_start_thread_event () from /lib/libpthread.so.0

Thread 20 (Thread 3592211 (LWP 2946)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403eeb92 in agents_show (fd=60, argc=2, argv=0xbd5ff634) at chan_agent.c:975
ASTERISK-1 0x0806b318 in ast_cli_command (fd=60, s=0xbd5ff7e4 "show agents") at cli.c:1007
ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7080) at asterisk.c:193
ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0
ASTERISK-4 0x40020101 in pthread_start_thread_event () from /lib/libpthread.so.0

Thread 19 (Thread 46098 (LWP 30609)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x00000000 in ?? ()

Thread 18 (Thread 3590161 (LWP 2931)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403eeb92 in agents_show (fd=53, argc=2, argv=0xbd9ff674) at chan_agent.c:975
ASTERISK-1 0x0806b318 in ast_cli_command (fd=53, s=0xbd9ff824 "show agents") at cli.c:1007
ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7070) at asterisk.c:193
ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 17 (Thread 2795536 (LWP 2056)):
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x403ee42a in check_availability (newlyavailable=0x80e6658, needlock=0) at chan_agent.c:812
ASTERISK-1 0x403eff23 in __login_exec (chan=0x82bade8, data=0xbdbff304, callbackmode=0) at chan_agent.c:1233
ASTERISK-2 0x403f0758 in login_exec (chan=0x82bade8, data=0xbdbff304) at chan_agent.c:1305
ASTERISK-3 0x08060c30 in pbx_exec (c=0x82bade8, app=0x80e54f0, data=0xbdbff304, newstack=1) at pbx.c:396
---Type <return> to continue, or q <return> to quit---
ASTERISK-4 0x08062c73 in pbx_extension_helper (c=0x82bade8, context=0x82baf40 "agentlogin1", exten=0x82bb034 "*3", priority=2,
callerid=0x82b0330 "\"OE 1\" <(250)-753-1231>", action=1) at pbx.c:1157
ASTERISK-5 0x080639ad in ast_pbx_run (c=0x82bade8) at pbx.c:1641
ASTERISK-6 0x406001f7 in ss_thread (data=0x82bade8) at chan_zap.c:4202
ASTERISK-7 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 16 (Thread 78863 (LWP 30670)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0xbddfaa74 in ?? ()
#2 0x0807dcef in session_do (data=0x812b058) at manager.c:686
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 15 (Thread 13326 (LWP 30566)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x40612a04 in __dso_handle () from /usr/lib/asterisk/modules/chan_zap.so
#2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 14 (Thread 12301 (LWP 30565)):
#0 0x4013ede1 in nanosleep () from /lib/libc.so.6
#1 0x4013ed78 in sleep () from /lib/libc.so.6
#2 0x405815bc in qcall (ignore=0x0) at app_qcall.c:167
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 13 (Thread 11276 (LWP 30564)):
#0 0x4013ede1 in nanosleep () from /lib/libc.so.6
#1 0x4013ed78 in sleep () from /lib/libc.so.6
#2 0x40544fda in scan_thread (unused=0x0) at pbx_spool.c:288
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 12 (Thread 10251 (LWP 30563)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x401d3104 in __dso_handle () from /usr/lib/asterisk/modules/pbx_wilcalu.so
#2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 11 (Thread 9226 (LWP 30562)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x4053ac94 in __dso_handle () from /usr/lib/asterisk/modules/chan_phone.so
#2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---

Thread 10 (Thread 8201 (LWP 30561)):
#0 0x40166bb0 in poll () from /lib/libc.so.6
#1 0x0805148b in ast_io_wait (ioc=0x80e5a98, howlong=1000) at io.c:254
#2 0x404046cb in network_thread (ignore=0x0) at chan_iax2.c:5027
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 9 (Thread 7176 (LWP 30560)):
#0 0x40166bb0 in poll () from /lib/libc.so.6
#1 0x0805148b in ast_io_wait (ioc=0x80e1c08, howlong=1000) at io.c:254
#2 0x403d848b in do_monitor (data=0x0) at chan_sip.c:5480
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 8 (Thread 6151 (LWP 30559)):
#0 0x40166bb0 in poll () from /lib/libc.so.6
#1 0x0805148b in ast_io_wait (ioc=0x80df1d0, howlong=1000) at io.c:254
#2 0x402da857 in network_thread (ignore=0x0) at chan_iax.c:4521
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 7 (Thread 5126 (LWP 30556)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x401cc3d4 in __dso_handle () from /usr/lib/asterisk/modules/res_parking.so
#2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 6 (Thread 4101 (LWP 30554)):
#0 0x401613c4 in read () from /lib/libc.so.6
#1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0
#2 0x401bd988 in monmp3thread (data=0x80d40b0) at res_musiconhold.c:234
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 5 (Thread 3076 (LWP 30552)):
#0 0x401613c4 in read () from /lib/libc.so.6
#1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0
#2 0x401bd988 in monmp3thread (data=0x80d3e38) at res_musiconhold.c:234
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 4 (Thread 2051 (LWP 30551)):
#0 0x4016e652 in accept () from /lib/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1 0x400254fb in accept () from /lib/libpthread.so.0
#2 0x0807dfbd in accept_thread (ignore=0x0) at manager.c:729
#3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 3 (Thread 1026 (LWP 30550)):
#0 0x401687ee in select () from /lib/libc.so.6
#1 0x00000041 in ?? ()
#2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0

Thread 2 (Thread 2049 (LWP 30549)):
#0 0x40166bb0 in poll () from /lib/libc.so.6
#1 0x4001fd96 in __pthread_manager () from /lib/libpthread.so.0

Thread 1 (Thread 1024 (LWP 30547)):
#0 0x401613c4 in read () from /lib/libc.so.6
#1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0
#2 0x0809009f in read_char (el=0x80c7e68, cp=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:302
#3 0x0809013d in el_getc (el=0x80c7e68, cp=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:347
#4 0x0808ffaa in read_getcmd (el=0x80c7e68, cmdnum=0xbffff70a "", ch=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:243
ASTERISK-1 0x080903e5 in el_gets (el=0x80c7e68, nread=0xbffff750) at read.c:443
ASTERISK-2 0x080824ef in main (argc=2, argv=0xbffffae4) at asterisk.c:1473
#0 0x401613c4 in read () from /lib/libc.so.6
(gdb)

gdb) info thread
24 Thread 3589144 (LWP 2928) 0x401687ee in select () from /lib/libc.so.6
23 Thread 3576854 (LWP 2910) 0x400c787e in sigsuspend () from /lib/libc.so.6
22 Thread 3588117 (LWP 2927) 0x400c787e in sigsuspend () from /lib/libc.so.6
21 Thread 3593236 (LWP 2953) 0x400c787e in sigsuspend () from /lib/libc.so.6
20 Thread 3592211 (LWP 2946) 0x400c787e in sigsuspend () from /lib/libc.so.6
19 Thread 46098 (LWP 30609) 0x401687ee in select () from /lib/libc.so.6
18 Thread 3590161 (LWP 2931) 0x400c787e in sigsuspend () from /lib/libc.so.6
17 Thread 2795536 (LWP 2056) 0x400c787e in sigsuspend () from /lib/libc.so.6
16 Thread 78863 (LWP 30670) 0x401687ee in select () from /lib/libc.so.6
15 Thread 13326 (LWP 30566) 0x401687ee in select () from /lib/libc.so.6
14 Thread 12301 (LWP 30565) 0x4013ede1 in nanosleep () from /lib/libc.so.6
13 Thread 11276 (LWP 30564) 0x4013ede1 in nanosleep () from /lib/libc.so.6
12 Thread 10251 (LWP 30563) 0x401687ee in select () from /lib/libc.so.6
11 Thread 9226 (LWP 30562) 0x401687ee in select () from /lib/libc.so.6
10 Thread 8201 (LWP 30561) 0x40166bb0 in poll () from /lib/libc.so.6
9 Thread 7176 (LWP 30560) 0x40166bb0 in poll () from /lib/libc.so.6
8 Thread 6151 (LWP 30559) 0x40166bb0 in poll () from /lib/libc.so.6
7 Thread 5126 (LWP 30556) 0x401687ee in select () from /lib/libc.so.6
6 Thread 4101 (LWP 30554) 0x401613c4 in read () from /lib/libc.so.6
5 Thread 3076 (LWP 30552) 0x401613c4 in read () from /lib/libc.so.6
4 Thread 2051 (LWP 30551) 0x4016e652 in accept () from /lib/libc.so.6
3 Thread 1026 (LWP 30550) 0x401687ee in select () from /lib/libc.so.6
2 Thread 2049 (LWP 30549) 0x40166bb0 in poll () from /lib/libc.so.6
1 Thread 1024 (LWP 30547) 0x401613c4 in read () from /lib/libc.so.6
Comments:By: tclark (tclark) 2003-11-27 09:19:46.000-0600

I found in the logs exactly where the system deadlock occurs
the moment after i

Nov 26 12:23:27 NOTICE[3583000]: Queue 'taxidispatchq' Join, Channel 'Zap/29-1',
CallerID; "MILLER R - C   " <2507585676>
Position '5', Cnt '5'
Nov 26 12:23:29 WARNING[3575828]: No class: none
Nov 26 12:23:29 VERBOSE[3575828]:   == Spawn extension (dispatchq, s, 1) exited non-zero on 'Zap/42-1'
Nov 26 12:23:29 VERBOSE[3575828]:     -- Hungup 'Zap/42-1'
Nov 26 12:23:30 VERBOSE[2795536]:     -- Playing 'beep' (language 'en')
***Right here is the dead lock Normally we'd see Agent/P: answer
-- Playing 'beep' (language 'en')
-- Agent/P:2-28722 answered Zap/42-1
***
Nov 26 12:23:31 VERBOSE[3580945]:     -- Hungup 'Zap/1-1'
Nov 26 12:23:31 VERBOSE[3580945]:   == Spawn extension (dispatchlogin1, 1, 2) exited non-zero on 'Local/1@dispatchlogin1-6
f6f,2'
Nov 26 12:23:55 VERBOSE[3581973]:     -- User disconnected while waiting their turn
Nov 26 12:23:55 NOTICE[3581973]: Queue 'taxidispatchq' Leave, Channel 'Zap/30-1', CallerID; "BONNER J       " <2507530954>
Position '4', Cnt '4'


This is the debug output from the moment of deadlock
ov 26 12:23:30 DEBUG[2795536]: Wrapup time expired!
Nov 26 12:23:30 DEBUG[2795536]: Checking availability of '6'
Nov 26 12:23:30 DEBUG[2795536]: Call 'Agent/P:2-25920' looks like a winner for agent '6'
Nov 26 12:23:30 DEBUG[2795536]: Playing beep, lang 'en'
Nov 26 12:23:30 DEBUG[2795536]: Scheduling timer at 160 sample intervals
Nov 26 12:23:30 DEBUG[2795536]: Played beep, result '0'


This is the srcs where we deadlock in chan_agent.812
ast_log( LOG_DEBUG, "Playing beep, lang '%s'\n", newlyavailable->chan->language);
res = ast_streamfile(newlyavailable->chan, "beep", newlyavailable->chan->language);
ast_log( LOG_DEBUG, "Played beep, result '%d'\n", res);
if (!res) {
res = ast_waitstream(newlyavailable->chan, "");
ast_log( LOG_DEBUG, "Waited for stream, result '%d'\n", res);
}
}
if (!res) {
/* Note -- parent may have disappeared */
if (p->abouttograb) {
newlyavailable->acknowledged = 1;
ast_setstate(parent, AST_STATE_UP);
ast_setstate(chan, AST_STATE_UP);
strncpy(parent->context, chan->context, sizeof(parent->context) - 1);
/* Go ahead and mark the channel as a zombie so that masquerade will
  destroy it for us, and we need not call ast_hangup */

--->DEAD LOCK HERE   ast_mutex_lock(&parent->lock);

OK So any gdb god tell me how i can see what fscking thread
has the ' parent->lock ', is there some command/technique to what process has this lock

By: tclark (tclark) 2003-11-28 20:04:50.000-0600

I am gonna ans myself here so any other developer w/o linux thread knowledge
will at least get my understanding...
1) in the asterisk makefile you need to uncomment
# Optional debugging parameters
DEBUG_THREADS = #-DDEBUG_THREADS #-DDO_CRASH

2) When you deadlock dont restart the box  or restart *
  attach gdb to the running * process
  gdb /usr/sbin/astrisk <pid of main * process>

3) after gdb loads  do
  info thread
  thread apply all bt

4) Identify dead locked threads by this pattern
  note the "_pthread_wait_for_restart_signal"
  that mean we are in wait loop wanting the mutex lock
#0 0x400c787e in sigsuspend () from /lib/libc.so.6
#1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0
#2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
#3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
You will find more than one of these patterns usually because once a thread is dead locked on a mutex lock, other threads that want the same lock will pile up quickly


5) Try to identify the first thread, that is dead locked.
  The sequence number of bt threads is not relevent, bcus threads are re-used.
  Look in your log files at the time stamps and try to corrolate the
  bt THREAD number to the earliest entry in the log file with that same
  THREAD number eg ( VERBOSE[3575828]),
  do a  bt thread  that has a FRAME doing "pthread_mutex_lock()"
  Log files are usually in /var/log/asterisk/[messages][debug]

6) Now that we have our potential guilty party as the
  1st in line for the lock do
  thread <sequence number> (for the THREAD of interest)
  frame <fame number> (for the frame # b4 the pthread_mutex_lock () )
  this now should be in our * sources right where we call ast_mutex()
  record the name of the lock it was trying to get

7) Now if we have properly turned on thread debugging we are going to
  be able to see into the include/asterisk/lock.f ast_mutex_t struc
  which looks like this
pthread_mutex_t mutex;
char *file;
int lineno;
char *func;
pthread_t thread;
 so now that we have our lock we can see who has it & what we are waiting on
 do the following bt cmds
 p somelockIjustFound->thread
 p somelockIjustFound->file
 p somelockIjustFound->func
 p somelockIjustFound->lineno

This is the guility code that is holding the lock that we want
to look at


8) Now comes the hard part
  Why is this Code in that thread, file, function, lineno
  not releasing or fscking lock !
  We have to now scour the code checking all places where that lock is
  set & released looking for
  hmm ...
  1) places where there is a lock hierarachy and
     locks are set and released in different order,
     (ie. is same rules for locking rows in a sql db,
     lock & release in the same order everwhere, )
  2) places where a lock is held toooo long
     for/while loops, b4 function calls etc
     (this is the same rule as sql db transactions get in get out quick)
  ...OK I am out of steam
      if some more season'd linux vet could now pick this up
      & tell us what else to look for specfic to * mutex locks

edited on: 11-28-03 21:01

By: Anthony Minessale (anthm) 2003-12-01 09:14:53.000-0600

if you find the code that plays the "beep" in the function 'agent_call'

that is what causes it.  I played around with trying to unlock it
but it's 3 functions deep and the whole mod is locked at the time that
little beep is played so even if you get it to unlock, it pauses the whole
system while the beep is playing.  I decided to yank that beep from chan_agents
and add it to the app_queue cos I can live with no beep in non-queue calls

see http://bugs.digium.com/bug_view_page.php?bug_id=0000549 for patch

edited on: 12-01-03 11:49

By: Olle Johansson (oej) 2004-03-21 16:20:25.000-0600

Is this still a bug with current CVS?

By: Olle Johansson (oej) 2004-04-14 03:03:49

Hello, anyone out there?

By: tclark (tclark) 2004-04-17 16:21:49

Yes its still a bug with race condition and you have to combine
: groups and non : groups using agentcallbacks and agentlogins
but i close it a sweep it under the mat :)
bcus there are a ton more places it can break with : groups

By: Brian West (bkw918) 2004-04-17 22:49:55

If this isn't what you wanted let me know.
bkw