Summary: | ASTERISK-00579: Deadlock in chan_agent | ||
Reporter: | tclark (tclark) | Labels: | |
Date Opened: | 2003-11-26 19:08:37.000-0600 | Date Closed: | 2011-06-07 14:04:43 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Applications/app_queue |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ||
Description: | Well I thought we were good to go here we ran 70+ hrs here with out a dead lock after tholo s deadlock fix, but here are the gory dtls Looks like Thread 17 want a mutex lock on the agent parent channel but some has it locked else where ... ****** ADDITIONAL INFORMATION ****** (gdb) thread apply all bt Thread 24 (Thread 3589144 (LWP 2928)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x00000000 in ?? () Thread 23 (Thread 3576854 (LWP 2910)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403ed85d in agent_hangup (ast=0x82ee760) at chan_agent.c:495 ASTERISK-1 0x08056dd9 in ast_hangup (chan=0x82ee760) at channel.c:664 ASTERISK-2 0x405a5fc2 in try_calling (qe=0xbcffede0, options=0xbcffeef2 "t", announceoverride=0x0, url=0x0) at app_queue.c:339 ASTERISK-3 0x405a6d45 in queue_exec (chan=0x82fb988, data=0xbcfff304) at app_queue.c:1162 ASTERISK-4 0x08060c30 in pbx_exec (c=0x82fb988, app=0x80f75d0, data=0xbcfff304, newstack=1) at pbx.c:396 ASTERISK-5 0x08062c73 in pbx_extension_helper (c=0x82fb988, context=0x82fbae0 "dispatchq", exten=0x82fbbd4 "s", priority=1, callerid=0x82fb5d8 "\"BONNER J \" <2507530954>", action=1) at pbx.c:1157 ASTERISK-6 0x080639ad in ast_pbx_run (c=0x82fb988) at pbx.c:1641 ASTERISK-7 0x4060173c in ss_thread (data=0x82fb988) at chan_zap.c:4488 ASTERISK-8 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 22 (Thread 3588117 (LWP 2927)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403eeb92 in agents_show (fd=56, argc=2, argv=0xbd1ff674) at chan_agent.c:975 ASTERISK-1 0x0806b318 in ast_cli_command (fd=56, s=0xbd1ff824 "show agents") at cli.c:1007 ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7060) at asterisk.c:193 ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 21 (Thread 3593236 (LWP 2953)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403eeb92 in agents_show (fd=65, argc=2, argv=0xbd3ff634) at chan_agent.c:975 ASTERISK-1 0x0806b318 in ast_cli_command (fd=65, s=0xbd3ff7e4 "show agents") at cli.c:1007 ---Type <return> to continue, or q <return> to quit--- ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7090) at asterisk.c:193 ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 ASTERISK-4 0x40020101 in pthread_start_thread_event () from /lib/libpthread.so.0 Thread 20 (Thread 3592211 (LWP 2946)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403eeb92 in agents_show (fd=60, argc=2, argv=0xbd5ff634) at chan_agent.c:975 ASTERISK-1 0x0806b318 in ast_cli_command (fd=60, s=0xbd5ff7e4 "show agents") at cli.c:1007 ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7080) at asterisk.c:193 ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 ASTERISK-4 0x40020101 in pthread_start_thread_event () from /lib/libpthread.so.0 Thread 19 (Thread 46098 (LWP 30609)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x00000000 in ?? () Thread 18 (Thread 3590161 (LWP 2931)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403eeb92 in agents_show (fd=53, argc=2, argv=0xbd9ff674) at chan_agent.c:975 ASTERISK-1 0x0806b318 in ast_cli_command (fd=53, s=0xbd9ff824 "show agents") at cli.c:1007 ASTERISK-2 0x0807f8c9 in netconsole (vconsole=0x80c7070) at asterisk.c:193 ASTERISK-3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 17 (Thread 2795536 (LWP 2056)): #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x403ee42a in check_availability (newlyavailable=0x80e6658, needlock=0) at chan_agent.c:812 ASTERISK-1 0x403eff23 in __login_exec (chan=0x82bade8, data=0xbdbff304, callbackmode=0) at chan_agent.c:1233 ASTERISK-2 0x403f0758 in login_exec (chan=0x82bade8, data=0xbdbff304) at chan_agent.c:1305 ASTERISK-3 0x08060c30 in pbx_exec (c=0x82bade8, app=0x80e54f0, data=0xbdbff304, newstack=1) at pbx.c:396 ---Type <return> to continue, or q <return> to quit--- ASTERISK-4 0x08062c73 in pbx_extension_helper (c=0x82bade8, context=0x82baf40 "agentlogin1", exten=0x82bb034 "*3", priority=2, callerid=0x82b0330 "\"OE 1\" <(250)-753-1231>", action=1) at pbx.c:1157 ASTERISK-5 0x080639ad in ast_pbx_run (c=0x82bade8) at pbx.c:1641 ASTERISK-6 0x406001f7 in ss_thread (data=0x82bade8) at chan_zap.c:4202 ASTERISK-7 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 16 (Thread 78863 (LWP 30670)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0xbddfaa74 in ?? () #2 0x0807dcef in session_do (data=0x812b058) at manager.c:686 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 15 (Thread 13326 (LWP 30566)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x40612a04 in __dso_handle () from /usr/lib/asterisk/modules/chan_zap.so #2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 14 (Thread 12301 (LWP 30565)): #0 0x4013ede1 in nanosleep () from /lib/libc.so.6 #1 0x4013ed78 in sleep () from /lib/libc.so.6 #2 0x405815bc in qcall (ignore=0x0) at app_qcall.c:167 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 13 (Thread 11276 (LWP 30564)): #0 0x4013ede1 in nanosleep () from /lib/libc.so.6 #1 0x4013ed78 in sleep () from /lib/libc.so.6 #2 0x40544fda in scan_thread (unused=0x0) at pbx_spool.c:288 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 12 (Thread 10251 (LWP 30563)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x401d3104 in __dso_handle () from /usr/lib/asterisk/modules/pbx_wilcalu.so #2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 11 (Thread 9226 (LWP 30562)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x4053ac94 in __dso_handle () from /usr/lib/asterisk/modules/chan_phone.so #2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- Thread 10 (Thread 8201 (LWP 30561)): #0 0x40166bb0 in poll () from /lib/libc.so.6 #1 0x0805148b in ast_io_wait (ioc=0x80e5a98, howlong=1000) at io.c:254 #2 0x404046cb in network_thread (ignore=0x0) at chan_iax2.c:5027 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 9 (Thread 7176 (LWP 30560)): #0 0x40166bb0 in poll () from /lib/libc.so.6 #1 0x0805148b in ast_io_wait (ioc=0x80e1c08, howlong=1000) at io.c:254 #2 0x403d848b in do_monitor (data=0x0) at chan_sip.c:5480 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 8 (Thread 6151 (LWP 30559)): #0 0x40166bb0 in poll () from /lib/libc.so.6 #1 0x0805148b in ast_io_wait (ioc=0x80df1d0, howlong=1000) at io.c:254 #2 0x402da857 in network_thread (ignore=0x0) at chan_iax.c:4521 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 7 (Thread 5126 (LWP 30556)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x401cc3d4 in __dso_handle () from /usr/lib/asterisk/modules/res_parking.so #2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 6 (Thread 4101 (LWP 30554)): #0 0x401613c4 in read () from /lib/libc.so.6 #1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0 #2 0x401bd988 in monmp3thread (data=0x80d40b0) at res_musiconhold.c:234 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 5 (Thread 3076 (LWP 30552)): #0 0x401613c4 in read () from /lib/libc.so.6 #1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0 #2 0x401bd988 in monmp3thread (data=0x80d3e38) at res_musiconhold.c:234 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 4 (Thread 2051 (LWP 30551)): #0 0x4016e652 in accept () from /lib/libc.so.6 ---Type <return> to continue, or q <return> to quit--- #1 0x400254fb in accept () from /lib/libpthread.so.0 #2 0x0807dfbd in accept_thread (ignore=0x0) at manager.c:729 #3 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 3 (Thread 1026 (LWP 30550)): #0 0x401687ee in select () from /lib/libc.so.6 #1 0x00000041 in ?? () #2 0x400200ba in pthread_start_thread () from /lib/libpthread.so.0 Thread 2 (Thread 2049 (LWP 30549)): #0 0x40166bb0 in poll () from /lib/libc.so.6 #1 0x4001fd96 in __pthread_manager () from /lib/libpthread.so.0 Thread 1 (Thread 1024 (LWP 30547)): #0 0x401613c4 in read () from /lib/libc.so.6 #1 0x4002dae0 in __DTOR_END__ () from /lib/libpthread.so.0 #2 0x0809009f in read_char (el=0x80c7e68, cp=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:302 #3 0x0809013d in el_getc (el=0x80c7e68, cp=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:347 #4 0x0808ffaa in read_getcmd (el=0x80c7e68, cmdnum=0xbffff70a "", ch=0xbffff70b "\b|?\b\bh~\f\bP") at read.c:243 ASTERISK-1 0x080903e5 in el_gets (el=0x80c7e68, nread=0xbffff750) at read.c:443 ASTERISK-2 0x080824ef in main (argc=2, argv=0xbffffae4) at asterisk.c:1473 #0 0x401613c4 in read () from /lib/libc.so.6 (gdb) gdb) info thread 24 Thread 3589144 (LWP 2928) 0x401687ee in select () from /lib/libc.so.6 23 Thread 3576854 (LWP 2910) 0x400c787e in sigsuspend () from /lib/libc.so.6 22 Thread 3588117 (LWP 2927) 0x400c787e in sigsuspend () from /lib/libc.so.6 21 Thread 3593236 (LWP 2953) 0x400c787e in sigsuspend () from /lib/libc.so.6 20 Thread 3592211 (LWP 2946) 0x400c787e in sigsuspend () from /lib/libc.so.6 19 Thread 46098 (LWP 30609) 0x401687ee in select () from /lib/libc.so.6 18 Thread 3590161 (LWP 2931) 0x400c787e in sigsuspend () from /lib/libc.so.6 17 Thread 2795536 (LWP 2056) 0x400c787e in sigsuspend () from /lib/libc.so.6 16 Thread 78863 (LWP 30670) 0x401687ee in select () from /lib/libc.so.6 15 Thread 13326 (LWP 30566) 0x401687ee in select () from /lib/libc.so.6 14 Thread 12301 (LWP 30565) 0x4013ede1 in nanosleep () from /lib/libc.so.6 13 Thread 11276 (LWP 30564) 0x4013ede1 in nanosleep () from /lib/libc.so.6 12 Thread 10251 (LWP 30563) 0x401687ee in select () from /lib/libc.so.6 11 Thread 9226 (LWP 30562) 0x401687ee in select () from /lib/libc.so.6 10 Thread 8201 (LWP 30561) 0x40166bb0 in poll () from /lib/libc.so.6 9 Thread 7176 (LWP 30560) 0x40166bb0 in poll () from /lib/libc.so.6 8 Thread 6151 (LWP 30559) 0x40166bb0 in poll () from /lib/libc.so.6 7 Thread 5126 (LWP 30556) 0x401687ee in select () from /lib/libc.so.6 6 Thread 4101 (LWP 30554) 0x401613c4 in read () from /lib/libc.so.6 5 Thread 3076 (LWP 30552) 0x401613c4 in read () from /lib/libc.so.6 4 Thread 2051 (LWP 30551) 0x4016e652 in accept () from /lib/libc.so.6 3 Thread 1026 (LWP 30550) 0x401687ee in select () from /lib/libc.so.6 2 Thread 2049 (LWP 30549) 0x40166bb0 in poll () from /lib/libc.so.6 1 Thread 1024 (LWP 30547) 0x401613c4 in read () from /lib/libc.so.6 | ||
Comments: | By: tclark (tclark) 2003-11-27 09:19:46.000-0600 I found in the logs exactly where the system deadlock occurs the moment after i Nov 26 12:23:27 NOTICE[3583000]: Queue 'taxidispatchq' Join, Channel 'Zap/29-1', CallerID; "MILLER R - C " <2507585676> Position '5', Cnt '5' Nov 26 12:23:29 WARNING[3575828]: No class: none Nov 26 12:23:29 VERBOSE[3575828]: == Spawn extension (dispatchq, s, 1) exited non-zero on 'Zap/42-1' Nov 26 12:23:29 VERBOSE[3575828]: -- Hungup 'Zap/42-1' Nov 26 12:23:30 VERBOSE[2795536]: -- Playing 'beep' (language 'en') ***Right here is the dead lock Normally we'd see Agent/P: answer -- Playing 'beep' (language 'en') -- Agent/P:2-28722 answered Zap/42-1 *** Nov 26 12:23:31 VERBOSE[3580945]: -- Hungup 'Zap/1-1' Nov 26 12:23:31 VERBOSE[3580945]: == Spawn extension (dispatchlogin1, 1, 2) exited non-zero on 'Local/1@dispatchlogin1-6 f6f,2' Nov 26 12:23:55 VERBOSE[3581973]: -- User disconnected while waiting their turn Nov 26 12:23:55 NOTICE[3581973]: Queue 'taxidispatchq' Leave, Channel 'Zap/30-1', CallerID; "BONNER J " <2507530954> Position '4', Cnt '4' This is the debug output from the moment of deadlock ov 26 12:23:30 DEBUG[2795536]: Wrapup time expired! Nov 26 12:23:30 DEBUG[2795536]: Checking availability of '6' Nov 26 12:23:30 DEBUG[2795536]: Call 'Agent/P:2-25920' looks like a winner for agent '6' Nov 26 12:23:30 DEBUG[2795536]: Playing beep, lang 'en' Nov 26 12:23:30 DEBUG[2795536]: Scheduling timer at 160 sample intervals Nov 26 12:23:30 DEBUG[2795536]: Played beep, result '0' This is the srcs where we deadlock in chan_agent.812 ast_log( LOG_DEBUG, "Playing beep, lang '%s'\n", newlyavailable->chan->language); res = ast_streamfile(newlyavailable->chan, "beep", newlyavailable->chan->language); ast_log( LOG_DEBUG, "Played beep, result '%d'\n", res); if (!res) { res = ast_waitstream(newlyavailable->chan, ""); ast_log( LOG_DEBUG, "Waited for stream, result '%d'\n", res); } } if (!res) { /* Note -- parent may have disappeared */ if (p->abouttograb) { newlyavailable->acknowledged = 1; ast_setstate(parent, AST_STATE_UP); ast_setstate(chan, AST_STATE_UP); strncpy(parent->context, chan->context, sizeof(parent->context) - 1); /* Go ahead and mark the channel as a zombie so that masquerade will destroy it for us, and we need not call ast_hangup */ --->DEAD LOCK HERE ast_mutex_lock(&parent->lock); OK So any gdb god tell me how i can see what fscking thread has the ' parent->lock ', is there some command/technique to what process has this lock By: tclark (tclark) 2003-11-28 20:04:50.000-0600 I am gonna ans myself here so any other developer w/o linux thread knowledge will at least get my understanding... 1) in the asterisk makefile you need to uncomment # Optional debugging parameters DEBUG_THREADS = #-DDEBUG_THREADS #-DDO_CRASH 2) When you deadlock dont restart the box or restart * attach gdb to the running * process gdb /usr/sbin/astrisk <pid of main * process> 3) after gdb loads do info thread thread apply all bt 4) Identify dead locked threads by this pattern note the "_pthread_wait_for_restart_signal" that mean we are in wait loop wanting the mutex lock #0 0x400c787e in sigsuspend () from /lib/libc.so.6 #1 0x40022879 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0 You will find more than one of these patterns usually because once a thread is dead locked on a mutex lock, other threads that want the same lock will pile up quickly 5) Try to identify the first thread, that is dead locked. The sequence number of bt threads is not relevent, bcus threads are re-used. Look in your log files at the time stamps and try to corrolate the bt THREAD number to the earliest entry in the log file with that same THREAD number eg ( VERBOSE[3575828]), do a bt thread that has a FRAME doing "pthread_mutex_lock()" Log files are usually in /var/log/asterisk/[messages][debug] 6) Now that we have our potential guilty party as the 1st in line for the lock do thread <sequence number> (for the THREAD of interest) frame <fame number> (for the frame # b4 the pthread_mutex_lock () ) this now should be in our * sources right where we call ast_mutex() record the name of the lock it was trying to get 7) Now if we have properly turned on thread debugging we are going to be able to see into the include/asterisk/lock.f ast_mutex_t struc which looks like this pthread_mutex_t mutex; char *file; int lineno; char *func; pthread_t thread; so now that we have our lock we can see who has it & what we are waiting on do the following bt cmds p somelockIjustFound->thread p somelockIjustFound->file p somelockIjustFound->func p somelockIjustFound->lineno This is the guility code that is holding the lock that we want to look at 8) Now comes the hard part Why is this Code in that thread, file, function, lineno not releasing or fscking lock ! We have to now scour the code checking all places where that lock is set & released looking for hmm ... 1) places where there is a lock hierarachy and locks are set and released in different order, (ie. is same rules for locking rows in a sql db, lock & release in the same order everwhere, ) 2) places where a lock is held toooo long for/while loops, b4 function calls etc (this is the same rule as sql db transactions get in get out quick) ...OK I am out of steam if some more season'd linux vet could now pick this up & tell us what else to look for specfic to * mutex locks edited on: 11-28-03 21:01 By: Anthony Minessale (anthm) 2003-12-01 09:14:53.000-0600 if you find the code that plays the "beep" in the function 'agent_call' that is what causes it. I played around with trying to unlock it but it's 3 functions deep and the whole mod is locked at the time that little beep is played so even if you get it to unlock, it pauses the whole system while the beep is playing. I decided to yank that beep from chan_agents and add it to the app_queue cos I can live with no beep in non-queue calls see http://bugs.digium.com/bug_view_page.php?bug_id=0000549 for patch edited on: 12-01-03 11:49 By: Olle Johansson (oej) 2004-03-21 16:20:25.000-0600 Is this still a bug with current CVS? By: Olle Johansson (oej) 2004-04-14 03:03:49 Hello, anyone out there? By: tclark (tclark) 2004-04-17 16:21:49 Yes its still a bug with race condition and you have to combine : groups and non : groups using agentcallbacks and agentlogins but i close it a sweep it under the mat :) bcus there are a ton more places it can break with : groups By: Brian West (bkw918) 2004-04-17 22:49:55 If this isn't what you wanted let me know. bkw |