Summary:ASTERISK-10689: Deadlock in channel.c
Reporter:callguy (callguy)Labels:
Date Opened:2007-11-05 14:18:21.000-0600Date Closed:2011-06-07 14:00:51
Versions:Frequency of
Environment:Attachments:( 0) coreshowlocks-11052007.txt
( 1) lock.txt
Description:We had what appears to be a new deadlock in channel.c today. output of core show locks attached.
Comments:By: Tilghman Lesher (tilghman) 2007-11-06 13:19:20.000-0600

Please get another "core show locks" after updating to SVN 1.4 revision 89045 or higher.

By: Denis Galvao (denisgalvao) 2007-11-07 08:24:34.000-0600

Im stilll getting SIP deadlocks using Asterisk SVN-branch-1.4-r88719.

Im gonna try to get more info.

By: Bing Li(enst) (enst) 2007-11-08 23:13:31.000-0600

I met a dead lock much like this.

p = find_call(&req, &sin, req.method); /* returns p locked */
    |----__sip_ack(p, seqno, FLAG_RESPONSE, 0);

same p, same p->lock

#0  0x00002afa4c968bd8 in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#1  0x00002afa4c964ac8 in _L_mutex_lock_106 () from /lib/libpthread.so.0
#2  0x00002afa4c964423 in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00002aaabc7c1f22 in __ast_pthread_mutex_lock (filename=0x2aaabc815c13 "chan_sip.c", lineno=2133, func=0x2aaabc816a0f "__sip_ack",
   mutex_name=0x2aaabc816a19 "&p->lock", t=0x2aaad804bc60) at /usr/src/asterisk/include/asterisk/lock.h:308
#4  0x00002aaabc7c3139 in __sip_ack (p=0x2aaad804bc60, seqno=102, resp=1, sipmethod=0) at chan_sip.c:2133
ASTERISK-1  0x00002aaabc807af1 in handle_request (p=0x2aaad804bc60, req=0x40255b10, sin=0x40256f40, recount=0x40255afc, nounlock=0x40255b00)
   at chan_sip.c:15069
ASTERISK-2  0x00002aaabc8083ac in sipsock_read (id=0x81aef0, fd=12, events=1, ignore=0x0) at chan_sip.c:15181
ASTERISK-3  0x000000000047ae2a in ast_io_wait (ioc=0x818ab0, howlong=1) at io.c:279
ASTERISK-4  0x00002aaabc808e13 in do_monitor (data=0x0) at chan_sip.c:15384
ASTERISK-5  0x00000000004deaae in dummy_start (data=0x822bb0) at utils.c:806
ASTERISK-6 0x00002afa4c9622a5 in start_thread () from /lib/libpthread.so.0
ASTERISK-7 0x00002afa4d33a61d in clone () from /lib/libc.so.6
ASTERISK-8 0x0000000000000000 in ?? ()

By: Tilghman Lesher (tilghman) 2007-11-09 16:41:10.000-0600

Guys, once again, for ANY of these deadlocks, I need a 'core show locks', not a backtrace.

By: Bing Li(enst) (enst) 2007-11-09 17:31:24.000-0600

here is the core show locks

have you seen chan_sip.c

p = find_call(&req, &sin, req.method); /* returns p locked */
       |----ast_mutex_lock(&p->lock);  0x2aaad804bc60
if (handle_request(p, &req, &sin, &recount, &nounlock) == -1)
         |----__sip_ack(p, seqno, FLAG_RESPONSE, 0);
                             |=----ast_mutex_lock(&p->lock); 0x2aaad804bc60

By: Mark Michelson (mmichelson) 2007-11-10 22:45:34.000-0600


13 lines below the call to find_call, p is unlocked. The conflict is most likely coming from elsewhere.

Edit: Also, just to clarify, your backtrace shows a thread waiting on p->lock, but your core show locks indicates something else, that there's actually a conflict trying to lock a channel, not a sip_pvt. Was the backtrace taken from a different time than the core show locks?

By: Bing Li(enst) (enst) 2007-11-11 03:01:10.000-0600

but 10 lines below find_call  can break the for loop
if (!p->owner || !ast_channel_trylock(p->owner))
break; /* locking succeeded */  

in gdb  I found the thread die at the function __sip_ack   wait for the same lock of find_call

By: Mark Michelson (mmichelson) 2007-11-11 06:08:04.000-0600

You're absolutely right. I need to stop browsing the tracker at night :)

By: Tilghman Lesher (tilghman) 2007-11-11 10:17:45.000-0600

enst:  is this "core show locks" from a revision at 89045 or higher?  There was a bug in the "core show locks" routine prior to that revision where we didn't show locks that were stuck in a trylock loop.  Your output doesn't show any apparent deadlocks, which is why I'm wondering about the revision.

By: Bing Li(enst) (enst) 2007-11-11 19:12:06.000-0600

sorry.  I only have 1.4.13 test environment.   the deadlock happened in every 1-2 days.
now I fixed the bug myself, and it has not happened for 3 days.

yes my 'core show lock' is output by 1.4.13.  so in the version that fixed the 'core show lock' bug, maybe can find the deadlock bug without gdb debuging? :)

By: Daniel Kühl Lima (dklima) 2007-11-13 18:10:00.000-0600

enst, how did you fixed by your self? share with us! We are getting the same deadlock on asterisk...
After a time of use of asterisk, it's by random time, can be in 10 minutes or about once a day.
I do not have asterisk compiled with DEBUG_*, so, I won't be able to show you the core show locks, sorry :( it's a production PBX.

I'm using asterisk SVN-branch-1.4-r89105.

By: Tilghman Lesher (tilghman) 2007-11-13 19:37:04.000-0600

dklima:  there is no performance penalty to running with DEBUG_THREADS, but if you do so, we're more likely to be able to find your issue.  Please reconsider adding that flag to your compile.

By: Bing Li(enst) (enst) 2007-11-13 23:51:31.000-0600

dklima: when I find the dead lock
core show locks    
=== Thread ID: 1102494016 (pbx_thread           started at [ 2627] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): MUTEX 5009 ast_channel_lock &chan->lock 0x2aaad804dab8 (1)

=== Thread ID: 1076197696 (do_monitor           started at [15443] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 15144 sipsock_read &netlock 0x2aaabca29580 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 4526 find_call &p->lock 0x2aaad804bc60 (2)
=== ---> Lock #2 (channel.c): MUTEX 5041 ast_channel_trylock &chan->lock 0x2aaad804dab8 (1)

then use 'gdb asterisk pidnum'    debug the thread 1076197696 ,
3 0x00002aaabc7c1f22 in __ast_pthread_mutex_lock (filename=0x2aaabc815c13 "chan_sip.c", lineno=2133, func=0x2aaabc816a0f "__sip_ack",
I find in func __sip_ack try to lock the same lock which is locked by find_call 0x2aaad804bc60

By: Tilghman Lesher (tilghman) 2007-11-14 00:15:18.000-0600

enst:  please upload the full output of that command to the file upload area of this issue, and please remember to update to revision 89045 or higher, first.

By: Bing Li(enst) (enst) 2007-11-14 00:23:33.000-0600

Corydon76: sorry, I can't do that. I can only test with 1.4.13 now. And I have done many changes on this version. The server is running 24hours/day.  For this issue, I don't need to update to revision 89045, the infomation is already enough. I'll not update untill 1.4.14 released.

By: callguy (callguy) 2007-11-14 15:37:20.000-0600

enst: I believe this issue is the same as 11080 (also reported by me). The symptoms were slightly different. Please look at my most recent post in that bug and let me know if you came to the same conclusion (that p->lock wasn't getting unlocked within the loop).

By: bhrugu mehta (bhrugu) 2007-12-26 05:17:18.000-0600

Bhrugu: I face same prolem in asterisk 1.4.15 but when i test that in asterisk prolem has not occured.

By: Russell Bryant (russell) 2007-12-26 12:11:57.000-0600

There have been a lot of things fixed in this area since this was reported.  I'm going to mark this as suspended.  If you still have a problem, please let us know and provide updated "core show locks" output.