ASTERISK-16337: [patch] Deadlock is happended.( between channel and sip

[Home]

Summary: ASTERISK-16337: [patch] Deadlock is happended.( between channel and sip_pvt lock)

Reporter: Kwangsik (warmguy) Labels:

Date Opened: 2010-07-08 23:36:31 Date Closed: 2011-06-07 14:00:23

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 20100813__issue17610.diff.txt
( 1) backtrace.txt
( 2) backtrace1.txt
( 3) core-show-locks1.txt

Description: I have been testing asterisk. (Trunk version)
When testing more than 4cps asterisk does not work.

I am confirmed reason for using GDB.
The reason is estimated to deadlock. Help needed.

Between thread ID 24 and thread ID 5 the circular Deadlock occurs.
Thread 24 (Thread 0x8cc8b90 (LWP 1886)): chan lock(waiting) <- sip_pvt_lock(acquired) <- monlock(acquired)
Thread 5 (Thread 0x648ab90 (LWP 1948)): sip_pvt lock(waiting) <- chan lock(acquired)

Here is the debugging information.
<removed> pabelanger

****** ADDITIONAL INFORMATION ******

<removed> pabelanger

Comments: By: Paul Belanger (pabelanger) 2010-07-09 06:24:28

Your backtrace is optimized (see below). Also, do not post your output as a note, upload it.

---
Debugging deadlocks:

Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install)

This will then give you the console command:

core show locks

When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx \\\"core show locks\\\" | tee /tmp/core-show-locks.txt

# gdb -se \\\"asterisk\\\" <pid of asterisk> | tee /tmp/backtrace.txt

gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!
By: Leif Madsen (lmadsen) 2010-07-09 12:53:58

As pabelanger stated, if this is a deadlock we need deadlock debugging per his notes.
By: Kwangsik (warmguy) 2010-07-09 19:30:55

In your advice,
When I compiled the asterisk, I selected DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect.
I captured and uploaded core-show-locks1.txt and backtrace1.txt
Thanks!
By: Kwangsik (warmguy) 2010-07-09 19:39:31

In backtrace1.txt file,
Between thread ID 39 and thread ID 11 the circular Deadlock occurs.
By: Stefan Schmidt (schmidts) 2010-07-12 02:24:57

i´ve got the same problem but found that it should be fixed in revision 274727.

have a look at the file main/channel.c at row 1258 if there is this:
prev = c;
then you should update your trunk, cause there is a deadlock situation which is fixed if your row looks like this:
_prev = c;
this could also happens when you do "core show channels verbose" and one channel is locked, cause the deadlock avoidence does an infinite loop.
if stress tested 1.6.2.9 with sipp with around 300 concurent calls and 15 cps and everything works wealthy after this fix.
By: Kwangsik (warmguy) 2010-07-12 03:21:35

Thank you for answers,
My test version is received from the http://svn.digium.com/svn/asterisk/trunk address.
In my opinion, fixed version is received from the http://svn.digium.com/svn/asterisk/branches/1.6.2 address.
It is much different between the file main/channel.c of two versions.
By: Stefan Schmidt (schmidts) 2010-07-12 03:47:00

sorry for my mistake but you have set production verion to 1.6.2.9 not to trunk so i thought you mean the 1.6.2.9 branch ;)
By: Kwangsik (warmguy) 2010-07-12 03:50:55

I knew it, It's my mistake. But I couldn't change production version ;)
I'm sorry.
Thanks for your reply.
By: Leif Madsen (lmadsen) 2010-07-16 13:19:28

What kind of hardware are you running this on and which Linux distribution are you using?

Also, are you able to reproduce this consistently?
By: Kwangsik (warmguy) 2010-07-18 23:20:13

My test mahcine spec. is as follows
HW : Dell PowerEdge 6950
SW : fedora 9(2.6.27.25-78.2.56.fc9.i686.PAE)
I configured asterisk only using sip mode(tcp-based).
I found this error consistently.
By: Tilghman Lesher (tilghman) 2010-07-29 16:50:39

Patch for trunk uploaded. I realize this may look somewhat ridiculous, but it's the result of discovering that our tendency of usleep(1) does not actually yield the processor to other threads, as intended. This might have worked before the days of multiple cores, but we need to wait much longer today, because we may be the only thread that is in a short wait state, owing to the other threads being on runnable cores. We must thus wait longer, to ensure that the resource is actually released correctly and made available to other cores.

In any case, please test and verify that this patch fixes the issue for you.
By: Tilghman Lesher (tilghman) 2010-08-11 10:53:02

warmguy: ping. Does this patch fix the issue for you?
By: Stefan Schmidt (schmidts) 2010-08-12 13:28:05

i´ve tried this patch cause of another issue and after some deep debuging of the do_monitor function which includes the callback for many sip_pvt locks i think going up to 1ms just slow down the entire send/receive part of do_monitor. the problem is that the ast_io_wait function which polls for incoming events only read one single packet when its recevied and maybe send up to 10 or more packets.
if its happens that the sip_pvt structure is locked, 1ms sleep time is longer than the scheduler time frame to exec events on the scheduler heap.
this would cause slowing down the sending procedure which will cause to reading less packets in the polling time frame.
By: Tilghman Lesher (tilghman) 2010-08-13 17:17:56

Okay, I have a new patch uploaded now that should assist in debugging. What it will do is make the output of 'core show locks' a bit more traceable, so we can figure out where the locking is going awry. Please apply the patch, recompile, reproduce the issue, and get another 'core show locks'.
By: Tilghman Lesher (tilghman) 2010-08-18 16:43:38

warmguy: have you been able to reproduce this with this debugging patch yet?
By: Kwangsik (warmguy) 2010-08-24 21:30:06

Sorry~ Until recently, I can't follow this problem, because of other work.
Few days ago, I showed asterisk-1.8.0-beta3 image in download site. I installed this version and I tested 4cps during 12 hours. It worked! I thought my issue is solved, Thank you~. If I find same problem, I will report new issue.