[Home]

Summary:ASTERISK-02360: Avoided deadlock for Zap channel
Reporter:thaeger (thaeger)Labels:
Date Opened:2004-09-07 04:26:15Date Closed:2008-01-15 15:07:12.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) event_log
( 1) messages
Description:It is very strange and not reproduceable ... but take a look to the attached files "event_log" and "messages". There is a restart of our asterisk at 14.29:33. And before there was some  messges like "Avoided deadlock for 'Zap/119-1', 10 retries!".
Unfortunately i don't can say more about this prob.
But this System is under heavy load in production, and i don't can do any experiments.... ;-)

AstVersion: "CVS-HEAD-08/19/04-22:02:08"
System double xeon , 1GB RAM
Kernel: "2.6.8.1 #4 SMP"


****** ADDITIONAL INFORMATION ******

Is there someone who knows this problem ?
Comments:By: thaeger (thaeger) 2004-09-07 04:31:24

Mark,

please take a more exactly look to the "messages" file.
There're also messages like "Avoiding IAX destroy deadlock" .....

Greets,

Thomas.

By: Brian West (bkw918) 2004-09-07 06:51:07

Well the most useful information would be a backtrace of the core file.  If you can't reproduce it then its hard to track down.  Your also not using the latest code.

bkw

By: thaeger (thaeger) 2004-09-07 07:08:24

hi,

the code is from 08/19/04 .... thiis is not very old, isn't it ?

By: thaeger (thaeger) 2004-09-07 07:16:35

I know it is not easy to reproduce this prob in this case,
but it is a strange behavior ...and in fact not accaptable for us....
i post this prob in expecting someone can help us wigth this....

Another problem is, that we are using * in production... an so we can not change the code for every five minutes ;-)

By: Mark Spencer (markster) 2004-09-07 08:58:20

Unfortunately once you restart, you completely eliminate our ability to diagnose the problem beyond the most basic problem that something on Zap/119-1 was holding a lock on the channel and not releasing it.

If you are afraid this might occur again, I strongly suggest you prepare a "hot spare" so that you can place the hot spare into use without restarting asterisk on the unit with a problem, permitting us to connect and perform analysis on the problematic system.

By: thaeger (thaeger) 2004-09-07 09:30:23

Hi Mark,
the problem seems to be on both of our machines...and any time they crashes it costs up to 100 EUR ...because of the running calls on the machine.

Can you tell me what a "host spare" is ?
We running asterisk with the safe_asterisk script, so when ever asterisk crashes it will be restarted by the script ....

maybe i should give you the login credentials of the machine...and you can log in and can install such a "host spare" ?

Greets,

Thomas.

By: Mark Spencer (markster) 2004-09-07 10:06:18

A "hot spare" means an identically configured machine so that if there is a runtime problem you can simply connect the backup machine without disturbing the original machine.  How often have you seen this problem?

By: silke (silke) 2004-09-07 10:20:41

Dear Thomas,


on next restart you could use the command-line
option "-g" (Dump core in case of a crash).

When asterisk crashes, it creates a big core file of
several Megabytes. I made the experience that the file
is either in asterisk-directory (/usr/sbin), one time
in root-directory and maybe in /tmp directory or somewhere
else. You have to search ;-) I think its simply there where last
CWD was.

With this file and gnu debugger you can track down
the line in C-Code where the Crash occured. Gnu Debugger
shows then the C-Code Lines that caused the crash.

This would help a lot!


Regards,


Silke

By: thaeger (thaeger) 2004-09-07 10:34:42

Mark: I've seen this prob every two days in average, since a few days ... maybe it has something to do with the ISDN line, because sometime one of our provider ports is somehow confused ... and this was the reason why crich (my co-worker) wanted to have a possiblity to restart one port from CLI while asterisk is running... it would be a very useful feature...and in fact all switches or pbx'es have such a port restart ;-)

Silke:
safe_asterisk starts asterisk with "-g" option. Can you tell me more exactly where the file should be and how it is named ? I've looked after, and i don't found anaything...

By: bfranks (bfranks) 2004-09-07 10:55:49

thaeger,

You should be able to find a core.##### (replace #### with some set of numbers).

Mine show up in /tmp.

- Brent

By: Mark Spencer (markster) 2004-09-07 11:31:02

The -g won't help you in this case because it isn't a crash -- it's a deadlock in effect.

By: thaeger (thaeger) 2004-09-07 12:10:24

...but the asterisk was restarted ... is it not a crash ?
How ever...i can not find a core dump file.

And know ?

By: Brian West (bkw918) 2004-09-07 12:38:30

its a deadlock so you can't diagnose the issue unless you attach to it with gdb.

"gdb pid /usr/sbin/asterisk"

Then you should be able to do a "thread apply all bt"

and post that here.

bkw

By: Mark Spencer (markster) 2004-09-07 12:43:55

Out of curiosity, are you using MySQL CDR records perchance?

By: thaeger (thaeger) 2004-09-08 03:35:01

No not cdr_mysql.so but a module named "cdr_sybase.so" ..... you are suppose that this could be a problem ? But if, in which case ?
We need this module for logging cdr's in our billing system. And the billing system running on a, oh my god!, a MS SQL Server 2000 ;-)

Greets,

Thomas

By: Mark Spencer (markster) 2004-09-08 07:37:21

Yesterday someone had a problem where the cdr_mysql was blocking, I didn't know it would be related.

By: Anthony Minessale (anthm) 2004-09-08 09:26:23

I see this one a lot more recently too.

my suspicion is that when a channel is being hung up and
you do something like a show channels or anything else that
"walks" the channels at the same time you trigger this error.

I have a cdr application that uses the chan name to go back
and fetch the chan pointer that the cdr record is associated with
so it can read some variables I get the "avoid" thing once in a while.
my application triggers a situation where the minute a channel is ended
it is doing a chan_walk very shortly thereafter.

Also in the chanspy application I made, If you are scanning for channels
to spy on and you are currently spying on a call and they hangup the scanner
immediately looks for more to scan (agan hangup followed by immediate walking)
this also triggers the error

neither case causes permant failure just the warning message for me but it still makes me nervous.  

Here is some brainstorming.....flamers or english majors need not reply.

Since chans are so commonly used I think they need to either be pooled and
reused or at least put into a destruct queue where thier pointers linger for
a while to make sure nobody is looking for it anymore.

another idea would be perhaps some kind of ast_channel_associate()
or ast_channel_use_count() so the chan being hungup is aware of every
process that is currently interested in it (eg holds a pointer to it)
that could be snuck into get_channel_by_name_locked or walk_locked
to append your thread id to a list of interested party then if the
thread id was also managed the chan could look at all the threads that
recently asked for a pointer to it and check if that thread is still running.

Finally if this issue is especially aparent in bridged calls perhaps the code
that bridges calls should hang on to "both" channels until they are both properly posted to cdr etc..

By: Mark Spencer (markster) 2004-09-09 01:10:42

The message shows up when something has locked a particular channel and is not releasing it.

By: Anthony Minessale (anthm) 2004-09-09 09:38:12

Exactly! which is why I am suspicious that something in the core that hangs up a bridged call is doing that in certian cases.

By: Mark Spencer (markster) 2004-09-09 16:19:20

Why would you think it has to do with hangup?  In any case if you can make it happen then you should be able to attach with gdb and see what is stuck.

By: Mark Spencer (markster) 2004-09-09 17:22:23

You don't happen to be running any external modules (e.g. "rate_engine.so") on this box are you?

By: Mark Spencer (markster) 2004-09-09 22:06:10

Also, if you build with -DDEBUG_THREADS and then build with "make valgrind" instead of the usual "make" it may help us debug the problem if it happens again, but we will *have* to be able to connect and login to work on it while it's occuring.

By: Anthony Minessale (anthm) 2004-09-10 12:21:27

You need to use the scrollbars to be able to read all the messages in this bug.  You ask me why I think it's related to hangup.. but then I just posted a 10 page explanation in the previous post.

There are now like 2 or 3 difft bugs talking about this whole Deadlock msg
it's all over the place in the last 2 weeks so I am highly suspicious that
there are new bugs in channel.c.... that is all I mean...

By: Mark Spencer (markster) 2004-09-10 14:45:13

It is very common that on a PRI channel the system must grab (a) The Channel lock, (b) the sub lock, and (c) the PRI lock all simultaneously when coming from the asterisk API side.  In order to handle that properly coming from the PRI side, the PRI lock must be released before trying to grab the sub lock and the channel lock.  This required some small but important changes to chan_zap to avoid this deadlock and is now fixed in CVS.

By: Digium Subversion (svnbot) 2008-01-15 15:07:12.000-0600

Repository: asterisk
Revision: 3760

U   trunk/channels/chan_zap.c

------------------------------------------------------------------------
r3760 | markster | 2008-01-15 15:07:12 -0600 (Tue, 15 Jan 2008) | 2 lines

Make sure we don't try to grab the sub and channel locks while still having the PRI lock! (bug ASTERISK-2360)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=3760