ASTERISK-03758: 'Avoided initial deadlock' message and Asterisk stops to handle calls

[Home]

Summary: ASTERISK-03758: 'Avoided initial deadlock' message and Asterisk stops to handle calls

Reporter: alextitov (alextitov) Labels:

Date Opened: 2005-03-24 11:39:33.000-0600 Date Closed: 2008-01-15 15:30:26.000-0600

Priority: Blocker Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) deadlock_reproduce_log.txt
( 1) extensions.conf
( 2) full
( 3) h323-dead-cvs.diff
( 4) h323-dead-cvs2.diff
( 5) threads_backtrace_2.txt
( 6) threads_backtrace.txt

Description: This is the exact duplicate of a closed bug 2738
Sometimes (we can't catch the right moment and reproduce it yet) the Asterisk stops to serve calls.
'show channels' command shows 0 (or 2 or 4) channels and a warning message beneath. Here it is:
pbx*CLI> show channels
Channel (Context Extension Pri ) State Appl. Data
0 active channel(s)
Mar 24 20:03:27 WARNING[11873]: channel.c:488 ast_channel_walk_locked: Avoided initial deadlock for 'H323/ip$10.110.5.17:15434/10153', 10 retries!
Mar 24 20:14:08 WARNING[11847]: channel.c:488 ast_channel_walk_locked: Avoided initial deadlock for 'H323/ip$10.110.5.17:15434/10153', 10 retries!

And it doesn't respond to 'stop now' console command also.

****** ADDITIONAL INFORMATION ******

We use DLink DVG-1104DH (H.323) as a FXO gateway to outer world and Cisco ATA 186 (SIP) with DLink 104SH (H.323) as a phone FXS gateways.

Also we use attended transfer feature:
atxfer => #

Linux Debian 2.4.29
pwlib 1.8.1
openh323 1.15.1

Comments: By: Paul Cadach (pcadach) 2005-03-24 11:45:02.000-0600

This is H.323-related bug, not Asterisk's core.
By: Mark Spencer (markster) 2005-03-24 14:05:16.000-0600

Can you duplicate this problem without H.323?
By: alextitov (alextitov) 2005-03-25 02:25:17.000-0600

We can't duplicate (reproduce) it yet at all.
I will keep an eye on it to catch the moment and notice of what leads to it.

Yes, it seems to be h.323-related bug because all the time the deadlock message says about "H323/ip$10.110.5.17..."

I can't disable H.323 because it is near-production usage and our current hardware (FXO/FXS gateways) works on H.323

Should I uncomment
DEBUG_THREADS = #-DDEBUG_THREADS #-DDO_CRASH #-DDETECT_DEADLOCKS
and wait for core dumps ?
I think I should.
By: alextitov (alextitov) 2005-03-25 03:57:21.000-0600

Now deadlock message at the other source's line (and other h323 ip), but asterisk continues to operate ok:
Mar 25 11:53:21 WARNING[14777]: channel.c:511 ast_channel_walk_locked: Avoided deadlock for 'H323/ip$10.110.5.16:2846/454', 10 retries!
I attached new threads_backtrace_2.txt
By: Paul Cadach (pcadach) 2005-03-25 04:59:30.000-0600

Alex, could you track deadlocked connections from logs (is it incoming or outgoing call, at which call phase deadlock was found, etc.)? This should be very helpful to localize the bug.
By: alextitov (alextitov) 2005-03-25 05:20:30.000-0600

I've just setup full logs recording and set
> set verbose 100
> set debug 100
So now it writes a lot of data to /var/log/asterisk/full
Sorry I didn't do that earlier.
I'll report here as soon as new deadlock occurs.
But probably not before Monday.
By: Andrey S Pankov (casper) 2005-03-25 09:48:15.000-0600

Another deadlock issue is documented in ASTERISK-3762. (It does NOT duplicate this report!)
By: Paul Cadach (pcadach) 2005-03-25 11:22:29.000-0600

casper, the root of the deadlocks is some sort of invalid interaction between Asterisk's H.323 channel driver and OpenH323 stack, so it could lock everything depends on call state when deadlock is happened. RTP still works because it is not based on OpenH323 implementation. But sometimes a deadlock rised within OpenH323 could cause internal Asterisk's structures get locked.

So, both tickets have the same nature - deadlock at interaction between channel driver and OpenH323 stack, so Mark closed ASTERISK-3762 as duplicate of this bug.
By: essobi (essobi) 2005-03-26 00:49:22.000-0600

I can't havn't been able to reproduce this bug except when h323 is used in conjunction with a queue.. Are you using queues?

I pushed 1000 outbound calls, and 1000 inbound calls and got no deadlock..

I tried twice with a queue with a single member, inbound h323 and outbound h323, the first time the deadlock occured 35 calls in, the second time, it was 300 calls.

I'm using slowstart on my H323.. that may be a contributing factor, as I've seen a few odd things happen when * answers the call too quickly, or the h323 end answers too quickly, before the call is fully acked.
By: essobi (essobi) 2005-03-26 01:50:46.000-0600

I was able to core * twice in a row just now with a queue.. I'm going to keep at it and see if it persists.. I'll upload the console logs in the morning after I wipe the IPs.. if someone cares to see the cores, I'm usually around on IRC.
By: alextitov (alextitov) 2005-03-26 02:23:48.000-0600

Essobi, I'm not sure but I think we don't use queues. At least I changed nothing in queue.conf and current default settings in there are only:
[general]
persistentmembers = yes

And I did nothing special in extensions.conf either.
However we receive a lot of incoming calls during the conversation. So it is some type of natural queue when incoming calls are waiting for receiving party to finish previous call. And I setup two sip operator phones to receive all incoming calls:

; operators landing
exten => 0,1,Dial(SIP/721&SIP/722,,t)

We also use atxfer feature extensively and receiving party is busy and not responding many times. And I had to apply a one-line patch of bug 3841 to get transferred call returned. But that deadlock problem occured before that patch also. So I don't believe it is due to that patch.

We have FastStart enabled on both (FXO/FXS) devices, I tried to disble it at FXS one and it refused to work with * then.
Our FXO device has:
FastStart support : Enabled.
Tunnelled H.245 support : Disabled.
H.245 message after FastStart : Enabled.

I record full logs now and wait for next deadlock occurrence. But yestarday * worked fine all the day. So I have to wait till Monday when people start calling again :)
By: alextitov (alextitov) 2005-03-28 05:35:15.000-0600

It happened again and now I can reproduce it.
It occurs almost regularly when I dial unexistent number from our H.323 FXS gw.
'Almost' means that everything is fine (and no dedalock occurs by that way) right after I started up the Asterisk. But a while later (when some other calls are made) I can dial any wrong number and Asterisk stops at
-- Executing Playback("H323/ip$10.110.5.17:15524/10639", "invalid") in new stack

I uploaded deadlock_reproduce_log.txt with the three cases:
1. no deadlock occurs (within a few seconds after Asterisk startup)
2. deadlock occurs (another dial attempt after a while)
3. I uncommented-out Playback(invalid) in extensions.conf and deadlock still occurs

And I also recorded full log for this:
startup asterisk, dial wrong number (8), hear busy tones, hangup, wait a little, dial wrong number again (8), hear silence, hangup, see deallock messages, kill asterisk processes.

What can I do to examine that better ?
By: essobi (essobi) 2005-03-28 08:35:45.000-0600

There is currently discussion on going in -dev mailing list, by Paul Cadach and Chih-Wei Haung, reguarding early media notification of channels. It would appear they are working through a patch for this in particular.
By: Paul Cadach (pcadach) 2005-03-28 11:28:38.000-0600

As possible workaround just add Wait(1) or Wait(2) to dialplan before you send something into channel to allow it to finish opening audio path. Correct fix for the problem is in discussion/trial stage.
By: alextitov (alextitov) 2005-03-30 09:22:51.000-0600

I tried to add Wait(3) and it doesn't help.
However it seems we don't dial wrong numbers often and Asterisk works fine this week.
By: Paul Cadach (pcadach) 2005-04-01 14:08:51.000-0600

Try attached patch (h323-dead-cvs.diff). Slightly tested on incoming calls only (calls was generated by callgen323 with faststart enabled).
By: jerjer (jerjer) 2005-04-01 14:45:31.000-0600

The while(1) in cleanup_connection does not seem sane to me.

how about:

do {
pvt = find_call()
ast_mutex_lock()
...
...
} while (!pvt)
By: Paul Cadach (pcadach) 2005-04-01 14:55:47.000-0600

Not so much important... In your case:
do {
pvt = find_call(..)
if (!pvt) // missed those 2 lines
return;
ast_mutex_lock(&pvt->lock);
...
} while(!pvt); // Isn't needed because function will return if pvt structure isn't available.
By: wgfreewill (wgfreewill) 2005-04-01 22:47:03.000-0600

Patch went fine, latest cvs
/usr/src/asterisk/channels# patch -p1 < h323-dead-cvs.diff
patching file chan_h323.c
patching file h323/ast_h323.cpp
patching file h323/ast_h323.h
patching file h323/chan_h323.h

but compiling wouldn't go

chan_h323.c: In function `cleanup_connection':
chan_h323.c:1643: error: structure has no member named `thread'
chan_h323.c:1643: error: structure has no member named `reentrancy'
chan_h323.c:1643: error: structure has no member named `func'
chan_h323.c:1643: error: structure has no member named `file'
chan_h323.c:1643: error: structure has no member named `lineno'
make[1]: *** [chan_h323.o] Error 1
make[1]: Leaving directory `/usr/src/asterisk/channels'
make: *** [subdirs] Error 1
By: Paul Cadach (pcadach) 2005-04-02 00:07:58.000-0600

Oops... I used thread debugging to locate deadlocks and possible deadlock conditions. Fixed.
By: wgfreewill (wgfreewill) 2005-04-03 23:28:22

Okay it built fine, I have it running, calls are working, but as we know it takes some amount of time between crashes.

I'll try to add another note in a week or so if we're stable.
By: Paul Cadach (pcadach) 2005-04-04 00:36:38

Just for notice - this bug probably merges ASTERISK-3563 and ASTERISK-3511.
By: yogel (yogel) 2005-04-04 04:09:00

It looks like Jeremy applied patches for this bug to CVS, but forgot to add chan_h323.c as well, so it fails to build at the moment.

...
chan_h323.c: In function `load_module':
chan_h323.c:2127: too few arguments to function `h323_callback_register'
By: Paul Cadach (pcadach) 2005-04-04 05:11:33

Yep, Jeremy's mistake.
By: jerjer (jerjer) 2005-04-04 11:05:40

applied to cvs -head
By: jerjer (jerjer) 2005-04-04 11:05:59

not in -stable
By: Digium Subversion (svnbot) 2008-01-15 15:30:23.000-0600

Repository: asterisk
Revision: 5379

U trunk/channels/h323/Makefile
U trunk/channels/h323/ast_h323.cpp
U trunk/channels/h323/ast_h323.h
U trunk/channels/h323/chan_h323.h

------------------------------------------------------------------------
r5379 | jeremy | 2008-01-15 15:30:22 -0600 (Tue, 15 Jan 2008) | 2 lines

Better dependancy and version checking method, solve call problems when dealing with a gatekeeper, sanity check call cleanup, in hopes to avoid deadlocks (Bugs ASTERISK-3758 ASTERISK-3563, ASTERISK-3511)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=5379
By: Digium Subversion (svnbot) 2008-01-15 15:30:26.000-0600

Repository: asterisk
Revision: 5383

U trunk/channels/chan_h323.c

------------------------------------------------------------------------
r5383 | jeremy | 2008-01-15 15:30:26 -0600 (Tue, 15 Jan 2008) | 2 lines

Better dependancy and version checking method, solve call problems when dealing with a gatekeeper, sanity check call cleanup, in hopes to avoid deadlocks (Bugs ASTERISK-3758 ASTERISK-3563, ASTERISK-3511)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=5383