[Home]

Summary:ASTERISK-02817: [patch] deadlock H.323
Reporter:cybershield (cybershield)Labels:
Date Opened:2004-11-15 14:48:09.000-0600Date Closed:2004-12-17 09:50:59.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan_h323.diff
( 1) deadlock.patch
( 2) markdiff.txt
Description:I have tried to use chan_h323 (update to the last version of CVS). After to have received one called in h323, using the command "show channels" I received the following message of error: "Avoided deadlock for 'H323/ip$83.218.124.183:43434/365', 10 retries!".
I have found two serious errors to the inside of chan_h323, i attached the corrections.
Comments:By: twisted (twisted) 2004-11-15 22:55:36.000-0600

A message that states "Avoided deadlock" does not mean a deadlock.

By: Andrey S Pankov (casper) 2004-11-16 11:41:31.000-0600

This means * is not functional any more under such a condition... :(

By: nirsim (nirsim) 2004-11-17 01:22:28.000-0600

Well twisted, I've encountered the same problem, and have tried your patch, but am still experiencing the same issue. Here is a debug trace from the asterisk console:

  -- Attempting call on H323/4315448707666261@81.27.72.11 for 123456@AutoDial:1 (Retry 1)
Nov 22 07:18:57 DEBUG[4169]: chan_h323.c:964 oh323_request: type=H323, format=4, data=4315448707666261@81.27.72.11.
Nov 22 07:18:57 DEBUG[4169]: chan_h323.c:997 oh323_request: Extension: 4315448707666261 Host: 81.27.72.11
Nov 22 07:18:57 DEBUG[4169]: chan_h323.c:882 find_peer: Could not find peer 81.27.72.11 by name or address
Nov 22 07:18:57 DEBUG[4169]: chan_h323.c:441 oh323_call: Dest is 4315448707666261@81.27.72.11
Nov 22 07:18:57 DEBUG[4169]: chan_h323.c:467 oh323_call: Placing outgoing call to 4315448707666261@81.27.72.11
   -- Attempting call on H323/4315448707666378@81.27.72.11 for 123456@AutoDial:1 (Retry 1)
Nov 22 07:18:59 DEBUG[4178]: chan_h323.c:964 oh323_request: type=H323, format=4, data=4315448707666378@81.27.72.11.
Nov 22 07:18:59 DEBUG[4178]: chan_h323.c:997 oh323_request: Extension: 4315448707666378 Host: 81.27.72.11
Nov 22 07:18:59 DEBUG[4178]: chan_h323.c:882 find_peer: Could not find peer 81.27.72.11 by name or address
Nov 22 07:18:59 DEBUG[4178]: chan_h323.c:441 oh323_call: Dest is 4315448707666378@81.27.72.11
Nov 22 07:18:59 DEBUG[4178]: chan_h323.c:467 oh323_call: Placing outgoing call to 4315448707666378@81.27.72.11
   -- Attempting call on H323/4315448707666802@81.27.72.11 for 123456@AutoDial:1 (Retry 1)
Nov 22 07:19:01 DEBUG[4179]: chan_h323.c:964 oh323_request: type=H323, format=4, data=4315448707666802@81.27.72.11.
Nov 22 07:19:01 DEBUG[4179]: chan_h323.c:997 oh323_request: Extension: 4315448707666802 Host: 81.27.72.11
Nov 22 07:19:01 DEBUG[4179]: chan_h323.c:882 find_peer: Could not find peer 81.27.72.11 by name or address
Nov 22 07:19:01 DEBUG[4179]: chan_h323.c:441 oh323_call: Dest is 4315448707666802@81.27.72.11
Nov 22 07:19:01 DEBUG[4179]: chan_h323.c:467 oh323_call: Placing outgoing call to 4315448707666802@81.27.72.11
gw00*CLI> show channels
       Channel  (Context    Extension    Pri )   State Appl.         Data
0 active channel(s)
Nov 22 07:19:02 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:02 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 DEBUG[3073]: channel.c:469 ast_channel_walk_locked: Avoiding initial deadlock for 'H323/81.27.72.11'
Nov 22 07:19:03 WARNING[3073]: channel.c:471 ast_channel_walk_locked: Avoided initial deadlock for 'H323/81.27.72.11', 10 retries!

Now, you can clearly see that asterisk is dialing those calls out, and there is no disconnection cause yet. However, when you show channels, there are none active, and the channels appear to be not connecting without a disconnect cause. Now, here is the funky thing about this, the calls that I'm sending all go to a GnuGK. GnuGK and chan_h323 are both compiled with the same openh323 version and pwlib version, and according to the GK, although asterisk shows no channels, the GK shows 20 active calls. I've been working closely with JerJer on the new chan_h323 code, as I need it badly. The new version finally dials out correctly, but exhibits the same issue as the old chan_h323 code. I fear the problem is located somewhere within asterisk, and not within chan_h323. Problem is, I can't put my finger yet on where the problem is.

By: cybershield (cybershield) 2004-11-17 03:05:37.000-0600

Sorry but i don't use Gatekeeper. I have only tried to use chan_h323 and i have found this serious error. I try to install GnuGk and i try chan_h323 with GK.

@dimitel:
Can you try to send one call to asterisk in H323 without GK (in chan_h323 set gatekeeper=DISABLE) ? Only for test !!!
With this patch i resolv the problem of 'avoided deadlock' when the calls are sended to asterisk.

By: nirsim (nirsim) 2004-11-17 11:29:49.000-0600

Well, GnuGK can work in two methods, either as a true gatekeeper that will answer LRQ requests generated by registered endpoints. Or as a straight signaling switch that will forward Q931 setup messages over H323. My asterisk is not setup to register to the gatekeeper, and sees the remote GK as another H323 entity.

By: twisted (twisted) 2004-11-30 00:14:04.000-0600

Has anyone attempted to get with JerJer regarding this?  h323 is his baby.  I was just wondering if this is still going on, as well, since it's marked major.

Thanks!

By: nirsim (nirsim) 2004-11-30 06:04:37.000-0600

Well, I've been in contact with JerJer in past 2 weeks, but he's somewhat of on the road, so his ability to work on the issue is somewhat limited I believe.
I'll see if I can reach him tonight, and see what gives with this issue.

By: Mark Spencer (markster) 2004-11-30 23:42:18.000-0600

I don't believe the patch is correct because h323_new should be called with the pvt lock already held.  I think there is a place where it was not being held when being called.  Can you tell me if this patch fixes it for you?

By: nirsim (nirsim) 2004-12-01 01:27:50.000-0600

Well, I've tried the above mentioned patch, but it didn't work for me. I also noticed that once the but is exhibited, no more calls can be made via H323 channel.

By: cybershield (cybershield) 2004-12-01 02:58:19.000-0600

@Mark:
if you dial with chan_h323 technology, when call oh323_new the mutex is already locked. When receive a call in h323, into answer_call function, call comes oh323_new without locking the mutex.

By: nirsim (nirsim) 2004-12-07 06:28:46.000-0600

Well, I was wondering if the problem of the dead-lock is caused by chan_h323 itself, so I went off and installed chan_oh323. Now, imagine my surprise that
chan_oh323 handles fastStart and Tunneling better than chan_h323, however, this
is not the case at this point.

I've noticed that the dead issue still remains even in chan_oh323. I'll update asterisk to the newest CVS, and see if the problem persists.

By: nirsim (nirsim) 2004-12-07 06:38:50.000-0600

Ok, I've just updated the asterisk version to latest CVS, and problem still persists. Now the problem is even worse, as the dead lock occurs immediately after running 5-6 calls.

What i've also noticed is that the calls although disconnected at Asterisk side, appear to still be active on the remote site, which by itself is very annoying and causing a billing havoc.

By: mdu113 (mdu113) 2004-12-07 11:55:06.000-0600

I wanted to add my 2 cents that may or may not shed some light on the issue.
All my findings are related to 1.0.x branch, I haven't tried/used cvs-head yet, but issue exists in 1.0.x too (may happen less often though).
I believe that the problem is that sometimes, not always, when chan_h323 calls
ast_queue_hangup the lock on the channel owner is already held somewhere and asterisk deadlocks in ast_queue_frame.
Here's consequence of events as I see it:
1. h323 library is about to clear the connection and calls MyH323EndPoint::OnConnectionCleared()
2. MyH323EndPoint::OnConnectionCleared() calls on_connection_cleared() which is a pointer to cleanup_connection() function in chan_h323.c
3. cleanup_connection() calls ast_queue_hangup(p->owner) to enqueue hangup
to the other channel.
4. ast_queue_hangup() calls ast_queue_frame that expects that channel is not locked and deadlock on ast_mutex_lock(&chan->lock)
5. After this channel deadlocked, the other side cannot see disconnection
and show channels will show you "Avoided initial deadlock" message.

Unfortunately I was unable to find the root of the problem. I worked around it with the following simple patch (made for asterisk-1.0.2):

--- channel.c.bak       2004-10-07 13:30:24.000000000 -0400
+++ channel.c   2004-10-07 13:36:41.000000000 -0400
@@ -373,13 +373,28 @@
       struct ast_frame *prev, *cur;
       int blah = 1;
       int qlen = 0;
+       int retries = 0;
+
       /* Build us a copy and free the original one */
       f = ast_frdup(fin);
       if (!f) {
               ast_log(LOG_WARNING, "Unable to duplicate frame\n");
               return -1;
       }
-       ast_mutex_lock(&chan->lock);
+
+//     ast_mutex_lock(&chan->lock);
+retry:
+        if (ast_mutex_trylock(&chan->lock)) {
+               if (retries < 100) {
+                       usleep(1);
+                        retries++;
+                        goto retry;
+                } else {
+                               ast_log(LOG_WARNING, "Avoided deadlock on channel %s, %d retries",chan->name,retries);
+                               return -1;
+               }
+       }
+
       prev = NULL;
       cur = chan->pvt->readq;
       while(cur) {

--- chan_h323.c.bak     2004-10-07 13:09:18.000000000 -0400
+++ chan_h323.c 2004-10-07 13:37:49.000000000 -0400
@@ -583,6 +586,12 @@
       /* Retrieve audio/etc from channel.  Assumes p->lock is already held. */
       struct ast_frame *f;
       static struct ast_frame null_frame = { AST_FRAME_NULL, };
+       static struct ast_frame hangup_frame = { AST_FRAME_CONTROL, AST_CONTROL_HANGUP };
+
+       if (!p->rtp) {
+               ast_log(LOG_WARNING,"request to read rtp after rtp was destroyed");
+               return &hangup_frame;
+        }

      /* Only apply it for the first packet, we just need the correct ip/port */
      if(p->nat)

I'm running asterisk with this patch for more that a month now without single deadlock, although sometimes I see "Avoided deadlock on channel" message in the log. I understand that this is a dirty hack, but hope it may help to find the root of the problem.

By: Mark Spencer (markster) 2004-12-15 15:54:58.000-0600

Can you attach your diff and confirm you have a disclaimer on file?  I can probably patch it up from there.

By: mdu113 (mdu113) 2004-12-15 17:04:11.000-0600

Haven't actually expected that you would want to apply it as this clearly
doesn't fix the problem, just work around it.
Anyway deadlock.patch is the diff made against asterisk-1.0.3 and I just
faxed the disclaimer.

By: jerjer (jerjer) 2004-12-15 20:00:19.000-0600

Fixed correctly in cvs -head.