[Home]

Summary:ASTERISK-02650: Crash in ast_queue_frame (channel.c)
Reporter:Andrey S Pankov (casper)Labels:
Date Opened:2004-10-22 04:40:02Date Closed:2008-01-15 15:12:47.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan_mgcp.c.1.79.2.4.diff
( 1) chan_mgcp.c.diff
( 2) core.27051.bz2
( 3) Master.csv
( 4) messages.20041002
( 5) messages.20041002.txt
Description:0  0x08059825 in ast_queue_frame (chan=0x4383c2a0, fin=0x424ae49c) at channel.c:384
1  0x0805f7a5 in ast_queue_hangup (chan=0x0) at channel.c:431
2  0x422a78b5 in handle_request (sub=0x81318f0, req=0x424ae60c, sin=0x4383c2a0)
   at chan_mgcp.c:3045
3  0x422a40e1 in mgcpsock_read (id=0x80fdeb0, fd=26, events=1, ignore=0x0) at chan_mgcp.c:3215
4  0x08052340 in ast_io_wait (ioc=0x811f0f0, howlong=1132708512) at io.c:267
5  0x422a256b in do_monitor (data=0x0) at chan_mgcp.c:3300
6  0x4002b9b1 in pthread_start_thread () from /lib/i686/libpthread.so.0

****** ADDITIONAL INFORMATION ******

I have no idea why and how this may happen...
Comments:By: Mark Spencer (markster) 2004-10-22 09:04:00

Your bug report says you can always reporduce the problem.  Can you run under valgrind and get a backtrace?

By: Andrey S Pankov (casper) 2004-10-22 10:40:13

No, the problem is reproducible only in production environment (heavy load, lots of calls), let's say more accurately its reproducibility is sometimes/random. I doubt I'll be able to run a production instance under valgrind.
It seems like chan pointer points to an area being dynamically allocated in:

handle_request (..., sin=0x4383c2a0)
ast_queue_frame (chan=0x4383c2a0, ...)
ast_queue_hangup (chan=0x0)

The only thing I can code in this case is:

Index: channel.c
===================================================================
RCS file: /usr/cvsroot/asterisk/channel.c,v
retrieving revision 1.139
diff -u -5 -r1.139 channel.c
--- channel.c   15 Sep 2004 14:12:45 -0000      1.139
+++ channel.c   22 Oct 2004 15:36:45 -0000
@@ -379,11 +379,14 @@
               ast_log(LOG_WARNING, "Unable to duplicate frame\n");
               return -1;
       }
       ast_mutex_lock(&chan->lock);
       prev = NULL;
-       cur = chan->pvt->readq;
+       cur = NULL;
+       if (chan->pvt) {
+               cur = chan->pvt->readq;
+       }
       while(cur) {
               if ((cur->frametype == AST_FRAME_CONTROL) && (cur->subclass == AST_CONTROL_HANGUP)) {
                       /* Don't bother actually queueing anything after a hangup */
                       ast_frfree(f);
                       ast_mutex_unlock(&chan->lock);
@@ -403,15 +406,16 @@
                       ast_frfree(f);
                       ast_mutex_unlock(&chan->lock);
                       return 0;
               }
       }
-       if (prev)
+       if (prev) {
               prev->next = f;
-       else
-               chan->pvt->readq = f;
-       if (chan->pvt->alertpipe[1] > -1) {
+       } else if (chan->pvt) {
+                       chan->pvt->readq = f;
+       }
+       if (chan->pvt && (chan->pvt->alertpipe[1] > -1)) {
               if (write(chan->pvt->alertpipe[1], &blah, sizeof(blah)) != sizeof(blah))
                       ast_log(LOG_WARNING, "Unable to write to alert pipe on %s, frametype/subclass %d/%d (qlen = %d): %s!\n",
                               chan->name, f->frametype, f->subclass, qlen, strerror(errno));
#ifdef ZAPTEL_OPTIMIZATIONS
       } else if (chan->timingfd > -1) {

But again you'll say "performance!" I suspect... Anyway, it's clear that something is miscoded somewhere, not properly locked or "I have no idea" as I stated it earlier.

By: Andrey S Pankov (casper) 2004-10-22 10:41:33

The disclaimer is on file at Digium.

By: Mark Spencer (markster) 2004-10-22 13:16:29

This is NOT the right fix.  This is a race, and the *race* must be fixed, not just disguised with this patch.

By: Andrey S Pankov (casper) 2004-10-23 05:27:59

Of course I understand this is a race...
Do you need any additional information from me?

Call flow?

chan_h323 -> app_queue -> chan_mgcp

Last console output? Seems to be something like:

-- Endpoint 'aaln/2@office2-1' observed 'hu'

without:

== Spawn extension (office-local, 103, 4) exited non-zero on 'MGCP/aaln/2@office2-1'

Please see attached warnings/notices from messages file just before the crash...

By: Mark Spencer (markster) 2004-10-23 16:38:12

Can you duplicate this without H.323?

By: Andrey S Pankov (casper) 2004-10-24 07:14:39

43024384 2004-10-22 09:00:57.000000000 +0000 core.27051
Please see bzipped core attached...

By: Andrey S Pankov (casper) 2004-10-24 07:45:25

Heh, it seems that the last call was an mcgp to h323 call... and the crash occured when mgcp endpoint was hanging up (please see my bugnote 10-23-04 05:27)

By: Mark Spencer (markster) 2004-10-24 09:03:08

The corefile is useless off your machine.

My question, again, is:

Can you duplicate this problem *without* using the H.323 channel driver at all.  Thanks.

By: Andrey S Pankov (casper) 2004-10-24 09:10:00

I can't... since if I remove chan_h323 my system will become totally unusable (we use h323 for incoming and outgoing calls). Sorry... :(

By: Andrey S Pankov (casper) 2004-10-25 08:57:27

Please tell what you think about this fix (chan_mgcp.c.diff)... Thanks.
Disclaimer is on file at Digium.

By: Andrey S Pankov (casper) 2004-10-25 09:03:48

Oops... it was not against current v1-0 CVS... Please see chan_mgcp.c.1.79.2.4.diff file. Thanks. ;)

By: Mark Spencer (markster) 2004-10-25 11:22:05

This patch fixes your problem?

By: Andrey S Pankov (casper) 2004-10-25 12:19:19

This patch fixes chan_mgcp... since at list one should check "if (f)" before "if (sub->owner)". I don't know if it fixes exactly this problem (ASTERISK-2650), but I hope after applying the fix we may mark this bug resolved for HEAD and closed after changes made to v1-0 branch. If a crash occurs in do_monitor() I promise to reopen the bug. :) Thanks.

By: Mark Spencer (markster) 2004-10-25 22:53:43

Fixed in CVS.  Hope that's really it :)

By: Russell Bryant (russell) 2004-10-26 11:31:40

added to 1.0.

Could you add .txt to your patches, please?  Thanks!

By: Andrey S Pankov (casper) 2004-11-01 05:02:05.000-0600

chan_mgcp.c.1.79.2.4.diff fixed another bug :(
So, I reopen this one.

By: Andrey S Pankov (casper) 2004-11-01 05:03:02.000-0600

0  0x08059825 in ast_queue_frame (chan=0x43a09390, fin=0x424ae49c) at channel.c:384
384             cur = chan->pvt->readq;
(gdb) bt
0  0x08059825 in ast_queue_frame (chan=0x43a09390, fin=0x424ae49c) at channel.c:384
1  0x0805f7a5 in ast_queue_hangup (chan=0x0) at channel.c:431
2  0x422a78c5 in handle_request (sub=0x8130408, req=0x424ae60c, sin=0x43a09390)
   at chan_mgcp.c:3043
3  0x422a40e1 in mgcpsock_read (id=0x80fdeb0, fd=26, events=1, ignore=0x0) at chan_mgcp.c:3213
4  0x08052340 in ast_io_wait (ioc=0x811f0f0, howlong=1134597008) at io.c:267
5  0x422a256b in do_monitor (data=0x0) at chan_mgcp.c:3298
6  0x4002b9b1 in pthread_start_thread () from /lib/i686/libpthread.so.0

By: Andrey S Pankov (casper) 2004-11-01 05:05:45.000-0600

I suspect that smth free chan->pvt before we are in ast_queue_frame...

By: Mark Spencer (markster) 2004-11-01 09:47:14.000-0600

It means we're trying to simultaneously trying to queue a frame while the channel is being destroyed.  What is the circumstance athat creates such a race?

By: Andrey S Pankov (casper) 2004-11-01 11:14:22.000-0600

Please see bugnote "casper 10-23-04 05:27"

By: Andrey S Pankov (casper) 2004-11-01 11:16:40.000-0600

and "casper 10-24-04 07:45" as well...

By: Mark Spencer (markster) 2004-11-01 11:25:04.000-0600

If you use SIP instead of MGCP for the phones, does the problem still occur?

By: Andrey S Pankov (casper) 2004-11-01 11:33:55.000-0600

I can't know it for sure, it occurs only on production systems and MGCP is always involved (because back traces are the same as I can see).

By: Andrey S Pankov (casper) 2004-11-01 11:39:29.000-0600

ast_queue_hangup (chan=0x0) - here is the mistery...
Why "chan->_softhangup |= AST_SOFTHANGUP_DEV;" does not fail if chan is NULL?

And then ast_queue_frame(chan=0x43a09390, ...)...
Why chan is not NULL while it was originally?

By: Mark Spencer (markster) 2004-11-02 14:31:04.000-0600

We're going to have to narrow down this problem if you ever want it to be fixed, especially given that is is not duplicated elsewhere.

Begin by replacing your MGCP devices with SIP devices, to see if this is a problem within MGCP or H.323.  If it is in MGCP, it will be easier for me to try to help you with and if not, then well, we'll have to find someone who knows h.323...

By: Andrey S Pankov (casper) 2004-11-03 10:31:18.000-0600

Today this happend again :(
What do you think about change below (considering that chan is NULL from backtrace)?

int ast_queue_hangup(struct ast_channel *chan)
{
        struct ast_frame f = { AST_FRAME_CONTROL, AST_CONTROL_HANGUP };
-        chan->_softhangup |= AST_SOFTHANGUP_DEV;
-        return ast_queue_frame(chan, &f);
+        if (chan) {
+                chan->_softhangup |= AST_SOFTHANGUP_DEV;
+                return ast_queue_frame(chan, &f);
+        }
+        return -1;
}

By: Mark Spencer (markster) 2004-11-03 22:50:10.000-0600

It's not really NULL, it's a race of some sort where one side is hanging up while the other is queuing a frame.

Apparently I haven't been clear enough in what I am saying, so I'm going to try just once more:

REPLACE YOUR MGCP UNIT WITH A SIP UNIT AND SEE IF THE PROBLEM STILL OCCURS, THEN AFTER SEVERAL DAYS OR HOWEVER LONG IT TAKES TO KNOW, LET US KNOW WHETHER THE PROBLEM IS SPECIFIC TO MGCP OR H.323 OR WHAT.

Thanks :)

By: Pavel Litvinenko (pavell) 2004-11-04 04:04:39.000-0600

I have the same problem too, but I use chan_oh323

Asterisk version 1.0.2

Loaded symbols for /usr/lib/asterisk/modules/format_sln.so
#0  0x08059cb6 in ast_queue_frame (chan=0x41b063c8, fin=0xbe7ff0ac) at channel.c:401
401             cur = chan->pvt->readq;
(gdb) bt full
#0  0x08059cb6 in ast_queue_frame (chan=0x41b063c8, fin=0xbe7ff0ac) at channel.c:401
       f = (struct ast_frame *) 0x41b03a08
       prev = (struct ast_frame *) 0x0
       cur = (struct ast_frame *) 0x41b063c8
       blah = 1
       qlen = 0
#1  0x08060335 in ast_queue_hangup (chan=0x0) at channel.c:448
       f = {frametype = 4, subclass = 1, datalen = 0, samples = 0, mallocd = 0, offset = 0, src = 0x0, data = 0x0, delivery = {tv_sec = 0, tv_usec = 0}, prev = 0x0, next = 0x0}
#2  0x403eb826 in handle_request (sub=0x81177e0, req=0xbe7ff20c, sin=0x41b063c8) at chan_mgcp.c:3043
       ev = 0x8117220 "aaln/4"
       f = {frametype = 0, subclass = 0, datalen = 0, samples = 0, mallocd = 0, offset = 0, src = 0x0, data = 0x0, delivery = {tv_sec = 0, tv_usec = 0}, prev = 0x0, next = 0x0}
       p = (struct mgcp_endpoint *) 0x8117208
       iabuf = "-&ASTERISK-1052;\177&ASTERISK-9301;$&ASTERISK-1061;>@p&ASTERISK-1065;\177&ASTERISK-9301;\000\000\000"
       res = 1102078920
       start = 4
#3  0x403e7f90 in mgcpsock_read (id=0x8120df0, fd=15, events=1, ignore=0x0) at chan_mgcp.c:3213
       req = {len = 81, verb = 0xbe7ff428 "NTFY", identifier = 0xbe7ff42d "8126", endpoint = 0xbe7ff432 "aaln/4@pavlenko2", version = 0xbe7ff443 "MGCP 1.0", headers = 4, header = {0xbe7ff428 "NTFY",
   0xbe7ff454 "N:[80.68.3.65]:2427", 0xbe7ff468 "X:431bd7b7", 0xbe7ff473 "O: hu", 0xbe7ff479 "", 0x0 <repeats 59 times>}, lines = 0, line = {0xbe7ff479 "", 0x0 <repeats 63 times>},
 data = "NTFY\0008126\000aaln/4@pavlenko2\000MGCP 1.0\000NCS 1.0\000N:[80.68.3.65]:2427\000X:431bd7b7\000O: hu", '\0' <repeats 1419 times>, cmd = 0, trid = 0, next = 0x0}
       sin = {sin_family = 2, sin_port = 31497, sin_addr = {s_addr = 51030208}, sin_zero = "\000\000\000\000\000\000\000"}
       sub = (struct mgcp_subchannel *) 0x81177e0
       res = 1102078920
       len = 16
       result = 200
       ident = 586
       iabuf = '\0' <repeats 15 times>
#4  0x08052450 in ast_io_wait (ioc=0x810f218, howlong=1102078920) at io.c:267
       res = 1
       x = 0
       origcnt = 1
ASTERISK-1  0x403e646d in do_monitor (data=0x0) at chan_mgcp.c:3298
       res = 1000
       reloading = 1077875588
ASTERISK-2  0x40027ca4 in pthread_detach () from /lib/libpthread.so.0
No symbol table info available.
ASTERISK-3  0x401f3fc7 in clone () from /lib/libc.so.6
No symbol table info available.
(gdb)

edited on: 11-04-04 04:56

By: Mark Spencer (markster) 2004-11-07 16:44:41.000-0600

Okay try latest CVS.  I've done some major work on trying to improve the locking.

By: Mark Spencer (markster) 2004-11-09 00:12:12.000-0600

Any feedback here from either of you?

By: Pavel Litvinenko (pavell) 2004-11-09 00:55:58.000-0600

Let us to test - today is a first working day of this week ...
trying to port your patches to 1.0.2

By: Andrey S Pankov (casper) 2004-11-09 08:03:10.000-0600

I hope PavelL's tests will be successful and your locking improvements will go to v1-0 branch too.

I'm still happy with my fix ("casper 11-03-04 10:31"), we'll see I hope. May be I'll backport your locking improvements to my source tree when I find enough time to think a bit more on them. :) Thank you any way for your great work.

By: Andrey S Pankov (casper) 2004-11-10 09:49:49.000-0600

PavelL: Any news? Don't you experience 100% CPU load when running * with MGCP endpoints or deadlocks?

markster: I doubt that for(;;) polling with usleep(1) will not be eating 100% CPU cycles in case when lock is being held, e.g. on network timeouts or retransmits.

Another thing - possible deadlocks in mgcp_queue_frame. I'm not sure here, but the code there seems a bit suspect to me. Thanks.

By: Mark Spencer (markster) 2004-11-10 15:43:55.000-0600

Just run this code, with no other changes, and let me know what happens.  If you don't care about this bug report and/or my attempts to resolve it through the appropriate changes let me know so I can close it out.  Thanks and look forward to feedback soon.

By: Mark Spencer (markster) 2004-11-14 19:57:46.000-0600

Seems to have lost interest.  I've got some more MGCP fixes in as of today, maybe these will all help out.

By: Russell Bryant (russell) 2004-11-14 21:39:26.000-0600

will consider merging Mark's MGCP fixes into stable ...

By: Digium Subversion (svnbot) 2008-01-15 15:11:45.000-0600

Repository: asterisk
Revision: 4100

U   trunk/channels/chan_mgcp.c
U   trunk/channels/chan_zap.c

------------------------------------------------------------------------
r4100 | markster | 2008-01-15 15:11:45 -0600 (Tue, 15 Jan 2008) | 2 lines

Fix MGCP dtmf buglet (bug ASTERISK-2650)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=4100

By: Digium Subversion (svnbot) 2008-01-15 15:11:49.000-0600

Repository: asterisk
Revision: 4105

U   branches/v1-0/channels/chan_mgcp.c

------------------------------------------------------------------------
r4105 | russell | 2008-01-15 15:11:49 -0600 (Tue, 15 Jan 2008) | 2 lines

fix mgcp bug (bug ASTERISK-2650)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=4105

By: Digium Subversion (svnbot) 2008-01-15 15:12:47.000-0600

Repository: asterisk
Revision: 4174

U   trunk/channels/chan_mgcp.c

------------------------------------------------------------------------
r4174 | markster | 2008-01-15 15:12:46 -0600 (Tue, 15 Jan 2008) | 2 lines

Major MGCP locking fixes (bug ASTERISK-2650)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=4174