[Home]

Summary:ASTERISK-00419: * crashes once in a week at this function: ast_waitfor
Reporter:levon (levon)Labels:
Date Opened:2003-10-23 06:03:21Date Closed:2004-09-25 02:56:57
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:#0  ast_waitfor (c=???, ms=???) at channel.c:920


****** ADDITIONAL INFORMATION ******

Additional bt is not available. I *think* it happens when someone dials in and hangs up before any agent has taken a call but I'm not *sure* about it as this is hardly reproducable. but ast_waitfor shouldn't crash, maybe it should issue a warning when trying to wait on a channel that isn't valid anymore. thanks for looking into it.
Comments:By: x martinp (martinp) 2003-11-05 11:06:25.000-0600

You need to run asteirsk with -g option to produce coredump. Also if you run redhat 9 before you run asterisk and gdb you need to setup this variable:

export LD_ASSUME_KERNEL=2.4.1

By: levon (levon) 2003-11-14 04:57:17.000-0600

Thanks Martin, but I'm very well aware of how to how to run asterisk to produce a core. It's simply that when this error occures, everything is so messed up, that one cannot PRODUCE a back trace out of a core dump. I can give you the core, but I doubt it will help. This error is happening not only once a week, but now once a day or very irregularly.

gdm asterisk core.2739
[...]
#0  0x0805b180 in ast_waitfor_n (c=???, n=???, ms=???) at channel.c:912
912             return ast_waitfor_nandfds(c, n, NULL, 0, NULL, NULL, ms);
(gdb) bt
#0  0x0805b180 in ast_waitfor_n (c=???, n=???, ms=???) at channel.c:912
Cannot access memory at address 0x0

By: levon (levon) 2003-11-14 04:59:45.000-0600

Asterisk *has* to be MUCH MORE fault tolerant with channels that have been deleted. Even when ONE thread remains there which accesses that "deleted" channel, * segfaults. This is in for *months* now and needs to be fixed asap, this is a really bad showstopper for everyone using it in production.

By: Brian West (bkw918) 2003-11-14 09:43:08.000-0600

what distro and kernel version are you running?  Do you have any outbound sip registrations?  and are you running cvs?  if not then what version?

By: levon (levon) 2003-11-14 09:55:06.000-0600

The production system is on debian stable, kernel 2.4.20 vanilla. No outbound sip registrations, only some registered sip phones, chan_zap and chan_capi. The error happens for several months now, always with current cvs.

The most disturbing part for me is the lack of a correct bt.

I 've added a call to ast_log() in ast_waitfor_n() to investigate the circumstances which lead to this problem and am waiting for the next crash to occur. ;)

By: Brian West (bkw918) 2003-11-14 09:58:31.000-0600

I'm wondering if chan_capi could have something to do with this.  I have bleeding edge CVS boxes that are weeks old without these issues.  Running on slackware.

asterisk*CLI> show version
Asterisk CVS-10/26/03-00:25:07 built by root@asterisk on a i686 running Linux
asterisk*CLI> show uptime
System uptime: 2 weeks, 18 hours, 4 minutes, 50 seconds
Last reload: 6 days, 20 hours, 49 minutes, 16 seconds

By: Brian West (bkw918) 2003-11-14 10:07:10.000-0600

I asked kapejod look over this also(author of chan_capi for those watchin) and he says that it doesn't look like anything associated with chan_capi.

bkw

By: Brian West (bkw918) 2003-11-20 13:50:21.000-0600

does this still happen?

By: levon (levon) 2003-11-24 05:55:48.000-0600

Yes, it does. I took some time to add debug traces and if {} clauses into the code to make sure no invalid arguments could screw things up internally, and I now seem to be able to produce a valid stack trace again. I've narrowed it down to ast_waitfor_nandfds which is called be ast_waitfor_n with empty (NULL) file descriptors.

The bt:
(gdb) bt
#0  0x08057350 in ast_waitfor_nandfds (c=0xbf3feb24, n=13631490, fds=0x0, nfds=0, exception=0x0, outfd=0x0, ms=0xbf3fefdc) at channel.c:848
#1  0x0805b2b9 in ast_waitfor_n (c=0xbf3feb24, n=2, ms=0xbf3fefdc) at channel.c:919
#2  0x405215f4 in wait_for_answer (in=0x81ae910, outgoing=0x81b49c8, to=0xbf3fefdc, allowredir_in=0xbf3fefe0, allowredir_out=0xbf3fefe4,
   allowdisconnect=0xbf3fefe8) at app_dial.c:182
#3  0x40522c9a in dial_exec (chan=0x81ae910, data=0xbf3ff7b4) at app_dial.c:635
#4  0x08060ed0 in pbx_exec (c=0x81ae910, app=0x811e8a8, data=0xbf3ff7b4, newstack=1) at pbx.c:396
ASTERISK-1  0x08062f39 in pbx_extension_helper (c=0x81ae910, context=0x81aea68 "foo", exten=0x81aeb5c "xxxxxxxxxxxxxxxx", priority=1,
   callerid=0x8562370 "yyyyyyyyyy", action=1) at pbx.c:1154
ASTERISK-2  0x08063c7d in ast_pbx_run (c=0x81ae910) at pbx.c:1638
ASTERISK-3  0x08069fee in pbx_thread (data=0x81ae910) at pbx.c:1859
ASTERISK-4  0x400220ba in pthread_start_thread () from /lib/libpthread.so.0

As you can see, the argument fds is NULL, but

#0  0x08057350 in ast_waitfor_nandfds (c=0xbf3feb24, n=13631490, fds=0x0, nfds=0, exception=0x0, outfd=0x0, ms=0xbf3fefdc) at channel.c:848
848                             if (c[x]->fds[y] > -1) {

it tries to access it as an array at some position which results in SIG11.

Now that it's narrowed down, I looked into ast_wait_for_n and its true, fds is *always* NULL. Now I ask myself why ast_waitfor_nandfds doesn't check for that condition.

By: Paul Cadach (pcadach) 2003-11-24 14:42:29.000-0600

As you pointed, line 848 of channel.c checks not fds argument but fds array in channel's array argument (c).

Looks like broken hardware or memory overwrite - ast_waitfor_n() called with n=2 argument, but ast_waitfor_nandfds() then called with n=13631490 but must be the same as for ast_waitfor_n() call.

By: levon (levon) 2003-11-25 05:54:26.000-0600

It's always easy to say that its broken hardware, isn't it? ;) Well, the reproducability is too similar as if it could be hardware. Also I make sure our hardware is tested thouroughly before I run production servers.

Another thing is, that I needed to add if{} clauses to the code until I got a valid back trace. As you can see from the start of this bug report, this wasn't the case at the beginning. I could not generate a bt at all, which indicates, that there is in fact overwritten memory.

I'd agree if this would happen with random values and random places in the code, but it doesn't. It's always in this stack (ast_waitfor_....).

I'll try gathering more info until we have a more clear picture of what happens.

By: Brian West (bkw918) 2004-01-06 23:57:21.000-0600

This still an issue?

By: jrollyson (jrollyson) 2004-01-14 03:12:34.000-0600

Unable to duplicate, may no longer be occuring.

By: Paul Cadach (pcadach) 2004-02-08 01:01:14.000-0600

Reminder sent to levon

Is your * PC have dual-CPU or hyper-threading enabled? Does this bug appears on single-CPU (non-SMP kernel)? I saw something like that with H.323 on SMP machines...