ASTERISK-12355: Asterisk segfault at startup/ael reload

[Home]

Summary: ASTERISK-12355: Asterisk segfault at startup/ael reload

Reporter: Chris Boot (bootc) Labels:

Date Opened: 2008-07-10 04:14:22 Date Closed: 2009-02-19 12:03:50.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: PBX/pbx_ael

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) ael-sample.tar.gz
( 1) btfull.txt
( 2) btfull2.txt

Description: I have a reasonably large and complex dialplan, and after I added a switch statement pbx_ael started crashing most of the time (but not always) after an "ael reload" command, or on startup when loading the dialplan.

****** STEPS TO REPRODUCE ******

Probably highly dependent on my dialplan, but:
1. start Asterisk
or
1. start Asterisk
2. login on the CLI
3. issue "ael reload"

Comments: By: Leif Madsen (lmadsen) 2008-07-10 08:04:41

Please provide a backtrace by following the instructions in the doc/backtrace.txt file of your Asterisk source. You will need to enable the DONT_OPTIMIZE compiler flag within menuselect, run 'make install' again, restart Asterisk, and reproduce the issue.

Once you have the backtrace, you can upload it to this bug.

Thanks!
/L
By: Leif Madsen (lmadsen) 2008-07-10 08:05:08

And of course after I post that I notice you've done just that :) Sorry for the noise!
By: Leif Madsen (lmadsen) 2008-07-10 08:06:49

Sorry again -- it appears its too early for me. You have attached the backtrace, but you need to enable that DONT_OPTIMIZE flag that I mentioned in my previous note. Once you've done that, you can perform the backtrace again and attach it to this bug.

Thanks!
By: Chris Boot (bootc) 2008-07-10 10:34:09

I've attached the 'bt full' from a core from asterisk with DONT_OPTIMIZE set. It took some doing this time, I only managed to reproduce it after a dozen or more restarts and reloads...
By: Steve Murphy (murf) 2008-07-11 12:41:55

Your core file indicates that the crash is in the ael parser.

attach your dialplan to this bug, so I can try to reproduce and fix the problem.

If you have something proprietary, wash it and see if you still have problems,
and then attach the 'washed', non-proprietary dialplan.

If you aren't even aware that you had an ael file, then remove the /etc/asterisk/extensions.ael file, and the problem should go away, but still,
attach the extensions.ael file you had (probably the default .ael file, but
still, it's good to know.)

The AEL stuff should not crash.
By: Chris Boot (bootc) 2008-07-14 06:12:39

I've just attached our AEL dialplan. It consists of a skeleton extensions.ael which includes a directory of AEL files (also attached) which form the meat of the dialplan. It also extensively uses func_odbc and stuff like that.
By: Steve Murphy (murf) 2008-07-15 11:38:13

Just tried you dialplan. I did 36 'ael reload' commands, and still no problems
in the svn head of 1.4.

So, your task will be to find the offending module. It could be, that one of the other modules is walking over memory of the AEL module, and messing things up. Try turning off or removing .conf file in /etc/asterisk and see if one of them, when not loaded, stops the crashes, and when it is loaded, the crashes resume.
By: Chris Boot (bootc) 2008-07-15 11:54:35

Hmm, that's a shame. It happens really quite often and it puts me off doing reloads during the day on our system. I assume you're trying this on 64-bit linux as I am?

I'll try to disable all the modules we don't use and see if it still happens I guess.
By: Steve Murphy (murf) 2008-07-15 12:51:43

Actually, I was doing the testing on a 32-bit machine, so I built the latest
1.4 on a 64-bit machine, and ran it thru 40 ael reloads, with no problems...
By: Steve Murphy (murf) 2008-07-15 18:52:29

I'm going to close this until you or someone else can come up with a way to
reproduce the problem. Feel free to re-open it the moment you have something
new.
By: Chris Boot (bootc) 2008-07-16 09:07:08

I've just reproduced the problem on a different machine (VMware VM) running RHEL5 64-bit, and CentOS 5 64-bit. I tried with Debian Etch 64-bit but that did NOT reproduce the problem. Is this likely to be an issue within Asterisk, or is this now mostly likely to be a glibc problem in RHEL?
By: Steve Murphy (murf) 2008-07-25 15:38:46

Well, it might explain why I'm not reproducing it. I'm running Ubuntu on both my 32 bit and 64 bit test systems, and that's debian based. Please, forgive me if I'm a bit redundant: you have demonstrated that you can repro this problem on centos, using the exactly the data you sent? Do you have an extensions.conf file? If so, I will need that also, as merge-and-delete is a possible problem and would involve both files. Are there any contexts with the same name between the two files?
By: Steve Murphy (murf) 2008-07-26 11:36:49

After dozens of attempts to reload on a 64-bit centos 4.4 virtual machine,
I still could not repro this bug. Our RHEL5 vm isn't operational, and we don't
have centos 5 available at the moment. If it's all the same to you, you might
go ahead and move over to a debian based distribution.

I'll see if we can add centos 5 to the mix.
By: Steve Murphy (murf) 2008-07-30 12:30:00

OK, I've installed 64-bit CentOS 5.2, and run asterisk 1.4, and reloaded
extensions.ael over 100 times and no problems. Our RHEL 64-bit VM isn't
working.

My suspicionis that some app is walking on memory, and is a
unique mixture of OS, app, and dialplan.

You are welcome to re-open this issue with any updates that might help
localize the problem.

In the meantime, if this bug gets irritating, I suggest moving your
application to a platform that doesn't exhibit this problem.
By: Chris Boot (bootc) 2009-02-09 11:04:34.000-0600

I got an update from our ticket on Red Hat support, who say the following:

"I have looked at the asterisk-1.4..21.1 sources and found out that the asterisk developer treat size_t as int and cast and compare between them deliberately. In the flex sources there seems to be even cast from pointer to int. Such an application will never work reliably on 64-bit architecture.

This really doesn't look to be an error in glibc. The asterisk sources are 64-bit unaware. All in all MALLOC_CHECK_ was invented for catching errors that would be difficult to catch otherwise."

If this is the case this would explain why the crashes appear to happen when I reload the dialplan (since that's when flex would be called to parse it).
By: Steve Murphy (murf) 2009-02-13 13:04:55.000-0600

I've just gone over the flex-generated source we are publishing, and while I see a fair amount of pointer arithmetic, I was not able to find any instances of where any int was cast to a pointer or vice-versa. I did see some comparisons between a size_t and an int, but I've never it given it any thought, as the compiler (gcc) seems to handle it just fine. Recently, while compiling stuff usually not compiled into asterisk by default, I found some void ptr <=> int and size_t <=> void* type issues, which the compiler pointed out, which I fixed and committed. We have an automated build farm that is a mix of linux and freebsd, 32 and 64 bit, and if there's going to be issues, they usually get highlighted there.

I have noted that ael loads are not locked, and doing two loads at once can lead to problems, but I haven't gotten around to addressing that yet. Could this be an issue for you?
By: Steve Murphy (murf) 2009-02-16 10:48:50.000-0600

Wait a minute. I looked at your stack traces again, and it reminds me of a bug I fixed a while back. Please try the latest svn of asterisk, and see if your problem persists. (try: svn co http://svn.digium.com/svn/asterisk/branches/1.4 latest-1.4-svn)
By: Chris Boot (bootc) 2009-02-16 12:33:12.000-0600

Do you recall what revision the suspected fix might be in? We run this in a production environment so I'd rather not touch it without feeling fairly certain this will fix it...
By: Steve Murphy (murf) 2009-02-16 15:28:51.000-0600

See http://svn.digium.com/view/asterisk?view=revision&revision=162013

The 4 files that were changed were:

branches/1.4/include/asterisk/ael_structs.h
branches/1.4/pbx/ael/ael.flex
branches/1.4/pbx/ael/ael_lex.c
branches/1.4/pbx/pbx_ael.c

So, if you want to test just that change, you can either fetch that revision of 1.4, and copy in just those files, or fetch the patch, and apply that.
By: Steve Murphy (murf) 2009-02-19 12:03:49.000-0600

OK, I'm convinced that this bug is a duplicate of 14019.

The fixes should be contained in 1.4.22.1, 1.4.22.2, 1.4.23-rc3, 1.4.23.1 (and above). Since you are using 1.4.21, iirc, then that would explain why you still see this.

If I'm wrong, and an upgrade or patch still has no effect, then re-open this bug and we'll look further into it.