ASTERISK-13165: Starting or restarting asterisk causes seg fault and core dump, apparently in ael/pval.c:4833

[Home]

Summary: ASTERISK-13165: Starting or restarting asterisk causes seg fault and core dump, apparently in ael/pval.c:4833

Reporter: Christopher K. Johnson (ckjohnsonme) Labels:

Date Opened: 2008-12-03 20:11:54.000-0600 Date Closed: 2009-02-19 12:00:21.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: PBX/pbx_ael

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 14019.diff
( 1) extensions.ael

Description: [root@freedom asterisk]# asterisk -V
Asterisk SVN-trunk-r160791
[root@freedom asterisk]# service asterisk start
Starting asterisk: [ OK ]
[root@freedom asterisk]# service asterisk restart
Shutting down asterisk: Asterisk ended with exit status 0
Asterisk shutdown normally.
[ OK ]
Starting asterisk: [ OK ]
[root@freedom asterisk]# /usr/sbin/safe_asterisk: line 138: 11599 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed
/usr/sbin/safe_asterisk: line 138: 11643 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed
/usr/sbin/safe_asterisk: line 138: 11688 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed
/usr/sbin/safe_asterisk: line 138: 11732 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed
/usr/sbin/safe_asterisk: line 138: 11775 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed

The problem occurs frequently but not reliably. Starting asterisk service usually causes one or more segfaults before safe_asterisk successfully starts asterisk.

****** ADDITIONAL INFORMATION ******

gdb /usr/sbin/asterisk -core /tmp/core.freedom.somerville.dirigo.net-2008-12-03T20:49:24-0500
...
#0 destroy_pval (item=0x7fdd00000000) at ael/pval.c:4833
4833 nxt = i->next;

System is Linux Fedora release 9 (Sulphur)
Kernel 2.6.27.5-41.fc9.x86_64
Hardware is multiprocessor - AMD Phenom(tm) 9850 Quad-Core Processor

I had the same results with tag 1.6.1-beta2 and beta3.

extensions.conf is an empty file. All extensions are defined in extensions.ael

Once asterisk succeeds in starting, it operates correctly.

Comments: By: Christopher K. Johnson (ckjohnsonme) 2008-12-03 20:45:25.000-0600

From the asterisk console an 'ael reload' also usually, but not reliably, causes one or more segfaults, and asterisk restarting with each of course.
Core dump shows same "at ael/pval.c:4833".

'dialplan reload' repeatedly does NOT cause segfault. Just 'ael reload'.
By: Christopher K. Johnson (ckjohnsonme) 2008-12-03 21:47:15.000-0600

The attached extensions.ael has been pared down from my original configuration, but still results in the same segfault sometimes upon service asterisk restart or ael reload. The sample extensions.ael provided by 'make samples' does not seem to segfault.
By: Steve Murphy (murf) 2008-12-04 13:58:21.000-0600

Very interesting. I'll look at it. You've done a good job giving me something to chew on.
By: Steve Murphy (murf) 2008-12-04 17:14:40.000-0600

I've got to re-install linux on my 64-bit machine. It's seriously messed up.
In the meantime, if you could give a complete stack trace from your machine,
it might help me. The snippet of the gdb trace is not sufficient.
By: Christopher K. Johnson (ckjohnsonme) 2008-12-04 21:43:50.000-0600

Here is one:
#0 destroy_pval (item=0x7fb200000000) at ael/pval.c:4833
#1 0x000000000481614f in destroy_pval_item (item=0x7fb2a007a5b0)
at ael/pval.c:4809
#2 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835
#3 0x000000000481614f in destroy_pval_item (item=0x7fb2a007a720)
at ael/pval.c:4809
#4 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835
ASTERISK-1 0x000000000481614f in destroy_pval_item (item=0x7fb2a0079ae0)
at ael/pval.c:4809
ASTERISK-2 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835
ASTERISK-3 0x00007fb294ab92e9 in pbx_load_module () at pbx_ael.c:150
ASTERISK-4 0x00007fb294ab932d in handle_cli_ael_reload (e=0x7fb200000000,
cmd=-55342744, a=0x25) at pbx_ael.c:206
ASTERISK-5 0x00000000004544b0 in ast_cli_command_full (uid=0, gid=0, fd=21,
s=0x41f80ae0 "ael reload") at cli.c:2224
ASTERISK-6 0x00000000004546e3 in ast_cli_command_multiple_full (uid=0, gid=0, fd=21,
size=<value optimized out>, s=0x41f80d20 "ael reload") at cli.c:2247
ASTERISK-7 0x000000000042f733 in netconsole (vconsole=<value optimized out>)
at asterisk.c:1071
ASTERISK-8 0x00000000004f14ac in dummy_start (data=<value optimized out>)
at utils.c:1024
ASTERISK-9 0x00000036fd60729a in start_thread () from /lib64/libpthread.so.0
ASTERISK-10 0x00000036fcae439d in clone () from /lib64/libc.so.6

And another:
#0 destroy_pval (item=0x7fe900000000) at ael/pval.c:4833
#1 0x000000000318014f in destroy_pval_item (item=0x7fe9082b3df0) at ael/pval.c:4809
#2 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835
#3 0x000000000318014f in destroy_pval_item (item=0x7fe9082b3f60) at ael/pval.c:4809
#4 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835
ASTERISK-1 0x000000000318014f in destroy_pval_item (item=0x7fe9082b4d40) at ael/pval.c:4809
ASTERISK-2 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835
ASTERISK-3 0x00007fe8fe36d2e9 in pbx_load_module () at pbx_ael.c:150
ASTERISK-4 0x00000000004906ba in load_resource (resource_name=0x1e9d430 "pbx_ael.so",
global_symbols_only=<value optimized out>) at loader.c:694
ASTERISK-5 0x0000000000490eef in load_modules (preload_only=0) at loader.c:898
ASTERISK-6 0x0000000000431db5 in main (argc=<value optimized out>, argv=0x7fff16af58b8)
at asterisk.c:3407
ASTERISK-7 0x00000036fca1e32a in __libc_start_main () from /lib64/libc.so.6
ASTERISK-8 0x0000000000419369 in pthread_attr_setdetachstate ()
ASTERISK-9 0x00007fff16af58a8 in ?? ()
ASTERISK-10 0x000000000000001c in ?? ()
ASTERISK-11 0x0000000000000004 in ?? ()
ASTERISK-12 0x00007fff16af6f61 in ?? ()
ASTERISK-13 0x00007fff16af6f74 in ?? ()
ASTERISK-14 0x00007fff16af6f77 in ?? ()
ASTERISK-15 0x00007fff16af6f7d in ?? ()
ASTERISK-16 0x0000000000000000 in ?? ()
By: Steve Murphy (murf) 2008-12-08 12:31:14.000-0600

OK. I installed ubuntu 8.10-server on my 64-bit test machine, and then proved all I needed to do was replace the stupid ethernet cable.

But I reproduced this. It seems only a bug that will bite on a 64-bit machine (so far), but it may be just an artifact of the way the machines handle things.

I also notice that it sometimes you'll crash on startup, sometimes on an ael reload, and sometimes, not at all. When it crashes, it's choking on a bad pointer to the args of the Dial command on line 27 (ext. 100).

I've also seen it die at other points.

I also note that if you start trunk with 'asterisk -cgvvvvvvvvvvvvvvvvW' instead of just 'asterisk -cgvvvvvvvvvvvvvvvv', that will prevent the crash for some very strange reason.

Valgrind doesn't find the problem. Instead, it is also useful in preventing the crash.

Using the debugger doesn't seem to prevent the crash tho.

And, at some point, the args to that dial command disappear in the pval tree...
or maybe, just the pointer to them.

This has very much the 'feel' of something walking over freed memory. That's why I tried valgrind...

more to come.

By: Steve Murphy (murf) 2008-12-09 08:18:50.000-0600

I've attached 14019.diff; please apply it to trunk, and verify that it prevents crashes.

By: Christopher K. Johnson (ckjohnsonme) 2008-12-09 09:03:14.000-0600

I applied the patch, make, make install, and now starting asterisk succeeds quietly, and reloading ael produces warnings and errors, but informs me of the reload failure politely without segfault.

After I fixed the errors in extensions.ael so it would load, I reloaded ael 10 times without any segfault.

Thanks for the prompt resolution. I'm not familiar with the code, but from your patch it looks like some nasty memory stomping was precipitated by my configuration errors.
By: Digium Subversion (svnbot) 2008-12-09 10:32:08.000-0600

Repository: asterisk
Revision: 162013

U branches/1.4/include/asterisk/ael_structs.h
U branches/1.4/pbx/ael/ael.flex
U branches/1.4/pbx/ael/ael_lex.c
U branches/1.4/pbx/pbx_ael.c

------------------------------------------------------------------------
r162013 | murf | 2008-12-09 10:32:08 -0600 (Tue, 09 Dec 2008) | 45 lines

(closes issue ASTERISK-13165)
Reported by: ckjohnsonme
Patches:
14019.diff uploaded by murf (license 17)
Tested by: ckjohnsonme, murf

This crash was the result of a few small errors that
would combine in 64-bit land to result in a crash.

32-bit land might have seen these combine to mysteriously
drop the args to an application call, in certain
circumstances.

Also, in trying to find this bug, I spotted
a situation in the flex input, where, in passing
back a 'word' to the parser, it would allocate
a buffer larger than necessary. I changed the
usage in such situations, so that strdup was
not used, but rather, an ast_malloc, followed
by ast_copy_string.

I removed a field from the pval struct, in
u2, that was never getting used, and set in
one spot in the code. I believe it was an
artifact of a previous fix to make switch
cases work invisibly with extens.

And, for goto's I removed a '!' from
before a strcmp, that has been there
since the initial merging of AEL2, that
might prevent the proper target of a
goto from being found. This was pretty
harmless on its own, as it would just
louse up a consistency check for users.

Many thanks to ckjohnsonme for providing
a simplified and complete set of information
about the bug, that helped considerably in
finding and fixing the problem.

Now, to get aelparse up and running again
in trunk, and out of its "horribly broken" state,
so I can run the regression suite!

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=162013