Summary: | ASTERISK-13165: Starting or restarting asterisk causes seg fault and core dump, apparently in ael/pval.c:4833 | ||
Reporter: | Christopher K. Johnson (ckjohnsonme) | Labels: | |
Date Opened: | 2008-12-03 20:11:54.000-0600 | Date Closed: | 2009-02-19 12:00:21.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | PBX/pbx_ael |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 14019.diff ( 1) extensions.ael | |
Description: | [root@freedom asterisk]# asterisk -V Asterisk SVN-trunk-r160791 [root@freedom asterisk]# service asterisk start Starting asterisk: [ OK ] [root@freedom asterisk]# service asterisk restart Shutting down asterisk: Asterisk ended with exit status 0 Asterisk shutdown normally. [ OK ] Starting asterisk: [ OK ] [root@freedom asterisk]# /usr/sbin/safe_asterisk: line 138: 11599 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY} Asterisk ended with exit status 139 Asterisk exited on signal 11. Automatically restarting Asterisk. mpg123: no process killed /usr/sbin/safe_asterisk: line 138: 11643 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY} Asterisk ended with exit status 139 Asterisk exited on signal 11. Automatically restarting Asterisk. mpg123: no process killed /usr/sbin/safe_asterisk: line 138: 11688 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY} Asterisk ended with exit status 139 Asterisk exited on signal 11. Automatically restarting Asterisk. mpg123: no process killed /usr/sbin/safe_asterisk: line 138: 11732 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY} Asterisk ended with exit status 139 Asterisk exited on signal 11. Automatically restarting Asterisk. mpg123: no process killed /usr/sbin/safe_asterisk: line 138: 11775 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY} Asterisk ended with exit status 139 Asterisk exited on signal 11. Automatically restarting Asterisk. mpg123: no process killed The problem occurs frequently but not reliably. Starting asterisk service usually causes one or more segfaults before safe_asterisk successfully starts asterisk. ****** ADDITIONAL INFORMATION ****** gdb /usr/sbin/asterisk -core /tmp/core.freedom.somerville.dirigo.net-2008-12-03T20:49:24-0500 ... #0 destroy_pval (item=0x7fdd00000000) at ael/pval.c:4833 4833 nxt = i->next; System is Linux Fedora release 9 (Sulphur) Kernel 2.6.27.5-41.fc9.x86_64 Hardware is multiprocessor - AMD Phenom(tm) 9850 Quad-Core Processor I had the same results with tag 1.6.1-beta2 and beta3. extensions.conf is an empty file. All extensions are defined in extensions.ael Once asterisk succeeds in starting, it operates correctly. | ||
Comments: | By: Christopher K. Johnson (ckjohnsonme) 2008-12-03 20:45:25.000-0600 From the asterisk console an 'ael reload' also usually, but not reliably, causes one or more segfaults, and asterisk restarting with each of course. Core dump shows same "at ael/pval.c:4833". 'dialplan reload' repeatedly does NOT cause segfault. Just 'ael reload'. By: Christopher K. Johnson (ckjohnsonme) 2008-12-03 21:47:15.000-0600 The attached extensions.ael has been pared down from my original configuration, but still results in the same segfault sometimes upon service asterisk restart or ael reload. The sample extensions.ael provided by 'make samples' does not seem to segfault. By: Steve Murphy (murf) 2008-12-04 13:58:21.000-0600 Very interesting. I'll look at it. You've done a good job giving me something to chew on. By: Steve Murphy (murf) 2008-12-04 17:14:40.000-0600 I've got to re-install linux on my 64-bit machine. It's seriously messed up. In the meantime, if you could give a complete stack trace from your machine, it might help me. The snippet of the gdb trace is not sufficient. By: Christopher K. Johnson (ckjohnsonme) 2008-12-04 21:43:50.000-0600 Here is one: #0 destroy_pval (item=0x7fb200000000) at ael/pval.c:4833 #1 0x000000000481614f in destroy_pval_item (item=0x7fb2a007a5b0) at ael/pval.c:4809 #2 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835 #3 0x000000000481614f in destroy_pval_item (item=0x7fb2a007a720) at ael/pval.c:4809 #4 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835 ASTERISK-1 0x000000000481614f in destroy_pval_item (item=0x7fb2a0079ae0) at ael/pval.c:4809 ASTERISK-2 0x0000000004816321 in destroy_pval (item=0x7fb200000000) at ael/pval.c:4835 ASTERISK-3 0x00007fb294ab92e9 in pbx_load_module () at pbx_ael.c:150 ASTERISK-4 0x00007fb294ab932d in handle_cli_ael_reload (e=0x7fb200000000, cmd=-55342744, a=0x25) at pbx_ael.c:206 ASTERISK-5 0x00000000004544b0 in ast_cli_command_full (uid=0, gid=0, fd=21, s=0x41f80ae0 "ael reload") at cli.c:2224 ASTERISK-6 0x00000000004546e3 in ast_cli_command_multiple_full (uid=0, gid=0, fd=21, size=<value optimized out>, s=0x41f80d20 "ael reload") at cli.c:2247 ASTERISK-7 0x000000000042f733 in netconsole (vconsole=<value optimized out>) at asterisk.c:1071 ASTERISK-8 0x00000000004f14ac in dummy_start (data=<value optimized out>) at utils.c:1024 ASTERISK-9 0x00000036fd60729a in start_thread () from /lib64/libpthread.so.0 ASTERISK-10 0x00000036fcae439d in clone () from /lib64/libc.so.6 And another: #0 destroy_pval (item=0x7fe900000000) at ael/pval.c:4833 #1 0x000000000318014f in destroy_pval_item (item=0x7fe9082b3df0) at ael/pval.c:4809 #2 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835 #3 0x000000000318014f in destroy_pval_item (item=0x7fe9082b3f60) at ael/pval.c:4809 #4 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835 ASTERISK-1 0x000000000318014f in destroy_pval_item (item=0x7fe9082b4d40) at ael/pval.c:4809 ASTERISK-2 0x0000000003180321 in destroy_pval (item=0x7fe900000000) at ael/pval.c:4835 ASTERISK-3 0x00007fe8fe36d2e9 in pbx_load_module () at pbx_ael.c:150 ASTERISK-4 0x00000000004906ba in load_resource (resource_name=0x1e9d430 "pbx_ael.so", global_symbols_only=<value optimized out>) at loader.c:694 ASTERISK-5 0x0000000000490eef in load_modules (preload_only=0) at loader.c:898 ASTERISK-6 0x0000000000431db5 in main (argc=<value optimized out>, argv=0x7fff16af58b8) at asterisk.c:3407 ASTERISK-7 0x00000036fca1e32a in __libc_start_main () from /lib64/libc.so.6 ASTERISK-8 0x0000000000419369 in pthread_attr_setdetachstate () ASTERISK-9 0x00007fff16af58a8 in ?? () ASTERISK-10 0x000000000000001c in ?? () ASTERISK-11 0x0000000000000004 in ?? () ASTERISK-12 0x00007fff16af6f61 in ?? () ASTERISK-13 0x00007fff16af6f74 in ?? () ASTERISK-14 0x00007fff16af6f77 in ?? () ASTERISK-15 0x00007fff16af6f7d in ?? () ASTERISK-16 0x0000000000000000 in ?? () By: Steve Murphy (murf) 2008-12-08 12:31:14.000-0600 OK. I installed ubuntu 8.10-server on my 64-bit test machine, and then proved all I needed to do was replace the stupid ethernet cable. But I reproduced this. It seems only a bug that will bite on a 64-bit machine (so far), but it may be just an artifact of the way the machines handle things. I also notice that it sometimes you'll crash on startup, sometimes on an ael reload, and sometimes, not at all. When it crashes, it's choking on a bad pointer to the args of the Dial command on line 27 (ext. 100). I've also seen it die at other points. I also note that if you start trunk with 'asterisk -cgvvvvvvvvvvvvvvvvW' instead of just 'asterisk -cgvvvvvvvvvvvvvvvv', that will prevent the crash for some very strange reason. Valgrind doesn't find the problem. Instead, it is also useful in preventing the crash. Using the debugger doesn't seem to prevent the crash tho. And, at some point, the args to that dial command disappear in the pval tree... or maybe, just the pointer to them. This has very much the 'feel' of something walking over freed memory. That's why I tried valgrind... more to come. By: Steve Murphy (murf) 2008-12-09 08:18:50.000-0600 I've attached 14019.diff; please apply it to trunk, and verify that it prevents crashes. By: Christopher K. Johnson (ckjohnsonme) 2008-12-09 09:03:14.000-0600 I applied the patch, make, make install, and now starting asterisk succeeds quietly, and reloading ael produces warnings and errors, but informs me of the reload failure politely without segfault. After I fixed the errors in extensions.ael so it would load, I reloaded ael 10 times without any segfault. Thanks for the prompt resolution. I'm not familiar with the code, but from your patch it looks like some nasty memory stomping was precipitated by my configuration errors. By: Digium Subversion (svnbot) 2008-12-09 10:32:08.000-0600 Repository: asterisk Revision: 162013 U branches/1.4/include/asterisk/ael_structs.h U branches/1.4/pbx/ael/ael.flex U branches/1.4/pbx/ael/ael_lex.c U branches/1.4/pbx/pbx_ael.c ------------------------------------------------------------------------ r162013 | murf | 2008-12-09 10:32:08 -0600 (Tue, 09 Dec 2008) | 45 lines (closes issue ASTERISK-13165) Reported by: ckjohnsonme Patches: 14019.diff uploaded by murf (license 17) Tested by: ckjohnsonme, murf This crash was the result of a few small errors that would combine in 64-bit land to result in a crash. 32-bit land might have seen these combine to mysteriously drop the args to an application call, in certain circumstances. Also, in trying to find this bug, I spotted a situation in the flex input, where, in passing back a 'word' to the parser, it would allocate a buffer larger than necessary. I changed the usage in such situations, so that strdup was not used, but rather, an ast_malloc, followed by ast_copy_string. I removed a field from the pval struct, in u2, that was never getting used, and set in one spot in the code. I believe it was an artifact of a previous fix to make switch cases work invisibly with extens. And, for goto's I removed a '!' from before a strcmp, that has been there since the initial merging of AEL2, that might prevent the proper target of a goto from being found. This was pretty harmless on its own, as it would just louse up a consistency check for users. Many thanks to ckjohnsonme for providing a simplified and complete set of information about the bug, that helped considerably in finding and fixing the problem. Now, to get aelparse up and running again in trunk, and out of its "horribly broken" state, so I can run the regression suite! ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=162013 |