[Home]

Summary:ASTERISK-08472: random clicking of pages in GUI produces a crash in manager
Reporter:zandbelt (zandbelt)Labels:
Date Opened:2007-01-03 13:02:55.000-0600Date Closed:2007-01-29 16:53:50.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/ManagerInterface
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk-svn-52502-manager-locks.patch
Description:A segmentation fault occurs randomly after some clicking through the
different screens; GDB stack trace see below.
Running Asterisk 1.4.0 and asterisk-gui svn revision 204 (on mipsel-linux).

****** ADDITIONAL INFORMATION ******

#0  0x004cbe60 in ast_dynamic_str_thread_build_va (buf=0x10071954,
max_len=0, ts=0x0, append=1, fmt=0x518768 " destroy, but can't do it
yet!\n",
    ap=0x7d3f7e74) at utils.c:971
971             if ((res + offset + 1) > (*buf)->len && (max_len ?
((*buf)->len < max_len) : 1)) {
(gdb) where
#0  0x004cbe60 in ast_dynamic_str_thread_build_va (buf=0x10071954,
max_len=0, ts=0x0, append=1, fmt=0x518768 " destroy, but can't do it
yet!\n",
    ap=0x7d3f7e74) at utils.c:971
#1  0x004cbf64 in ast_dynamic_str_append (buf=0x10071954, max_len=0,
fmt=0x518768 " destroy, but can't do it yet!\n") at threadstorage.h:411
#2  0x0047482c in astman_append (s=0x100e85d0, fmt=0x5179b0
"SrcFilename") at manager.c:423
#3  0x00476654 in action_getconfig (s=0x10071910, m=0x7d3f8a80) at
manager.c:1011
#4  0x0047f1ac in process_message (s=0x10071910, m=0x7d3f8a80) at
manager.c:1982
ASTERISK-1  0x00481640 in generic_http_callback (format=0, requestor=0x510000,
uri=0x0, params=0x100e8b28, status=0x7d3ffc98, title=0x20c6,
    contentlength=0x7d3ffca0) at manager.c:2469
ASTERISK-2  0x00482598 in rawman_http_callback (requestor=0x0, uri=0x40 <Address
0x40 out of bounds>, params=0x20c6, status=0x100e8614, title=0x40,
    contentlength=0x20c6) at manager.c:2552
ASTERISK-3  0x00468bc0 in ast_httpd_helper_thread (data=0x10093e28) at http.c:339
ASTERISK-4  0x004cb4e0 in dummy_start (data=0x40) at utils.c:545
ASTERISK-5  0x2ab354f8 in ?? ()


...<snip>...

(gdb) print *buf
$1 = (struct ast_dynamic_str *) 0x0
Comments:By: Serge Vecher (serge-v) 2007-01-03 13:10:23.000-0600

I think this is actually a seg-fault in Asterisk-proper that was fixed after 1.4.0 release. Can you please double-check by downloading the latest 1.4 code from svn?

By: Russell Bryant (russell) 2007-01-03 13:26:13.000-0600

Is this something you can recreate?

By: zandbelt (zandbelt) 2007-01-03 13:37:32.000-0600

Yes, three times in a row, clicking randomly through the pages.
I'm not able to reproduce it with a fixed set of consecutive clicks.
(it takes up to 5 mins. of fairly randomly clicking...)

I was also able to reproduce it with asterisk-gui svn rev 207 (still on
asterisk release 1.4.0):

#0  0x004cbe60 in ast_dynamic_str_thread_build_va (buf=0x100729ec, max_len=0, ts=0x0, append=1,
   fmt=0x518768 " destroy, but can't do it yet!\n", ap=0x7cdf7e74) at utils.c:971
971             if ((res + offset + 1) > (*buf)->len && (max_len ? ((*buf)->len < max_len) : 1)) {
(gdb) bt
#0  0x004cbe60 in ast_dynamic_str_thread_build_va (buf=0x100729ec, max_len=0, ts=0x0, append=1,
   fmt=0x518768 " destroy, but can't do it yet!\n", ap=0x7cdf7e74) at utils.c:971
#1  0x004cbf64 in ast_dynamic_str_append (buf=0x100729ec, max_len=0, fmt=0x518768 " destroy, but can't do it yet!\n")
   at threadstorage.h:411
#2  0x0047482c in astman_append (s=0x101d94b8, fmt=0x5179b0 "SrcFilename") at manager.c:423
#3  0x00476654 in action_getconfig (s=0x100729a8, m=0x7cdf8a80) at manager.c:1011
#4  0x0047f1ac in process_message (s=0x100729a8, m=0x7cdf8a80) at manager.c:1982
ASTERISK-1  0x00481640 in generic_http_callback (format=0, requestor=0x510000, uri=0x0, params=0x101ddb48, status=0x7cdffc98,
   title=0x1842, contentlength=0x7cdffca0) at manager.c:2469
ASTERISK-2  0x00482598 in rawman_http_callback (requestor=0x0, uri=0x3c <Address 0x3c out of bounds>, params=0x1842,
   status=0x101d94f8, title=0x3c, contentlength=0x1842) at manager.c:2552
ASTERISK-3  0x00468bc0 in ast_httpd_helper_thread (data=0x101e24e0) at http.c:339
ASTERISK-4  0x004cb4e0 in dummy_start (data=0x3c) at utils.c:545
ASTERISK-5  0x2ab354f8 in ?? ()
... <snip> ...
(gdb) print buf
$1 = (struct ast_dynamic_str **) 0x100729ec
(gdb) print *buf
$2 = (struct ast_dynamic_str *) 0x0
(gdb)



By: Anthony LaMantia (alamantia) 2007-01-03 16:11:47.000-0600

zandbelt, just a quick question, do you use #include directives inside any of your configuration files?

if you you may want to test this branch of asterisk.
http://svn.digium.com/view/asterisk/team/anthonyl/8678-commentbuffer/

which solves a problem with the get_config processing via the manager api when processing #include directives.

from the backtrace you have posted i am unsure if this is the source of the problem you are having but it would be nice to know if this solves the issue that you are encountering.

By: zandbelt (zandbelt) 2007-01-04 02:08:03.000-0600

No, I don't use #includes:

root@OpenWrt:/mnt/disc0_1/etc/asterisk# grep "#include" *
extensions.conf:; You can include other config files, use the #include command
extensions.conf:; that includes contexts within other contexts. The #include command works
extensions.conf:;#include "filename.conf"

Do you still want me to try the latest SVN version? I'll take some more time to try that (since I have to build OpenWRT packages from them first); if you don't strongly expect that it has already solved that bug, then I would rather not make this effort...

By: zandbelt (zandbelt) 2007-01-06 12:37:53.000-0600

FYI: I have tried 3 different recent asterisk SVN versions (49463 49667 49765) but apparently these do not work with the current asterisk-gui SVN versions (204, 207, 216). These asterisk-gui SVN versions do work with the asterisk 1.4.0 release, but that one contains the crash.
So am currently not able to test against a recent asterisk SVN release...

By: Jason Parker (jparker) 2007-01-09 13:59:45.000-0600

Please try svn branch 1.4, NOT svn trunk.

By: zandbelt (zandbelt) 2007-01-09 14:59:43.000-0600

Thanks for the suggestion; the asterisk-gui works now but asterisk still shows the same behaviour: a crash with stack trace below. I used now:
 asterisk-gui SVN rev 220,
 asterisk     SVN branch 1.4 version 50212

The asterisk console mentions:
 SVN-branch-1.4-r50186M
The asterisk-gui sysinfo mentions:
 Asterisk GUI-version Revision: 205 $

#0  0x004ccd90 in ast_dynamic_str_thread_build_va (buf=0x1007231c, max_len=0, ts=0x0, append=1, fmt=0x519698 "%s",
   ap=0x7cffcd0c) at utils.c:974
974             if ((res + offset + 1) > (*buf)->len && (max_len ? ((*buf)->len < max_len) : 1)) {
(gdb) bt
#0  0x004ccd90 in ast_dynamic_str_thread_build_va (buf=0x1007231c, max_len=0, ts=0x0, append=1, fmt=0x519698 "%s",
   ap=0x7cffcd0c) at utils.c:974
#1  0x004cce94 in ast_dynamic_str_append (buf=0x1007231c, max_len=0, fmt=0x519698 "%s") at threadstorage.h:477
#2  0x00475dac in astman_append (s=0x1018c0b8, fmt=0x518f48 "\r\n") at manager.c:423
#3  0x00477c3c in action_getconfig (s=0x100722d8, m=0x7cffd940) at manager.c:1015
#4  0x0048067c in process_message (s=0x100722d8, m=0x7cffd940) at manager.c:1984
ASTERISK-1  0x00482b44 in generic_http_callback (format=0, requestor=0x7cffd940, uri=0x0, params=0x10185da0, status=0x7cfffc98,
   title=0x15, contentlength=0x7cfffca0) at manager.c:2470
ASTERISK-2  0x00483b80 in rawman_http_callback (requestor=0x0, uri=0x2 <Address 0x2 out of bounds>, params=0x15, status=0x1018c0bd,
   title=0x2, contentlength=0x15) at manager.c:2564
ASTERISK-3  0x0046a064 in ast_httpd_helper_thread (data=0x1014c810) at http.c:354
ASTERISK-4  0x004cc410 in dummy_start (data=0x2) at utils.c:545
ASTERISK-5  0x2ab354f8 in ?? ()
...<snip>...
(gdb) print *buf
$1 = (struct ast_dynamic_str *) 0x0

By: Jason Parker (jparker) 2007-01-09 15:03:18.000-0600

What modifications have you made to the source?

By: zandbelt (zandbelt) 2007-01-09 15:14:58.000-0600

I only modified the original SVN tree with the patches below, to be able to compile and run it on the mipsel platform with the OpenWRT SDK. These mods seem rather harmless and unrelated to me.

--- asterisk-svn-50212.org/main/Makefile        2007-01-09 21:07:24.000000000 +0100
+++ asterisk-svn-50212/main/Makefile    2007-01-09 21:15:19.000000000 +0100
@@ -95,7 +95,7 @@
endif

editline/libedit.a:
-       cd editline && test -f config.h || CFLAGS="$(ASTCFLAGS:-Werror=)" ./configure --build=$(BUILD_PLATFORM) --host=$(HOST_PLATFORM)
+       cd editline && test -f config.h || CFLAGS="$(ASTCFLAGS:-Werror=)" LDFLAGS="$(ASTLDFLAGS)" ./configure --build=$(BUILD_PLATFORM) --host=$(HOST_PLATFORM)
       $(MAKE) -C editline libedit.a

db1-ast/libdb1.a:
@@ -135,7 +135,7 @@
       @rm -f $(ASTTOPDIR)/include/asterisk/build.h.tmp
       @$(CC) -c -o buildinfo.o $(ASTCFLAGS) buildinfo.c
       $(ECHO_PREFIX) echo "   [LD] $^ -> $@"
-       $(CMD_PREFIX) $(CXX) $(STATIC_BUILD) -o $@ $(ASTLINK) $(AST_EMBED_LDFLAGS) $(ASTLDFLAGS) $(H323LDFLAGS) $^ buildinfo.o $(AST_LIBS) $(AST_EMBED_LIBS) $(H323LDLIBS)
+       $(CMD_PREFIX) $(CC) $(STATIC_BUILD) -o $@ $(ASTLINK) $(AST_EMBED_LDFLAGS) $(ASTLDFLAGS) $(H323LDFLAGS) $^ buildinfo.o $(AST_LIBS) $(AST_EMBED_LIBS) $(H323LDLIBS)
       @$(ASTTOPDIR)/build_tools/strip_nonapi $@

clean::
--- asterisk-svn-50212.org/channels/chan_iax2.c 2007-01-09 21:07:11.000000000 +0100
+++ asterisk-svn-50212/channels/chan_iax2.c     2007-01-09 21:18:03.000000000 +0100
@@ -1342,7 +1342,7 @@
               last++;
       else
               last = s;
-       snprintf(s2, strlen(s) + 100, "/var/tmp/%s-%ld", last, (unsigned long)ast_random());
+       snprintf(s2, strlen(s) + 100, "/tmp/%s-%ld", last, (unsigned long)ast_random());
       res = stat(s, &stbuf);
       if (res < 0) {
               ast_log(LOG_WARNING, "Failed to stat '%s': %s\n", s, strerror(errno));

By: zandbelt (zandbelt) 2007-01-10 08:43:16.000-0600

I did some "printf debugging" and it seems that the outputstr member variable in the manager session struct is causing the crash because it is being accessed in two different threads at the same time. It looks like one http request is not finished yet while the other already starts and finishes, thereby destroying the outputstr member. This type of bugs is actually likely to show up on the mipsel platform that I run Asterisk on because it is considerably slower than "normal" asterisk boxes :-)
Below is the patch that I used to find the following printout in the normal case (where I click slowly through the web interface):

## checking outputstr in thread: 112661 (0x1008fdf8)
## appending outputstr in thread: 112661
## return from appending outputstr in thread: 112661
## freeing outputstr in thread: 112661
## return from freeing outputstr in thread: 112661

When clicking faster the following is likely to occur:

## checking outputstr in thread: 113685 (0x10098068)
## appending outputstr in thread: 113685
## freeing outputstr in thread: 88084
## return from freeing outputstr in thread: 88084
Segmentation fault

So I guess its a matter of locking (part of) the session context; I tested a naive patch doing just that with brute coarse grained locking (see the commented out lock/unlock in the patch below) but that leads to deadlocks. So this seems a more fundamental issue to me also because according to the documentation in manager.c (see below) concurrent access should not be possible.

According to the documentation concurrent access in action callbacks would not happen since "s->busy" would be set (manager.c:737), but I can't find this variable in the session struct. I also noticed more calls to astman_send_* in process_message are not locked whereas according the the documentation of these calls, the caller should lock.

--- asterisk-svn-50212.org/main/manager.c       2007-01-09 21:07:24.000000000 +0100
+++ asterisk-svn-50212/main/manager.c   2007-01-10 15:32:31.000000000 +0100
@@ -417,10 +417,12 @@
       if (s->fd > -1)
               ast_carefulwrite(s->fd, buf->str, strlen(buf->str), s->writetimeout);
       else {
+               fprintf(stderr, " ## checking outputstr in thread: %lu (%p)\n", pthread_self(), s->outputstr);
               if (!s->outputstr && !(s->outputstr = ast_calloc(1, sizeof(*s->outputstr))))
                       return;
-
+               fprintf(stderr, " ## appending outputstr in thread: %lu \n", pthread_self());
               ast_dynamic_str_append(&s->outputstr, 0, "%s", buf->str);
+               fprintf(stderr, " ## return from appending outputstr in thread: %lu \n", pthread_self());
       }
}

@@ -654,6 +656,7 @@
       struct eventqent *eqe;
       if (s->fd > -1)
               close(s->fd);
+       fprintf(stderr, " ## free_session: free outputstr in thread: %lu \n", pthread_self());
       if (s->outputstr)
               free(s->outputstr);
       ast_mutex_destroy(&s->__lock);
@@ -2467,6 +2470,7 @@
                       m.hdrcount = x + 1;
               }

+               //ast_mutex_lock(&s->__lock);
               if (process_message(s, &m)) {
                       if (s->authenticated) {
                               if (option_verbose > 1) {
@@ -2513,9 +2517,12 @@
                       }
                       if (tmp != s->outputstr->str)
                               free(tmp);
+                       fprintf(stderr, " ## freeing outputstr in thread: %lu \n", pthread_self());
                       free(s->outputstr);
                       s->outputstr = NULL;
+                       fprintf(stderr, " ## return from freeing outputstr in thread: %lu \n", pthread_self());
               }
+               //ast_mutex_unlock(&s->__lock);
               /* Still okay because c would safely be pointing to workspace even
                  if retval failed to allocate above */
               if (format == FORMAT_XML) {

By: nite (nite) 2007-01-15 13:33:38.000-0600

I get the samething too

Core was generated by `asterisk -g'.
Program terminated with signal 11, Segmentation fault.
#0  0x0810cbd2 in ast_dynamic_str_thread_build_va (buf=0x821ded8, max_len=0, ts=0x0, append=1,
   fmt=0x814f4f4 "%s", ap=0xb612192c ",\022&\b\036&#65533;\024\bh\031\022&#65533;\001") at utils.c:969
969             res = vsnprintf((*buf)->str + offset, (*buf)->len - offset, fmt, ap);
(gdb) bt
#0  0x0810cbd2 in ast_dynamic_str_thread_build_va (buf=0x821ded8, max_len=0, ts=0x0, append=1,
   fmt=0x814f4f4 "%s", ap=0xb612192c ",\022&\b\036&#65533;\024\bh\031\022&#65533;\001") at utils.c:969
#1  0x0810ab35 in ast_dynamic_str_append (buf=0x821ded8, max_len=0, fmt=0x814f4f4 "%s")
   at /usr/src/branch/asterisk-1.4/include/asterisk/threadstorage.h:477
#2  0x080b6d55 in astman_append (s=0x821ddf0, fmt=0x814fd1e "Line-%06d-%06d: %s=%s\r\n") at manager.c:423
#3  0x080b8d70 in action_getconfig (s=0x821ddf0, m=0xb6121db0) at manager.c:1011
#4  0x080bd4de in process_message (s=0x821ddf0, m=0xb6121db0) at manager.c:1984
ASTERISK-1  0x080bf1b4 in generic_http_callback (format=0, requestor=0x8262450, uri=0xb6123314 "", params=0x825f208,
   status=0xb61222cc, title=0xb61222d0, contentlength=0xb61222c8) at manager.c:2469
ASTERISK-2  0x080bf8fd in rawman_http_callback (requestor=0x8262450, uri=0xb6123314 "", params=0x825f208,
   status=0xb61222cc, title=0xb61222d0, contentlength=0xb61222c8) at manager.c:2563
ASTERISK-3  0x080a9e03 in handle_uri (sin=0x8262450, uri=0xb6123314 "", status=0xb61222cc, title=0xb61222d0,
   contentlength=0xb61222c8, cookies=0xb61222d4) at http.c:354
ASTERISK-4  0x080aa373 in ast_httpd_helper_thread (data=0x8262448) at http.c:464
ASTERISK-5  0x0810bdb1 in dummy_start (data=0x823a240) at utils.c:545
ASTERISK-6 0xb7f08504 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
ASTERISK-7 0xb7d3851e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb)


Asterisk SVN-branch-1.4-r50921
[*] 1.  DEBUG_CHANNEL_LOCKS
[*] 2.  DEBUG_SCHEDULER
[*] 3.  DEBUG_THREADS
[*] 4.  DEBUG_THREADLOCALS
[*] 7.  DONT_OPTIMIZE

By: Anthony LaMantia (alamantia) 2007-01-17 18:46:03.000-0600

nite, what kind of platform are you running asterisk on? is it a standard x86 pc? or a more like the wrt554g zandbelt is using?

By: Russell Bryant (russell) 2007-01-23 16:13:52.000-0600

This should be resolved in the 1.4 branch and trunk in revisions 51781 and 51787. Thanks for the thorough report and all of the debugging help!

By: zandbelt (zandbelt) 2007-01-29 04:43:32.000-0600

The issue remains. I don't think that the provided solution of locking the astman_send_* calls is enough; that was just an inconsistency that I noticed. As far as I can see, the real issue is outside of the process_message call, in the generic_http_callback method where outputstr is also accessed; see my previous (outcommented) attempt to apply coarse grained locks.

Unfortunately revision 51750 in between introduced a deadlock that complicates matters. I'll attach a patch that removes the deadlock, adds more locking in places that have been forgotten, but does not resolve this issue. (however this patch can be used to reproduce the issue)



By: Russell Bryant (russell) 2007-01-29 13:15:09.000-0600

I'm sorry I broke this.  I'll work on getting it cleaned up now.  Thanks for the update!

By: Russell Bryant (russell) 2007-01-29 13:37:19.000-0600

So I guess the deadlock you saw was because of the added lock and unlock of the actionlock around action handling.  Can you explain the cause of the deadlock here?

I'm on IRC most of the day if you want to find me on there for live discussion.

By: Russell Bryant (russell) 2007-01-29 13:39:29.000-0600

Also, I can't seem to get this crash to happen on my system.  Would it be possible to give me access to the system where you can make this happen to make it easier for me to get it fixed?

By: Russell Bryant (russell) 2007-01-29 14:08:19.000-0600

Nevermind, I understand the deadlock now ...

By: Russell Bryant (russell) 2007-01-29 14:54:32.000-0600

The deadlock has been fixed in 1.4 in revision 52611, and the trunk in revision 52613.

As for the crash, it would be nice if you could test again with the current code.    However, per our discussion, I do think it is still possible for there to be a problem.  But, if there was, I would think it would be more likely incomplete output as opposed to a crash as it is now.

By: zandbelt (zandbelt) 2007-01-29 15:50:09.000-0600

I tested r52643 and I'm not able to reproduce the crash; looking at the code I'm also pretty confident that the patch solves the crash since outputstr is well-locked now. However, I also believe that messed-up output is just as likely to occur, although I have no evidence for that so far from my experiments.

Some additional remarks:

- at line 2134 I spotted a call to the (modified) astman_append function that itself still locks the session (whereas the function itself also does that). I would expect a deadlock on that although I haven't experienced it...

- the free_session function (line 659) does not lock before destroying outputstr; I guess this could still cause a potential crash.

- to me it seems that a session is created per connection (see accept_thread function) and not per-user like you indicated on IRC; however I don't see how different threads (= different sessions) could then access the same session context, so I must be missing something here...

By: Russell Bryant (russell) 2007-01-29 16:53:39.000-0600

I'll respond to each comment ...

1) I have modified that code not to do the extra lock/unlock of the session around the call to astman_append().  The reason it doesn't deadlock is because we use recursive locks in Asterisk, meaning we can lock them multiple times in the same thread without a deadlock.

2) There is no need to lock the session in destroy session.  This is because the code path before getting this far ensures that nobody is using and nobody will start using will it is being removed and destroyed from the sessions lsit.

3) Well, that may be the case for regular manager sessions, but it's different at least for HTTP based sessions.  generic_http_callback loops through the existing sessions matching by the session ID, and will use the same one if it is found to match.  So, they can definitely be used at the same time.

However, Pari, the GUI developer, told me that the GUI should never be sending a request before receiving the response for a previous one.  So, I'm just going to close this out.

Thanks for the help and let me know if you have any further problems.