[Home]

Summary:ASTERISK-11483: Asterisk crash when Paging
Reporter:Joel Vandal (jvandal)Labels:
Date Opened:2008-02-20 13:14:51.000-0600Date Closed:2008-02-28 14:10:20.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:PBX/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12038.patch
( 1) 12038v2.patch
( 2) bt-12038.txt
( 3) gdb.txt
( 4) valgrind.txt
Description:We using latest Asterisk SVN 1.4 branches and when paging we got coredump.

****** ADDITIONAL INFORMATION ******

(gdb) bt
#0  0x00bd1e01 in __ast_pthread_mutex_lock (filename=0xc1bfbc "chan_sip.c", lineno=1900, func=0xc1c4e0 "retrans_pkt",
   mutex_name=0xc1c4ec "&pkt->owner->lock", t=0x0) at /dar/build/asterisk-1.4.18/include/asterisk/lock.h:302
#1  0x00bd0f06 in retrans_pkt (data=0x89413f8) at chan_sip.c:1900
#2  0x080f1a75 in ast_sched_runq ()
#3  0x00c0fec8 in do_monitor (data=0x0) at chan_sip.c:15629
#4  0x08101a1b in dummy_start ()
ASTERISK-1  0x0098c3cc in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-2  0x008b51ae in clone () from /lib/tls/libc.so.6


(gdb) bt full
#0  0x00bd1e01 in __ast_pthread_mutex_lock (filename=0xc1bfbc "chan_sip.c", lineno=1900, func=0xc1c4e0 "retrans_pkt",
   mutex_name=0xc1c4ec "&pkt->owner->lock", t=0x0) at /dar/build/asterisk-1.4.18pre1/include/asterisk/lock.h:302
       res = 11682376
       canlog = 10018551
       __PRETTY_FUNCTION__ = "__ast_pthread_mutex_lock"
#1  0x00bd0f06 in retrans_pkt (data=0x89413f8) at chan_sip.c:1900
       pkt = (struct sip_pkt *) 0x89413f8
       prev = (struct sip_pkt *) 0x80f1a61
       cur = (struct sip_pkt *) 0x0
       reschedule = 1000
       xmitres = 0
       __PRETTY_FUNCTION__ = "retrans_pkt"
#2  0x080f1a75 in ast_sched_runq ()
No symbol table info available.
#3  0x00c0fec8 in do_monitor (data=0x0) at chan_sip.c:15629
       res = 0
       sip = (struct sip_pvt *) 0x0
       peer = (struct sip_peer *) 0x0
       t = 1203534360
       fastrestart = 0
       lastpeernum = -1
       curpeernum = 20
       reloading = 0
       __PRETTY_FUNCTION__ = "do_monitor"
#4  0x08101a1b in dummy_start ()
No symbol table info available.
ASTERISK-1  0x0098c3cc in start_thread () from /lib/tls/libpthread.so.0
No symbol table info available.
ASTERISK-2  0x008b51ae in clone () from /lib/tls/libc.so.6
No symbol table info available.
Comments:By: Tilghman Lesher (tilghman) 2008-02-20 14:17:49.000-0600

Do you have any unusual messages in your log just before the crash?

By: Joel Vandal (jvandal) 2008-02-20 14:44:27.000-0600

No weird messages on console, except that we see devicestate notification from paged extensions from 'InUse' to 'Idle', what look "normal" IMHO.

To reproduce the problem we had to execute the Page(channels) application multiple times.

Without the DEBUG_THREADS option, asterisk crash at each  ~20-30 attempts and backtrace look a little bit differents, we got an error on app_dial.c:172 when he execute the ast_dial_join(dial) option.

We have recompile Asterisk/Asterisk-addons with DEBUG_THREADS/DONT_OPTIMIZE and now get the BT initially posted but must try ~80-100 attempts before it crash.

By: Tilghman Lesher (tilghman) 2008-02-20 14:52:40.000-0600

What I'm seeing from the crash is that this is occurring when the mutex is set to NULL, which is not something that we ever do, even when we destroy a mutex.  So I'm inclined to say that this is memory corruption.

Please run under valgrind, as documented in doc/valgrind.txt.

By: Joel Vandal (jvandal) 2008-02-20 15:27:32.000-0600

Hmm, 1st time I use Valgrind (3.1.1) and when I start Asterisk with :

valgrind --log-file-exactly=valgrind.txt asterisk -vvvvvvcg 2>malloc_debug.txt


But I got this error :

res_adsi.so => (ADSI Resource)
 == Parsing '/etc/asterisk/manager.conf': Found
--24902:0:aspacem  Valgrind: FATAL: aspacem assertion failed:
--24902:0:aspacem    aspacem_minAddr <= holeStart
--24902:0:aspacem    at m_aspacemgr/aspacemgr.c:1991 (vgPlain_am_get_advisory)
--24902:0:aspacem  Exiting now.


So I have upgrade Valgrind from 3.1.1 to 3.3.0 and now asterisk/valgrind start... Now I must reproduce the problem and send logs =)

By: Tilghman Lesher (tilghman) 2008-02-20 23:20:00.000-0600

Well, valgrind may actually prevent the coredump, but the output it produces will still be useful.

By: Joel Vandal (jvandal) 2008-02-21 07:58:05.000-0600

This trace from valgrind look similar to the trace we get when running * without DEBUG_THREADS options...

==27187== Invalid read of size 1
==27187==    at 0x80999C3: __ast_pthread_mutex_lock (in /usr/sbin/asterisk)
==27187==    by 0x809AFD2: ast_dial_join (in /usr/sbin/asterisk)
==27187==    by 0x4AC0344: ??? (app_page.c:172)
==27187==    by 0x80C2384: pbx_exec (in /usr/sbin/asterisk)
==27187==    by 0x80C570A: pbx_extension_helper (in /usr/sbin/asterisk)
==27187==    by 0x80C67AA: ast_spawn_extension (in /usr/sbin/asterisk)
==27187==    by 0x80C6C7C: __ast_pbx_run (in /usr/sbin/asterisk)
==27187==    by 0x80C7882: pbx_thread (in /usr/sbin/asterisk)
==27187==    by 0x8101A1A: dummy_start (in /usr/sbin/asterisk)
==27187==    by 0x98C3CB: start_thread (in /lib/tls/libpthread-2.3.4.so)
==27187==    by 0x8B51AD: clone (in /lib/tls/libc-2.3.4.so)
==27187==  Address 0x90 is not stack'd, malloc'd or (recently) free'd

By: Tilghman Lesher (tilghman) 2008-02-21 14:25:59.000-0600

I'm actually more interested in seeing a error of the type "Invalid write", but please upload the entire file to the file upload area.

By: Joel Vandal (jvandal) 2008-02-21 14:45:20.000-0600

Hmm, I have send the whole valdrind.txt file generated with :

valgrind --log-file=valgrind.txt asterisk -vvvvcg 2>malloc_debug.txt

Please note that on Valgrind 3.3.0+, the --log-file-exactly has been replaced by --log-file, possible that we need to add more cmdline args.

By: Joel Vandal (jvandal) 2008-02-21 14:49:57.000-0600

I have re-read the doc/valgrind.txt file and must recompile Asterisk using MALLOC_DEBUG, doing it right now and resend new dump.

By: Joel Vandal (jvandal) 2008-02-25 08:55:41.000-0600

I have recompile Asterisk w/ DEBUG_THREADS and MALLOC_DEBUG but the valgrind.txt are very similar, I dont see any "write" error on the file and filesize is only ~17k.

Perhaps, I have upload a new backtrace (bt, bt full, thread bt all) that contain more informations.

Each time we Page, server crash (tested multiple server/arch/distro).

P.S. you will see "asterisk-1.4.19pre1" on debugfile, this is only a private tag on my test server, in fact this is latest "svn branches/1.4"

By: Joel Vandal (jvandal) 2008-02-25 10:38:13.000-0600

Trying to debug, I'm not an expert but theses values look invalid, someone can correct me ?  


meetmeopts = "1795225903d|Atqxd\000|mqxdw(5)\000@ÂÁ\tðNw\001P\000\000\000§,\025\b\\.w\001call,all", '\0' <repeats 31 times>

originator = "SIP/6003\000b7dc8e10\000\026\b!>\024\b\230-w\001(\234\v\b´ÄÁ\tDÂÁ\t\001\000\000\000\000\000\000\000 øµ·\000\000\000\000H.w\001×\r\f\b³;\024\b¤\t\000\000îT\024\býA\024\b"
       flags = {flags = 0}

By: Dmitry Andrianov (dimas) 2008-02-25 11:29:08.000-0600

These are valid actually.
It is just a full content of a char array. Since strings are zero terminated, anything after \000 does not matter - it can be any kind of garbage. So basically you have

meetmeopts = "1795225903d|Atqxd"
originator = "SIP/6003"

By: Mark Michelson (mmichelson) 2008-02-25 15:38:48.000-0600

I suspect that the problem here is due to a race condition involving the linked list used in the ast_dial structure.

I have uploaded 12038.patch which adds locking around list operations. I tested briefly but I would like further testing to be done before I commit a fix. Thanks.

By: Joel Vandal (jvandal) 2008-02-26 12:58:21.000-0600

Patch applied to latest SVN branches/1.4 and still have segfault when paging. Backtrace (gdb.txt) is same as previous.

By: Joel Vandal (jvandal) 2008-02-27 08:53:05.000-0600

I have post new backtrace (gdb) using latest branches/1.4 from SVN (104502)

Will try to post valgrind output ASAP.

By: Mark Michelson (mmichelson) 2008-02-27 10:18:19.000-0600

One interesting piece of information in the latest backtrace which was not present in the first is that the ast_channel structure on which the lock is attempted is NULL. This would point to a different race condition, one where the channel is hung up but we still think the channel is running an application. There's a really simple solution to this problem, check to see if the channel is NULL before attempting to lock it. The problem  is that I'm not sure yet if this is a symptom of a deeper problem or not. I will look more closely and try to get a patch uploaded as soon as I can.

By: Mark Michelson (mmichelson) 2008-02-27 11:02:02.000-0600

New patch uploaded (12038v2.patch).

This one adds the list locking (since I still feel it could cause a problem) and the check for a NULL channel in ast_dial_join. Please test to be sure this fixes the issue.

Thanks for your patience on this one!

By: Joel Vandal (jvandal) 2008-02-27 11:50:30.000-0600

Congratulation, this new patch (12038v2) has definitively fixed our crash problem =) Thanks all !

By: Digium Subversion (svnbot) 2008-02-27 15:45:49.000-0600

Repository: asterisk
Revision: 104841

U   branches/1.4/main/dial.c

------------------------------------------------------------------------
r104841 | mmichelson | 2008-02-27 15:45:47 -0600 (Wed, 27 Feb 2008) | 17 lines

Two fixes:

1. Make the list of ast_dial_channels a lockable list. This is because in some cases,
  the ast_dial may exist in multiple threads due to asynchronous execution of its application, and
  I found some cases where race conditions could exist.

2. Check in ast_dial_join to be sure that the channel still exists before attempting to lock it, since
  it could have gotten hung up but the is_running_app flag on the ast_dial_channel may not have been
  cleared yet.

(closes issue ASTERISK-11483)
Reported by: jvandal
Patches:
     12038v2.patch uploaded by putnopvut (license 60)
Tested by: jvandal


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=104841

By: Digium Subversion (svnbot) 2008-02-28 14:10:20.000-0600

Repository: asterisk
Revision: 105060

_U  trunk/
U   trunk/main/dial.c

------------------------------------------------------------------------
r105060 | mmichelson | 2008-02-28 14:10:19 -0600 (Thu, 28 Feb 2008) | 25 lines

Merged revisions 104841 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r104841 | mmichelson | 2008-02-27 15:49:20 -0600 (Wed, 27 Feb 2008) | 17 lines

Two fixes:

1. Make the list of ast_dial_channels a lockable list. This is because in some cases,
  the ast_dial may exist in multiple threads due to asynchronous execution of its application, and
  I found some cases where race conditions could exist.

2. Check in ast_dial_join to be sure that the channel still exists before attempting to lock it, since
  it could have gotten hung up but the is_running_app flag on the ast_dial_channel may not have been
  cleared yet.

(closes issue ASTERISK-11483)
Reported by: jvandal
Patches:
     12038v2.patch uploaded by putnopvut (license 60)
Tested by: jvandal


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105060