[Home]

Summary:ASTERISK-11504: Two Asterisk crashes
Reporter:Norman Franke (norman)Labels:
Date Opened:2008-02-25 10:43:39.000-0600Date Closed:2008-03-06 11:50:39.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk_log_0222114202.short.gz
( 1) asterisk_log_0225092825.txt.gz
( 2) valgrind_0222114202.txt
( 3) valgrind_0225092825.txt
Description:I had two Asterisk crashes or lockups within 2 hours today. However, it ran all weekend without a problem. I was running under Valgrind, but the first crash apparently Valgrind can't track down. I had another like this one last week. Basically, Valgrind runs out of thread slots and one is corrupt:

Thread 18: status = VgTs_WaitSys
==20183==    at 0x4000792: (within /lib/ld-2.3.6.so)
==20183==    by 0x28ADD17: ???

(I believe this is while trying to lock a mutex.) I have compiled with MALLOC_DEBUG and those files are zero length. I have the verbose 3 output for these, but the first is 500 MB. I do have some previous (in the valgrind log) that I reported with ASTERISK-11408 that were not fixed. Those deal with ast_log accessing free'd memory in chan_sip.

The second log is more interesting. Lots of errors like this:

==7874== Thread 7:
==7874== Invalid read of size 4
==7874==    at 0x59D0671: ??? (chan_sip.c:4564)
==7874==    by 0x5A03A9A: ??? (chan_sip.c:15366)
==7874==    by 0x80AD830: ast_io_wait (io.c:279)
==7874==    by 0x5A0477C: ??? (chan_sip.c:15613)
==7874==    by 0x810003C: dummy_start (utils.c:852)
==7874==    by 0x403123F: start_thread (in /lib/tls/i686/cmov/libpthread-2.3.6.so)
==7874==    by 0x4FAF49D: clone (in /lib/tls/i686/cmov/libc-2.3.6.so)
==7874==  Address 0x719d4d0 is 136 bytes inside a block of size 5,620 free'd
==7874==    at 0x401D40C: free (vg_replace_malloc.c:323)
==7874==    by 0x8072362: __ast_free_region (astmm.c:174)
==7874==    by 0x80726EE: __ast_free (astmm.c:208)
==7874==    by 0x8081D5F: ast_channel_free (channel.c:1202)
==7874==    by 0x8082A22: ast_hangup (channel.c:1496)
==7874==    by 0x697BD42: ??? (app_dial.c:1726)
==7874==    by 0x697BF6D: ??? (app_dial.c:1760)
==7874==    by 0x80BE5EE: pbx_exec (pbx.c:532)
==7874==    by 0x80C19E7: pbx_extension_helper (pbx.c:1851)
==7874==    by 0x80C2AE1: ast_spawn_extension (pbx.c:2306)
==7874==    by 0x80C2FFD: __ast_pbx_run (pbx.c:2408)
==7874==    by 0x80C3DB2: pbx_thread (pbx.c:2623)

and

==7874== Invalid read of size 4
==7874==    at 0x4033197: pthread_mutex_trylock (in /lib/tls/i686/cmov/libpthread-2.3.6.so)
==7874==    by 0x59C4937: ??? (lock.h:706)
==7874==    by 0x5A04394: ??? (chan_sip.c:15540)
==7874==    by 0x810003C: dummy_start (utils.c:852)
==7874==    by 0x403123F: start_thread (in /lib/tls/i686/cmov/libpthread-2.3.6.so)
==7874==    by 0x4FAF49D: clone (in /lib/tls/i686/cmov/libc-2.3.6.so)
==7874==  Address 0x719d4bc is 116 bytes inside a block of size 5,620 free'd
==7874==    at 0x401D40C: free (vg_replace_malloc.c:323)
==7874==    by 0x8072362: __ast_free_region (astmm.c:174)
==7874==    by 0x80726EE: __ast_free (astmm.c:208)
==7874==    by 0x8081D5F: ast_channel_free (channel.c:1202)
==7874==    by 0x8082A22: ast_hangup (channel.c:1496)
==7874==    by 0x697BD42: ??? (app_dial.c:1726)
==7874==    by 0x697BF6D: ??? (app_dial.c:1760)
==7874==    by 0x80BE5EE: pbx_exec (pbx.c:532)
==7874==    by 0x80C19E7: pbx_extension_helper (pbx.c:1851)
==7874==    by 0x80C2AE1: ast_spawn_extension (pbx.c:2306)
==7874==    by 0x80C2FFD: __ast_pbx_run (pbx.c:2408)
==7874==    by 0x80C3DB2: pbx_thread (pbx.c:2623)

I'll upload these files.
Comments:By: Tilghman Lesher (tilghman) 2008-02-25 12:14:24.000-0600

Your first valgrind output shows that you're running debug output somewhere.  Could you please upload the corresponding debug log file?

By: Norman Franke (norman) 2008-02-25 12:39:22.000-0600

It's 500 MB, even gziped this system won't allow it. Do you have an anon FTP  site I could send it to? I could probably put it on a web site for download. Or I can edit it down to like the last 15 minutes or something.

The memory errors in the first part of the first file occurred very early on, BTW. Toward the end, it seemed to have got stuck on a lock and tons of channels were created that ended up waiting for a lock, which thus exceeded valgrind's thread capacity.



By: Tilghman Lesher (tilghman) 2008-02-25 12:43:02.000-0600

I don't need the whole log, just around the times (say, back to 10 minutes before) that the crash happened.

By: Norman Franke (norman) 2008-02-25 12:57:19.000-0600

I've uploaded the last 28 minutes of the log. Note that all of these errors listed before the too many threads error happened way, way earlier (Friday at the latest.) I'll also upload the much shorter logs from the second crash. In this one, something got locked and never came back.

By: Tilghman Lesher (tilghman) 2008-02-25 16:04:55.000-0600

Looks like you don't have the verbosity of DEBUG turned up, so this is going to be harder to debug.

Please increase the debug level, with "core set debug 5" and obtain another pair of valgrind and log output.  You may want to issues a "logger rotate" prior to this, to ensure we're only seeing the log from when valgrind starts.

By: Norman Franke (norman) 2008-02-25 16:45:33.000-0600

That first crash has only happened once in 3 weeks, so reproducing it is going to be nearly impossible.

How about the second one? Lots of chan_sip using memory after free'd issues.

By: Russell Bryant (russell) 2008-02-25 17:20:50.000-0600

In your valgrind output, I see a lot of invalid memory accesses that involve a module called "app_amd", which is not a module that is a part of Asterisk.  We do not support patched systems.

By: Russell Bryant (russell) 2008-02-25 17:23:44.000-0600

Oops, I meant "app_acd"

By: Norman Franke (norman) 2008-02-25 17:49:06.000-0600

russell- there are no memory issues with that module in looking at the valgrind output. There are a number of hung threads after something else bad happened, but it didn't flag any memory issues. As I noted in the description, this is related to ASTERISK-11408 and I did converse with Corydon76 about app_acd there, so this is known to him. He did offer a patch in that case, so I think a -4 was rather harsh.

By: Russell Bryant (russell) 2008-02-25 17:53:53.000-0600

We still do not support patched systems of any kind.  Sorry.

By: Norman Franke (norman) 2008-03-06 11:26:11.000-0600

Russell- after extensive searching, I found what appears to be a bug in chan_sip. Do you want another bug report, or just disregard since I'm running a patched system?

By: Russell Bryant (russell) 2008-03-06 11:50:37.000-0600

Feel free to open another report if you have specific information about a bug in chan_sip.