[Home]

Summary:ASTERISK-14129: [patch] Abort by memory allocator, possibly in moh_files_generator
Reporter:Joel Vandal (jvandal)Labels:
Date Opened:2009-05-14 10:49:50Date Closed:2009-09-08 13:32:02
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Resources/res_musiconhold
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 06082009lock.log
( 1) 06262009mohgdb.txt
( 2) bt_crashnoformatmp3.txt
( 3) bt_full-28082009-crashwhitpatch.TXT
( 4) bt_full-from28to31-five-crashwithpatch.TXT
( 5) bt_scrubbed.txt
( 6) bt_scubbed.txt
( 7) btfull_06052009.txt
( 8) gdb_dump.1.txt
( 9) gdb_dump.2.txt
(10) gdb07132009.txt
(11) malloc_debug.txt
(12) malloc_debug10062009.txt
(13) mmlog
(14) mp3_filestream_corruption.rev1.txt
(15) valgrind.txt
(16) valgrind06242009.txt
(17) valgrind13072009.txt
(18) valgrind15072009.zip
(19) valgrind15377.txt
(20) valgrind18082009.txt
(21) valgrindcrash06302009.txt
(22) valgrindcrash07022009allflags.txt
(23) valgrindcrash19082009asterisk1.4.24.1official.txt
(24) valgrindfilesmoh06232009.txt
(25) valgrindnocrash08022009withflags.txt
Description:I have a server running with Asterisk 1.4.24.1 where it randomly segfault for "unknown" reason.

I'm not sure if this is related to moh_files_generator function or with filestream_descructor.

Let me know what needed in order to fix this crash, if GDB traces aren't enough.

Asterisk is compiled with DONT_OPTIMIZE and others flag needed for "gdb".


****** ADDITIONAL INFORMATION ******

Backtrace (GDB) output:

Thread 1 (process 32543):
#0  0x009a1410 in __kernel_vsyscall ()
#1  0x002a6d10 in raise () from /lib/libc.so.6
#2  0x002a8621 in abort () from /lib/libc.so.6
#3  0x002dee5b in __libc_message () from /lib/libc.so.6
#4  0x002e6d06 in _int_free () from /lib/libc.so.6
ASTERISK-1  0x002ea1e0 in free () from /lib/libc.so.6
ASTERISK-2  0x002d5866 in fclose@@GLIBC_2.1 () from /lib/libc.so.6
ASTERISK-3  0x080b14d3 in filestream_destructor ()
ASTERISK-4  0x0807704c in __ast_pthread_mutex_unlock ()
ASTERISK-5  0x080b4017 in ast_filestream_frame_freed ()
ASTERISK-6 0x080b55a3 in ast_frame_free ()
ASTERISK-7 0x004513e7 in moh_files_generator (chan=0xb5a86948, data=0x8881278,
   len=160, samples=160) at res_musiconhold.c:294
ASTERISK-8 0x0808d353 in ast_read_generator_actions ()
ASTERISK-9 0x0808ebcd in __ast_read ()
ASTERISK-10 0x0808ee75 in ast_read ()
ASTERISK-11 0x08093f44 in ast_generic_bridge ()
ASTERISK-12 0x08095442 in ast_channel_bridge ()
ASTERISK-13 0x00624142 in ast_bridge_call (chan=0xb5a86948, peer=0x8a04d80,
   config=0xb66fe894) at res_features.c:1728
ASTERISK-14 0x00c3e29d in try_calling (qe=0xb66febe0, options=0xb66feb87 "",
   announceoverride=0xb66feb89 "", url=0xb66feb88 "", tries=0xb66fed80,
   noption=0xb66fed7c, agi=0xb66feb8e "") at app_queue.c:3209
ASTERISK-15 0x00c4189a in queue_exec (chan=0xb5a86948, data=0xb6700f08)
   at app_queue.c:4054
ASTERISK-16 0x080d59d1 in pbx_exec ()
ASTERISK-17 0x080d977f in pbx_extension_helper ()
ASTERISK-18 0x080dab16 in ast_spawn_extension ()
ASTERISK-19 0x080daf76 in __ast_pbx_run ()
ASTERISK-20 0x080dbf87 in pbx_thread ()
ASTERISK-21 0x08120c93 in dummy_start ()
ASTERISK-22 0x003f745b in start_thread () from /lib/libpthread.so.0
ASTERISK-23 0x0034ee5e in clone () from /lib/libc.so.6

Comments:By: Joel Vandal (jvandal) 2009-05-14 10:52:27

Can have a relation with ticket 14958, closed by request, unreproducible by aragon.

Maybe related to ticket 14395 (crash with filestream_destructor) but this patch is already include with 1.4.24.1.



By: David Woolley (davidw) 2009-05-15 05:04:58

This isn't a segfault.

It's an explicit abort by the memory allocator which means you need to follow the valgrind.txt procedures.

By: Leif Madsen (lmadsen) 2009-05-15 09:53:40

I'm curious if you're doing any 'moh reload' here at all? I ran into the same kind of issue this morning when doing some testing with MoH. I'm not exactly sure how I reproduced it, but it was happening when I did a 'moh reload'.

By: David Brillert (aragon) 2009-05-15 10:27:19

When I first reported this problem the time stamp was at night and after hours.
No one would have issued a reload at that time imho.

But if I had to reproduce based on the reload scenario I think I would try this...
Run Asterisk under valgrind
Set up a call between two end points and place the call on hold to play back moh.
issue reload command every 60 seconds for half an hour and post results of valgrind.

Does this make sense or would the test have to keep running until Asterisk crash before valgrind could be posted?

By: Leif Madsen (lmadsen) 2009-05-15 10:44:22

Well, basically valgrind will most likely keep Asterisk from crashing, so what you'll want to do is see if you can easily reproduce outside of valgrind, then reproduce the same test inside valgrind. Once you've executed your test, then you can attach the valgrind files here.

The reason for doing it this way is that you probably won't get a crash from Asterisk while inside of valgrind.

By: Joel Vandal (jvandal) 2009-05-15 11:13:09

It possible that the 'moh reload' command has been executed but I will verify with the customer.

Aragon, on GUI, the Full Commit will always execute the 'moh reload'.

By: David Brillert (aragon) 2009-05-15 13:30:14

I labbed this up.
Ran Asterisk in Valgrind
Four Polycom phones
Used script to do full reload every 60 seconds while I made a bunch of test calls.
I tried every hold scenario I could think of while transferring calls to queues and multiple extensions and park.
I could not generate a crash.

I'm uploading Valgrind anyway...

By: Mark Michelson (mmichelson) 2009-05-15 16:54:18

The valgrind output here doesn't have anything related to the filestream refcount problem in it, unfortunately. It just has the typical libc startup errors.

We had a customer seeing this issue a little while back, and it was impossible for me to reproduce it while running valgrind. I did a lot of code inspection and I could not figure out how this particular problem could be happening.

lmadsen: you say you were able to reproduce this by doing some reloading of moh? I actually did not try doing that during my tests. I may revisit this since I was very frustrated that I could not see how the crash could be happening.

By: Leif Madsen (lmadsen) 2009-05-19 12:40:41

mmichelson: I can't guarantee that is how I reproduced the issue, as it just seemed that I had crashed Asterisk while testing another issue (which was related to MoH becoming inactive when you did a 'reload res_musiconhold.so'. Asterisk had crashed, but the backtrace was not useful because I didn't have DONT_OPTIMIZE compiled in.

However, that was something I had seen, then I did some bug triage and saw this issue which appeared to be a similar issue to what I had seen, so I was simply making a suggestion that perhaps reloading MoH while it was being used could potentially cause this crash to happen?

I can have a look again later this week to see if I can reproduce, but I'm not sure how much time I'll have to get back to this during the week.

By: David Brillert (aragon) 2009-06-05 12:53:56

I just had this happen on a newer install.
Sorry no Valgrind but I am uploading a full bt... btfull 06052009.txt

I noticed that since the crash I am getting this warning but I checked /var/lib/asterisk/moh/ and the file is definitely there.

After seeing the warnings I did a reload and the warnings went away and moh is working normally.

WARNING[26161]: file.c:655 ast_openstream_full: File /var/lib/asterisk/moh/fpm-calm-river.wav does not exist in any format
[Jun  5 11:32:47] WARNING[26161]: res_musiconhold.c:250 ast_moh_files_next: Unable to open file '/var/lib/asterisk/moh/fpm-calm-river.wav': No such file or directory

dir
fpm-calm-river.alaw     fpm-sunshine.alaw     fpm-world-mix.alaw
fpm-calm-river.gsm      fpm-sunshine.gsm      fpm-world-mix.gsm
fpm-calm-river.sln      fpm-sunshine.sln      fpm-world-mix.sln
fpm-calm-river.ulaw     fpm-sunshine.ulaw     fpm-world-mix.ulaw
fpm-calm-river.wav      fpm-sunshine.wav      fpm-world-mix.wav
fpm-calm-river.wav.md5  fpm-sunshine.wav.md5  fpm-world-mix.wav.md5

By: David Brillert (aragon) 2009-06-08 10:41:02

The lock is caused during a reload!
Just happened to me on yet another site using Asterisk 1.4.21

The CLI became unresponsive after a reload
There was no segfault or core show locks data
All commands were unresponsive

thread apply all bt 06082009lock.log will be uploaded shortly



By: David Brillert (aragon) 2009-06-10 15:11:15

I have been able to lab this up and reproduce the crash 12 times before running under valgrind.  I should be able to test any patch very quickly in this lab environment.
Using Asterisk 1.4 SVN branch revision 199742

Uploading a valgrind dump and malloc_debug.txt
valgrind.txt.core.3857.zip
malloc_debug10062009.txt

Here is the CLI the system was shooting out prior to the crash.

   -- Executing [5000@zap-incoming:1] Answer("Zap/242-1", "") in new stack
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
   -- Accepting call from '1021' to '5000' on channel 0/2, span 11
   -- Executing [5000@zap-incoming:3] NoOp("Zap/241-1", ""INCOMING CALL FROM CALLER ID: 1002 (1002)"") in new stack
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
   -- Executing [5000@zap-incoming:4] Set("Zap/241-1", "__INCOMING_DNIS=5000") in new stack
   -- Executing [5000@zap-incoming:5] Set("Zap/241-1", "CALLERID(dnid)=5000") in new stack
   -- Executing [5000@zap-incoming:6] Set("Zap/241-1", "CDR(userfield)=5000") in new stack
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
[Jun 10 15:43:34] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1611' is now Reachable. (1884ms / 2000ms)
[Jun 10 15:43:34] ERROR[14172]: pbx.c:1565 ast_func_write: Function CDR not registered
   -- Executing [5000@zap-incoming:7] Goto("Zap/241-1", "10") in new stack
[Jun 10 15:43:35] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1645' is now Lagged. (3272ms / 2000ms)
   -- Executing [5000@zap-incoming:2] Set("Zap/242-1", "CHANNEL(musicclass)=default") in new stack
   -- Goto (zap-incoming,5000,10)
   -- Executing [5000@zap-incoming:10] AGI("Zap/241-1", "/var/www/scopserv/telephony/scripts/agi/fixcidname.php") in new stack
 == Parsing '/etc/asterisk/res_snmp.conf': Found
Loading [Sub]Agent Module
   -- Executing [5000@zap-incoming:3] NoOp("Zap/242-1", ""INCOMING CALL FROM CALLER ID: 1021 (1021)"") in new stack
   -- Executing [5000@zap-incoming:1] Answer("Zap/243-1", "") in new stack
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1623' is now Reachable. (1736ms / 2000ms)
res_snmp.so => (SNMP [Sub]Agent for Asterisk)
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1632' is now Reachable. (1955ms / 2000ms)
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1625' is now Reachable. (1963ms / 2000ms)
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1624' is now Reachable. (1964ms / 2000ms)
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1611' is now Lagged. (3386ms / 2000ms)
[Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1644' is now Reachable. (1964ms / 2000ms)
   -- Executing [5000@zap-incoming:4] Set("Zap/242-1", "__INCOMING_DNIS=5000") in new stack
   -- Accepting call from '1040' to '5000' on channel 0/3, span 11
 == Parsing '/etc/asterisk/codecs.conf': [Jun 10 15:43:36] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1639' is now Reachable. (1829ms / 2000ms)
   -- Executing [5000@zap-incoming:1] Answer("Zap/244-1", "") in new stack
   -- Executing [5000@zap-incoming:5] Set("Zap/242-1", "CALLERID(dnid)=5000") in new stack
   -- Executing [5000@zap-incoming:2] Set("Zap/243-1", "CHANNEL(musicclass)=default") in new stack
Found
   -- codec_g722: using generic PLC
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
   -- Executing [5000@zap-incoming:6] Set("Zap/242-1", "CDR(userfield)=5000") in new stack
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php
   -- Got SIP response 489 "Bad event" back from 192.168.30.146
 == Registered translator 'g722tolin' from format g722 to slin, cost 486
[Jun 10 15:43:37] ERROR[14203]: pbx.c:1565 ast_func_write: Function CDR not registered
   -- Executing [5000@zap-incoming:7] Goto("Zap/242-1", "10") in new stack
   -- Goto (zap-incoming,5000,10)
   -- Executing [5000@zap-incoming:10] AGI("Zap/242-1", "/var/www/scopserv/telephony/scripts/agi/fixcidname.php") in new stack
   -- Executing [5000@zap-incoming:3] NoOp("Zap/243-1", ""INCOMING CALL FROM CALLER ID: 1040 (1040)"") in new stack
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php
   -- Accepting call from '1027' to '5000' on channel 0/4, span 11
   -- Executing [5000@zap-incoming:2] Set("Zap/244-1", "CHANNEL(musicclass)=default") in new stack
 == Registered translator 'lintog722' from format slin to g722, cost 501
   -- Executing [5000@zap-incoming:4] Set("Zap/243-1", "__INCOMING_DNIS=5000") in new stack
codec_g722.so => (ITU G.722-64kbps G722 Transcoder)
 == Registered application 'ZapBarge'
app_dahdibarge.so => (Barge in on channel application)
   -- AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php completed, returning 0
   -- Executing [5000@zap-incoming:11] Set("Zap/242-1", "FAXNUMBER=5000") in new stack
   -- Executing [5000@zap-incoming:3] NoOp("Zap/244-1", ""INCOMING CALL FROM CALLER ID: 1027 (1027)"") in new stack
   -- AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php completed, returning 0
   -- Executing [5000@zap-incoming:11] Set("Zap/241-1", "FAXNUMBER=5000") in new stack
   -- Executing [5000@zap-incoming:4] Set("Zap/244-1", "__INCOMING_DNIS=5000") in new stack
   -- Loaded PUBLIC key 'iaxtel'
   -- Executing [5000@zap-incoming:12] Set("Zap/242-1", "__INCOMINGLINE=1") in new stack
   -- Executing [5000@zap-incoming:5] Set("Zap/243-1", "CALLERID(dnid)=5000") in new stack
   -- Loaded PUBLIC key 'freeworlddialup'
res_crypto.so => (Cryptographic Digital Signatures)
   -- Executing [5000@zap-incoming:12] Set("Zap/241-1", "__INCOMINGLINE=1") in new stack
   -- Executing [5000@zap-incoming:1] Answer("Zap/245-1", "") in new stack
   -- Executing [5000@zap-incoming:5] Set("Zap/244-1", "CALLERID(dnid)=5000") in new stack
 == Registered application 'NVFaxDetect'
app_nv_fax_detect.so => (NV Fax Detection Application)
   -- Executing [5000@zap-incoming:13] GotoIf("Zap/241-1", "0?14:15") in new stack
   -- Executing [5000@zap-incoming:13] GotoIf("Zap/242-1", "0?14:15") in new stack
 == Registered application 'Transfer'
app_transfer.so => (Transfer)
   -- Executing [5000@zap-incoming:6] Set("Zap/243-1", "CDR(userfield)=5000") in new stack
[Jun 10 15:43:39] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1623' is now Lagged. (5426ms / 2000ms)
   -- Executing [5000@zap-incoming:6] Set("Zap/244-1", "CDR(userfield)=5000") in new stack
   -- Goto (zap-incoming,5000,15)
   -- Executing [5000@zap-incoming:15] GotoIf("Zap/241-1", "0?16:17") in new stack
   -- Goto (zap-incoming,5000,15)
 == Registered application 'SetCDRUserField'
 == Registered application 'AppendCDRUserField'
[Jun 10 15:43:39] ERROR[10126]: manager.c:2649 ast_manager_register_struct: Could not obtain lock on manager list
   -- Executing [5000@zap-incoming:15] GotoIf("Zap/242-1", "0?16:17") in new stack
[Jun 10 15:43:39] ERROR[14227]: pbx.c:1565 ast_func_write: Function CDR not registered
   -- Executing [5000@zap-incoming:7] Goto("Zap/243-1", "10") in new stack
   -- Executing [5000@zap-incoming:2] Set("Zap/245-1", "CHANNEL(musicclass)=default") in new stack
   -- Accepting call from '1029' to '5000' on channel 0/5, span 11
   -- Executing [5000@zap-incoming:3] NoOp("Zap/245-1", ""INCOMING CALL FROM CALLER ID: 1029 (1029)"") in new stack
   -- Goto (zap-incoming,5000,17)
   -- Executing [5000@zap-incoming:17] GotoIf("Zap/241-1", "0?18:19") in new stack
[Jun 10 15:43:40] ERROR[14244]: pbx.c:1565 ast_func_write: Function CDR not registered
   -- Executing [5000@zap-incoming:7] Goto("Zap/244-1", "10") in new stack
   -- Goto (zap-incoming,5000,17)
   -- Executing [5000@zap-incoming:17] GotoIf("Zap/242-1", "0?18:19") in new stack
   -- Goto (zap-incoming,5000,10)
   -- Executing [5000@zap-incoming:10] AGI("Zap/243-1", "/var/www/scopserv/telephony/scripts/agi/fixcidname.php") in new stack
[Jun 10 15:43:40] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1632' is now Lagged. (6226ms / 2000ms)
   -- Executing [5000@zap-incoming:1] Answer("Zap/246-1", "") in new stack
   -- Goto (zap-incoming,5000,19)
   -- Executing [5000@zap-incoming:19] GotoIf("Zap/241-1", "0?20:21") in new stack
   -- Executing [5000@zap-incoming:4] Set("Zap/245-1", "__INCOMING_DNIS=5000") in new stack
   -- Goto (zap-incoming,5000,10)
   -- Executing [5000@zap-incoming:10] AGI("Zap/244-1", "/var/www/scopserv/telephony/scripts/agi/fixcidname.php") in new stack
   -- Goto (zap-incoming,5000,19)
   -- Executing [5000@zap-incoming:19] GotoIf("Zap/242-1", "0?20:21") in new stack
   -- Accepting call from '1021' to '5000' on channel 0/6, span 11
[Jun 10 15:43:40] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1625' is now Lagged. (6548ms / 2000ms)
   -- Goto (zap-incoming,5000,21)
   -- Executing [5000@zap-incoming:21] Set("Zap/241-1", "CHANNEL(language)=en") in new stack
   -- Executing [5000@zap-incoming:5] Set("Zap/245-1", "CALLERID(dnid)=5000") in new stack
   -- Goto (zap-incoming,5000,21)
   -- Executing [5000@zap-incoming:21] Set("Zap/242-1", "CHANNEL(language)=en") in new stack
[Jun 10 15:43:41] ERROR[14244]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:431 __ast_pthread_mutex_lock: loader.c line 194 (__ast_module_user_add): Error obtaining mutex: Invalid argument
[Jun 10 15:43:41] ERROR[14244]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:514 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): mutex '&(&mod->users)->lock' freed more times than we've locked!
loader.c line 196 (__ast_module_user_add): mutex '&(&mod->users)->lock' freed more times than we've locked!
[Jun 10 15:43:41] ERROR[14244]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:531 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): Error releasing mutex: Invalid argument
   -- Executing [5000@zap-incoming:22] Set("Zap/242-1", "QUEUE_PRIO=0") in new stack
[Jun 10 15:43:41] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1624' is now Lagged. (7648ms / 2000ms)
   -- Executing [5000@zap-incoming:22] Set("Zap/241-1", "QUEUE_PRIO=0") in new stack
   -- Executing [5000@zap-incoming:6] Set("Zap/245-1", "CDR(userfield)=5000") in new stack
   -- Executing [5000@zap-incoming:2] Set("Zap/246-1", "CHANNEL(musicclass)=default") in new stack
[Jun 10 15:43:42] NOTICE[13661]: chan_sip.c:13085 handle_response_peerpoke: Peer '1644' is now Lagged. (7810ms / 2000ms)
   -- Executing [5000@zap-incoming:23] Gosub("Zap/242-1", "default-queue|s|1") in new stack
[Jun 10 15:43:42] ERROR[14304]: pbx.c:1565 ast_func_write: Function CDR not registered
   -- Executing [5000@zap-incoming:7] Goto("Zap/245-1", "10") in new stack
   -- Executing [5000@zap-incoming:1] Answer("Zap/247-1", "") in new stack
   -- Executing [5000@zap-incoming:23] Gosub("Zap/241-1", "default-queue|s|1") in new stack
   -- Executing [5000@zap-incoming:3] NoOp("Zap/246-1", ""INCOMING CALL FROM CALLER ID: 1021 (1021)"") in new stack
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:431 __ast_pthread_mutex_lock: loader.c line 194 (__ast_module_user_add): Error obtaining mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:514 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): mutex '&(&mod->users)->lock' freed more times than we've locked!
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:531 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): Error releasing mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:431 __ast_pthread_mutex_lock: loader.c line 207 (__ast_module_user_remove): Error obtaining mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:514 __ast_pthread_mutex_unlock: loader.c line 209 (__ast_module_user_remove): mutex '&(&mod->users)->lock' freed more times than we've locked!
[Jun 10 15:43:42] ERROR[14203]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:531 __ast_pthread_mutex_unlock: loader.c line 209 (__ast_module_user_remove): Error releasing mutex: Invalid argument
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php
   -- Goto (zap-incoming,5000,10)
   -- Executing [5000@zap-incoming:10] AGI("Zap/245-1", "/var/www/scopserv/telephony/scripts/agi/fixcidname.php") in new stack
   -- Executing [s@default-queue:1] NoOp("Zap/242-1", ""=== START QUEUE (default-queue) ==="") in new stack
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:431 __ast_pthread_mutex_lock: loader.c line 194 (__ast_module_user_add): Error obtaining mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:514 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): mutex '&(&mod->users)->lock' freed more times than we've locked!
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:531 __ast_pthread_mutex_unlock: loader.c line 196 (__ast_module_user_add): Error releasing mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:431 __ast_pthread_mutex_lock: loader.c line 207 (__ast_module_user_remove): Error obtaining mutex: Invalid argument
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:514 __ast_pthread_mutex_unlock: loader.c line 209 (__ast_module_user_remove): mutex '&(&mod->users)->lock' freed more times than we've locked!
[Jun 10 15:43:42] ERROR[14172]: /dar/build/asterisk-1.4.26/include/asterisk/lock.h:531 __ast_pthread_mutex_unlock: loader.c line 209 (__ast_module_user_remove): Error releasing mutex: Invalid argument
   -- Executing [s@default-queue:1] NoOp("Zap/241-1", ""=== START QUEUE (default-queue) ==="") in new stack
   -- Executing [5000@zap-incoming:4] Set("Zap/246-1", "__INCOMING_DNIS=5000") in new stack
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/fixcidname.php



By: David Brillert (aragon) 2009-06-10 15:23:50

I have a Valgrind dump that is too big to be uploaded and contains confidential data.

How can I get this info to a developer privately?

By: Leif Madsen (lmadsen) 2009-06-12 08:07:14

aragon: can you send this to tlesher@digium.com and mention this is for issue 15109 which is a blocker for 1.4.27? Ask him to forward to someone more appropriate if necessary. He is the first person I would ask to look at it, but that doesn't mean he has to, or that he'll have the time. Thanks!

By: David Brillert (aragon) 2009-06-15 08:18:23

I ran the lab test over the weekend and ran several reload commands to get the lab system to crash under Valgrind.
No luck getting the system to crash, but this morning the entire system was locked.
core show channels displayed nothing
core show locks displayed nothing
I could not attach to the asterisk/valgrind pid with gdb because of a memory error.  So I had to kill memcheck in order to restart my experiment.

Therefore I could not get valgrind.txt.core file

However, malloc_debug.txt had one error that was different than before
WARNING: Freeing unused memory at (nil), in lock_info_destroy of utils.c, line 536

valgrind.txt displayed some moh reload data that I hope will be useful.

==16435==    at 0x4006998: strlen (mc_replace_strmem.c:242)
==16435==    by 0x4050BCD: vfprintf (in /lib/libc-2.5.so)
==16435==    by 0x40713B3: vsnprintf (in /lib/libc-2.5.so)
==16435==    by 0x8121C21: ast_dynamic_str_thread_build_va (in /usr/sbin/asterisk)
==16435==    by 0x80C6416: ast_log (in /usr/sbin/asterisk)
==16435==    by 0x4B26BA2: spawn_mp3 (res_musiconhold.c:433)
==16435==    by 0x4B2703A: monmp3thread (res_musiconhold.c:530)
==16435==    by 0x812049A: dummy_start (in /usr/sbin/asterisk)
==16435==    by 0x8B249A: start_thread (in /lib/libpthread-2.5.so)
==16435==    by 0x40E242D: clone (in /lib/libc-2.5.so)
==16435==  Address 0x9b48460 is 416 bytes inside a block of size 1,068 free'd
==16435==    at 0x400562C: free (vg_replace_malloc.c:323)
==16435==    by 0x8072790: __ast_free_region (in /usr/sbin/asterisk)
==16435==    by 0x8073182: __ast_free (in /usr/sbin/asterisk)
==16435==    by 0x80749ED: ao2_ref (in /usr/sbin/asterisk)
==16435==    by 0x8075578: ao2_callback (in /usr/sbin/asterisk)
==16435==    by 0x4B29C2E: load_moh_classes (res_musiconhold.c:1309)
==16435==    by 0x4B2A2AE: reload (res_musiconhold.c:1447)
==16435==    by 0x80C272A: ast_module_reload (in /usr/sbin/asterisk)
==16435==    by 0x809567D: handle_reload_deprecated (in /usr/sbin/asterisk)
==16435==    by 0x809BDEF: ast_cli_command (in /usr/sbin/asterisk)
==16435==    by 0x806CD1D: consolehandler (in /usr/sbin/asterisk)
==16435==    by 0x80723F7: main (in /usr/sbin/asterisk)
==16435==
==16435== Invalid read of size 1
==16435==    at 0x40069A3: strlen (mc_replace_strmem.c:242)
==16435==    by 0x4050BCD: vfprintf (in /lib/libc-2.5.so)
==16435==    by 0x40713B3: vsnprintf (in /lib/libc-2.5.so)
==16435==    by 0x8121C21: ast_dynamic_str_thread_build_va (in /usr/sbin/asterisk)
==16435==    by 0x80C6416: ast_log (in /usr/sbin/asterisk)
==16435==    by 0x4B26BA2: spawn_mp3 (res_musiconhold.c:433)
==16435==    by 0x4B2703A: monmp3thread (res_musiconhold.c:530)
==16435==    by 0x812049A: dummy_start (in /usr/sbin/asterisk)
==16435==    by 0x8B249A: start_thread (in /lib/libpthread-2.5.so)
==16435==    by 0x40E242D: clone (in /lib/libc-2.5.so)
==16435==  Address 0x9b48461 is 417 bytes inside a block of size 1,068 free'd
==16435==    at 0x400562C: free (vg_replace_malloc.c:323)
==16435==    by 0x8072790: __ast_free_region (in /usr/sbin/asterisk)
==16435==    by 0x8073182: __ast_free (in /usr/sbin/asterisk)
==16435==    by 0x80749ED: ao2_ref (in /usr/sbin/asterisk)
==16435==    by 0x8075578: ao2_callback (in /usr/sbin/asterisk)
==16435==    by 0x4B29C2E: load_moh_classes (res_musiconhold.c:1309)
==16435==    by 0x4B2A2AE: reload (res_musiconhold.c:1447)
==16435==    by 0x80C272A: ast_module_reload (in /usr/sbin/asterisk)
==16435==    by 0x809567D: handle_reload_deprecated (in /usr/sbin/asterisk)
==16435==    by 0x809BDEF: ast_cli_command (in /usr/sbin/asterisk)
==16435==    by 0x806CD1D: consolehandler (in /usr/sbin/asterisk)
==16435==    by 0x80723F7: main (in /usr/sbin/asterisk)
==16435==
==16435== Invalid read of size 1
==16435==    at 0x4007521: mempcpy (mc_replace_strmem.c:677)
==16435==    by 0x4076634: _IO_default_xsputn (in /lib/libc-2.5.so)
==16435==    by 0x40506F5: vfprintf (in /lib/libc-2.5.so)
==16435==    by 0x40713B3: vsnprintf (in /lib/libc-2.5.so)
==16435==    by 0x8121C21: ast_dynamic_str_thread_build_va (in /usr/sbin/asterisk)
==16435==    by 0x80C6416: ast_log (in /usr/sbin/asterisk)
==16435==    by 0x4B26BA2: spawn_mp3 (res_musiconhold.c:433)
==16435==    by 0x4B2703A: monmp3thread (res_musiconhold.c:530)
==16435==    by 0x812049A: dummy_start (in /usr/sbin/asterisk)
==16435==    by 0x8B249A: start_thread (in /lib/libpthread-2.5.so)
==16435==    by 0x40E242D: clone (in /lib/libc-2.5.so)
==16435==  Address 0x9b48460 is 416 bytes inside a block of size 1,068 free'd
==16435==    at 0x400562C: free (vg_replace_malloc.c:323)
==16435==    by 0x8072790: __ast_free_region (in /usr/sbin/asterisk)
==16435==    by 0x8073182: __ast_free (in /usr/sbin/asterisk)
==16435==    by 0x80749ED: ao2_ref (in /usr/sbin/asterisk)
==16435==    by 0x8075578: ao2_callback (in /usr/sbin/asterisk)
==16435==    by 0x4B29C2E: load_moh_classes (res_musiconhold.c:1309)
==16435==    by 0x4B2A2AE: reload (res_musiconhold.c:1447)
==16435==    by 0x80C272A: ast_module_reload (in /usr/sbin/asterisk)
==16435==    by 0x809567D: handle_reload_deprecated (in /usr/sbin/asterisk)
==16435==    by 0x809BDEF: ast_cli_command (in /usr/sbin/asterisk)
==16435==    by 0x806CD1D: consolehandler (in /usr/sbin/asterisk)
==16435==    by 0x80723F7: main (in /usr/sbin/asterisk)
==16435==
==16435== Invalid read of size 1
==16435==    at 0x400752D: mempcpy (mc_replace_strmem.c:677)
==16435==    by 0x4076634: _IO_default_xsputn (in /lib/libc-2.5.so)
==16435==    by 0x40506F5: vfprintf (in /lib/libc-2.5.so)
==16435==    by 0x40713B3: vsnprintf (in /lib/libc-2.5.so)
==16435==    by 0x8121C21: ast_dynamic_str_thread_build_va (in /usr/sbin/asterisk)
==16435==    by 0x80C6416: ast_log (in /usr/sbin/asterisk)
==16435==    by 0x4B26BA2: spawn_mp3 (res_musiconhold.c:433)
==16435==    by 0x4B2703A: monmp3thread (res_musiconhold.c:530)
==16435==    by 0x812049A: dummy_start (in /usr/sbin/asterisk)
==16435==    by 0x8B249A: start_thread (in /lib/libpthread-2.5.so)
==16435==    by 0x40E242D: clone (in /lib/libc-2.5.so)
==16435==  Address 0x9b48462 is 418 bytes inside a block of size 1,068 free'd
==16435==    at 0x400562C: free (vg_replace_malloc.c:323)
==16435==    by 0x8072790: __ast_free_region (in /usr/sbin/asterisk)
==16435==    by 0x8073182: __ast_free (in /usr/sbin/asterisk)
==16435==    by 0x80749ED: ao2_ref (in /usr/sbin/asterisk)
==16435==    by 0x8075578: ao2_callback (in /usr/sbin/asterisk)
==16435==    by 0x4B29C2E: load_moh_classes (res_musiconhold.c:1309)
==16435==    by 0x4B2A2AE: reload (res_musiconhold.c:1447)
==16435==    by 0x80C272A: ast_module_reload (in /usr/sbin/asterisk)
==16435==    by 0x809567D: handle_reload_deprecated (in /usr/sbin/asterisk)
==16435==    by 0x809BDEF: ast_cli_command (in /usr/sbin/asterisk)
==16435==    by 0x806CD1D: consolehandler (in /usr/sbin/asterisk)
==16435==    by 0x80723F7: main (in /usr/sbin/asterisk)

The basis of my tests is to load 4 PRI interfaces and pass multiple calls to an ACD queue to logged agents.  As a result some calls are answered and some are held with MOH. A percentage of ACD calls answered by agents are transferred to other extensions.  A reload command is issued every ten minutes.
Without running Asterisk under valgrind this is sufficient to crash Asterisk consistently about twice per hour.  
With valgrind I was able to produce two crashes and I sent two valgrind.txt.core files to tlesher for analysis (one per crash under valgrind).
It is likely that at least one of those valgrind.txt.core files is not directly related to the crash reported in 15109.  The second valgrind.txt.core file looked more related to the original bug report.

I'll keep trying to produce additional valgrind dumps, however I am finding it extremely difficult to reproduce the crash under valgrind.



By: Leif Madsen (lmadsen) 2009-06-16 14:19:59

You should also note that running under valgrind may not actually cause the system to crash!  You should run the test without valgrind to know *when* the system would crash, then under valgrind, run the same test to that same point, then just provide the valgrind output as valgrind will very likely cause the system to not crash.

By: David Brillert (aragon) 2009-06-18 00:44:20

This crash didn't happen under 1.4.24.1
So I checked the changelog for 1.4.26rc2 to look for a bad revision.

I found it.

I was running a script to run the conversion tool on each reload to convert existing MOH and IVR files audio codecs to multiple formats/codecs.
This combined with the fact that a reload also reloads MOH I think was causing a race condition which could not be reproduced with valgrind.
Since disabling the conversion script I have no more segfaults (I think).
My test procedure as follows:
No Valgrind...
A script to run the conversion and reload every ten minutes with revision 199742
produced 12 crashes with core file generation every 8 hours.
A script to do a reload every 5 minutes with revision 199742 produced 0 crashes over 8 hours.

The bad revision is here

2009-05-26 18:14 +0000 [r196826] Russell Bryant <russell@digium.com>
  * res/res_convert.c: Resolve a file handle leak. The frames here
   should have always been freed. However, out of luck, there was
   never any memory leaked. However, after file streams became
   reference counted, this code would leak the file stream for the
   file being read. (closes issue ASTERISK-14190) Reported by: jkroon

Edit: After using moh files mode in my configs instead of non-files mode moh have produced crashes when convert.c is not used.  Therefore imho there is no relation to this revision to bug 15109.



By: David Brillert (aragon) 2009-06-18 00:47:51

I should mention I did not revert/rebuild Asterisk 199742 with r196826 patch removed.  I simply stopped using res_convert.c to convert existing audio files and no more segfaults.



By: Digium Subversion (svnbot) 2009-06-18 10:24:32

Repository: asterisk
Revision: 201600

U   branches/1.4/res/res_musiconhold.c

------------------------------------------------------------------------
r201600 | russell | 2009-06-18 10:24:31 -0500 (Thu, 18 Jun 2009) | 29 lines

Fix memory corruption and leakage related reloads of non files mode MoH classes.

For Music on Hold classes that are not files mode, meaning that we are executing
an application that will feed us audio data, we use a thread to monitor the
external application and read audio from it.  This thread also makes use of the
MoH class object.  In the MoH class destructor, we used pthread_cancel() to ask
the thread to exit.  Unfortunately, the code did not wait to ensure that the
thread actually went away.  What needed to be done is a pthread_join() to ensure
that the thread fully cleans up before we proceed.  By adding this one line, we
resolve two significant problems:

 1) Since the thread was never joined, it never fully goes away.  So, on every
    reload of non-files mode MoH, an unused thread was sticking around.

 2) There was a race condition here where the application monitoring thread
    could still try to access the MoH class, even though the thread executing
    the MoH reload has already destroyed it.

(issue ASTERISK-14129)
Reported by: jvandal

(issue ASTERISK-14141)
Reported by: axisinternet

(issue ASTERISK-14203)
Reported by: amorsen

(issue AST-208)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=201600

By: Russell Bryant (russell) 2009-06-18 10:26:05

Please give that patch a try and let me know if it makes the crashes go away.

By: Digium Subversion (svnbot) 2009-06-18 10:27:10

Repository: asterisk
Revision: 201610

_U  trunk/
U   trunk/res/res_musiconhold.c

------------------------------------------------------------------------
r201610 | russell | 2009-06-18 10:27:10 -0500 (Thu, 18 Jun 2009) | 36 lines

Merged revisions 201600 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r201600 | russell | 2009-06-18 10:24:31 -0500 (Thu, 18 Jun 2009) | 29 lines
 
 Fix memory corruption and leakage related reloads of non files mode MoH classes.
 
 For Music on Hold classes that are not files mode, meaning that we are executing
 an application that will feed us audio data, we use a thread to monitor the
 external application and read audio from it.  This thread also makes use of the
 MoH class object.  In the MoH class destructor, we used pthread_cancel() to ask
 the thread to exit.  Unfortunately, the code did not wait to ensure that the
 thread actually went away.  What needed to be done is a pthread_join() to ensure
 that the thread fully cleans up before we proceed.  By adding this one line, we
 resolve two significant problems:
 
   1) Since the thread was never joined, it never fully goes away.  So, on every
      reload of non-files mode MoH, an unused thread was sticking around.
 
   2) There was a race condition here where the application monitoring thread
      could still try to access the MoH class, even though the thread executing
      the MoH reload has already destroyed it.
 
 (issue ASTERISK-14129)
 Reported by: jvandal
 
 (issue ASTERISK-14141)
 Reported by: axisinternet
 
 (issue ASTERISK-14203)
 Reported by: amorsen
 
 (issue AST-208)
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=201610

By: Digium Subversion (svnbot) 2009-06-18 10:32:38

Repository: asterisk
Revision: 201612

_U  branches/1.6.0/
U   branches/1.6.0/res/res_musiconhold.c

------------------------------------------------------------------------
r201612 | russell | 2009-06-18 10:32:38 -0500 (Thu, 18 Jun 2009) | 43 lines

Merged revisions 201610 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r201610 | russell | 2009-06-18 10:27:10 -0500 (Thu, 18 Jun 2009) | 36 lines
 
 Merged revisions 201600 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r201600 | russell | 2009-06-18 10:24:31 -0500 (Thu, 18 Jun 2009) | 29 lines
   
   Fix memory corruption and leakage related reloads of non files mode MoH classes.
   
   For Music on Hold classes that are not files mode, meaning that we are executing
   an application that will feed us audio data, we use a thread to monitor the
   external application and read audio from it.  This thread also makes use of the
   MoH class object.  In the MoH class destructor, we used pthread_cancel() to ask
   the thread to exit.  Unfortunately, the code did not wait to ensure that the
   thread actually went away.  What needed to be done is a pthread_join() to ensure
   that the thread fully cleans up before we proceed.  By adding this one line, we
   resolve two significant problems:
   
     1) Since the thread was never joined, it never fully goes away.  So, on every
        reload of non-files mode MoH, an unused thread was sticking around.
   
     2) There was a race condition here where the application monitoring thread
        could still try to access the MoH class, even though the thread executing
        the MoH reload has already destroyed it.
   
   (issue ASTERISK-14129)
   Reported by: jvandal
   
   (issue ASTERISK-14141)
   Reported by: axisinternet
   
   (issue ASTERISK-14203)
   Reported by: amorsen
   
   (issue AST-208)
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=201612

By: Digium Subversion (svnbot) 2009-06-18 10:36:11

Repository: asterisk
Revision: 201613

_U  branches/1.6.1/
U   branches/1.6.1/res/res_musiconhold.c

------------------------------------------------------------------------
r201613 | russell | 2009-06-18 10:36:11 -0500 (Thu, 18 Jun 2009) | 43 lines

Merged revisions 201610 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r201610 | russell | 2009-06-18 10:27:10 -0500 (Thu, 18 Jun 2009) | 36 lines
 
 Merged revisions 201600 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r201600 | russell | 2009-06-18 10:24:31 -0500 (Thu, 18 Jun 2009) | 29 lines
   
   Fix memory corruption and leakage related reloads of non files mode MoH classes.
   
   For Music on Hold classes that are not files mode, meaning that we are executing
   an application that will feed us audio data, we use a thread to monitor the
   external application and read audio from it.  This thread also makes use of the
   MoH class object.  In the MoH class destructor, we used pthread_cancel() to ask
   the thread to exit.  Unfortunately, the code did not wait to ensure that the
   thread actually went away.  What needed to be done is a pthread_join() to ensure
   that the thread fully cleans up before we proceed.  By adding this one line, we
   resolve two significant problems:
   
     1) Since the thread was never joined, it never fully goes away.  So, on every
        reload of non-files mode MoH, an unused thread was sticking around.
   
     2) There was a race condition here where the application monitoring thread
        could still try to access the MoH class, even though the thread executing
        the MoH reload has already destroyed it.
   
   (issue ASTERISK-14129)
   Reported by: jvandal
   
   (issue ASTERISK-14141)
   Reported by: axisinternet
   
   (issue ASTERISK-14203)
   Reported by: amorsen
   
   (issue AST-208)
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=201613

By: Digium Subversion (svnbot) 2009-06-18 10:40:16

Repository: asterisk
Revision: 201614

_U  branches/1.6.2/
U   branches/1.6.2/res/res_musiconhold.c

------------------------------------------------------------------------
r201614 | russell | 2009-06-18 10:40:16 -0500 (Thu, 18 Jun 2009) | 43 lines

Merged revisions 201610 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r201610 | russell | 2009-06-18 10:27:10 -0500 (Thu, 18 Jun 2009) | 36 lines
 
 Merged revisions 201600 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r201600 | russell | 2009-06-18 10:24:31 -0500 (Thu, 18 Jun 2009) | 29 lines
   
   Fix memory corruption and leakage related reloads of non files mode MoH classes.
   
   For Music on Hold classes that are not files mode, meaning that we are executing
   an application that will feed us audio data, we use a thread to monitor the
   external application and read audio from it.  This thread also makes use of the
   MoH class object.  In the MoH class destructor, we used pthread_cancel() to ask
   the thread to exit.  Unfortunately, the code did not wait to ensure that the
   thread actually went away.  What needed to be done is a pthread_join() to ensure
   that the thread fully cleans up before we proceed.  By adding this one line, we
   resolve two significant problems:
   
     1) Since the thread was never joined, it never fully goes away.  So, on every
        reload of non-files mode MoH, an unused thread was sticking around.
   
     2) There was a race condition here where the application monitoring thread
        could still try to access the MoH class, even though the thread executing
        the MoH reload has already destroyed it.
   
   (issue ASTERISK-14129)
   Reported by: jvandal
   
   (issue ASTERISK-14141)
   Reported by: axisinternet
   
   (issue ASTERISK-14203)
   Reported by: amorsen
   
   (issue AST-208)
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=201614

By: David Brillert (aragon) 2009-06-18 16:24:21

I have this all labbed up and will test overnight for segfaults.

By: Miguel Molina (coolmig) 2009-06-18 23:34:43

Testing here too (file based MoH), do you have files based MoH or external application? IMHO most of the cases presented on this and the related issues belong to file based MoH.

By: David Brillert (aragon) 2009-06-19 07:52:37

I ran the test overnight (about 14 hours) on 4 fully loaded PRI's at 50 calls per second while issuing a reload command every 5 minutes.  The file conversion was also scripted during each reload.
No segfaults.
Patch appears to fix problem.

By: Russell Bryant (russell) 2009-06-19 08:28:36

aragon, thanks for the feedback!

coolmig, you are correct that most of the backtraces are related to files mode MoH.  The valgrind output on this bug from aragon is what pointed me to the problems addressed by the last fix.  It's possible that the same problem could cause corruption in res_musiconhold if any non-files mode MoH classes were configured.  If you _only_ have files mode configured, then it's unlikely that the last patch will change anything.

Testing is still certainly appreciated.  However, I'll still be thinking and looking for what else could cause these crashes.

By: David Brillert (aragon) 2009-06-19 09:22:36

russell

All of my tests ran using mp3 moh file playback.
I will run the test again with fpm-calm-river.wav fpm-sunshine.wav fpm-world-mix.wav etc...
I'll post results either 5pm EST today or wait until Monday AM while the lab soaks over the weekend.

By: Russell Bryant (russell) 2009-06-19 11:12:21

aragon - your valgrind output in a previous note showed MoH monitoring threads running.  Perhaps you have some classes configured that aren't mode=files?

Also, if the problem still occurs, if anyone is able to make this happen with valgrind (see doc/valgrind.txt), that would be very helpful.

By: David Brillert (aragon) 2009-06-19 13:37:33

Last test ran stable with musiconhold.conf
[default]
mode            =  custom
application     =  /usr/bin/madplay --mono -A -20 -R 8000 --output=raw:-
directory       =  /var/lib/asterisk/moh

Current test stable after several hours (will soak for more hours)
musiconhold.conf
[default]
mode            =  files
directory       =  /var/lib/asterisk/moh/
random          =  yes

By: David Brillert (aragon) 2009-06-19 13:58:45

Also just for fun I just copied some mp3 files along side of the stock wav files in /var/lib/asterisk/moh

We'll see how that runs over the weekend...
50 calls per second with a reload and file conversion script every 5 minutes ;)

By: David Brillert (aragon) 2009-06-21 15:38:53

Patched version crashed on 2009-06-20T09:42:59-0400 in files mode moh.

musiconhold.conf
[default]
mode = files
directory = /var/lib/asterisk/moh/
random = yes

I'll try to post some valgrind data on Monday if I can produce any data under Valgrind...
Here is the full backtrace (not under valgrind)


(gdb) bt
#0  0x006d9402 in __kernel_vsyscall ()
#1  0x00760d80 in raise () from /lib/libc.so.6
#2  0x00762691 in abort () from /lib/libc.so.6
#3  0x0079924b in __libc_message () from /lib/libc.so.6
#4  0x007a4ceb in free () from /lib/libc.so.6
ASTERISK-1  0x0041642a in ExitMP3 () from /usr/lib/asterisk/modules/format_mp3.so
ASTERISK-2  0x0040ef53 in __i686.get_pc_thunk.bx ()
  from /usr/lib/asterisk/modules/format_mp3.so
ASTERISK-3  0xb26418a8 in ?? ()
ASTERISK-4  0x00228974 in ?? () from /usr/lib/asterisk/modules/res_musiconhold.so
ASTERISK-5  0xb67d0008 in ?? ()
ASTERISK-6 0x080b04b2 in filestream_destructor ()
Backtrace stopped: frame did not save the PC
(gdb) bt full
#0  0x006d9402 in __kernel_vsyscall ()
No symbol table info available.
#1  0x00760d80 in raise () from /lib/libc.so.6
No symbol table info available.
#2  0x00762691 in abort () from /lib/libc.so.6
No symbol table info available.
#3  0x0079924b in __libc_message () from /lib/libc.so.6
No symbol table info available.
#4  0x007a4ceb in free () from /lib/libc.so.6
No symbol table info available.
ASTERISK-1  0x0041642a in ExitMP3 () from /usr/lib/asterisk/modules/format_mp3.so
No symbol table info available.
ASTERISK-2  0x0040ef53 in __i686.get_pc_thunk.bx ()
  from /usr/lib/asterisk/modules/format_mp3.so
No symbol table info available.
ASTERISK-3  0xb26418a8 in ?? ()
No symbol table info available.
ASTERISK-4  0x00228974 in ?? () from /usr/lib/asterisk/modules/res_musiconhold.so
No symbol table info available.
ASTERISK-5  0xb67d0008 in ?? ()
No symbol table info available.
ASTERISK-6 0x080b04b2 in filestream_destructor ()
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
Backtrace stopped: frame did not save the PC
(gdb) thread apply all bt

Thread 1 (process 4093):
#0  0x006d9402 in __kernel_vsyscall ()
#1  0x00760d80 in raise () from /lib/libc.so.6
#2  0x00762691 in abort () from /lib/libc.so.6
#3  0x0079924b in __libc_message () from /lib/libc.so.6
#4  0x007a4ceb in free () from /lib/libc.so.6
ASTERISK-1  0x0041642a in ExitMP3 () from /usr/lib/asterisk/modules/format_mp3.so
ASTERISK-2  0x0040ef53 in __i686.get_pc_thunk.bx ()
  from /usr/lib/asterisk/modules/format_mp3.so
ASTERISK-3  0xb26418a8 in ?? ()
ASTERISK-4  0x00228974 in ?? () from /usr/lib/asterisk/modules/res_musiconhold.so
ASTERISK-5  0xb67d0008 in ?? ()
ASTERISK-6 0x080b04b2 in filestream_destructor ()

By: David Brillert (aragon) 2009-06-23 08:29:43

Running in Valgrind for about 18 hours now but no crash.
4 PRI spans 50 active calls, reload every 5 minutes and conversion script being used. Files based MOH in use.
I think I see some errors in current running valgrind.txt file
Uploading valgrindfixedmoh06232009.txt

By: David Brillert (aragon) 2009-06-23 10:05:45

Uploading valgrind.txt from bug report 15377 where I caught a crash using an IAX2 trunk under valgrind.
I'm going to name this debug file valgrind15377.txt

Edit: Bug 15377 is fixed and not related to this issue



By: David Brillert (aragon) 2009-06-23 13:41:10

Just an edit here to comply with lmadsen's request to not paste debug traces directly into bug reports.
Valgrind data that was in the original note can be found in my valgrind attachments in this report.



By: David Brillert (aragon) 2009-06-24 14:26:17

More CLI errors while in files mode and playing back a mix of mp3 and default wav files
The carefulwrite errors only appear during a reload as best I can tell.
Also the mp3 file playback sound like garbage and I always see the "decodeMP3: Junk at the beginning of frame" whenever files mode is used to play back any mp3 files.
wav files sound fine.

Jun 24 15:15:34] ERROR[1064]: utils.c:966 ast_carefulwrite:write() returned error: Broken pipe
[Jun 24 15:15:34] WARNING[986]: mp3/interface.c:215 decodeMP3: Junk at the beginning of frame 54414753
[Jun 24 15:15:39] ERROR[986]: utils.c:966 ast_carefulwrite: write() returned error: Broken pipe

Still trying to duplicate crash under valgrind while in moh files mode and issuing reload commands.  I have pretty much given up any hope for a crash under valgrind so I hope all the previous uploaded debugs will help debug this once and for all.
Since I cannot reproduce any crash using non files mode moh I am very tempted to do away with files mode once and for all to avoid any related crashes...

I'll wait for a few more hours with valgrind and upload final valgrind.txt file before restoring lab to normal operation.  I will then wait for a patch if one comes along to test.

By: David Brillert (aragon) 2009-06-24 14:55:18

valgrind06242009.txt uploading

By: Mark Michelson (mmichelson) 2009-06-24 15:21:16

Man, even if we don't fix this specific issue, there's enough good stuff from this valgrind output to fix at least a few bugs. Someone will have to be sure to go through all of this and find and fix any of the problems seen here. THis is great. Thanks for all the help, aragon.

By: David Brillert (aragon) 2009-06-24 15:39:20

mmichelson this valgrind is from revision 201600
and includes infinite loop patch from https://issues.asterisk.org/view.php?id=15345



By: Leif Madsen (lmadsen) 2009-06-24 15:51:00

aragon: thanks for the info! Sounds like you've found some good stuff there.

However, can you please attach your debugging information as a text file so we can keep the bug notes clean?

Thanks!

By: David Brillert (aragon) 2009-06-26 13:26:26

I uploaded another gdb output after patched * crashed moh in files mode
06262009mohgdb.txt



By: David Brillert (aragon) 2009-06-28 12:03:04

mmichelson

How do you want to handle the additional issues exposed by the Valgrind trace?
I can provide my lab scenario for testing only for a few more weeks... and I would like to test as many fixes as possible (or until valgrind runs clean).
I'm sorry for the short window of opportunity but I have to free up some of these servers.
The lab has been upgraded to r204008 and is currently testing a patch written for bug 15377

Should I/we open a separate bug report on each issue?
Or I should work with you on iirc...

Could you break down the issues exposed by the Valgrind trace so they can be documented?  Again I am happy to create another bug report(s) for this...

By: David Brillert (aragon) 2009-06-30 15:50:08

I was running another hammer test and crashed again on moh destruct (same backtrace as before).  So I elected to run Asterisk in Valgrind for a couple of days in lab since tomorrow is a national holiday in Canada.

Asterisk crashed a couple of times while starting with Valgrind so I am uploading valgrindcrash06302009.txt from that crash.  Maybe it will prove useful.

While still loading; Asterisk crashed under Valgrind with these last lines of CLI text

 == Registered application 'Transfer'
app_transfer.so => (Transfer)
 == Registered application 'SetCDRUserField'
 == Registered application 'AppendCDRUserField'
[Jun 30 16:50:48] ERROR[5646]: manager.c:2649 ast_manager_register_struct: Could not obtain lock on manager list

Latest testing is done using r204117 as I am gradually getting patches written for other open bug reports which remove some other crash scenarios.  My hope is to eventually get to a 1.4.26 release which will run stable...  Anyway this revision removes some mixmonitor problems and iax crashes to help me run my tests.



By: Alan Graham (zerohalo) 2009-07-02 08:25:19

getting this in 1.4.26rc4, no reload, files-based moh. bt attached.

By: David Brillert (aragon) 2009-07-02 09:50:31

My valgrind ran for two days with several reported memory errors in file.  Except that additional flags were needed to make the file more useful.

I restarted asterisk with valgrind again and it crashed before beginning call processing.  To be honest I don't know if it is common for Asterisk to crash while starting under Valgrind.  So I am posting the full valgrind output with all required flags "valgrindcrash07022009allflags.txt"

Next I started valgrind again with all required flags and I will allow this to run for several hours and will upload that valgrind file later.  This I hope will allow a developer to see where the memory issues are and will get us a patch(es).
I don't expect anything to crash under valgrind since I expect this is a race condition in Asterisk that will not appear when running under Valgrind...



By: David Brillert (aragon) 2009-07-02 11:58:24

Uploading valgrind output from a few hours of testing.
There is lots of good data here since all flags were enabled in Valgrind.

There are two files uploaded of interest that I would be happy if a developer or admin could look at.
valgrindcrash07022009allflags.txt (Asterisk crashed under valgrind during reload)
valgrindnocrash08022009withflags.txt (Asterisk did not crash)

Asterisk 1.4 SVN is still crashing in moh files mode.



By: Russell Bryant (russell) 2009-07-02 23:31:33

Thanks for the updated backtrace and valgrind output.  Unfortunately, there isn't actually much in the valgrind output that we can do anything with.  The errors at the beginning happen at startup and are expected.  Then, the end is a memory allocation summary after Asterisk was stopped.

Mark and I have both put a lot of time into looking at the related code and trying to come up with what might be going wrong.  We haven't come up with anything more yet.  If we do, we will certainly post an update here.

By: David Brillert (aragon) 2009-07-03 07:48:07

Thanks for the feedback Russell

I'm on vacation next week so I will let the lab soak under Valgrind and hope for a crash to get some useful output.

Essentially there are two bugs that prevent me from upgrading to Asterisk 1.4.26
This bug 15109, and https://issues.asterisk.org/view.php?id=15396
While I wait for the moh crash it is hit or miss if moh causes the crash or whatever is happening in 15396.  It's probably asking a lot but I would appreciate it you or Mark could have a look at bug 15396 before I leave for my vacation.  It would be great to get the lab at least stable enough to concentrate on 15109.  Either bug causes at least two crashes per day in my lab under load tests.

By: David Brillert (aragon) 2009-07-03 15:20:23

Russell

I'm really confused.
You said the valgrind output is normal because Asterisk was stopped.
But Asterisk was not stopped, in fact it crashed and I caught the crash in Valgrind. I caught that data here https://issues.asterisk.org/file_download.php?file_id=23158&type=bug

I have been trying to find ways of reproducing the bug and I think I am close.  I'm pretty sure it is caused during a reload.  And now I am suspecting that this crash only involves musiconhold if an mp3 file is played back in files mode and not a wav file.  I have been trying to reproduce a crash with that hypothesis by removing all mp3 files from /var/lib/asterisk/moh/ and only using stock wav files.  With that test in place and no Valgrind running so I could more easily reproduce the problem I ran into two crashes today.
Neither of these crashes bt evidenced any crash caused by moh.
1. Another crash caused by the bug in 15396
2. A crash during a reload

This is the backtrace from 2.
(Apologies in advance to Leif for posting full bt in bug notes but the bt data is small and I fixed the formatting to remove hyperlinks...)

_0  0x00727827 in _dl_close_worker () from /lib/ld-linux.so.2
(gdb) bt
_0  0x00727827 in _dl_close_worker () from /lib/ld-linux.so.2
_1  0x00728487 in _dl_close () from /lib/ld-linux.so.2
_2  0x0087ece4 in dlclose_doit () from /lib/libdl.so.2
_3  0x00722e46 in _dl_catch_error () from /lib/ld-linux.so.2
_4  0x0087f2cc in _dlerror_run () from /lib/libdl.so.2
_5  0x0087ed1a in dlclose () from /lib/libdl.so.2
_6  0x080c1ff3 in load_dynamic_module ()
_7  0x080c338c in load_resource ()
_8  0x080c3dea in load_modules ()
_9  0x0807212c in main ()
(gdb) bt full
_0  0x00727827 in _dl_close_worker () from /lib/ld-linux.so.2
No symbol table info available.
_1  0x00728487 in _dl_close () from /lib/ld-linux.so.2
No symbol table info available.
_2  0x0087ece4 in dlclose_doit () from /lib/libdl.so.2
No symbol table info available.
_3  0x00722e46 in _dl_catch_error () from /lib/ld-linux.so.2
No symbol table info available.
_4  0x0087f2cc in _dlerror_run () from /lib/libdl.so.2
No symbol table info available.
_5  0x0087ed1a in dlclose () from /lib/libdl.so.2
No symbol table info available.
_6  0x080c1ff3 in load_dynamic_module ()
No symbol table info available.
_7  0x080c338c in load_resource ()
No symbol table info available.
_8  0x080c3dea in load_modules ()
No symbol table info available.
_9  0x0807212c in main ()
No symbol table info available.
(gdb) thread apply all bt

Thread 1 (process 15353):
_0  0x00727827 in _dl_close_worker () from /lib/ld-linux.so.2
_1  0x00728487 in _dl_close () from /lib/ld-linux.so.2
_2  0x0087ece4 in dlclose_doit () from /lib/libdl.so.2
_3  0x00722e46 in _dl_catch_error () from /lib/ld-linux.so.2
_4  0x0087f2cc in _dlerror_run () from /lib/libdl.so.2
_5  0x0087ed1a in dlclose () from /lib/libdl.so.2
_6  0x080c1ff3 in load_dynamic_module ()
_7  0x080c338c in load_resource ()
_8  0x080c3dea in load_modules ()
_9  0x0807212c in main ()

And this looks exactly like my Valgrind file.
So my hypothesis looks more provable and the crash/memory abort occurs during a reload (maybe moh is just incidental).  Some more days trying to reproduce the crash without mp3 (wav only) files in /var/lib/asterisk/moh/ I think will finish proving my hypothesis.

Does the valgrind info provide any clue why asterisk would crash during a normal reload?



By: Alan Graham (zerohalo) 2009-07-03 17:03:57

Our crashes happen without reloads, though I suspect reloads speed up the time until crash - and we're using .wav file-based moh classes as well - no mp3s at all.

By: Alan Graham (zerohalo) 2009-07-07 14:20:50

Is there anything I can do to help with this or help push it along? This is the only outstanding bug preventing upgrading past 1.4.20rc2.

By: David Brillert (aragon) 2009-07-08 08:32:33

I cant keep valgrind running long enough without a crash to continue debugging bug 15109.  I keep getting crashes caused by https://issues.asterisk.org/view.php?id=15396 Asterisk crashed under Valgrind on that bug report so I uploaded valgrind trace.  Then I started Valgrind/Asterisk again to hope for 15109 crash under Valgrind.

I hope a developer can look at https://issues.asterisk.org/view.php?id=15396 so I can continue to work on 15109 but to be honest 15396 is a much more serious bug.

By: Alan Graham (zerohalo) 2009-07-10 14:48:25

Here's a snip of console right up to the latest crash, if it helps... Crash happened as moh (files based) was stopped:

   -- Started music on hold, class 'xxxx', on SIP/XXXX-3334-4293-098231d0
 == Spawn extension (macro-checkmessage, s, 3) exited non-zero on 'SIP/XXXX-7662-09391090' in macro 'checkmessage'
 == Spawn extension (xxxx, *123, 5) exited non-zero on 'SIP/XXXX-7662-09391090'
   -- SIP/XXXX-1745-4300-097ab520 is ringing
 == Spawn extension (macro-vm, s, 5) exited non-zero on 'SIP/gateway2-093f6aa0' in macro 'vm'
 == Spawn extension (somecontext-main, s, 6) exited non-zero on 'SIP/gateway2-093f6aa0'
   -- Executing AppendCDRUserField("SIP/cam-g2-b57c4e50", "DST=XXXXXXXXXX|")
   -- Executing Set("SIP/gateway2-b57c4e50", "CDR(accountcode)=pppp")
   -- Executing SetMusicOnHold("SIP/gateway2-b57c4e50", "default")
   -- Executing Dial("SIP/gateway2-b57c4e50", "SIP/XXXX@XXXXXXXXXX|70|r")
   -- Called 2000@XXXXXXXXXX
   -- SIP/XXXXXXXXXX-09d72c38 is ringing
   -- Stopped music on hold on SIP/XXXX-3334-4293-098231d0
sfo-c1*CLI> *** glibc detected *** double free or corruption (out): 0xb5740fe8 ***

By: Alan Graham (zerohalo) 2009-07-10 15:33:25

[default]
mode=files
directory=/var/lib/asterisk/mohmp3

files in /var/lib/asterisk/mohmp3:

xxxx.wav:   RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
classical:   directory
customers:   directory
xxxx.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
xxxx.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
silence:     directory

By: David Brillert (aragon) 2009-07-13 12:00:37

Another crash immediately after restarting Asterisk to test a new revision.
The crash happens immediately after issuing an asterisk restart command.
I see this type of crash often and it appears to be a crash during a normal load.  The mpg123 error looks suspicious when * crashes...

[root@sip ~]# service asterisk restart
Shutting down asterisk:                                    [  OK  ]
Starting asterisk:                                         [  OK  ]
[root@sip ~]# /usr/sbin/safe_asterisk: line 125:  6208 Segmentation fault      (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} >&/dev/${TTY} < /dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal EXITSTATUS-128.
Automatically restarting Asterisk.
mpg123: no process killed

gdb bt uploaded as gdb07132009.txt

By: David Brillert (aragon) 2009-07-13 15:46:20

Running under Valgrind since last crash caused by 15396
No crash since the 8th but latest Valgrind trace upload has some musiconhold references in it.  I hope this will be the magic bullet since I don't think I will see a moh crash in valgrind...
valgrind13072009.txt uploaded

By: David Brillert (aragon) 2009-07-14 12:59:39

Russel/Mark

Have either of you had a chance to look at valgrind13072009.txt?
Sorry for the duress but I am very anxious to get a stable 1.4.26 release...
Currently I am forced to use 1.4.24.1

There are two records of interest in the valgrind13072009.txt file

==25462== 6,688 bytes in 19 blocks are still reachable in loss record 60 of 70
==25462==    at 0x4005A88: malloc (vg_replace_malloc.c:207)
==25462==    by 0x79058E: __fopen_internal (in /lib/libc-2.5.so)
==25462==    by 0x792B3B: fopen64 (in /lib/libc-2.5.so)
==25462==    by 0x80B09C6: ast_filehelper (in /usr/sbin/asterisk)
==25462==    by 0x80B1342: ast_openstream_full (in /usr/sbin/asterisk)
==25462==    by 0x46D2106: ast_moh_files_next (res_musiconhold.c:250)
==25462==    by 0x46D22CE: moh_files_readframe (res_musiconhold.c:275)
==25462==    by 0x46D2339: moh_files_generator (res_musiconhold.c:291)
==25462==    by 0x808B6CB: ast_read_generator_actions (in /usr/sbin/asterisk)
==25462==    by 0x808D039: __ast_read (in /usr/sbin/asterisk)
==25462==    by 0x808D2E1: ast_read (in /usr/sbin/asterisk)
==25462==    by 0x808B311: ast_waitfordigit_full (in /usr/sbin/asterisk)

==13644== 144 bytes in 1 blocks are possibly lost in loss record 21 of 70
==13644==    at 0x4004BAE: calloc (vg_replace_malloc.c:397)
==13644==    by 0x725D49: _dl_allocate_tls (in /lib/ld-2.5.so)
==13644==    by 0x8B2B92: pthread_create@@GLIBC_2.1 (in /lib/libpthread-2.5.so)
==13644==    by 0x8120E1A: ast_pthread_create_stack (in /usr/sbin/asterisk)
==13644==    by 0x80DBC69: ast_pbx_start (in /usr/sbin/asterisk)
==13644==    by 0x68279A8: local_call (chan_local.c:523)
==13644==    by 0x808FF36: ast_call (in /usr/sbin/asterisk)
==13644==    by 0x554F8BD: ring_entry (app_queue.c:1988)
==13644==    by 0x554FDA0: ring_one (app_queue.c:2073)
==13644==    by 0x5552CCE: try_calling (app_queue.c:2957)
==13644==    by 0x5557B61: queue_exec (app_queue.c:4086)
==13644==    by 0x80D563C: pbx_exec (in /usr/sbin/asterisk)

By: Russell Bryant (russell) 2009-07-14 13:31:54

Those are reports of what memory was allocated in res_musiconhold when Asterisk stopped running.  The purpose of that output from valgrind is to assist in finding memory leaks.  Unfortunately, it's not a record of invalid memory access that could cause a crash.  Sorry.  :-(

By: David Brillert (aragon) 2009-07-14 13:44:52

Thanks Russel, your response is relaxing even if it isn't good news...
What about bug https://issues.asterisk.org/view.php?id=15396
I have posted full valgrind crash output and all bt's to that report and I am super anxious to get some feedback?

Also would you say I am not at risk of crash if I use non-files mode moh and your previous commit?
I don't mind using madplay for moh playback...

By: Russell Bryant (russell) 2009-07-14 14:29:23

I can't say for sure, but this crash is in the files-mode MOH code.  So, if you didn't enable any files mode MOH, you shouldn't see this crash.

By: David Brillert (aragon) 2009-07-15 09:26:42

While waiting for crash in Valgrind I noticed some channels were hung for a very long time.  Then it became apparent that Asterisk wasn't processing calls any more.

7 days of Valgrind tests down the drain..

Should I open another ticket for the locks or will someone post a patch here so I can continue debugging 15109 ?
I'll post the full valgrind text to this ticket for now to maybe get the locks sorted out.

core show locks

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 109050768 (do_monitor           started at [16810] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 16479 sipsock_read &netlock 0x67c0d40 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 4757 find_call &p->lock 0x7d9f4d0 (1)
=== ---> Waiting for Lock #2 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 118102928 (pri_dchannel         started at [10045] chan_dahdi.c start_pri())
=== ---> Lock #0 (chan_dahdi.c): MUTEX 9022 pri_dchannel &pri->lock 0x68a9de4 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 144808848 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 155130768 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 167926672 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 170630032 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 155376528 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 157834128 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 147266448 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 154639248 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 153656208 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 169155472 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 154393488 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 170384272 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 159800208 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 156359568 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 156113808 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 159554448 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 168909712 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 149478288 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 160537488 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 157588368 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 166697872 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 157096848 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 166943632 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 169892752 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 161766288 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 145300368 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 159062928 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 158571408 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 158325648 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 156851088 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------
===
=== Thread ID: 161274768 (session_do           started at [ 2522] manager.c accept_thread())
=== ---> Lock #0 (manager.c): RDLOCK 2281 process_message &actionlock 0x81855e0 (1)
=== ---> Waiting for Lock #1 (db.c): MUTEX 151 ast_db_put &dblock 0x817b3a0 (1)
=== --- ---> Locked Here: db.c line 151 (ast_db_put)
=== -------------------------------------------------------------------



By: David Brillert (aragon) 2009-07-15 09:48:22

valgrind15072009.zip uploaded

By: David Brillert (aragon) 2009-07-15 09:57:33

I made ticket 15506 to open bug report on locks :-(

By: David Brillert (aragon) 2009-07-17 08:35:37

I have given up on Valgrind to debug this issue.
Instead I am testing revision 206273.
Also I have ceased using PRI spans for my tests and now I am using a SIP trunk between Asterisk servers and limiting the number of concurrent channels on those trunks to 30.  This has reduced the load under testing from 20-30 to about 5.

In a 24 hour period there has been no MOH crash and no crash as in bug 15396.
This leads me to believe that both issues are caused by locks induced by high CPU load.
The only trick now is for me to capture any existing locks to a log that exist EXACTLY before any segfault or I will not get any relevant data to debug.
Can someone help me with such a debug tool?  I'm looking for a diagnostic Asterisk patch...
With such a patch I can then control the CPU loads until I can reproduce the crashes and maybe then find root cause.



By: David Brillert (aragon) 2009-07-17 08:45:27

I'm going to try something like this and increase load on test servers.

watch -n 15 'asterisk -rx "core show locks" > /tmp/lock.`date +"%Y%m%d-%T"`'

By: David Brillert (aragon) 2009-07-20 08:09:18

I have not been able to reproduce a segfault under very high load using SVN revision 206273.  In fact there were no segfaults at all with this SVN release...
However, while capturing lock data for the last 72 hours I noticed a consistent lock message that occurs constantly.  Asterisk doesn't appear to lock or stop processing calls although there appears to be some delay in processing audio.
The most consistent lock is "(autoservice.c): MUTEX 89 autoservice_run &(&aslist)->lock 0x8179948 (1)".  That message appears in almost every core show locks output.

Is this lock info unusual or does it indicate any problems?


=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 3085396880 (do_devstate_changes  started at [  363] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1128 channel_find_locked &c->lock 0x8e771c8 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 3050437520 (autoservice_run      started at [  238] autoservice.c ast_autoservice_start())
=== ---> Lock #0 (autoservice.c): MUTEX 89 autoservice_run &(&aslist)->lock 0x8179948 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 2882485136 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 1493 ast_hangup &chan->lock 0x8e771c8 (1)
=== ---> Tried and failed to get Lock #1 (chan_local.c): MUTEX 575 local_hangup &p->chan->lock 0xa64db18 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 2890775440 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (autoservice.c): MUTEX 263 ast_autoservice_stop &(&aslist)->lock 0x8179948 (1)
=== --- ---> Locked Here: autoservice.c line 89 (autoservice_run)
=== -------------------------------------------------------------------
===
=== Thread ID: 2825239440 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (autoservice.c): MUTEX 263 ast_autoservice_stop &(&aslist)->lock 0x8179948 (1)
=== --- ---> Locked Here: autoservice.c line 89 (autoservice_run)
=== -------------------------------------------------------------------
===
=== Thread ID: 2926230416 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (autoservice.c): MUTEX 263 ast_autoservice_stop &(&aslist)->lock 0x8179948 (1)
=== --- ---> Locked Here: autoservice.c line 89 (autoservice_run)
=== -------------------------------------------------------------------
===
=== Thread ID: 2836298640 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (autoservice.c): MUTEX 263 ast_autoservice_stop &(&aslist)->lock 0x8179948 (1)
=== --- ---> Locked Here: autoservice.c line 89 (autoservice_run)
=== -------------------------------------------------------------------
===
=== Thread ID: 2837281680 (pbx_thread           started at [ 2626] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (autoservice.c): MUTEX 263 ast_autoservice_stop &(&aslist)->lock 0x8179948 (1)
=== --- ---> Locked Here: autoservice.c line 89 (autoservice_run)
=== -------------------------------------------------------------------

By: David Brillert (aragon) 2009-07-20 08:28:37

This lock info is for 15109

=== Thread ID: 3085396880 (do_devstate_changes started at [ 363] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1128 channel_find_locked &c->lock 0x8e771c8 (0)

This message is identical to a lock I found when creating ticket https://issues.asterisk.org/view.php?id=15411 (core show locks and thread apply all bt pid attached)
The difference is in 15411 the lock was fatal and that crash occurred on a production system in * 1.4.24
In 15109 core show lock output the lock fails and no crash...

Lock info from 15411
=== Thread ID: 3085810592 (do_devstate_changes started at [ 363] devicestate.c ast_device_state_engine_init())
=== ---> Lock #0 (pbx.c): RDLOCK 6151 ast_rdlock_contexts &conlock 0x8167a40 (2)
=== ---> Lock #1 (pbx.c): MUTEX 2034 ast_hint_state_changed &(&hints)->lock 0x8167c48 (1)
=== ---> Lock #2 (chan_sip.c): MUTEX 8843 cb_extensionstate &p->lock 0x8817610 (1)

Something is going on here... although maybe not relevant to 15109 (I have no idea if they are related, also I have no idea which change in code removes crashes in revision 206273)



By: David Brillert (aragon) 2009-07-21 11:15:56

Testing SVN revision 206273 the moh files based segfaults are gone.
Can anyone else test 206273 SVN and confirm?

I don't know if the locks are normal from my previous notes...

By: Alan Graham (zerohalo) 2009-07-21 18:25:32

So far, so good here on a r206273 test/production machine, but without stress testing and only about 16 hours of normal call load (around 60 calls concurrent on this machine).



By: David Brillert (aragon) 2009-07-22 08:05:11

I'm still gettings logs like this in valgrind.txt while trying to debug another 1.4.26 crash https://issues.asterisk.org/view.php?id=15544
Using r206273

This info looks different than previous valgrind output and I am not experiencing any MOH crashes without valgrind...


==12561== Thread 53:
==12561== Invalid read of size 4
==12561==    at 0x80B4B62: __frame_free (in /usr/sbin/asterisk)
==12561==    by 0x80B4CBA: ast_frame_free (in /usr/sbin/asterisk)
==12561==    by 0x46D4396: moh_files_generator (res_musiconhold.c:295)
==12561==    by 0x808B79B: ast_read_generator_actions (in /usr/sbin/asterisk)
==12561==    by 0x808D179: __ast_read (in /usr/sbin/asterisk)
==12561==    by 0x808D421: ast_read (in /usr/sbin/asterisk)
==12561==    by 0x8079BD0: autoservice_run (in /usr/sbin/asterisk)
==12561==    by 0x8121A6D: dummy_start (in /usr/sbin/asterisk)
==12561==    by 0x8B249A: start_thread (in /lib/libpthread-2.5.so)
==12561==    by 0x80942D: clone (in /lib/libc-2.5.so)
==12561==  Address 0x42386b8 is 400 bytes inside a block of size 872 free'd
==12561==    at 0x400562C: free (vg_replace_malloc.c:323)
==12561==    by 0x80728BA: __ast_free_region (in /usr/sbin/asterisk)
==12561==    by 0x80732AC: __ast_free (in /usr/sbin/asterisk)
==12561==    by 0x8074B19: ao2_ref (in /usr/sbin/asterisk)
==12561==    by 0x80B35CB: ast_filestream_frame_freed (in /usr/sbin/asterisk)
==12561==    by 0x80B4B5E: __frame_free (in /usr/sbin/asterisk)
==12561==    by 0x80B4CBA: ast_frame_free (in /usr/sbin/asterisk)
==12561==    by 0x46D4396: moh_files_generator (res_musiconhold.c:295)
==12561==    by 0x808B79B: ast_read_generator_actions (in /usr/sbin/asterisk)
==12561==    by 0x808D179: __ast_read (in /usr/sbin/asterisk)
==12561==    by 0x808D421: ast_read (in /usr/sbin/asterisk)
==12561==    by 0x8079BD0: autoservice_run (in /usr/sbin/asterisk)



By: David Brillert (aragon) 2009-07-23 09:39:30

after a valgrind restart and two days of testing no more logs of musiconhold memory issues in running valgrind.txt file (perhaps I forgot to delete the older valgrind log file before posting my previous note... but I'm not sure)
Of course under Valgrind there is no crash...

Maybe we are getting to the point where we can close this bug as fixed in SVN?
Anyone else have anything to report under SVN r206273?

By: Alan Graham (zerohalo) 2009-07-28 16:22:28

still no crashes here after 7 days - definitely have hope for this rev.

By: Miguel Molina (coolmig) 2009-07-28 16:31:06

That's very good news for newer versions to come. This is one of the bugs that made me downgrade my production server to 1.4.22... BTW, what was the fix?

By: David Brillert (aragon) 2009-07-29 09:05:35

I have some bad news.
I just saw another crash but this appears in my case related to app_queue and moh (and maybe ast_indicate).  ast_indicate could make sense since I had opened another ticket based on 1.4.24.1 that was closed because I could not reproduce under 1.4.26 SVN until now.
While debugging 15109 I opened 15396 and in that ticket I have valgrind crash traces and additional gdb backtraces.
Hopefully the info in 15396 logs can help solve both tickets.

If anyone wants to see the latest gdb output then
https://issues.asterisk.org/file_download.php?file_id=23373&type=bug

Like I said there is plenty of good valgrind (crash under valgrind) details in 15396



By: Raimund Sacherer (hatrix) 2009-07-31 06:36:36

Hi Aragon, I have as well issues with segfaults, very strange ones with app_queue, but my problem is I can not reproduce them in my lab, i am working the machine up, have lot's of logged in agents and create lot's of calls using sipp, but I believe the problem might be with some reload in combination with a call transfer, can you, if you would be so kind, tell me how you did tests regarding call-transfers? I have a nice script setup for agent registering, login logoff, accepting calls and making lot's of calls, but I do not know exactly how i would simulate a call transfer ...

thanks for helping
best

Ray

By: David Woolley (davidw) 2009-07-31 06:50:03

The original fault, here, was NOT a segfault.  It's signal 3, which is the result of the library code software deliberately declaring an error, whereas a segfault is signal 11, which is detected by the hardware, when the code accesses an address (in a way) that the address translation hardware thinks it should not.

By: David Brillert (aragon) 2009-07-31 09:25:55

I spent weeks trying to get a crash under Valgrind.  IMHO it is hopeless and none of my Valgrind output was helpful to the developers.

Maybe once tilghman's work here is finished I can get some memory output related to this crash that is useful (I'm holding out)
https://reviewboard.asterisk.org/r/309/

Sorry Hatrix I'm on vacation starting today...

By: Joel Vandal (jvandal) 2009-08-01 07:48:40

aragon, new test version is available as RPM package

By: Alan Graham (zerohalo) 2009-08-03 10:48:35

after about a week, crash in 1.4.26GA with slightly different backtrace, still in moh. bt attached.

By: Tilghman Lesher (tilghman) 2009-08-03 11:25:37

My work is now mostly done.  Try a checkout from:

http://svn.digium.com/svn/asterisk/team/tilghman/malloc_hold

You'll need to activate the extra memory debugging by entering 'make menuselect' and under "Compiler Flags - Development", enable "MALLOC_HOLD".

After the crash, upload the mmlog file in /var/log/asterisk to this issue.

By: David Brillert (aragon) 2009-08-03 13:36:38

Hi jvandal,

I am now using revision 209955 in lab tests. Also I have increased concurrent calls limit to increase likely frequency of crashes.
Can you provide me with an rpm based on tilghman's requirements for https://reviewboard.asterisk.org/r/309/ ?

By: Joel Vandal (jvandal) 2009-08-03 14:22:58

Hi aragon,

I will try but the net connection here is pretty slow. Else I will try to do it friday when I'm in Israel (sorry for offtopic)

By: David Brillert (aragon) 2009-08-04 17:07:50

jvandal

There is something wrong with your new test rpm.
After Asterisk crashes I am not able to get any useful gdb info from core dumps.

By: Leif Madsen (lmadsen) 2009-08-05 12:26:49

Changing the status of this issue to FEEDBACK while we wait for some feeback from a MALLOC_HOLD enabled system that appears to crash. Thanks!

By: David Brillert (aragon) 2009-08-11 09:26:44

i tried to install the malloc_hold checkout but couldn't get past menuselect

[root@sip malloc_hold]# make menuselect
**************************************************
*** Install ncurses to use the menu interface! ***
**************************************************
menuselect changes NOT saved!
[root@sip malloc_hold]# rpm -qa ncurses
ncurses-5.5-24.20060715
[root@sip malloc_hold]# yum update ncurses ncurses-devel
centos-os                                                | 1.1 kB     00:00
centos-updates                                           |  951 B     00:00
centos-extras                                            | 1.1 kB     00:00
Excluding Packages in global exclude list
Finished
Setting up Update Process
No Packages marked for Update

By: Tilghman Lesher (tilghman) 2009-08-11 10:13:44

Try "yum install ncurses-devel" instead.  Yum won't install, if you ask only for update.

By: David Brillert (aragon) 2009-08-11 10:19:11

rpm -qa ncurses ncurses-devel
ncurses-5.5-24.20060715
ncurses-devel-5.5-24.20060715

By: David Brillert (aragon) 2009-08-11 10:36:56

tilghman, don't worry about this for now.
I assume you have already tested the malloc_hold checkout and it installs properly for you with make menuselect

I'm going to work with jvandal to get this properly installed so I can resume testing.

By: David Brillert (aragon) 2009-08-12 13:30:00

tilghman, Asterisk malloc_hold checkout keeps segfaulting with this error during load

Cannot set memory protections 'PROT_NONE' (0) on 0xae7cf000: Cannot allocate memory
Aborted

By: Tilghman Lesher (tilghman) 2009-08-12 13:44:05

aragon:  that indicates that your KERNEL is running out of memory, which will cause lots of mysterious problems.  I would recommend increasing the amount of physical RAM in this machine or reducing the number of applications running, if you cannot increase the physical RAM.

By: David Brillert (aragon) 2009-08-12 13:51:30

I have 4 GB of installed memory and htop shows only 660 MB used when Asterisk is stopped.
The memory goes through the roof only when I start asterisk using asterisk -vvc
Asterisk aborts when memory consumption reaches about 1 GB.



By: David Brillert (aragon) 2009-08-12 14:40:02

(gdb) bt
#0  0x00b87402 in ?? ()
Cannot access memory at address 0xbff43040

By: David Brillert (aragon) 2009-08-12 15:21:22

I can't provide any more feedback since I cannot get malloc_hold version to start.
I am willing to provide tilghman remote access to my lab.

By: Amilcar S Silvestre (amilcar) 2009-08-12 18:15:21

I'm having the same segfaults described here. Always one of this two backtraces:

First box:

Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal 6, Aborted.
(gdb) bt
#0  0x00002abdd9fde215 in raise () from /lib64/libc.so.6
#1  0x00002abdd9fdfcc0 in abort () from /lib64/libc.so.6
#2  0x00002abdda0187fb in __libc_message () from /lib64/libc.so.6
#3  0x00002abdda01fce2 in _int_free () from /lib64/libc.so.6
#4  0x00002abdda02390c in free () from /lib64/libc.so.6
ASTERISK-1  0x00002abdda00ed0b in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
ASTERISK-2  0x00000000004554c8 in filestream_destructor (arg=<value optimized out>) at file.c:346
ASTERISK-3  0x0000000000427883 in ao2_ref (user_data=0x2aaadc0f4068, delta=22576) at astobj2.c:229
ASTERISK-4  0x0000000000458aba in ast_frame_free (frame=0x2aaadc0f40b8, cache=1) at frame.c:350
ASTERISK-5  0x00002aaaab6f2385 in moh_files_generator (chan=0x2aaad0042ca0, data=<value optimized out>, len=<value optimized out>, samples=<value optimized out>) at res_musiconhold.c:295
ASTERISK-6 0x000000000043d732 in ast_read_generator_actions (chan=0x2aaad0042ca0, f=0x53c1250) at channel.c:1997
ASTERISK-7 0x000000000043dde9 in __ast_read (chan=0x2aaad0042ca0, dropaudio=0) at channel.c:2476
ASTERISK-8 0x000000000043f51c in ast_channel_bridge (c0=0x2aaad0042ca0, c1=0x0, config=0x443df6d0, fo=0x443de318, rc=0x443de310) at channel.c:2510
ASTERISK-9 0x00002aaaaacbc6a5 in ast_bridge_call (chan=0x2aaad0042ca0, peer=0x545c2c0, config=0x443df6d0) at res_features.c:1806
ASTERISK-10 0x00002aaac22473b7 in dial_exec_full (chan=0x2aaad0042ca0, data=<value optimized out>, peerflags=0x443df990, continue_exec=0x0) at app_dial.c:1804
ASTERISK-11 0x00002aaac22489a5 in dial_exec (chan=0x109a, data=0x5830) at app_dial.c:1843
ASTERISK-12 0x00000000004801dd in pbx_extension_helper (c=0x2aaad0042ca0, con=<value optimized out>, context=0x2aaad0042ef0 "ivr-main", exten=0x2aaad0042f40 "s", priority=10,
   label=<value optimized out>, callerid=0x2aaad0042c30 "2197595646", action=E_SPAWN) at pbx.c:536
ASTERISK-13 0x0000000000482862 in __ast_pbx_run (c=0x2aaad0042ca0) at pbx.c:2283
ASTERISK-14 0x0000000000483729 in pbx_thread (data=0x109a) at pbx.c:2599
ASTERISK-15 0x00000000004aef9c in dummy_start (data=<value optimized out>) at utils.c:856
ASTERISK-16 0x00002abdd9900367 in start_thread () from /lib64/libpthread.so.0
ASTERISK-17 0x00002abdda080f7d in clone () from /lib64/libc.so.6


Another box, little bit different:

Program terminated with signal 6, Aborted.
#0  0x0000003629430215 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003629430215 in raise () from /lib64/libc.so.6
#1  0x0000003629431cc0 in abort () from /lib64/libc.so.6
#2  0x000000362946a7fb in __libc_message () from /lib64/libc.so.6
#3  0x0000003629471ce2 in _int_free () from /lib64/libc.so.6
#4  0x000000362947590c in free () from /lib64/libc.so.6
ASTERISK-1  0x0000003629460d0b in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
ASTERISK-2  0x00000000004554c8 in filestream_destructor (arg=<value optimized out>) at file.c:346
ASTERISK-3  0x0000000000427883 in ao2_ref (user_data=0x2aaae4029e58, delta=29153) at astobj2.c:229
ASTERISK-4  0x0000000000458aba in ast_frame_free (frame=0x2aaae4029ea8, cache=1) at frame.c:350
ASTERISK-5  0x00002aaaab90a385 in moh_files_generator (chan=0x98dfe40, data=<value optimized out>, len=<value optimized out>, samples=<value optimized out>) at res_musiconhold.c:295
ASTERISK-6 0x000000000043d732 in ast_read_generator_actions (chan=0x98dfe40, f=0x2aaaac03a4e0) at channel.c:1997
ASTERISK-7 0x000000000043dde9 in __ast_read (chan=0x98dfe40, dropaudio=0) at channel.c:2476
ASTERISK-8 0x0000000000429dec in autoservice_run (ign=<value optimized out>) at autoservice.c:121
ASTERISK-9 0x00000000004aef9c in dummy_start (data=<value optimized out>) at utils.c:856
ASTERISK-10 0x000000362a006367 in start_thread () from /lib64/libpthread.so.0
ASTERISK-11 0x00000036294d2f7d in clone () from /lib64/libc.so.6

All the boxes war very stable in 1.4.22, just changed to 1.4.26 (the problem seems to occur 14.23.2+).

By: Bereterbide Marcelo (marhbere) 2009-08-14 13:06:18

I'll try to add me experience.

We run 1.6.0 SVN now r209896, and we are having this same issue:
Always ocurr when the member on the queue does a tranfer and then crash:

#0  0x00633416 in __kernel_vsyscall ()
#1  0x00348460 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x00349e28 in abort () at abort.c:88
#3  0x00385fed in __libc_message (do_abort=2, fmt=0x460e68 "*** glibc detected *** %s: %s: 0x%s ***\n")
   at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#4  0x0038c3a4 in malloc_printerr (action=2, str=0x460f40 "double free or corruption (out)", ptr=0xb72fb600) at malloc.c:5994
ASTERISK-1  0x0038e356 in __libc_free (mem=0xb72fb600) at malloc.c:3625
ASTERISK-2  0x0037bda4 in _IO_new_fclose (fp=0xb72fb600) at iofclose.c:88
ASTERISK-3  0x080d5aa5 in filestream_destructor (arg=0xb74e93d8) at file.c:329
ASTERISK-4  0x08080640 in ao2_ref (user_data=0xb74e93d8, delta=-1) at astobj2.c:227
ASTERISK-5  0x080d873b in ast_filestream_frame_freed (fr=0xb74e9408) at file.c:1309
ASTERISK-6 0x080d9bd0 in __frame_free (fr=0xb74e9408, cache=1) at frame.c:338
ASTERISK-7 0x080d9d2e in ast_frame_free (frame=0xb74e9408, cache=1) at frame.c:380
ASTERISK-8 0x01acf6e5 in moh_files_generator (chan=0xb4425cd8, data=0xb74e6130, len=0, samples=160) at res_musiconhold.c:302
ASTERISK-9 0x08096f95 in generator_force (data=0xb4425cd8) at channel.c:1930
ASTERISK-10 0x080987bf in __ast_read (chan=0xb4425cd8, dropaudio=0) at channel.c:2647
ASTERISK-11 0x0809a1e5 in ast_read (chan=0xb4425cd8) at channel.c:3021
ASTERISK-12 0x080a03af in ast_generic_bridge (c0=0xb4425cd8, c1=0xb465d818, config=0xb5943540, fo=0xb5941a5c, rc=0xb5941a58, bridge_end=
     {tv_sec = 0, tv_usec = 0}) at channel.c:4749
ASTERISK-13 0x080a1de3 in ast_channel_bridge (c0=0xb4425cd8, c1=0xb465d818, config=0xb5943540, fo=0xb5941a5c, rc=0xb5941a58)
   at channel.c:5120
ASTERISK-14 0x080cd3ac in ast_bridge_call (chan=0xb4425cd8, peer=0xb465d818, config=0xb5943540) at features.c:2330
ASTERISK-15 0x00ada40b in try_calling (qe=0xb5943a70, options=0xb59439be "", announceoverride=0xb59439c0 "", url=0xb59439bf "",
   tries=0xb5943c34, noption=0xb5943c30, agi=0x0, macro=0x0, gosub=0x0, ringing=0) at app_queue.c:3956
ASTERISK-16 0x00ade233 in queue_exec (chan=0xb4425cd8, data=0xb5945e38) at app_queue.c:4879
ASTERISK-17 0x080ff85c in pbx_exec (c=0xb4425cd8, app=0xb7df9058, data=0xb5945e38) at pbx.c:951
ASTERISK-18 0x08106afe in pbx_extension_helper (c=0xb4425cd8, con=0x0, context=0xb4425f2c "FunctionRouteQueue", exten=0xb4425f7c "s",
   priority=6, label=0x0, callerid=0xb7407b08 "13475735805", action=E_SPAWN, found=0xb594827c, combined_find_spawn=1)
   at pbx.c:3120
ASTERISK-19 0x081085d4 in ast_spawn_extension (c=0xb4425cd8, context=0xb4425f2c "FunctionRouteQueue", exten=0xb4425f7c "s", priority=6,
   callerid=0xb7407b08 "13475735805", found=0xb594827c, combined_find_spawn=1) at pbx.c:3584
ASTERISK-20 0x08108c86 in __ast_pbx_run (c=0xb4425cd8, args=0x0) at pbx.c:3671
ASTERISK-21 0x08109e9d in pbx_thread (data=0xb4425cd8) at pbx.c:3944
ASTERISK-22 0x0815ac0c in dummy_start (data=0xb46a0cb0) at utils.c:861
ASTERISK-23 0x004cb51f in start_thread (arg=0xb5948b90) at pthread_create.c:297
ASTERISK-24 0x0040104e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

Before I really understand that happend, I had made a report: https://issues.asterisk.org/view.php?id=15460

Aragon, for resolve "No symbol table info available." you can try apply: debuginfo-install coreutils. We have FC9

Sorry for my intrusion and for my bad language.

By: Russell Bryant (russell) 2009-08-17 09:15:55

For those that mentioned that this seems to happen when transfers occur, can you specify what type of transfer is being used?  Is it a SIP transfer, or a DTMF transfer?  blind or attended?

By: David Brillert (aragon) 2009-08-17 09:22:01

In my lab tests I issue a DTMF blind transfer.

However where this crash first occurred my customer does not use any DTMF transfers. They use SIP transfers.  My lab isn't configured to do any SIP transfers during tests.



By: Alan Graham (zerohalo) 2009-08-17 10:11:33

SIP attended here, with and without queue involvement.

By: David Brillert (aragon) 2009-08-17 10:20:47

Just saw another crash and the backtrace shows the abort occuring in DTMF blind transfer.

Asterisk SVN-branch-1.4-r211807

Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal 6, Aborted.
[New process 5054]
#0  0x00d0e402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00d0e402 in __kernel_vsyscall ()
#1  0x00760d80 in raise () from /lib/libc.so.6
#2  0x00762691 in abort () from /lib/libc.so.6
#3  0x0079924b in __libc_message () from /lib/libc.so.6
#4  0x007a10f1 in _int_free () from /lib/libc.so.6
ASTERISK-1  0x007a4bc0 in free () from /lib/libc.so.6
ASTERISK-2  0x0078fc56 in fclose@@GLIBC_2.1 () from /lib/libc.so.6
ASTERISK-3  0x080739c9 in __ast_fdleak_fclose (ptr=0x9f172e8) at astfd.c:201
ASTERISK-4  0x080b1d92 in filestream_destructor (arg=0xa167b28) at file.c:346
ASTERISK-5  0x0807658c in ao2_ref (user_data=0xa167b28, delta=-1) at astobj2.c:229
ASTERISK-6 0x080b339d in ast_closestream (f=0xa167b28) at file.c:908
ASTERISK-7 0x00c51f25 in local_ast_moh_stop (chan=0xb5ee62f0)
   at res_musiconhold.c:1059
ASTERISK-8 0x08096654 in ast_moh_stop (chan=0xb5ee62f0) at channel.c:4845
ASTERISK-9 0x0036ba34 in sip_indicate (ast=0xb5ee62f0, condition=17, data=0x0,
   datalen=0) at chan_sip.c:4097
ASTERISK-10 0x0808eab4 in ast_indicate_data (chan=0xb5ee62f0, _condition=17, data=0x0,
   datalen=0) at channel.c:2580
ASTERISK-11 0x0808e97a in ast_indicate (chan=0xb5ee62f0, condition=17)
   at channel.c:2525
ASTERISK-12 0x001eb534 in finishup (chan=0xb5ee62f0) at res_features.c:760
ASTERISK-13 0x001eb94f in builtin_blindtransfer (chan=0xb5ee62f0, peer=0xa0b42d0,
   config=0xb651e888, code=0xb651cea0 "*1", sense=2, data=0x0)
   at res_features.c:828
ASTERISK-14 0x001ed278 in feature_interpret_helper (chan=0xb5ee62f0, peer=0xa0b42d0,
   config=0xb651e888, code=0xb651cea0 "*1", sense=2,
   dynamic_features_buf=0xb651ca9c "atxfer#blindxfer#automon#automixmon#hookflash#supervisor#conference#tagcode-1000#tagcode-1001#tagcode-1002#tagcode-1003#tagcode", features=0xb651ca98, operation=1, feature=0xb651c8bc)
   at res_features.c:1257
ASTERISK-15 0x001edb2e in feature_interpret (chan=0xb5ee62f0, peer=0xa0b42d0,
   config=0xb651e888, code=0xb651cea0 "*1", sense=2) at res_features.c:1341
ASTERISK-16 0x001f0127 in ast_bridge_call (chan=0xb5ee62f0, peer=0xa0b42d0,
   config=0xb651e888) at res_features.c:1940
ASTERISK-17 0x00608436 in try_calling (qe=0xb651ebe0, options=0xb651eb8a "",
   announceoverride=0xb651eb8c "", url=0xb651eb8b "", tries=0xb651ed80,
   noption=0xb651ed7c, agi=0xb651eb90 "") at app_queue.c:3229
ASTERISK-18 0x0060ba4a in queue_exec (chan=0xb5ee62f0, data=0xb6520f08)
   at app_queue.c:4077
ASTERISK-19 0x080d7005 in pbx_exec (c=0xb5ee62f0, app=0x9f99a38, data=0xb6520f08)
   at pbx.c:536
ASTERISK-20 0x080dad7e in pbx_extension_helper (c=0xb5ee62f0, con=0x0,
   context=0xb5ee6530 "default-application-acd-irs-new-english",
   exten=0xb5ee6580 "s", priority=13, label=0x0, callerid=0x9f76ec8 "1040",
   action=E_SPAWN) at pbx.c:1863
ASTERISK-21 0x080dbfc6 in ast_spawn_extension (c=0xb5ee62f0,
   context=0xb5ee6530 "default-application-acd-customer-new-english",
   exten=0xb5ee6580 "s", priority=13, callerid=0x9f76ec8 "1040") at pbx.c:2283
ASTERISK-22 0x080dc426 in __ast_pbx_run (c=0xb5ee62f0) at pbx.c:2373
ASTERISK-23 0x080dd435 in pbx_thread (data=0xb5ee62f0) at pbx.c:2599
ASTERISK-24 0x081230d6 in dummy_start (data=0xb5622820) at utils.c:856
ASTERISK-25 0x008b249b in start_thread () from /lib/libpthread.so.0
ASTERISK-26 0x0080942e in clone () from /lib/libc.so.6

By: Amilcar S Silvestre (amilcar) 2009-08-17 11:52:35

In my case, the transfers are dtmf attended (mostly).

By: David Brillert (aragon) 2009-08-18 13:42:55

I think we are dealing with a memory leak caused by reloads while files based moh is in use. While I watch the system with htop installed I can see roughly 100MB of RAM is leaked on my test setup per hour.  As the memory reaches critical the swap reaches 2GB.
As I said in the past I could not get Asterisk to crash under valgrind.  I think this has to do with the swap file and the large amount of RAM available.
While in valgrind the memory has more time to free... and we avoid the memory abort.

My server had 4GB of ram installed so I brought this figure down to get a crash with Valgrind.
Anyway, long story short: Asterisk locked before it could crash so I am uploading valgrind18092009.txt
Memory usage reached 950/1010 while capturing this output.  I hope the details of the leak are captured in this text file...  I'm running another valgrind setup in hopes of catching this in a crash.

valgrind was running asterisk r 211807



By: Bereterbide Marcelo (marhbere) 2009-08-18 15:24:44

rusell, In our case is SIP attended transfers, and our scenario is:

SIP -> QUEUE (Members are LOCAL without '/n' and Dynamics) -> SIP (Attended Transfer) -> SIP.
In 24hs where were made 75 attended transfer, only 1 of these has produced the crash.

This is last log before crash:

[2009-08-15 17:29:04.717] DEBUG[22402] chan_sip.c: Trying to put 'SIP/2.0 20' onto UDP socket destined for 10.240.18.189:5060
[2009-08-15 17:29:04.717] DEBUG[22402] channel.c: Planning to masquerade channel SIP/EXT-THSMSW-b7577a10 into the structure of SIP/325-b75ca530
[2009-08-15 17:29:04.718] DEBUG[22402] channel.c: Done planning to masquerade channel SIP/EXT-THSMSW-b7577a10 into the structure of SIP/325-b75ca530
[2009-08-15 17:29:04.718] DEBUG[22402] chan_sip.c: Trying to put 'NOTIFY sip' onto UDP socket destined for 10.240.18.189:5060
[2009-08-15 17:29:04.718] DEBUG[22402] chan_sip.c: SIP attended transfer: Unlocking channel SIP/325-b75ca530
[2009-08-15 17:29:04.718] DEBUG[4109] channel.c: Got clone lock for masquerade on 'SIP/EXT-THSMSW-b7577a10' at 0x91c1c3c
[2009-08-15 17:29:04.718] DEBUG[4109] chan_sip.c: Hangup call SIP/EXT-THSMSW-b7577a10<MASQ>, SIP callid 12f5f432-d97c1df6@10.240.18.189
[2009-08-15 17:29:04.718] DEBUG[4109] chan_sip.c: Trying to put 'SIP/2.0 60' onto UDP socket destined for 10.240.18.189:5060
[2009-08-15 17:29:04.718] DEBUG[4109] res_config_mysql.c: MySQL RealTime: Connection okay.
[2009-08-15 17:29:04.718] DEBUG[4109] res_config_mysql.c: MySQL RealTime: Insert SQL: INSERT into PBX_QDR (time, callid, queuename, agent, event, data) values ('1250357344', 'ippbx01v2.xxxxxxxxxx.net-1250357236.24456', 'qTHS-CustomerService', '6502', 'TRANSFER', 's|FunctionRouteChannel|15|60')
[2009-08-15 17:29:04.744] DEBUG[22402] chan_sip.c: Stopping retransmission on '3688fc1b714ee6962f9b011e50ad76ee@10.240.10.54' of Request 103: Match Found
[2009-08-15 17:29:04.744] DEBUG[22402] chan_sip.c: Got 200 OK on NOTIFY for transfer
[2009-08-15 17:29:04.756] DEBUG[22402] chan_sip.c: Stopping retransmission on '12f5f432-d97c1df6@10.240.18.189' of Response 102: Match Found
[2009-08-15 17:29:04.841] DEBUG[4109] res_config_mysql.c: MySQL RealTime: row inserted on table: PBX_QDR, id: 894502
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Set channel SIP/EXT-THSMSW-b7577a10 to write format ulaw
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Set channel SIP/EXT-THSMSW-b7577a10 to read format ulaw
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Putting channel SIP/EXT-THSMSW-b7577a10 in 4/4 formats
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Driver for channel 'SIP/EXT-THSMSW-b7577a10' does not support indication 3, emulating it
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Set channel SIP/EXT-THSMSW-b7577a10 to write format slin
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Scheduling timer at 160 sample intervals
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Released clone lock on 'SIP/325-b75ca530<ZOMBIE>'
[2009-08-15 17:29:04.841] DEBUG[4109] channel.c: Done Masquerading SIP/EXT-THSMSW-b7577a10 (6)
[2009-08-15 17:29:11.987] VERBOSE[4139] logger.c: [2009-08-15 17:29:11.987]  Asterisk Event Logger Started /var/log/asterisk/event_log

The gdb bt:
#0  0x00c36416 in __kernel_vsyscall ()
#1  0x00348460 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x00349e28 in abort () at abort.c:88
#3  0x00385fed in __libc_message (do_abort=2, fmt=0x460e68 "*** glibc detected *** %s: %s: 0x%s ***\n")
   at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#4  0x0038c3a4 in malloc_printerr (action=2, str=0x460f40 "double free or corruption (out)", ptr=0xb635a138) at malloc.c:5994
ASTERISK-1  0x0038e356 in __libc_free (mem=0xb635a138) at malloc.c:3625
ASTERISK-2  0x0037bda4 in _IO_new_fclose (fp=0xb635a138) at iofclose.c:88
ASTERISK-3  0x080d5aa5 in filestream_destructor (arg=0xb6396390) at file.c:329
ASTERISK-4  0x08080640 in ao2_ref (user_data=0xb6396390, delta=-1) at astobj2.c:227
ASTERISK-5  0x080d873b in ast_filestream_frame_freed (fr=0xb63963c0) at file.c:1309
ASTERISK-6 0x080d9bd0 in __frame_free (fr=0xb63963c0, cache=1) at frame.c:338
ASTERISK-7 0x080d9d2e in ast_frame_free (frame=0xb63963c0, cache=1) at frame.c:380
ASTERISK-8 0x01dff6e5 in moh_files_generator (chan=0x91c1bb0, data=0xb5b9fad8, len=0, samples=160) at res_musiconhold.c:302
ASTERISK-9 0x08096f95 in generator_force (data=0x91c1bb0) at channel.c:1930
ASTERISK-10 0x080987bf in __ast_read (chan=0x91c1bb0, dropaudio=0) at channel.c:2647
ASTERISK-11 0x0809a1e5 in ast_read (chan=0x91c1bb0) at channel.c:3021
ASTERISK-12 0x080a03af in ast_generic_bridge (c0=0x91c1bb0, c1=0xb733ed30, config=0xb5cce540, fo=0xb5ccca5c, rc=0xb5ccca58, bridge_end=
     {tv_sec = 0, tv_usec = 0}) at channel.c:4749
ASTERISK-13 0x080a1de3 in ast_channel_bridge (c0=0x91c1bb0, c1=0xb733ed30, config=0xb5cce540, fo=0xb5ccca5c, rc=0xb5ccca58)
   at channel.c:5120
ASTERISK-14 0x080cd3ac in ast_bridge_call (chan=0x91c1bb0, peer=0xb733ed30, config=0xb5cce540) at features.c:2330
ASTERISK-15 0x00c1640b in try_calling (qe=0xb5ccea70, options=0xb5cce9c8 "", announceoverride=0xb5cce9ca "", url=0xb5cce9c9 "",
   tries=0xb5ccec34, noption=0xb5ccec30, agi=0x0, macro=0x0, gosub=0x0, ringing=0) at app_queue.c:3956
ASTERISK-16 0x00c1a233 in queue_exec (chan=0x91c1bb0, data=0xb5cd0e38) at app_queue.c:4879
ASTERISK-17 0x080ff85c in pbx_exec (c=0x91c1bb0, app=0xb7e76c10, data=0xb5cd0e38) at pbx.c:951
ASTERISK-18 0x08106afe in pbx_extension_helper (c=0x91c1bb0, con=0x0, context=0x91c1e04 "FunctionRouteQueue", exten=0x91c1e54 "s",
   priority=6, label=0x0, callerid=0xb7553b10 "18133258149", action=E_SPAWN, found=0xb5cd327c, combined_find_spawn=1)
   at pbx.c:3120
ASTERISK-19 0x081085d4 in ast_spawn_extension (c=0x91c1bb0, context=0x91c1e04 "FunctionRouteQueue", exten=0x91c1e54 "s", priority=6,
   callerid=0xb7553b10 "18133258149", found=0xb5cd327c, combined_find_spawn=1) at pbx.c:3584
ASTERISK-20 0x08108c86 in __ast_pbx_run (c=0x91c1bb0, args=0x0) at pbx.c:3671
ASTERISK-21 0x08109e9d in pbx_thread (data=0x91c1bb0) at pbx.c:3944
ASTERISK-22 0x0815ac0c in dummy_start (data=0x924b5d0) at utils.c:861
ASTERISK-23 0x004cb51f in start_thread (arg=0xb5cd3b90) at pthread_create.c:297
ASTERISK-24 0x0040104e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

To resolve to find the real problem, we tried to establish MALLOC_CHECK_ to 1, but is not working. the system crashed again and has occurred without causing what is said in "man malloc". Why could this have happened?

Reg

By: Amilcar S Silvestre (amilcar) 2009-08-18 20:27:00

We have this crash in 7 or 8 boxes running 1.4.26. Some are using app_queue with static members based on chan_agent (agentcallback), others are using queues with dynamic SIP members (interface SIP directly, not using chan_local).

Aragon, none of this boxes had a single reload of res_musiconhold.so before the crash or in any other moment.

edit: the crash seems to always happen in the moment of a transfer (sip or dtmf), but only very few of them produces the crash.



By: David Brillert (aragon) 2009-08-19 07:54:20

Here it is, the long awaited valgrind output after a crash.
valgrindcrash19082009asterisk1.4.24.1official.txt

I downgraded to Asterisk 1.4.24.1 since the svn revisions were less likely to crash.
Decreasing hardware RAM to 1GB did the trick to reproduce quickly (within one day in test lab).

My tests include DTMF transfers and a scripted reload which reloads every 5 minutes.

By: David Brillert (aragon) 2009-08-19 08:10:53

For amilcar... I'll redo the tests with the reload script removed.  But the valgrind capture is still valid.
It will be interesting to see if the reloads accelerate the crash.

By: Russell Bryant (russell) 2009-08-25 14:49:02

I have attached a patch, mp3_filestream_corruption.rev1.txt, which I am hoping will resolve this, and a whole bunch of other crashes that have been reported.

This patch is for the Asterisk-addons 1.4 version of format_mp3.

A bug in format_mp3 made it so the internals of an ast_filestream could get corrupted.  Since the corruption happened within valid memory, it is a bug that valgrind would never notice.  That would explain why we have had so much trouble getting help from valgrind on this issue.

If you're having trouble with crashes within a filestream destructor (or really, anything filestream related at this point), and you are using format_mp3, please try this patch.

Thanks!

By: Miguel Molina (coolmig) 2009-08-25 15:06:36

Russell and aragon, thanks a lot for the efforts you are making to solve this issue. Just a little question: could this ast_filestream corruption also affect systems that have format_mp3.so loaded but never use it to play MoH or files?

By: D KULL (kulldominique) 2009-08-25 15:28:47

yes, that would be interesting to know, I think I've experienced the issue with both situations with and without MP3 files being used.

By: Richard Odekerken (rgj) 2009-08-25 15:38:29

Same thing here in 0015616 and 0015719 - we're not using MP3 but format_mp3.so is loaded. We'll get rid of it and keep you posted.

By: Bereterbide Marcelo (marhbere) 2009-08-25 15:56:01

Russell, is a great news; I want try applied but I need know if the patch can be aply to addons-1.6.0-SVN. Thanks for advise!

By: Alan Graham (zerohalo) 2009-08-25 16:37:27

we're not loading format_mp3 at all; digging through the old cores/backtraces to demonstrate this.

By: Russell Bryant (russell) 2009-08-25 16:37:39

I just tried it and this patch will apply to asterisk-addons 1.6.0, as well.

By: David Brillert (aragon) 2009-08-25 18:18:43

I have patched a 1.4.24 version in my lab.
I'll update bug notes in a few days or sooner if it crashes

By: Amilcar S Silvestre (amilcar) 2009-08-25 20:17:26

We don't use any MP3 files in any boxes, and all of them crashes with exactly the same coredump. But we do have format_mp3.so module loaded. Can this cause the crash?

By: Russell Bryant (russell) 2009-08-25 20:59:10

If format_mp3 is not in use, then this fix shouldn't make a difference.  So, if people not using MP3s are seeing this same backtrace, then we haven't gotten to the bottom of it, yet.  :-(

However, if you are using format_mp3, the symptoms of the problem fixed by the patch are exactly what is shown in these backtraces.

We may have multiple bugs on our hands causing the same thing.  I'm not sure.  We will continue looking for more things to fix.

In the mean time, testing is much appreciated.  Just post your results here.

Thanks to everyone for your help and feedback!

By: David Brillert (aragon) 2009-08-26 07:07:16

Overnight results are in:
Memory leak is gone and no crashes so far.
In fact CPU load average is much lower than before the patch to asterisk-addons
I think it needs at least another day to soak.

By: David Brillert (aragon) 2009-08-27 08:08:04

Second day of testing and no crash. If there is no crash before 21/8/2009 I think this patch can definitely be committed.

By: D KULL (kulldominique) 2009-08-27 08:11:00

> If format_mp3 is not in use, then this fix shouldn't make a difference. So, if people not using MP3s are seeing this same backtrace, then we haven't gotten to the bottom of it, yet. :-(

When you say not in use, not loaded or not actually being used?

By: Bereterbide Marcelo (marhbere) 2009-08-27 10:32:04

I used the patch, and after 8hs we have had core dump:

(gdb) bt full
#0  0x001ad51d in _IO_seekoff_unlocked (fp=0xb58b4640, offset=0, dir=1, mode=0) at ioseekoff.c:71
No locals.
#1  0x001ac067 in _IO_ftell (fp=0xb58b4640) at ioftell.c:41
       pos = <value optimized out>
#2  0x08316ff8 in __i686.get_pc_thunk.bx () from /usr/lib/asterisk/modules/format_mp3.so
No symbol table info available.
#3  0xb58b4640 in ?? ()
No symbol table info available.
#4  0x08000020 in ?? ()
No symbol table info available.
ASTERISK-1  0xb4881348 in ?? ()
No symbol table info available.
ASTERISK-2  0x0831719f in __i686.get_pc_thunk.bx () from /usr/lib/asterisk/modules/format_mp3.so
No symbol table info available.
ASTERISK-3  0xb62b6d20 in ?? ()
No symbol table info available.
ASTERISK-4  0x00000000 in ?? ()
No symbol table info available.
(gdb)

Last part log before crash:
"
[2009-08-26 22:03:04.816] VERBOSE[30612] logger.c: [2009-08-26 22:03:04.816]     -- SIP/325-b5360da0 is ringing
[2009-08-26 22:03:04.816] DEBUG[30612] rtp.c: Setting early bridge SDP of 'SIP/329-0a2ab9e0' with that of 'SIP/325-b5360da0'
[2009-08-26 22:03:04.816] DEBUG[30612] chan_sip.c: Trying to put 'SIP/2.0 18' onto UDP socket destined for 10.240.18.186:5060
[2009-08-26 22:03:04.962] DEBUG[24659] audiohook.c: Failed to get 160 samples from read factory 0xa2360f8
[2009-08-26 22:03:04.982] DEBUG[24659] audiohook.c: Failed to get 160 samples from read factory 0xa2360f8
[2009-08-26 22:03:05.082] DEBUG[8514] rtp.c: Got RTCP report of 116 bytes
[2009-08-26 22:03:05.386] DEBUG[13057] chan_sip.c: Attended transfer: Will use Replace-Call-ID : 63635032-269ef602@10.240.18.186 (No check of from/to tags)
[2009-08-26 22:03:05.386] DEBUG[13057] chan_sip.c: Trying to put 'SIP/2.0 20' onto UDP socket destined for 10.240.18.186:5060
[2009-08-26 22:03:05.386] DEBUG[13057] channel.c: Planning to masquerade channel SIP/EXT-THSMSW-b7134fe8 into the structure of SIP/329-0a2ab9e0
[2009-08-26 22:03:05.386] DEBUG[13057] channel.c: Done planning to masquerade channel SIP/EXT-THSMSW-b7134fe8 into the structure of SIP/329-0a2ab9e0
[2009-08-26 22:03:05.386] DEBUG[13057] chan_sip.c: Trying to put 'NOTIFY sip' onto UDP socket destined for 10.240.18.186:5060
[2009-08-26 22:03:05.386] DEBUG[13057] chan_sip.c: SIP attended transfer: Unlocking channel SIP/329-0a2ab9e0
[2009-08-26 22:03:05.386] DEBUG[30612] channel.c: Got clone lock for masquerade on 'SIP/EXT-THSMSW-b7134fe8' at 0xb71a3e9c
[2009-08-26 22:03:05.386] DEBUG[30612] chan_sip.c: Hangup call SIP/EXT-THSMSW-b7134fe8<MASQ>, SIP callid 63635032-269ef602@10.240.18.186
[2009-08-26 22:03:05.386] DEBUG[30612] chan_sip.c: AST hangup cause 16 (no match found in SIP)
[2009-08-26 22:03:05.386] DEBUG[30612] chan_sip.c: Trying to put 'SIP/2.0 60' onto UDP socket destined for 10.240.18.186:5060
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Set channel SIP/EXT-THSMSW-b7134fe8 to write format ulaw
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Set channel SIP/EXT-THSMSW-b7134fe8 to read format ulaw
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Putting channel SIP/EXT-THSMSW-b7134fe8 in 4/4 formats
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Driver for channel 'SIP/EXT-THSMSW-b7134fe8' does not support indication 3, emulating it
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Set channel SIP/EXT-THSMSW-b7134fe8 to write format slin
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Scheduling timer at 160 sample intervals
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Released clone lock on 'SIP/329-0a2ab9e0<ZOMBIE>'
[2009-08-26 22:03:05.387] DEBUG[30612] channel.c: Done Masquerading SIP/EXT-THSMSW-b7134fe8 (6)
[2009-08-26 22:03:12.507] VERBOSE[31222] logger.c: [2009-08-26 22:03:12.507]  Asterisk Event Logger Started /var/log/asterisk/event_log
"
Could this be something to about patch?
what do you think?

By: Russell Bryant (russell) 2009-08-27 22:15:08

kulldominique, I mean not being used.  Being loaded would not matter if you never played MP3 files.

marhbere, I can't get anything from that backtrace.  Did you compile with DONT_OPTIMIZE enabled?

All, if you are reporting that are having problems and know that you are _not_ using format_mp3, can you explicitly verify that you are seeing the same backtrace as the one posted in the summary of this bug?

A number of crashes have come up here.  Some related, and some are separate issues.  A lot of times when we get long standing bugs like this, a single bug report tends to morph into multiple, sometimes.  My immediate concern with this bug report is fixing the crash shown in the backtrace in the summary (which has been seen a bunch of times, and in a number of other bug reports).  There may be other issues that we will need to extract and handle as separate reports.

By: David Brillert (aragon) 2009-08-28 10:09:09

Still no crashes after three days and 50 calls per second on 1.4.24

By: Alan Graham (zerohalo) 2009-08-28 10:21:14

1.4.24.1 crash, bt attached -> bt_crashnoformatmp3.txt

By: Russell Bryant (russell) 2009-08-28 10:25:20

zerohalo, thanks for the information.  Is there a reason that you can't test with the latest 1.4 version?  Also, can you provide some detail as to how we might be able to replicate what you're seeing locally?

By: David Brillert (aragon) 2009-08-28 10:27:35

I test on 1.4.24 because it crashes more often than 1.4.26 or SVN with moh.
More crashes means quicker test results.
Also I have lots of locking issues in 1.4.26 that prevent me from running a clean test.

By: David Brillert (aragon) 2009-08-28 10:31:00

Russell
I just saw a new ticket
https://issues.asterisk.org/view.php?id=15789
The backtrace looks very similar to my other related ticket 15396

The reporter posted a patch on 15789; does the patch make sense?

By: Alan Graham (zerohalo) 2009-08-28 10:31:34

russell-

unfortunately, I'm waiting for a recent version machine to crash to get the bt. It appears that the previous backtraces already posted for 1.4.26 aren't good examples and 1.4.26 takes much longer to crash (at this point I restart them regularly). I have been unable to reproduce this in a lab setup at all. We've never used the format_mp3 module, btw.

By: Bereterbide Marcelo (marhbere) 2009-08-28 10:43:05

russell, Yes we compiled with DONT_OPTIMIZE.

Now we have new crash with the patch applied.
We run 1.6.0-SVN r213976M

After 24hs having 50 "Attenden Transfers" from Queue using MHO with mp3. I see crash it with the same core-Dump that before. You can see it into attached file.

By: Alan Graham (zerohalo) 2009-08-28 12:09:33

I don't see what's different from the first bt_scrubbed.txt attached (1.4.26rc4) - in thread 1, I see pretty much the same thing as in the 1.4.24.1 I posted today.

By: David Brillert (aragon) 2009-08-31 08:07:38

Still no crashes here.
Patch looks stable.

By: Richard Odekerken (rgj) 2009-08-31 10:18:38

We just got another ASTERISK-1552719 crash so the patch is definitely not patching that issue.

By: Bereterbide Marcelo (marhbere) 2009-08-31 11:20:44

Since my last post we have had five crashes:
[1] 08-28 16:58 still core-dump that was before
[2] 08-29 00:12 something about format_mp3
[3] 08-29 00:55 something about format_mp3
[4] 08-29 22:10 still core-dump that was before
[5] 08-31 14:28 still core-dump that was before

08-28 ~ 50 TRANSFER in queue_log
08-29 ~ 35 TRANSFER in queue_log
08-30 ~ 5  TRANSFER in queue_log
08-31 ~ 15 TRANSFER in queue_log

You can see in the file attach

By: Digium Subversion (svnbot) 2009-09-01 15:38:54

Repository: asterisk-addons
Revision: 1023

U   branches/1.4/formats/format_mp3.c

------------------------------------------------------------------------
r1023 | russell | 2009-09-01 15:38:54 -0500 (Tue, 01 Sep 2009) | 45 lines

Fix memory corruption caused by format_mp3.

format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
read().  However, it lied.  This means that other parts of the code that
attempted to make use of the offset buffer would end up corrupting the fields
in the ast_filestream structure.  This resulted in quite a few crashes due to
unexpected values for fields in ast_filestream.

This patch closes out quite a few bugs.  However, some of these bugs have been
open for a while and have been an area where more than one bug has been
discussed.  So with that said, anyone that is following one of the issues
closed here, if you still have a problem, please open a new bug report for the
specific problem you are still having.  If you do, please ensure that the bug
report is based on the newest version of Asterisk, and that this patch is
applied if format_mp3 is in use.  Thanks!

(closes issue ASTERISK-14129)
Reported by: jvandal
Tested by: aragon, russell, zerohalo, marhbere, rgj

(closes issue ASTERISK-14007)
Reported by: aragon

(closes issue ASTERISK-14141)
Reported by: axisinternet

(closes issue ASTERISK-14074)
Reported by: maxnuv

(closes issue ASTERISK-14374)
Reported by: aragon

(closes issue ASTERISK-14203)
Reported by: amorsen
Tested by: amorsen

(closes issue ASTERISK-14718)
Reported by: jensvb

(closes issue ASTERISK-14673)
Reported by: thom4fun

(closes issue ASTERISK-14428)
Reported by: marhbere

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk-addons?view=rev&revision=1023

By: Digium Subversion (svnbot) 2009-09-01 15:40:16

Repository: asterisk-addons
Revision: 1024

U   branches/1.6.0/formats/format_mp3.c

------------------------------------------------------------------------
r1024 | russell | 2009-09-01 15:40:16 -0500 (Tue, 01 Sep 2009) | 45 lines

Fix memory corruption caused by format_mp3.

format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
read().  However, it lied.  This means that other parts of the code that
attempted to make use of the offset buffer would end up corrupting the fields
in the ast_filestream structure.  This resulted in quite a few crashes due to
unexpected values for fields in ast_filestream.

This patch closes out quite a few bugs.  However, some of these bugs have been
open for a while and have been an area where more than one bug has been
discussed.  So with that said, anyone that is following one of the issues
closed here, if you still have a problem, please open a new bug report for the
specific problem you are still having.  If you do, please ensure that the bug
report is based on the newest version of Asterisk, and that this patch is
applied if format_mp3 is in use.  Thanks!

(closes issue ASTERISK-14129)
Reported by: jvandal
Tested by: aragon, russell, zerohalo, marhbere, rgj

(closes issue ASTERISK-14007)
Reported by: aragon

(closes issue ASTERISK-14141)
Reported by: axisinternet

(closes issue ASTERISK-14074)
Reported by: maxnuv

(closes issue ASTERISK-14374)
Reported by: aragon

(closes issue ASTERISK-14203)
Reported by: amorsen
Tested by: amorsen

(closes issue ASTERISK-14718)
Reported by: jensvb

(closes issue ASTERISK-14673)
Reported by: thom4fun

(closes issue ASTERISK-14428)
Reported by: marhbere

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk-addons?view=rev&revision=1024

By: Digium Subversion (svnbot) 2009-09-01 15:42:24

Repository: asterisk-addons
Revision: 1025

U   branches/1.6.1/formats/format_mp3.c

------------------------------------------------------------------------
r1025 | russell | 2009-09-01 15:42:24 -0500 (Tue, 01 Sep 2009) | 45 lines

Fix memory corruption caused by format_mp3.

format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
read().  However, it lied.  This means that other parts of the code that
attempted to make use of the offset buffer would end up corrupting the fields
in the ast_filestream structure.  This resulted in quite a few crashes due to
unexpected values for fields in ast_filestream.

This patch closes out quite a few bugs.  However, some of these bugs have been
open for a while and have been an area where more than one bug has been
discussed.  So with that said, anyone that is following one of the issues
closed here, if you still have a problem, please open a new bug report for the
specific problem you are still having.  If you do, please ensure that the bug
report is based on the newest version of Asterisk, and that this patch is
applied if format_mp3 is in use.  Thanks!

(closes issue ASTERISK-14129)
Reported by: jvandal
Tested by: aragon, russell, zerohalo, marhbere, rgj

(closes issue ASTERISK-14007)
Reported by: aragon

(closes issue ASTERISK-14141)
Reported by: axisinternet

(closes issue ASTERISK-14074)
Reported by: maxnuv

(closes issue ASTERISK-14374)
Reported by: aragon

(closes issue ASTERISK-14203)
Reported by: amorsen
Tested by: amorsen

(closes issue ASTERISK-14718)
Reported by: jensvb

(closes issue ASTERISK-14673)
Reported by: thom4fun

(closes issue ASTERISK-14428)
Reported by: marhbere

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk-addons?view=rev&revision=1025

By: Digium Subversion (svnbot) 2009-09-01 15:43:13

Repository: asterisk-addons
Revision: 1026

U   branches/1.6.2/formats/format_mp3.c

------------------------------------------------------------------------
r1026 | russell | 2009-09-01 15:43:13 -0500 (Tue, 01 Sep 2009) | 45 lines

Fix memory corruption caused by format_mp3.

format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
read().  However, it lied.  This means that other parts of the code that
attempted to make use of the offset buffer would end up corrupting the fields
in the ast_filestream structure.  This resulted in quite a few crashes due to
unexpected values for fields in ast_filestream.

This patch closes out quite a few bugs.  However, some of these bugs have been
open for a while and have been an area where more than one bug has been
discussed.  So with that said, anyone that is following one of the issues
closed here, if you still have a problem, please open a new bug report for the
specific problem you are still having.  If you do, please ensure that the bug
report is based on the newest version of Asterisk, and that this patch is
applied if format_mp3 is in use.  Thanks!

(closes issue ASTERISK-14129)
Reported by: jvandal
Tested by: aragon, russell, zerohalo, marhbere, rgj

(closes issue ASTERISK-14007)
Reported by: aragon

(closes issue ASTERISK-14141)
Reported by: axisinternet

(closes issue ASTERISK-14074)
Reported by: maxnuv

(closes issue ASTERISK-14374)
Reported by: aragon

(closes issue ASTERISK-14203)
Reported by: amorsen
Tested by: amorsen

(closes issue ASTERISK-14718)
Reported by: jensvb

(closes issue ASTERISK-14673)
Reported by: thom4fun

(closes issue ASTERISK-14428)
Reported by: marhbere

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk-addons?view=rev&revision=1026

By: Digium Subversion (svnbot) 2009-09-01 15:44:57

Repository: asterisk
Revision: 215212

U   trunk/addons/format_mp3.c

------------------------------------------------------------------------
r215212 | russell | 2009-09-01 15:44:57 -0500 (Tue, 01 Sep 2009) | 45 lines

Fix memory corruption caused by format_mp3.

format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
read().  However, it lied.  This means that other parts of the code that
attempted to make use of the offset buffer would end up corrupting the fields
in the ast_filestream structure.  This resulted in quite a few crashes due to
unexpected values for fields in ast_filestream.

This patch closes out quite a few bugs.  However, some of these bugs have been
open for a while and have been an area where more than one bug has been
discussed.  So with that said, anyone that is following one of the issues
closed here, if you still have a problem, please open a new bug report for the
specific problem you are still having.  If you do, please ensure that the bug
report is based on the newest version of Asterisk, and that this patch is
applied if format_mp3 is in use.  Thanks!

(closes issue ASTERISK-14129)
Reported by: jvandal
Tested by: aragon, russell, zerohalo, marhbere, rgj

(closes issue ASTERISK-14007)
Reported by: aragon

(closes issue ASTERISK-14141)
Reported by: axisinternet

(closes issue ASTERISK-14074)
Reported by: maxnuv

(closes issue ASTERISK-14374)
Reported by: aragon

(closes issue ASTERISK-14203)
Reported by: amorsen
Tested by: amorsen

(closes issue ASTERISK-14718)
Reported by: jensvb

(closes issue ASTERISK-14673)
Reported by: thom4fun

(closes issue ASTERISK-14428)
Reported by: marhbere

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=215212

By: Digium Subversion (svnbot) 2009-09-01 15:45:27

Repository: asterisk
Revision: 215213

_U  branches/1.6.2/

------------------------------------------------------------------------
r215213 | russell | 2009-09-01 15:45:26 -0500 (Tue, 01 Sep 2009) | 51 lines

Blocked revisions 215212 via svnmerge

........
 r215212 | russell | 2009-09-01 15:44:13 -0500 (Tue, 01 Sep 2009) | 45 lines
 
 Fix memory corruption caused by format_mp3.
 
 format_mp3 claimed that it provided AST_FRIENDLY_OFFSET in frames returned by
 read().  However, it lied.  This means that other parts of the code that
 attempted to make use of the offset buffer would end up corrupting the fields
 in the ast_filestream structure.  This resulted in quite a few crashes due to
 unexpected values for fields in ast_filestream.
 
 This patch closes out quite a few bugs.  However, some of these bugs have been
 open for a while and have been an area where more than one bug has been
 discussed.  So with that said, anyone that is following one of the issues
 closed here, if you still have a problem, please open a new bug report for the
 specific problem you are still having.  If you do, please ensure that the bug
 report is based on the newest version of Asterisk, and that this patch is
 applied if format_mp3 is in use.  Thanks!
 
 (closes issue ASTERISK-14129)
 Reported by: jvandal
 Tested by: aragon, russell, zerohalo, marhbere, rgj
 
 (closes issue ASTERISK-14007)
 Reported by: aragon
 
 (closes issue ASTERISK-14141)
 Reported by: axisinternet
 
 (closes issue ASTERISK-14074)
 Reported by: maxnuv
 
 (closes issue ASTERISK-14374)
 Reported by: aragon
 
 (closes issue ASTERISK-14203)
 Reported by: amorsen
 Tested by: amorsen
 
 (closes issue ASTERISK-14718)
 Reported by: jensvb
 
 (closes issue ASTERISK-14673)
 Reported by: thom4fun
 
 (closes issue ASTERISK-14428)
 Reported by: marhbere
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=215213