[Home]

Summary:ASTERISK-10477: crash in ast_var_name on SIP hangup
Reporter:Atis Lezdins (atis)Labels:
Date Opened:2007-10-09 08:37:41Date Closed:2007-11-16 12:25:52.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 1_bt.core.5249.txt
( 1) 1_valgrind.clean.txt
( 2) 2_valgrind.clean.txt
( 3) 20071107__bug10923.diff.txt
( 4) 20071109__bug10923.diff.txt
( 5) 3_bt.core.13334.txt
( 6) 3_valgrind.clean.txt
( 7) 4_valgrind.clean.txt
( 8) 5_valgrind.clean.txt
( 9) bt_new2.txt
(10) bt.core.19445.txt
(11) bt.core.21274.txt
(12) bt.core.26535.txt
(13) call2_full.log
(14) full.log
Description:I just upgraded to asterisk 1.4.12.1 (from 1.4.10), and within few hours i got  crash.

It occurred on hangup of SIP channel, and it seems to be related to some SIP data.

****** ADDITIONAL INFORMATION ******

log:

[Oct  9 05:40:25] DEBUG[25507] chan_agent.c: Bridge on 'SIP/90110-00b2d0b0' being cleared (2)
[Oct  9 05:40:26] DEBUG[21279] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Oct  9 05:40:26] VERBOSE[25507] logger.c:   == Spawn extension (db_ext_queues, 28901, 14) exited non-zero on 'SIP/69.80.215.6-ac090c10'
[Oct  9 05:40:26] DEBUG[21279] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM sip_buddies WHERE name = '90110'
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM extensions_table WHERE exten = 'h' AND context = 'db_ext_queues' AND priority = '1'
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM extensions_table WHERE exten = 'h' AND context = 'db_ext_queues' AND priority = '1'
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM extensions_table WHERE exten = 'h' AND context = 'db_ext_queues' AND priority = '1'
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Everything is fine.
[Oct  9 05:40:26] DEBUG[25507] res_config_mysql.c: MySQL RealTime: Retrieve SQL: SELECT * FROM extensions_table WHERE exten = 'h' AND context = 'db_ext_queues' AND priority = '1'
[Oct  9 05:40:26] VERBOSE[25507] logger.c:     -- Executing Hangup("SIP/69.80.215.6-ac090c10", "")
[Oct  9 05:40:26] VERBOSE[25507] logger.c:   == Spawn extension (db_ext_queues, h, 1) exited non-zero on 'SIP/69.80.215.6-ac090c10'
[Oct  9 05:40:30] VERBOSE[26535] logger.c: Asterisk Event Logger Started /var/log/asterisk/event_log


bt full:
(gdb) bt full
#0  0x000000000044dc64 in ast_var_name (var=0x2aabac032bf0) at chanvars.c:69
       name = 0x2aabac032c00 <Address 0x2aabac032c00 out of bounds>
#1  0x0000000000499810 in pbx_builtin_setvar_helper (chan=0x2aaaac01cd70, name=0x2aaab79a54d8 "RTPAUDIOQOS", value=0x2aaaac012050 "ssrc=1012538103;themssrc=626054918;lp=0;rxjitter
=0.001544;rxcount=14707;txjitter=0.011597;txcount=14509;rlp=0;rtt=0.004000") at pbx.c:5825
       newvariable = (struct ast_var_t *) 0x2aabac032bf0
       headp = (struct varshead *) 0x2aaaac01d2d0
       nametail = 0x2aaab79a54d8 "RTPAUDIOQOS"
       __PRETTY_FUNCTION__ = "pbx_builtin_setvar_helper"
#2  0x00002aaab795a18d in sip_hangup (ast=0x2aaaac01cd70) from /usr/lib/asterisk/modules/chan_sip.so
       audioqos = 0x2aaaac012050 "ssrc=1012538103;themssrc=626054918;lp=0;rxjitter=0.001544;rxcount=14707;txjitter=0.011597;txcount=14509;rlp=0;rtt=0.004000"
       videoqos = 0x2aaab79a36da ""
       p = (struct sip_pvt *) 0x2aaaac090c10
       needcancel = 0
       needdestroy = 0
       oldowner = (struct ast_channel *) 0x2aaaac01cd70
       __PRETTY_FUNCTION__ = "sip_hangup"
#3  0x0000000000441759 in ast_hangup (chan=0x2aaaac01cd70) at channel.c:1764
       res = 0
       cdr = (struct ast_cdr *) 0xb32be0
       __PRETTY_FUNCTION__ = "ast_hangup"
Comments:By: Tilghman Lesher (tilghman) 2007-10-09 09:52:52

Are you running/using any custom third party apps?  What we're seeing in your backtrace is corruption of your channel variable list.  Basically something is getting added to your channel variables (before this one) that contained out of bounds addresses, and while checking to see if the variable was already set, it came across that record and segfaulted.

By: Atis Lezdins (atis) 2007-10-09 10:10:53

No, there is no third party applications, only asterisk-addons (app_addon_sql_mysql, cdr_addon_mysql, res_config_mysql). I also upgraded those from 1.4.2 to 1.4.3.

I have a few patches on top of it, but if i understand correctly - you mean dialplan variables - none of my custom patches goes that far. Plus, they were working fine with 1.4.10. In this upgrade i also added patches from ASTERISK-1078847 and ASTERISK-1082888, but those doesn't touch asterisk dialplan variables neither.

I guess, you will need full log of affected channel? I'll try to provide that shortly.

By: Tilghman Lesher (tilghman) 2007-10-09 10:29:26

Well, if it's not that, then it's generally corrupted memory, which is a LOT more difficult to track down.  You'd have to run Asterisk under valgrind.

By: Atis Lezdins (atis) 2007-10-09 11:27:59

Sorry, can you please delete file callid_1191940941.1404.log - i forgot to remove secure information.

By: Atis Lezdins (atis) 2007-10-09 11:36:13

I got another crash today, the same line, but different conditions. Can you update category, as this isn't SIP related?

I have reverted my production back to 1.4.10

Can't anything be done without valgrind? Inside gdb or so? I definitely won't be able to run valgrind on production, because it's performance is really unacceptable. I will probably try some advanced load-testing to simulate actions taken on production, but i'm not sure how soon i'll have time for that.

By: 850t (850t) 2007-11-01 03:32:22

Hi, we have just upgraded our system from Asterisk 1.2.X to 1.4.12 and 1.4.13, and both experienced similar symptoms.  The problems only happen in our production machine but we aren't able to replicate the issues in our development machine.

I've attached a backtrace, bt_new2.txt for review.

By: Tilghman Lesher (tilghman) 2007-11-01 07:36:57

Please recompile with MALLOC_DEBUG and run under valgrind.  That is the only output that is going to help us track down this issue.

By: Atis Lezdins (atis) 2007-11-01 08:49:41

Corydon76, is there any tutorial - what should i do with valgrind? I.e - how should i capture the output, what flags to use..

Finally today i got some automatic load/feature tests, that are able to crash asterisk, i'll try to debug it as you suggested. Unfortunately my asterisk is crashing all the time,  and those variable problems are only in 1/15th of crashes :D

By: Tilghman Lesher (tilghman) 2007-11-01 12:04:24

valgrind --log-file-exactly=valgrind.txt asterisk -vvvvc 2> malloc_debug.txt

After you stop Asterisk, there will be two files named valgrind.txt and malloc_debug.txt in the current working directory.  Upload those files to the file upload area of this issue.

By: Digium Subversion (svnbot) 2007-11-05 16:07:01.000-0600

Repository: asterisk
Revision: 88805

U   branches/1.4/include/asterisk/pbx.h
U   branches/1.4/main/pbx.c

------------------------------------------------------------------------
r88805 | russell | 2007-11-05 16:07:00 -0600 (Mon, 05 Nov 2007) | 12 lines

After seeing crashes related to channel variables, I went looking around at the
ways that channel variables are handled.  In general, they were not handled in
a thread-safe way.  The channel _must_ be locked when reading or writing from/to
the channel variable list.

What I have done to improve this situation is to make pbx_builtin_setvar_helper()
and friends lock the channel when doing their thing.  Asterisk API calls almost
all lock the channel for you as necessary, but this family of functions did not.

(closes issue ASTERISK-10477, reported by atis)
(closes issue ASTERISK-10683, reported by 850t)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-11-06 08:07:00.000-0600

Repository: asterisk
Revision: 88934

_U  trunk/
U   trunk/channels/busy.h
U   trunk/channels/ringtone.h
U   trunk/include/asterisk/pbx.h
U   trunk/main/pbx.c

------------------------------------------------------------------------
r88934 | russell | 2007-11-06 08:06:59 -0600 (Tue, 06 Nov 2007) | 20 lines

Merged revisions 88805 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r88805 | russell | 2007-11-05 16:07:54 -0600 (Mon, 05 Nov 2007) | 12 lines

After seeing crashes related to channel variables, I went looking around at the
ways that channel variables are handled.  In general, they were not handled in
a thread-safe way.  The channel _must_ be locked when reading or writing from/to
the channel variable list.

What I have done to improve this situation is to make pbx_builtin_setvar_helper()
and friends lock the channel when doing their thing.  Asterisk API calls almost
all lock the channel for you as necessary, but this family of functions did not.

(closes issue ASTERISK-10477, reported by atis)
(closes issue ASTERISK-10683, reported by 850t)

........

------------------------------------------------------------------------

By: Atis Lezdins (atis) 2007-11-07 07:07:54.000-0600

This is not helping. I just got crash on 1.4 r89079.

Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -U asterisk -G asterisk -vvvcgt'.
Program terminated with signal 11, Segmentation fault.
#0  0x08090b7a in ast_var_name (var=0x4731d243) at chanvars.c:69
69    if (name[0] == '_') {
#0  0x08090b7a in ast_var_name (var=0x4731d243) at chanvars.c:69
#1  0x080d969c in pbx_builtin_setvar_helper (chan=0x873c300, name=0xb50dc6c0 "INPUTANUM_488", value=0xb50dc7a0 "") at pbx.c:5850
#2  0xb5d2646e in read_exec (chan=0x873c300, data=0xb50dcb40) at app_read.c:192
#3  0x080ca8ea in pbx_exec (c=0x873c300, app=0x8239828, data=0xb50dcb40) at pbx.c:532
#4  0xb6f6c836 in realtime_exec (chan=0x873c300, context=0x873c540 "db_ivr", exten=0x873c590 "ivr---488", priority=15, callerid=0x85bd640 "112233", data=0x8503c51 "db_ivr")
   at pbx_realtime.c:216
ASTERISK-1  0x080ce59b in pbx_extension_helper (c=0x873c300, con=0x0, context=0x873c540 "db_ivr", exten=0x873c590 "ivr---488", priority=15, label=0x0, callerid=0x85bd640 "112233",
   action=E_SPAWN) at pbx.c:1849
ASTERISK-2  0x080cf8b8 in ast_spawn_extension (c=0x873c300, context=0x873c540 "db_ivr", exten=0x873c590 "ivr---488", priority=15, callerid=0x85bd640 "112233") at pbx.c:2293
ASTERISK-3  0x080cfdda in __ast_pbx_run (c=0x873c300) at pbx.c:2393
ASTERISK-4  0x080d0baf in pbx_thread (data=0x873c300) at pbx.c:2608
ASTERISK-5  0x081110b3 in dummy_start (data=0x856b2a0) at utils.c:843
ASTERISK-6 0x497fd3db in start_thread () from /lib/libpthread.so.0
ASTERISK-7 0x4975706e in clone () from /lib/libc.so.6

By: Tilghman Lesher (tilghman) 2007-11-07 07:22:22.000-0600

Okay, then, please follow the valgrind steps noted earlier to produce valgrind.txt and malloc_debug.txt and upload them to the file upload area of this bug.

By: Atis Lezdins (atis) 2007-11-07 07:25:34.000-0600

Ok, within two days of testing i got two crashes under valgrind (2 per hour when running without valgrind). I'm not sure that they are related to this issue, as core dumps looks really different (i assume that valgrind kills asterisk as soon as something goes really bad) I'll upload them within few minutes.

By: Atis Lezdins (atis) 2007-11-07 07:39:32.000-0600

Ok, here they go.

From valgrind.txt i grepped out
==19682== Warning: invalid file descriptor 1877 in syscall close()
==19682== Warning: invalid file descriptor 1878 in syscall close()
==19682== Warning: invalid file descriptor 1879 in syscall close()
==19682== Warning: invalid file descriptor 1880 in syscall close()
==19682== Warning: invalid file descriptor 1881 in syscall close()

All that malloc_debug contains is:

mISDN_close: fid(12) isize(131072) inbuf(0x46feaa0) irp(0x46feaa0) iend(0x46feaa0)
WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091
WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091
WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091
(repeating 10MB)

By: Tilghman Lesher (tilghman) 2007-11-07 08:50:42.000-0600

There are missing sections in these valgrind outputs.  What are in the missing sections?

By: Atis Lezdins (atis) 2007-11-07 08:54:13.000-0600

It's just the lines mentioned in previous comment.

Otherwise i got size 400-500 MB for valgrind.txt

Here's how i process them:

grep -F -v "Warning: invalid file descriptor" valgrind.txt | grep -v "Use --log-fd=<number> to select an alternative log fd." > valgrind.clean.txt
grep -F -v "WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091" mmlog > mmlog.clean.txt
grep -F -v "WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091" malloc_debug.txt > malloc_debug.clean.txt

By: Tilghman Lesher (tilghman) 2007-11-07 17:29:26.000-0600

atis:  please test this patch to see if it solves these crashes.  This missing code could very definitely cause memory corruption.

By: Atis Lezdins (atis) 2007-11-08 04:49:51.000-0600

Ok, i'll try the patch today.

Last night i got another crash in stress-testing environment. I'm attaching valgrind log and backtrace.

This time the malloc_debug had 40MB of the following line:
XXX ERROR XXX A thread holds more locks than '16'.  Increase AST_MAX_LOCKS!

I'll probably do that on next testing.

By: Atis Lezdins (atis) 2007-11-08 08:42:51.000-0600

I tested today with patch, and seems that pbx_builtin_setvar_helper -> ast_var_name crash is gone. However now i got two pbx_builtin_getvar_helper -> ast_var_name (which i didn't get before).

Here goes two backtraces:

#0  0x08090b86 in ast_var_name (var=0x47331f58) at chanvars.c:69
69    if (name[0] == '_') {
#0  0x08090b86 in ast_var_name (var=0x47331f58) at chanvars.c:69
#1  0x080d92b3 in pbx_builtin_getvar_helper (chan=0x8544c10, name=0xb5b0ddda "CHANLOCALSTATUS") at pbx.c:5770
#2  0xb5b048ca in agent_hangup (ast=0x86aea98) at chan_agent.c:804
#3  0x08085d6d in ast_hangup (chan=0x86aea98) at channel.c:1766
#4  0xb6ef745e in try_calling (qe=0xb54a6700, options=0xb54a66b7 "", announceoverride=0xb54a66b9 "", url=0xb54a66b8 "", tries=0xb54a68ac, noption=0xb54a68a8, agi=0x0)
   at app_queue.c:2875
ASTERISK-1  0xb6efa17f in queue_exec (chan=0x85d76c8, data=0xb54a6b40) at app_queue.c:3636
ASTERISK-2  0x080ca8f6 in pbx_exec (c=0x85d76c8, app=0x81fcc60, data=0xb54a6b40) at pbx.c:532
ASTERISK-3  0xb6ea5836 in realtime_exec (chan=0x85d76c8, context=0x85d7908 "db_ext_queues", exten=0x85d7958 "22901", priority=16, callerid=0x87098d0 "112233",
   data=0x84f6a99 "db_ext_queues") at pbx_realtime.c:216
ASTERISK-4  0x080ce5a7 in pbx_extension_helper (c=0x85d76c8, con=0x0, context=0x85d7908 "db_ext_queues", exten=0x85d7958 "22901", priority=16, label=0x0, callerid=0x87098d0 "112233",
   action=E_SPAWN) at pbx.c:1849
ASTERISK-5  0x080cf8c4 in ast_spawn_extension (c=0x85d76c8, context=0x85d7908 "db_ext_queues", exten=0x85d7958 "22901", priority=16, callerid=0x87098d0 "112233") at pbx.c:2293
ASTERISK-6 0x080cfde6 in __ast_pbx_run (c=0x85d76c8) at pbx.c:2393
ASTERISK-7 0x080d0bbb in pbx_thread (data=0x85d76c8) at pbx.c:2608
ASTERISK-8 0x081110bf in dummy_start (data=0x86ebbd8) at utils.c:843
ASTERISK-9 0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-10 0x4a7ea06e in clone () from /lib/libc.so.6

#0  0x08090b86 in ast_var_name (var=0x47332978) at chanvars.c:69
69    if (name[0] == '_') {
#0  0x08090b86 in ast_var_name (var=0x47332978) at chanvars.c:69
#1  0x080d92b3 in pbx_builtin_getvar_helper (chan=0x86486d8, name=0xb5b54dda "CHANLOCALSTATUS") at pbx.c:5770
#2  0xb5b4b8ca in agent_hangup (ast=0x827e0c8) at chan_agent.c:804
#3  0x08085d6d in ast_hangup (chan=0x827e0c8) at channel.c:1766
#4  0xb6f3e45e in try_calling (qe=0xb4ac4700, options=0xb4ac46b7 "", announceoverride=0xb4ac46b9 "", url=0xb4ac46b8 "", tries=0xb4ac48ac, noption=0xb4ac48a8, agi=0x0)
   at app_queue.c:2875
ASTERISK-1  0xb6f4117f in queue_exec (chan=0x88cad80, data=0xb4ac4b40) at app_queue.c:3636
ASTERISK-2  0x080ca8f6 in pbx_exec (c=0x88cad80, app=0x81fb8d8, data=0xb4ac4b40) at pbx.c:532
ASTERISK-3  0xb6eec836 in realtime_exec (chan=0x88cad80, context=0x88cafc0 "db_ext_queues", exten=0x88cb010 "22901", priority=16, callerid=0x880f880 "112233",
   data=0x84f6b61 "db_ext_queues") at pbx_realtime.c:216
ASTERISK-4  0x080ce5a7 in pbx_extension_helper (c=0x88cad80, con=0x0, context=0x88cafc0 "db_ext_queues", exten=0x88cb010 "22901", priority=16, label=0x0, callerid=0x880f880 "112233",
   action=E_SPAWN) at pbx.c:1849
ASTERISK-5  0x080cf8c4 in ast_spawn_extension (c=0x88cad80, context=0x88cafc0 "db_ext_queues", exten=0x88cb010 "22901", priority=16, callerid=0x880f880 "112233") at pbx.c:2293
ASTERISK-6 0x080cfde6 in __ast_pbx_run (c=0x88cad80) at pbx.c:2393
ASTERISK-7 0x080d0bbb in pbx_thread (data=0x88cad80) at pbx.c:2608
ASTERISK-8 0x081110bf in dummy_start (data=0x872b0b0) at utils.c:843
ASTERISK-9 0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-10 0x4a7ea06e in clone () from /lib/libc.so.6


Additionally, i started to pay attention to other crashes that i'm getting - they contain ast_var_delete(). I never got them while i had 1.4.12 on production, but i get them on testing server even more often than ast_var_name crashes.

Generally they are like this:

#0  0xb7f47410 in __kernel_vsyscall ()
#0  0xb7f47410 in __kernel_vsyscall ()
#1  0x4a745d40 in raise () from /lib/libc.so.6
#2  0x4a747591 in abort () from /lib/libc.so.6
#3  0x4a77b18b in __libc_message () from /lib/libc.so.6
#4  0x4a782dbe in _int_free () from /lib/libc.so.6
ASTERISK-1  0x4a786550 in free () from /lib/libc.so.6
ASTERISK-2  0x08072a04 in __ast_free_region (ptr=0x8565488, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:174
ASTERISK-3  0x08073365 in __ast_free (ptr=0x8565488, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:208
ASTERISK-4  0x08090b5c in ast_var_delete (var=0x8565488) at chanvars.c:59
ASTERISK-5  0x08083877 in ast_channel_free (chan=0x86f9c88) at channel.c:1261
ASTERISK-6 0x08085e3c in ast_hangup (chan=0x86f9c88) at channel.c:1783
ASTERISK-7 0x080d0969 in __ast_pbx_run (c=0x86f9c88) at pbx.c:2548
ASTERISK-8 0x080d0bbb in pbx_thread (data=0x86f9c88) at pbx.c:2608
ASTERISK-9 0x081110bf in dummy_start (data=0x87c9f50) at utils.c:843
ASTERISK-10 0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-11 0x4a7ea06e in clone () from /lib/libc.so.6

and like this (less often):
#0  0x08072867 in __ast_free_region (ptr=0x87b1af8, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:152
152   for (reg = regions[hash]; reg; reg = reg->next) {
#0  0x08072867 in __ast_free_region (ptr=0x87b1af8, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:152
#1  0x08073365 in __ast_free (ptr=0x87b1af8, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:208
#2  0x08090b5c in ast_var_delete (var=0x87b1af8) at chanvars.c:59
#3  0x08083877 in ast_channel_free (chan=0x868e700) at channel.c:1261
#4  0x08085e3c in ast_hangup (chan=0x868e700) at channel.c:1783
ASTERISK-1  0x080d0969 in __ast_pbx_run (c=0x868e700) at pbx.c:2548
ASTERISK-2  0x080d0bbb in pbx_thread (data=0x868e700) at pbx.c:2608
ASTERISK-3  0x081110bf in dummy_start (data=0x868f4a8) at utils.c:843
ASTERISK-4  0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-5  0x4a7ea06e in clone () from /lib/libc.so.6


There's one that includes pbx_builtin_setvar_helper, but no ast_var_name anymore:

#0  0xb7ff6410 in __kernel_vsyscall ()
#0  0xb7ff6410 in __kernel_vsyscall ()
#1  0x4a745d40 in raise () from /lib/libc.so.6
#2  0x4a747591 in abort () from /lib/libc.so.6
#3  0x4a77b18b in __libc_message () from /lib/libc.so.6
#4  0x4a782dbe in _int_free () from /lib/libc.so.6
ASTERISK-1  0x4a786550 in free () from /lib/libc.so.6
ASTERISK-2  0x08072a04 in __ast_free_region (ptr=0x85175d0, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:174
ASTERISK-3  0x08073365 in __ast_free (ptr=0x85175d0, file=0x814b47f "chanvars.c", lineno=59, func=0x814b4e2 "ast_var_delete") at astmm.c:208
ASTERISK-4  0x08090b5c in ast_var_delete (var=0x85175d0) at chanvars.c:59
ASTERISK-5  0x080d975b in pbx_builtin_setvar_helper (chan=0x864c1c8, name=0xb50796f0 "__target_num_id", value=0xb5079700 "84") at pbx.c:5853
ASTERISK-6 0x080d9a81 in pbx_builtin_setvar (chan=0x864c1c8, data=0xb507d9b0) at pbx.c:5903
ASTERISK-7 0x080ca8f6 in pbx_exec (c=0x864c1c8, app=0x81ae5c8, data=0xb507d9b0) at pbx.c:532
ASTERISK-8 0x080ce4b0 in pbx_extension_helper (c=0x864c1c8, con=0x0, context=0x864c408 "macro-init_vars", exten=0x864c458 "s", priority=54, label=0x0, callerid=0x855b638 "112233",
   action=E_SPAWN) at pbx.c:1838
ASTERISK-9 0x080cf8c4 in ast_spawn_extension (c=0x864c1c8, context=0x864c408 "macro-init_vars", exten=0x864c458 "s", priority=54, callerid=0x855b638 "112233") at pbx.c:2293
ASTERISK-10 0xb5bba2d1 in _macro_exec (chan=0x864c1c8, data=0xb5084f40, exclusive=0) at app_macro.c:308
ASTERISK-11 0xb5bbb103 in macro_exec (chan=0x864c1c8, data=0xb5084f40) at app_macro.c:486
ASTERISK-12 0x080ca8f6 in pbx_exec (c=0x864c1c8, app=0x82421a0, data=0xb5084f40) at pbx.c:532
ASTERISK-13 0x080ce4b0 in pbx_extension_helper (c=0x864c1c8, con=0x0, context=0x864c408 "macro-init_vars", exten=0x864c458 "s", priority=7, label=0x0, callerid=0x855b638 "112233",
   action=E_SPAWN) at pbx.c:1838
ASTERISK-14 0x080cf8c4 in ast_spawn_extension (c=0x864c1c8, context=0x864c408 "macro-init_vars", exten=0x864c458 "s", priority=7, callerid=0x855b638 "112233") at pbx.c:2293
ASTERISK-15 0x080cfde6 in __ast_pbx_run (c=0x864c1c8) at pbx.c:2393
ASTERISK-16 0x080d0bbb in pbx_thread (data=0x864c1c8) at pbx.c:2608
ASTERISK-17 0x081110bf in dummy_start (data=0x8547df8) at utils.c:843
ASTERISK-18 0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-19 0x4a7ea06e in clone () from /lib/libc.so.6

And here's one that i never got before:

#0  0x08072a8f in get_unaligned_uint32 (p=0x4f87a06e) at /usr/dist/asterisk-1.4-svn-2007-11-07/include/asterisk/unaligned.h:36
36    return pp->d;
#0  0x08072a8f in get_unaligned_uint32 (p=0x4f87a06e) at /usr/dist/asterisk-1.4-svn-2007-11-07/include/asterisk/unaligned.h:36
#1  0x0807295e in __ast_free_region (ptr=0x8547110, file=0x8147eb3 "channel.c", lineno=1180, func=0x81494d4 "free_cid") at astmm.c:170
#2  0x08073365 in __ast_free (ptr=0x8547110, file=0x8147eb3 "channel.c", lineno=1180, func=0x81494d4 "free_cid") at astmm.c:208
#3  0x08083307 in free_cid (cid=0x855db8c) at channel.c:1180
#4  0x080836bd in ast_channel_free (chan=0x855da00) at channel.c:1239
ASTERISK-1  0x08085e3c in ast_hangup (chan=0x855da00) at channel.c:1783
ASTERISK-2  0xb6f1845e in try_calling (qe=0xb557e700, options=0xb557e6b7 "", announceoverride=0xb557e6b9 "", url=0xb557e6b8 "", tries=0xb557e8ac, noption=0xb557e8a8, agi=0x0)
   at app_queue.c:2875
ASTERISK-3  0xb6f1b17f in queue_exec (chan=0x8534c18, data=0xb557eb40) at app_queue.c:3636
ASTERISK-4  0x080ca8f6 in pbx_exec (c=0x8534c18, app=0x81fd108, data=0xb557eb40) at pbx.c:532
ASTERISK-5  0xb6ec6836 in realtime_exec (chan=0x8534c18, context=0x8534e58 "db_ext_queues", exten=0x8534ea8 "22901", priority=16, callerid=0x8715e90 "112233",
   data=0x84f7499 "db_ext_queues") at pbx_realtime.c:216
ASTERISK-6 0x080ce5a7 in pbx_extension_helper (c=0x8534c18, con=0x0, context=0x8534e58 "db_ext_queues", exten=0x8534ea8 "22901", priority=16, label=0x0, callerid=0x8715e90 "112233",
   action=E_SPAWN) at pbx.c:1849
ASTERISK-7 0x080cf8c4 in ast_spawn_extension (c=0x8534c18, context=0x8534e58 "db_ext_queues", exten=0x8534ea8 "22901", priority=16, callerid=0x8715e90 "112233") at pbx.c:2293
ASTERISK-8 0x080cfde6 in __ast_pbx_run (c=0x8534c18) at pbx.c:2393
ASTERISK-9 0x080d0bbb in pbx_thread (data=0x8534c18) at pbx.c:2608
ASTERISK-10 0x081110bf in dummy_start (data=0x855c9d0) at utils.c:843
ASTERISK-11 0x4a8903db in start_thread () from /lib/libpthread.so.0
ASTERISK-12 0x4a7ea06e in clone () from /lib/libc.so.6

By: Tilghman Lesher (tilghman) 2007-11-08 09:11:23.000-0600

Okay, any of these that crash on malloc or free, we need the valgrind output on.  The backtraces only have enough information to indicate that we need the valgrind output.

By: Atis Lezdins (atis) 2007-11-08 09:25:46.000-0600

Sure, i'm back to running valgrind, with your patch.

By: Tilghman Lesher (tilghman) 2007-11-09 17:38:39.000-0600

Okay, another one found in your latest valgrind.  Please make sure to apply both patches, as this is in addition to, not instead of.

By: Digium Subversion (svnbot) 2007-11-12 11:27:12.000-0600

Repository: asterisk
Revision: 89184

U   branches/1.4/channels/chan_sip.c
U   branches/1.4/main/channel.c

------------------------------------------------------------------------
r89184 | tilghman | 2007-11-12 11:27:11 -0600 (Mon, 12 Nov 2007) | 5 lines

Fix two cases of memory corruption caused by background threads.
Reported by: atis
Patch by: tilghman
Fixes issue ASTERISK-10477

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-11-12 11:41:59.000-0600

Repository: asterisk
Revision: 89185

_U  trunk/
U   trunk/channels/chan_sip.c
U   trunk/main/channel.c

------------------------------------------------------------------------
r89185 | tilghman | 2007-11-12 11:41:58 -0600 (Mon, 12 Nov 2007) | 13 lines

Merged revisions 89184 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r89184 | tilghman | 2007-11-12 11:29:17 -0600 (Mon, 12 Nov 2007) | 5 lines

Fix two cases of memory corruption caused by background threads.
Reported by: atis
Patch by: tilghman
Fixes issue ASTERISK-10477

........

------------------------------------------------------------------------

By: Tilghman Lesher (tilghman) 2007-11-12 11:55:43.000-0600

Patches committed.  If I don't see any more valgrind outputs that diagnose more corruption issues in the next 3 days, I'll assume that this is fixed.

By: Atis Lezdins (atis) 2007-11-15 07:45:32.000-0600

The problem is not completely gone. I can still crash it without valgrind, but no success when running valgrind.

The big problem when running asterisk+valgrind - the system is becoming extremely slow, SIP registrations are lost, and there are problems with sending media. I can repeat the problem with some automated tests, but valgrind slows everything down numerous times - and no success yet in crashing it.

I will try to run tests on our production system that has 8 cores at night time and i hope that this would compensate slowdown, but that needs some preparations, and i have only few hours for tests.

As far as i have tested, this problem is related to having multiple queues, transfers between them and agent logins/logouts. I'm also using agentcallbacklogin, and i know it's not recommended, but it's not crashing for 1.4.10 like this.

So, here goes errors that are starting within short time of runnig valgrind. I get none of those registration and audio problems when not running valgrind.


[Nov 15 14:06:18] NOTICE[23025]: chan_sip.c:7322 sip_reg_timeout:    -- Registration for '22218@agent_22218' timed out, trying again (Attempt #4)

[Nov 15 14:06:19] NOTICE[23360]: pbx_spool.c:341 attempt_thread: Call failed to go through, reason (0) Call Failure (not BUSY, and not NO_ANSWER, maybe Circuit busy or down?)

   -- Playing 'tt-monkeys' (escape_digits=) (sample_offset 0)
[Nov 15 14:06:26] WARNING[23239]: file.c:638 ast_readaudio_callback: Failed to write frame



By: Tilghman Lesher (tilghman) 2007-11-15 08:27:55.000-0600

Yes, I know valgrind slows it down to the point where it doesn't crash, but valgrind *ALSO* points out where it *MIGHT* have otherwise crashed, which is the output I need to fix the problem.

Please upload your latest valgrind.txt (derived after all patches were added).

By: Atis Lezdins (atis) 2007-11-15 09:07:55.000-0600

Ok, it's nice to hear this. This info is worth to add into valgrind.txt - if i would have known that i wouldn't have deleted previous outputs.

Here goes latest one, cleaned as before. I also increased AST_MAX_LOCKS to 64, but i still got those errors mentioned before.

By: Digium Subversion (svnbot) 2007-11-15 12:03:48.000-0600

Repository: asterisk
Revision: 89298

U   branches/1.4/channels/chan_iax2.c

------------------------------------------------------------------------
r89298 | tilghman | 2007-11-15 12:03:46 -0600 (Thu, 15 Nov 2007) | 5 lines

Yet another memory corruption issue.
Reported by: atis
Patch by: tilghman
Fixes issue ASTERISK-10477

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-11-15 12:09:26.000-0600

Repository: asterisk
Revision: 89299

_U  trunk/
U   trunk/channels/chan_iax2.c

------------------------------------------------------------------------
r89299 | tilghman | 2007-11-15 12:09:26 -0600 (Thu, 15 Nov 2007) | 13 lines

Merged revisions 89298 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r89298 | tilghman | 2007-11-15 12:05:56 -0600 (Thu, 15 Nov 2007) | 5 lines

Yet another memory corruption issue.
Reported by: atis
Patch by: tilghman
Fixes issue ASTERISK-10477

........

------------------------------------------------------------------------

By: Tilghman Lesher (tilghman) 2007-11-15 12:16:13.000-0600

Okay, please update to revision 89298 or higher before trying another session of valgrind.  I note that one of the problems found in the latest valgrind.txt was in a section that was already fixed in an earlier patch.

Remember, we're running more valgrind debugs to find additional memory corruption, not memory corruption that we've already found and fixed.  ;-)

By: Atis Lezdins (atis) 2007-11-16 07:23:17.000-0600

Ok, here goes valgrind output from SVN-branch-1.4-r89302.

I'm just not sure that we're moving into right direction with this. Those connection problems that arises are leading to most agents not-answering, and thus not the problem might not occur.

By: Atis Lezdins (atis) 2007-11-16 08:13:26.000-0600

The last version shows some stability. I didn't tested it before launching valgrind (output i posted), but today i'm running for some hour already and no crashes so far.

By: Tilghman Lesher (tilghman) 2007-11-16 12:25:52.000-0600

Based upon the latest valgrind output, I'm satisfied that we've fixed the majority of problems that cause these types of crashes.  I'm closing this bug out.  If you have additional problems in the future, please open a new bug.