[Home]

Summary:ASTERISK-15819: [patch] internal_ao2_ref fails to check if null returned from INTERNAL_OBJ
Reporter:Alec Davis (alecdavis)Labels:
Date Opened:2010-03-16 05:32:55Date Closed:2010-09-20 17:34:35
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20100427__issue17037.diff.txt
( 1) bug17037.diff.txt
( 2) gdb.10128.txt
( 3) gdb.10771.txt
( 4) gdb.270510.txt
( 5) gdb.9473.txt
( 6) valgrind.txt
Description: == Spawn extension (phones, 10015, 2) exited non-zero on 'Local/10015@phones-f860;2'
[Mar 16 23:14:00] ERROR[30417]: astobj2.c:116 INTERNAL_OBJ: bad magic number 0x0 for 0x85dc240

Bad magic number typically would trigger a segfault.

****** ADDITIONAL INFORMATION ******

repeatable with following test dialplan.
When 10000 is diall the local channel is looped 20 times, and would immediately segfault after some 'Exceptionly long queued frames...

[phones]
exten => _10020,1,Answer()
exten => _10020,n,Playback(test-tones-follow)
exten => _10020,n,Milliwatt()

exten => _1XXXX,1,Set(i=${MATH(${EXTEN}+1,int)})
exten => _1XXXX,n,Dial(Local/${i}@phones,,r)
Comments:By: Alec Davis (alecdavis) 2010-03-16 05:37:13

1st part of bugs17037.diff.txt is the fix.

2nd part of patch
'if (!obj)' changed to 'if (obj == NULL)' to be consistant with rest of module.

By: Russell Bryant (russell) 2010-03-18 08:14:08

This patch is covering a symptom rather than the real bug.  The "bad magic number" message is a very bad thing and should _never_ occur.  I don't have a problem with this patch, but someone needs to figure out what is passing in the invalid pointer and fix that, too.

By: Alec Davis (alecdavis) 2010-03-18 16:03:03

I agree, it's covering a symptom.
In the few instances that internal_ao2_ref is called, the 'bad magic number' has immediately been verifed by INTERNAL_OBJ.

As corydon-76 has previously suggested in IRC, 'C' stack space may be exhausted, that discussion was in the context of 'how many looped local calls should we be able to set up'.

I'm wondering, if a LocalChannel has been optimized out between the first check by INTERNAL_OBJ and then call to internal_ao2_ref.

By: Digium Subversion (svnbot) 2010-03-19 02:37:02

Repository: asterisk
Revision: 253490

U   trunk/main/astobj2.c

------------------------------------------------------------------------
r253490 | alecdavis | 2010-03-19 02:37:01 -0500 (Fri, 19 Mar 2010) | 19 lines

prevent segfault if bad magic number is encountered.

internal_ao2_ref uses INTERNAL_OBJ which mzy report 'bad magic number', but
internal_ao2_ref continues on, causing segfault.

Although AO2_MAGIC number is checked by INTERNAL_OBJ before internal_ao2_ref is
called, A02_MAGIC is being destroyed (or a wrong pointer) by the time
internal_ao2_ref uses INTERNAL_OBJ.

internal_ao2_ref now returns -1 if INTERNAL_OBJ encouters a bad magic number.

(issue ASTERISK-15819)
Reported by: alecdavis
Patches:
     bug17037.diff.txt uploaded by alecdavis (license 585)
Tested by: alecdavis



------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=253490

By: Digium Subversion (svnbot) 2010-03-19 02:59:35

Repository: asterisk
Revision: 253491

_U  branches/1.6.1/
U   branches/1.6.1/main/astobj2.c

------------------------------------------------------------------------
r253491 | alecdavis | 2010-03-19 02:59:35 -0500 (Fri, 19 Mar 2010) | 24 lines

Merged revisions 253490 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r253490 | alecdavis | 2010-03-19 20:37:00 +1300 (Fri, 19 Mar 2010) | 19 lines
 
 prevent segfault if bad magic number is encountered.
 
 internal_ao2_ref uses INTERNAL_OBJ which mzy report 'bad magic number', but
 internal_ao2_ref continues on, causing segfault.
 
 Although AO2_MAGIC number is checked by INTERNAL_OBJ before internal_ao2_ref is
 called, A02_MAGIC is being destroyed (or a wrong pointer) by the time
 internal_ao2_ref uses INTERNAL_OBJ.
 
 internal_ao2_ref now returns -1 if INTERNAL_OBJ encouters a bad magic number.
 
 (issue ASTERISK-15819)
 Reported by: alecdavis
 Patches:
       bug17037.diff.txt uploaded by alecdavis (license 585)
 Tested by: alecdavis
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=253491

By: Digium Subversion (svnbot) 2010-03-19 03:05:07

Repository: asterisk
Revision: 253492

_U  branches/1.6.2/
U   branches/1.6.2/main/astobj2.c

------------------------------------------------------------------------
r253492 | alecdavis | 2010-03-19 03:05:06 -0500 (Fri, 19 Mar 2010) | 24 lines

Merged revisions 253490 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r253490 | alecdavis | 2010-03-19 20:37:00 +1300 (Fri, 19 Mar 2010) | 19 lines
 
 prevent segfault if bad magic number is encountered.
 
 internal_ao2_ref uses INTERNAL_OBJ which mzy report 'bad magic number', but
 internal_ao2_ref continues on, causing segfault.
 
 Although AO2_MAGIC number is checked by INTERNAL_OBJ before internal_ao2_ref is
 called, A02_MAGIC is being destroyed (or a wrong pointer) by the time
 internal_ao2_ref uses INTERNAL_OBJ.
 
 internal_ao2_ref now returns -1 if INTERNAL_OBJ encouters a bad magic number.
 
 (issue ASTERISK-15819)
 Reported by: alecdavis
 Patches:
       bug17037.diff.txt uploaded by alecdavis (license 585)
 Tested by: alecdavis
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=253492

By: Russell Bryant (russell) 2010-03-20 06:38:05

Have you run this test under valgrind?  I would be curious what errors it produces.

By: Alec Davis (alecdavis) 2010-03-22 05:55:45

uploaded valgrind.txt

I was unable to get asterisk to segfault while running under valgrind.
But would lockup often.

By: Tilghman Lesher (tilghman) 2010-04-27 15:42:59

I think I've found the actual cause.  When we masquerade, it appears that we weren't saving a reference to the original channel, so when we attempted to unlink it from the channels list, it was effectively disappearing while we worked on it.

Patch uploaded to fix this oversight (trunk only).

By: Alec Davis (alecdavis) 2010-04-28 03:24:20

Applied patch. tried with 64 loops and 32 loops, with full debug set in logger.conf and got segfaults.
gdb.10128.txt
gdb.10771.txt

removed full debugging from logger.conf.
got 'bad magic number' and ''Exceptionly long queued frames...' but system became totally unusable, disk trashing, unable to login after 15 minutes.
System required reset :(

Asterisk SVN-trunk-r259617



By: Alec Davis (alecdavis) 2010-04-28 04:58:29

After restart, reverted patch.
reporting this as trunk changed since initial report.

No full debug set logger.conf
segfault gdb.9473.txt

By: Alec Davis (alecdavis) 2010-04-29 06:26:26

with patch applied again.
Started at extension 10004 and dialplan stopped at extension 10020.

Note the warning below, and then "Exceptionally long voice q.."
Once the "Exceptionally long" message appears, there is not recovery.

Also the console order of PBX execution, and "Local/100XX .. is Ringing" are out of order.

[Apr 29 23:10:03] WARNING[3927]: channel.c:5401 ast_do_masquerade: Fixup failed on channel Local/10019@phones-256c;1<MASQ>, strange things may happen.
[Apr 29 23:10:03] WARNING[3927]: channel.c:5406 ast_do_masquerade: Hangup failed!  Strange things may happen!

console output:
*CLI>     -- Starting simple switch on 'DAHDI/35-1'
   -- Executing [10004@phones:1] Set("DAHDI/35-1", "i=10005") in new stack
   -- Executing [10004@phones:2] Dial("DAHDI/35-1", "Local/10005@phones,,r") in new stack
   -- Called 10005@phones
   ...
   -- Called 10016@phones
   -- Executing [10016@phones:1] Set("Local/10016@phones-1495;2", "i=10017") in new stack
   -- Local/10015@phones-e74e;1 is ringing
   -- Local/10014@phones-bcd5;1 is ringing
   -- Executing [10016@phones:2] Dial("Local/10016@phones-1495;2", "Local/10017@phones,,r") in new stack
   -- Called 10017@phones
   -- Executing [10017@phones:1] Set("Local/10017@phones-b9b3;2", "i=10018") in new stack
   -- Executing [10017@phones:2] Dial("Local/10017@phones-b9b3;2", "Local/10018@phones,,r") in new stack
   -- Local/10016@phones-1495;1 is ringing
   -- Local/10015@phones-e74e;1 is ringing
   -- Called 10018@phones
   -- Local/10017@phones-b9b3;1 is ringing
   -- Local/10016@phones-1495;1 is ringing
   -- Executing [10018@phones:1] Set("Local/10018@phones-9bfc;2", "i=10019") in new stack
   -- Executing [10018@phones:2] Dial("Local/10018@phones-9bfc;2", "Local/10019@phones,,r") in new stack
   -- Called 10019@phones
   -- Executing [10019@phones:1] Set("Local/10019@phones-256c;2", "i=10020") in new stack
   -- Local/10018@phones-9bfc;1 is ringing
   -- Executing [10019@phones:2] Dial("Local/10019@phones-256c;2", "Local/10020@phones,,r") in new stack
   -- Local/10017@phones-b9b3;1 is ringing
   -- Executing [10020@phones:1] Answer("Local/10020@phones-485d;2", "") in new stack
   -- Called 10020@phones
   -- Local/10019@phones-256c;1 is ringing
   -- Local/10018@phones-9bfc;1 is ringing
   -- Local/10020@phones-485d;1 answered Local/10019@phones-256c;2
   -- Local/10019@phones-256c;1 answered Local/10018@phones-9bfc;2
   -- Local/10018@phones-9bfc;1 answered Local/10017@phones-b9b3;2
   -- Local/10017@phones-b9b3;1 answered Local/10016@phones-1495;2
   -- Local/10016@phones-1495;1 answered Local/10015@phones-e74e;2
   -- Executing [10020@phones:2] Playback("Local/10020@phones-485d;2", "test-tones-follow") in new stack
   -- Local/10015@phones-e74e;1 answered Local/10014@phones-bcd5;2
   -- Local/10014@phones-bcd5;1 answered Local/10013@phones-7fb2;2
   -- Local/10013@phones-7fb2;1 answered Local/10012@phones-c40c;2
   -- Local/10012@phones-c40c;1 answered Local/10011@phones-ab4f;2
   -- Local/10011@phones-ab4f;1 answered Local/10010@phones-67b6;2
   -- Local/10010@phones-67b6;1 answered Local/10009@phones-d243;2
   -- Local/10009@phones-d243;1 answered Local/10008@phones-2977;2
   -- Local/10008@phones-2977;1 answered Local/10007@phones-d0c5;2
   -- Local/10007@phones-d0c5;1 answered Local/10006@phones-f39d;2
   -- Local/10006@phones-f39d;1 answered Local/10005@phones-79fa;2
   -- Local/10005@phones-79fa;1 answered Local/10004@phones-0731;2
   -- Local/10004@phones-0731;1 answered DAHDI/35-2
   -- <Local/10020@phones-485d;2> Playing 'test-tones-follow.slin' (language 'en')
 == Spawn extension (phones, 10017, 2) exited non-zero on 'Local/10017@phones-b9b3;2'
<b>[Apr 29 23:10:03] WARNING[3927]: channel.c:5401 ast_do_masquerade: Fixup failed on channel Local/10019@phones-256c;1<MASQ>, strange things may happen.
[Apr 29 23:10:03] WARNING[3927]: channel.c:5406 ast_do_masquerade: Hangup failed!  Strange things may happen!</b>
 == Spawn extension (phones, 10016, 2) exited non-zero on 'Local/10016@phones-1495;2'
 == Spawn extension (phones, 10015, 2) exited non-zero on 'Local/10015@phones-e74e;2'
 == Spawn extension (phones, 10018, 2) exited non-zero on 'Local/10018@phones-9bfc;2'
 == Spawn extension (phones, 10014, 2) exited non-zero on 'Local/10014@phones-bcd5;2'
 == Spawn extension (phones, 10013, 2) exited non-zero on 'Local/10013@phones-7fb2;2'
 == Spawn extension (phones, 10012, 2) exited non-zero on 'Local/10012@phones-c40c;2'
 == Spawn extension (phones, 10011, 2) exited non-zero on 'Local/10011@phones-ab4f;2'
 == Spawn extension (phones, 10010, 2) exited non-zero on 'Local/10010@phones-67b6;2'
 == Spawn extension (phones, 10009, 2) exited non-zero on 'Local/10009@phones-d243;2'
 == Spawn extension (phones, 10008, 2) exited non-zero on 'Local/10008@phones-2977;2'
 == Spawn extension (phones, 10007, 2) exited non-zero on 'Local/10007@phones-d0c5;2'
 == Spawn extension (phones, 10006, 2) exited non-zero on 'Local/10006@phones-f39d;2'
 == Spawn extension (phones, 10005, 2) exited non-zero on 'Local/10005@phones-79fa;2'
 == Spawn extension (phones, 10004, 2) exited non-zero on 'Local/10004@phones-0731;2'
 == Spawn extension (phones, 10003, 2) exited non-zero on 'DAHDI/35-2'
   -- Hanging up on 'DAHDI/35-2'
   -- Hungup 'DAHDI/35-2'
   -- Executing [10020@phones:3] Milliwatt("Local/10020@phones-485d;2", "") in new stack
[Apr 29 23:10:05] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:06] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:08] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:09] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:10] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:12] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:13] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:14] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:15] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
[Apr 29 23:10:17] WARNING[3932]: channel.c:1174 __ast_queue_frame: Exceptionally long voice queue len
...
core show channels
Channel              Location             State   Application(Data)
<b>Local/10019@phones-2 10019@phones:2       Up      Dial(Local/10020@phones,,r)</b>
Local/10020@phones-c (None)               Up      AppDial((Outgoing Line))
Local/10020@phones-c 10020@phones:3       Ringing Milliwatt()
<b>Local/10019@phones-2 (None)               Up      AppDial((Outgoing Line))</b>
Local/10020@phones-4 10020@phones:3       Up      Milliwatt()
Local/10020@phones-4 (None)               Up      AppDial((Outgoing Line))
DAHDI/35-1           10003@phones:2       Up      Dial(Local/10004@phones,,r)
7 active channels
4 active calls
53 calls processed



By: Alec Davis (alecdavis) 2010-05-12 15:00:47

I've found if I wait for all of the channels to be optimized out by adding the additional Wait(1), then seems to work every time.

If the Playback starts before all channels are optimized out, you then get
'Failed to Fixup...' as highlighted in ~121149/

[phones]
exten => _10020,1,Answer()
exten => _10020,n,Wait(1)
exten => _10020,n,Playback(test-tones-follow)
exten => _10020,n,Milliwatt()

exten => _1XXXX,1,Set(i=${MATH(${EXTEN}+1,int)})
exten => _1XXXX,n,Dial(Local/${i}@phones,,r)

By: Tilghman Lesher (tilghman) 2010-05-26 16:40:53

Does the problem go away if you add '/n' onto the end of the Local channel?

By: Alec Davis (alecdavis) 2010-05-27 01:56:56

Problem goes away using '/n' on end of Local channel.

using following dialplan, dialling 10020 creates 20 calls, and stays up.
Using same dialplan but without '/n' causes crash. <pre>exten => _10000,1,Answer()
exten => _10000,n,Playback(test-tones-follow)
exten => _10000,n,Milliwatt()

exten => _1XXXX,1,Set(i=${MATH(${EXTEN}-1,int)})
exten => _1XXXX,n,Dial(Local/${i}@phones/n,,r)</pre>

Too many loops causes, but still no segfaults.
<b>[May 27 18:55:44] WARNING[14836]: channel.c:1044 __ast_queue_frame: Exceptionally long queue length queuing to Local/10000@phones-c3f8;2</b>

By: Alec Davis (alecdavis) 2010-05-27 03:46:33

gdb.270510.txt backtrace of latest segfault
This backtrace is repeatable with the dialplan from ~122544 without usng the '/n' on the end of LocalChannel



By: Ramon Peek-Fares (ramonpeek) 2010-07-23 09:34:51

I think I have the same problem reported through issue: ASTERISK-14975

If you read my notes https://issues.asterisk.org/view.php?id=16057#123996
You notice the same problem with local channels and the "/n" option that seems to fix it.
Since I've upgraded to 1.4.33 the CLI now also displays: "Exceptionally long voice queue length queuing to Local/xxx@!HG-xxxx", just like mentioned in this case.

However I have applied the patch mentioned in ASTERISK-16118 which might be the reason why I don't see the message "Hangup failed! Strange things may happen!" anymore.


Maybe someone could confirm (or not) my suspected relationship between these cases?

By: Tilghman Lesher (tilghman) 2010-07-23 10:28:09

Since the original problem has long since been fixed and since the remaining problem has a workaround by using the "n" option to the Local channel, I'm closing this issue out.