Summary:ASTERISK-08012: SIP Segfault with high call setup volume in ast_rtp_lookup_code()
Reporter:jerjer (jerjer)Labels:
Date Opened:2006-10-25 23:50:53Date Closed:2006-12-18 11:38:39.000-0600
Versions:Frequency of
Environment:Attachments:( 0) bt_200612111610.txt
( 1) bt_200612121119.txt
( 2) bt_200612121320.txt
( 3) bt_200612121445.txt
( 4) bt_200612141105.txt
( 5) bt_200612141256.txt
( 6) bt_full.txt
( 7) bt_thread.txt
Description:Setup 75 calls at 500ms interval.   This only occured once in the testing but file said he may know what could cause this.


(gdb) bt full
#0  0xa7d48060 in free () from /lib/tls/libc.so.6
No symbol table info available.
#1  0xa7d4991e in calloc () from /lib/tls/libc.so.6
No symbol table info available.
#2  0x080e1762 in _ast_calloc (num=1, len=1028, file=0x812f453 "utils.c", lineno=832,
   func=0x812f6d5 "add_string_pool") at /usr/src/asterisk-1.4.0-beta3/include/asterisk/utils.h:342
       p = (void *) 0xb
       __PRETTY_FUNCTION__ = "_ast_calloc"
#3  0x080e2ec5 in add_string_pool (mgr=0x835b840, size=1024) at utils.c:832
       pool = (struct ast_string_field_pool *) 0x13c4
       __PRETTY_FUNCTION__ = "add_string_pool"
#4  0x080e3070 in __ast_string_field_index_build (mgr=0x835b840, fields=0x835b7ac, num_fields=37, index=32,
   format=0xa77d5f90 "SIP/2.0/UDP %s:%d;branch=z9hG4bK%08x%s") at utils.c:899
       new_size = 1024
       needed = 57
       ap1 = 0xa6ad26d4 "X\0366\b?\023"
       ap2 = 0xa6ad26d4 "X\0366\b?\023"
ASTERISK-1  0xa7793248 in build_via (p=0x835b790) at chan_sip.c:1721
       rport = 0xa77d5f86 ";rport"
ASTERISK-2  0xa77a1ed4 in reqprep (req=0xa6ad2dbc, p=0x835b790, sipmethod=8, seqno=103, newbranch=1)
   at chan_sip.c:5467
       orig = (struct sip_request *) 0x835ba98
       stripped = '\0' <repeats 79 times>
       tmp = '\0' <repeats 79 times>
       newto = "13028 RTP/AVP 0\r\n\000\000\000?????+  ", '\0' <repeats 48 times>, "????\000\000\000\000???q????\033v}??\214}?\030(  ?b?'/  !\017\000\000?\212}?,(  ,(  4E~?h(  ?\016z?'/  !\017\000\000?\212}?????\033v}??\214}?X(  ?b??/  \236\016\000\000?\212}?l(  l(  4E~??(  ?\016z??/  \236\016\000\000?\212}??z}?? 3\b\vf}?? 4\b?\rz??*"...
       c = 0x0
       ot = 0x0
       of = 0x0
       is_strict = 0
       __PRETTY_FUNCTION__ = "reqprep"
---Type <return> to continue, or q <return> to quit---
ASTERISK-3  0xa77aa9ae in transmit_request_with_auth (p=0x835b790, sipmethod=8, seqno=0, reliable=XMIT_RELIABLE,
   newbranch=1) at chan_sip.c:7319
       resp = {rlPart1 = 0x0, rlPart2 = 0x0, len = 0, headers = 0, method = 0, lines = 0, flags = 0,
 header = {0x0 <repeats 64 times>}, line = {0x0 <repeats 64 times>}, data = '\0' <repeats 4095 times>,
 sdp_start = 0, sdp_end = 0}
       __PRETTY_FUNCTION__ = "transmit_request_with_auth"
ASTERISK-4  0xa7799d5c in sip_hangup (ast=0x834e740) at chan_sip.c:3318
       audioqos = 0x836e1d0 "ssrc=1129215709;themssrc=558780065;lp=0;rxjitter=0.000000;rxcount=1;txjitter=0.000000;txcount=0;rlp=0;rtt=0.000000"
       videoqos = 0xa77d5df2 ""
       p = (struct sip_pvt *) 0x835b790
       needcancel = 0
       needdestroy = 0
       oldowner = (struct ast_channel *) 0x834e740
       __PRETTY_FUNCTION__ = "sip_hangup"
ASTERISK-5  0x08079e05 in ast_hangup (chan=0x834e740) at channel.c:1560
       res = 0
       __PRETTY_FUNCTION__ = "ast_hangup"
ASTERISK-6 0xa773f6df in dial_exec_full (chan=0x835d290, data=0xa6ad6ff8, peerflags=0xa6ad4e74) at app_dial.c:1624
       number = 0x83700e9 "1234@provider"
       end_time = 1161834240
       answer_time = 1161834220
       res = -1
       u = (struct ast_module_user *) 0x8352e80
       rest = 0x0
       cur = 0x0
       outgoing = (struct dial_localuser *) 0x0
       peer = (struct ast_channel *) 0x834e740
       to = -1
       numbusy = 0
       numcongestion = 0
       numnochan = 0
       cause = 0
---Type <return> to continue, or q <return> to quit---
       numsubst = "1234@provider\000\000\000\000????\000\000\000?\003?\003???L  e\025??\003??\000????\000\000\000\033\000\000\000?\000\000\000\000??????\000\000\000?\003?? L  "
       cidname = "????\234L  \000\000\000\000\000\000\000\000\234L  \f\0306\b \0306\b\000\000\000\000 N  ????\234L  \f\0306\b\210M  ????Xo   @\021\b?L  ?b??\000\000\000?L  "
       privdb_val = 0
       calldurationlimit = 0
       timelimit = 0
       play_warning = 0
       warning_freq = 0
       warning_sound = 0x0
       end_sound = 0x0
       start_sound = 0x0
       dtmfcalled = 0x0
       dtmfcalling = 0x0
       status = "ANSWER\000R\000K  \000\000\000\000\000\000\000\000\234K  \224M  ?M  \000\000\000\000 L  ????\234K  \224M  \210L  a??\234K  ??\022\b?L  \224M  \000\000\000\000<L  O\000\000\000\001\200 ?\224M  \224M  \224M  \224M  ?M  ?M  \224M  ?M  ", '\0' <repeats 56 times>, "????\000\000\000\000$|?\000\000\000\000????",'\0' <repeats 16 times>, "\001\000\000\000\000\000\000\000?\030???\025???q?????\000\000\000\000?\024??$F?"...
       play_to_caller = 0
       play_to_callee = 0
       sentringing = 0
       moh = 0
       outbound_group = 0x0
       result = 0
       start_time = 1161834220
       privintro = '\0' <repeats 30 times>, " s\f\000\000\000\000\000\000\000(\000\000\000\000\000\000\000?M  ", '\0' <repeats 12 times>, "????\030\000\000\000 ?\022\b??\022\bdL  \006", '\0' <repeats 19 times>, "\n\000\000\000bL  ", '\0' <repeats 40 times>, "????\000\000 d\000\000\000\000\000\000\000\000?N  \000\000 d\000\000\000\000PX\022\b", '\0' <repeats 96 times>, "\001", '\0' <repeats 15 times>, "XN  \000\000\000\000\000\000\000\000??\022\bTN  \000\000\000\000\000\000\000\000??\022\b", '\0' <repeats 144 times>, "$|?", '\0' <repeats24 times>...
       privcid = '\0' <repeats 30 times>, " d\a", '\0' <repeats 11 times>, "RX dRX\022\b\000\000\000\000????---Type <return> to continue, or q <return> to quit---
QX\022\b\001\000\000\000xL  \037??\234L  QX\022\b\001", '\0' <repeats 15 times>, "???\000\000\000\000\000\000\000\000PL  \000\000\000\000\000\000\000\000?K  ", '\0' <repeats 24 times>, "\001\000\000\000\223A?\000\000\000\000\000\000\000\000?L   ?\022\b\000\000\000\000??\022\b?L  \000\000\000\000??????\022\bRX\022\b6X\022\b?K  \006", '\0' <repeats 19 times>, "\n\000\000\000?K  ", '\0' <repeats 15 times>
       parse = 0xa6ad4130 "SIP"
       opermode = 0
       args = {argc = 1, argv = 0xa6ad4628, peers = 0xa6ad4130 "SIP", timeout = 0x0, options = 0x0,
 url = 0x0}
       opts = {flags = 0}
       opt_args = {0x1b <Address 0x1b out of bounds>, 0x812e7bd "", 0x812e7a7 "%c[%d;%sm%s%c[0;%d;%dm",
 0xa6ad4ae4 "", 0x6 <Address 0x6 out of bounds>, 0x0, 0x0, 0x0, 0x0}
       __PRETTY_FUNCTION__ = "dial_exec_full"
ASTERISK-7 0xa773f835 in dial_exec (chan=0x835d290, data=0xa6ad6ff8) at app_dial.c:1653
       peerflags = {flags = 0}
ASTERISK-8 0x080aaeee in pbx_exec (c=0x835d290, app=0x817b4a0, data=0xa6ad6ff8) at pbx.c:505
       res = 135922698
       saved_c_appl = 0x0
       saved_c_data = 0x0
ASTERISK-9 0x080add43 in pbx_extension_helper (c=0x835d290, con=0x0, context=0x835d410 "testing",
   exten=0x835d460 "2345", priority=1, label=0x0, callerid=0x8370910 "sipp", action=E_SPAWN) at pbx.c:1791
       e = (struct ast_exten *) 0x817d800
       app = (struct ast_app *) 0x817b4a0
       res = 137784008
       q = {incstack = {0x0 <repeats 128 times>}, stacklen = 0, status = 5, swo = 0x0, data = 0x0,
 foundcontext = 0x835d410 "testing"}
       passdata = "SIP/1234@provider", '\0' <repeats 8174 times>
       matching_action = 0
       __PRETTY_FUNCTION__ = "pbx_extension_helper"
ASTERISK-10 0x080aec74 in ast_spawn_extension (c=0x835d290, context=0x835d410 "testing", exten=0x835d460 "2345",
   priority=1, callerid=0x8370910 "sipp") at pbx.c:2245
No locals.
ASTERISK-11 0x080af0f0 in __ast_pbx_run (c=0x835d290) at pbx.c:2345
       dst_exten = "\000\000\000\000\000\000\000\000h\000\000\000\000\017}\000?\223  ?????\222  4\223  h\000---Type <return> to continue, or q <return> to quit---
\000\000?\222  4\223  \234\223  $|??\223  ?\223  \000\000\000\000?\223  ?\223  ?\223  ?\223  ", '\0' <repeats 28 times>, "\020\000\000\000?\024?? \223  ??\n\b\000\000\000\000\000\000\000\000?o??\000???\000\000\000\000\000\000\000\000\234\223  \000\000\000\000?\223   \223  \000\000\000\000?\223  ?\223  \220Y??\223  \f", '\0' <repeats 19 times>, "?????\024??\021C\001\000?\223  \036\231??\024?"...
       pos = 0
       digit = 0
       found = 1
       res = 0
       autoloopflag = 0
       error = 0
       __PRETTY_FUNCTION__ = "__ast_pbx_run"
ASTERISK-12 0x080afcd4 in pbx_thread (data=0x835d290) at pbx.c:2556
       c = (struct ast_channel *) 0x835d290
ASTERISK-13 0x080e2595 in dummy_start (data=0x81a04b0) at utils.c:544
       _buffer = {__routine = 0x8067aa8 <ast_unregister_thread>, __arg = 0xa6ad9bb0,
 __canceltype = -1478565698, __prev = 0x0}
       ret = (void *) 0xa6ad9468
       a = {start_routine = 0x80afcbd <pbx_thread>, data = 0x835d290,
 name = 0x833bac0 "pbx_thread", ' ' <repeats 11 times>, "started at [ 2580] pbx.c ast_pbx_start()"}
ASTERISK-14 0xa7f7d0bd in start_thread () from /lib/tls/libpthread.so.0
No symbol table info available.
ASTERISK-15 0xa7dad8fe in clone () from /lib/tls/libc.so.6
Comments:By: jerjer (jerjer) 2006-10-26 11:21:16

After further testing I am able to very easily duplicate this same type of crash (calloc/free) at various call setup levels on a mini-itx VIA C3 processor by generating calls from a Dual Xeon Dell machine running SIPp.

Russell and I are starting to suspect a bug in glibc (v2.3.6)

By: Joseph Yannaccone (jyannaccone) 2006-10-27 14:19:33

I believe I am experiencing the same problem.  I am placing a random test load of 1-36 calls every 1-5 seconds (with a 5 second call duration) on an Asterisk 1.4b3 test server.  The test load is running between a pair of SIP peers (the calls are initiated by a peer named test_peer and terminated by a peer named test_gw) that connect through the Asterisk server.  I am using a default build and (mostly) default configuration.  The only non-default configurations are sip.conf and extensions.conf (configured as shown below).







exten => s,1,Hangup()

exten => _1561NXXXXXX,1,Dial(SIP/${EXTEN:1}@test_gw)
exten => _1561NXXXXXX,n,Congestion()
exten => _561NXXXXXX,1,Dial(SIP/${EXTEN}@test_gw)
exten => _561NXXXXXX,n,Congestion()


The Asterisk server crashes with a segfault after anywhere from 1-5 minutes of this test load.  However, I am not certain that this is necessarily a load issue  and that it wouldn't eventually occur with low volumes of traffic (the high load just seems to accelerate reaching the condition).  I will run an extended test this weekend to determine if the same problem occurs with a low traffic load after a longer period of time.

I was previously running Asterisk v1.2.7.1 on the same test server and it completed 24 hours of the exact same test load with no problem.

I ran this test more than a dozen times, each time experiencing the segmentation fault.  However, I did find that the segfault did seem to occur at a different place each time.  Below are the backtraces of the several core dumps that occurred during my testing.


Core dump from test 1:

(gdb) bt full
#0  0x00322450 in _int_malloc () from /lib/libc.so.6
No symbol table info available.
#1  0x003239a2 in calloc () from /lib/libc.so.6
No symbol table info available.
#2  0x008702e3 in __sip_reliable_xmit (p=0x96f8cc0, seqno=102, resp=0,
   data=0xb67e1ef8 "INVITE sip:5617893727@ SIP/2.0\r\nVia: SIP/2.0/UDP;branch=z9hG4bK0f8afb6d;rport\r\nFrom: \"5617618610\" <sip:test_gw@>;tag=as76a4a48b\r\nTo: <sip:5617893727@>"...,
   len=759, fatal=1, sipmethod=5)
   at /usr/src/asterisk-1.4.0-beta3/include/asterisk/utils.h:342
       __PRETTY_FUNCTION__ = "__sip_reliable_xmit"
#3  0x0087a7ea in send_request (p=0x96f8cc0, req=0xb67e1cdc,
   reliable=XMIT_CRITICAL, seqno=102) at chan_sip.c:2147
       res = 58384
#4  0x0088ab8a in transmit_invite (p=0x96f8cc0, sipmethod=5, sdp=1, init=2)
   at chan_sip.c:6679
       content = Variable "content" is not available.


Core dump from test 2:

(gdb) bt full
#0  0x080c872c in p2p_rtp_callback (id=0x8be4dd8, fd=406, events=1,
   cbdata=0x8bc8e80) at rtp.c:2797
       res = 172
       sin = {sin_family = 2, sin_port = 18473, sin_addr = {
   s_addr = 218173612}, sin_zero = "\000\000\000\000\000\000\000"}
       len = 16
       header = Variable "header" is not available.


Core dump from test 3:

(gdb) bt full
#0  0x0032018e in malloc_consolidate () from /lib/libc.so.6
No symbol table info available.
#1  0x0032235a in _int_malloc () from /lib/libc.so.6
No symbol table info available.
#2  0x003239a2 in calloc () from /lib/libc.so.6
No symbol table info available.
#3  0x009d94f3 in sip_alloc (
   callid=0xb7b5a2b5 "37ee39032a47a19211ef1ae663dca5ff@",
   sin=0xb7b5b2c0, useglobal_nat=1, intended_method=5)
   at /usr/src/asterisk-1.4.0-beta3/include/asterisk/utils.h:342
       p = Variable "p" is not available.


The server is running a vanilla Fedora Core 5 setup (using glibc 2.4).

I have the core files for each test (or could perform additional testing) if that would be helpful.

By: Olle Johansson (oej) 2006-10-29 10:11:18.000-0600

Well, one of the core dumps seems to have no relationship to memory allocation. The others have. Have you tested with the memory alloc tracing turned on?

By: Olle Johansson (oej) 2006-10-29 10:13:12.000-0600

Does calloc call free() ? Check JerJer's backtrace... And when does calloc cause segfault?

By: Olle Johansson (oej) 2006-10-29 14:52:19.000-0600

Please check if this is related to ASTERISK-7801 - crashes in memory allocation with old glibc. Thanks.

By: Joseph Yannaccone (jyannaccone) 2006-10-30 12:03:16.000-0600

I have not tested with memory allocation tracing turned on.  Not sure how to do that but if you tell me what to do I will do it and post the results.

I am not certain if optimizations are on or off because 'make dont_optimize' doesn't work with 1.4 and I'm not sure how to pass the DONT_OPTIMIZE flag to menuselect.

Also, I looked at ASTERISK-7801, but that appears to be an issue related to using an old version of glibc.  I am using glibc v2.4 (the version that comes with a standard Fedora Core 5 distribution).

By: Joseph Yannaccone (jyannaccone) 2006-11-03 12:55:54.000-0600

I figured out how to turn on the DONT_OPTIMIZE and MALLOC_DEBUG (for memory allocation tracing) options.  I ran the test again.  To simplify things, I set the call generator to setup 72 calls every 10 seconds (each call with a 5 second duration).  This made the maximum number of simultaneous calls more deterministic.  This test scenario caused Asterisk to crash in less than a minute.

I captured gdb output from the resulting core file and attached it to this ticket (in the standard bt_full.txt and bt_thread.txt format).

I performed another test with the call generator set to setup 48 calls every 10 seconds and experienced the same results (segfault occurred in the same place).

By: Olle Johansson (oej) 2006-11-06 09:24:07.000-0600

Anyone that can find a pattern in all of this? I don't see Asterisk crashing in the SIP channel, even though that's where you put a lot of pressure...

By: Olle Johansson (oej) 2006-11-12 15:20:39.000-0600

Maybe we should test on FreeBSD/OpenBSD to see if there's a difference.

By: Joseph Yannaccone (jyannaccone) 2006-11-15 14:43:31.000-0600

Testing on another platform sounds like a reasonable idea but I don't have access to any non-Fedora systems.  Can anyone else perform a high call volume test on FreeBSD or OpenBSD?

By: Terry Wilson (twilson) 2006-12-07 10:08:58.000-0600

We are seeing this crash about 5 times a day on our server that is handling Queue calls.  Here is the bt full (with DONT_OPTIMIZE)

#0  0x081383d9 in p2p_rtp_callback (id=0x84685e0, fd=47, events=1, cbdata=0x8435228) at rtp.c:2805
2805            else if (rtp->rtcp && rtp->rtcp->s == fd)
(gdb) bt full
#0  0x081383d9 in p2p_rtp_callback (id=0x84685e0, fd=47, events=1, cbdata=0x8435228) at rtp.c:2805
       res = 32
       hdrlen = 12
       sin = {sin_family = 2, sin_port = 19008, sin_addr = {s_addr = 3089234751}, sin_zero = "\000\000\000\000\000\000\000"}
       len = 16
       header = (unsigned int *) 0x84352ac
       rtp = (struct ast_rtp *) 0x8435228
       is_rtp = 0
       is_rtcp = 0
       __PRETTY_FUNCTION__ = "p2p_rtp_callback"
#1  0x08111f83 in ast_io_wait (ioc=0x83faca8, howlong=16) at io.c:279
       res = 3
       x = 2
       origcnt = 9
#2  0xb618acb8 in do_monitor (data=0x0) at chan_sip.c:14799
       res = 16
       sip = (struct sip_pvt *) 0x0
       peer = (struct sip_peer *) 0x0
       t = 1165507277
       fastrestart = 0
       lastpeernum = -1
       curpeernum = 8
       reloading = 0
---Type <return> to continue, or q <return> to quit---
       __PRETTY_FUNCTION__ = "do_monitor"
#3  0x0815cfd9 in dummy_start (data=0x83fca90) at utils.c:545
       _buffer = {__routine = 0x80d9d25 <ast_unregister_thread>, __arg = 0xb6137bb0, __canceltype = 0, __prev = 0x0}
       ret = (void *) 0x0
       a = {start_routine = 0xb618a7e8 <do_monitor>, data = 0x0,
 name = 0x83fcaa0 "do_monitor", ' ' <repeats 11 times>, "started at [14858] chan_sip.c restart_monitor()"}
#4  0xb7fcb371 in start_thread () from /lib/tls/libpthread.so.0
No symbol table info available.
ASTERISK-1  0xb7e5bffe in clone () from /lib/tls/libc.so.6
No symbol table info available.

By: Curt Moore (jcmoore) 2006-12-07 10:59:51.000-0600

To add to twilson's comment in (0056071), the server this is on is CentOS 4.4, which is RedHat/Fedora really, with glibc-common-2.3.4-2.25 and gcc-3.4.6-3.  So it would appear that this issue is not only limited to a particular distro such as the BSD based distros.  What version of glibc and gcc are being used on BSD?  Don't know if that has anything to do with this but may be helpful to have in the bug.

By: Curt Moore (jcmoore) 2006-12-11 09:45:55.000-0600

On Friday, file made some changes to rtp.c which we tested on one of my servers over the weekend and it appeared to help this particular segfault out a good deal.  He commited these changes to 1.4 earlier today in r48381.  I'm still seeing a segfault but it is no longer in rtp.c and happens much less frequently than before.  I'll have to probe deeper and see exactly what is causing this new segfault.  Thanks file!

Could everyone else update to the latest 1.4 so we can see if it fixes everyone else's issue as well?

By: Serge Vecher (serge-v) 2006-12-11 10:47:19.000-0600

tgrman: please don't forget to attach a new bt. Thanks for the update

By: Curt Moore (jcmoore) 2006-12-11 16:19:42.000-0600

I just added a backtrace bt_200612111610.txt of my latest segfault.  I don't think it has to do with the RTP changes file made over the weekend but I've now seen a segfault in the same place twice.  This leads me to beleive that there is something systematic going on as the segfaults are happening in the same place each time.  Nothing is jumping out at me off hand so comments are welcome.

As I mentioned, I think that this is a seperate segfault from the one file fixed with his changes to rtp.c so should we leave this here or open another bug?  Bug marshalls please comment.

By: Curt Moore (jcmoore) 2006-12-12 11:23:22.000-0600

Latest segfault, did again appear to be in rtp.c.  Backtrace file is bt_200612121119.txt.

By: Curt Moore (jcmoore) 2006-12-12 14:53:58.000-0600

It would appear that there are at least 2 other issues at play here as I've now seen 4 segfaults; 2 of which happen in the same place in the code, and the other 2 happen in a different same place in the code.  One appears to still be in rtp.c and the other appears to have something to do with logging to syslog.

For now, I've turned off logging to syslog to see if I still get that particular segfault.  So far, I'm only still seeing the segfault in rtp.c related to ast_rtp_lookup_code().

By: Joshua C. Colp (jcolp) 2006-12-13 21:38:13.000-0600

I just totally redid how RTCP is handled, give 1.4 as of revision 48461 if you saw issues with it.

By: Curt Moore (jcmoore) 2006-12-14 09:57:52.000-0600

Upgraded to latest 1.4 around 1am CST this morning.  So far, so good.  Will post any updates.

By: Curt Moore (jcmoore) 2006-12-14 11:07:53.000-0600

Backtrace of latest rtp related segfault attached, bt_200612141105.txt

By: Joshua C. Colp (jcolp) 2006-12-14 11:36:55.000-0600

Okay I isolated the issue. Try revision 48472 a try :D

By: Curt Moore (jcmoore) 2006-12-14 12:59:22.000-0600

Latest RTP related backtrace attached, bt_200612141256.txt.  Sorry file. :-(

By: Joshua C. Colp (jcolp) 2006-12-15 14:00:50.000-0600

Yet again... try as of revision 48506.

By: Serge Vecher (serge-v) 2006-12-15 15:03:59.000-0600

tgrman: your input will be appreciated ...

By: Curt Moore (jcmoore) 2006-12-15 15:06:13.000-0600

So far, the changes file made yesterday to rtp.c directly on my server have worked perfectly, no crashes.  I don't think that file has commited these back into SVN as of yet.

By: Joshua C. Colp (jcolp) 2006-12-18 11:38:37.000-0600

This should be fixed in 1.4 as of revision 48506 and trunk as of revision 48507. If the issue still exists, please feel free to reopen this bug. Peace all!