[Home]

Summary:ASTERISK-04754: Segfault when doing tranfers with a SNOM phone
Reporter:Guenther Starnberger (gst)Labels:
Date Opened:2005-08-02 05:12:58Date Closed:2011-06-07 14:00:22
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Transfers
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk-full-crash1.gz
( 1) asterisk-full-crash2.gz
( 2) asterisk-full-crash3.gz
( 3) bt-full-crash1.txt
( 4) bt-full-crash2.txt
( 5) bt-full-crash3.txt
( 6) cdrnull.txt
( 7) gdb4.txt
( 8) gdb-without-pyastre.txt
( 9) gdb-with-pyastre.txt
(10) log.txt
(11) log4.txt
(12) log6.txt.gz
(13) log6-gdb.txt
(14) log-without-pyastre.txt
(15) log-with-pyastre.txt
(16) snom-sip-trace1.txt
Description:This bug was found by a coworker and I'm currently out of office - so I apologize for the description of the bug not being 100% clear. I'll have the possibility to test this myself in the next days - just mail me if you need any further information.

I already reported another bug on call transfers (ASTERISK-4634) - I don't think that this one is related as i don't match on the hangup extension any more.

Our extension.conf section is:
[incoming-sip]
exten => _X.,1,Python(incoming_sip_dial,${EXTEN},${SIPDOMAIN},${SIP_HEADER(Proxy-Authorization)},proxy)
exten => _[a-zA-Z0-9].,1,Python(incoming_sip_dial,${EXTEN},${SIPDOMAIN},${SIP_HEADER(Proxy-Authorization)},proxy)

The bug can be caused by doing the following (according to my coworker Asterisk doesn't crash everytime when doing this, sometimes it just hangs - sometimes nothing does happen):

call 85 (with the snom phone)
press hold
call 86
press 'transfer'
press the button for line 1 (which should have been already transfered)
press 'clear' (i'm currently not sure what the 'clear' button does)
segfault (maybe)

GDB shows the following information:
[New Thread 753695 (LWP 11508)]
[Thread 753695 (LWP 11508) exited]
[Thread 720925 (LWP 11504) exited]
[Thread 704540 (LWP 11503) exited]
[New Thread 770080 (LWP 11509)]
[Thread 770080 (LWP 11509) exited]
[New Thread 786465 (LWP 11510)]
[Thread 786465 (LWP 11510) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 229390 (LWP 11457)]
0x080a42b2 in ast_rtp_pt_clear (rtp=0x0) at rtp.c:652
652                     rtp->current_RTP_PT[i].isAstFormat = 0;
(gdb) bt
#0  0x080a42b2 in ast_rtp_pt_clear (rtp=0x0) at rtp.c:652
#1  0x408d35cd in process_sdp (p=0x82356f8, req=0xbdffe7c4) at
chan_sip.c:3202
#2  0x408d88cf in handle_request_invite (p=0x82356f8, req=0xbdffe7c4,
debug=1, ignore=0, seqno=6,
   sin=0x0, recount=0x0, e=0x0) at chan_sip.c:9083
#3  0x408d2969 in handle_request (p=0x82356f8, req=0xbdffe7c4,
sin=0xbdffe7b4, recount=0x0,
   nounlock=0x0) at chan_sip.c:9651
#4  0x408d0818 in sipsock_read (id=0x81f5610, fd=11, events=1,
ignore=0x0) at chan_sip.c:9783
ASTERISK-1  0x08054fcd in ast_io_wait (ioc=0x81cf018, howlong=0) at io.c:272
ASTERISK-2  0x408c8571 in do_monitor (data=0x0) at chan_sip.c:9928
ASTERISK-3  0x40026e51 in pthread_start_thread () from /lib/libpthread.so.0
ASTERISK-4  0x402f792a in clone () from /lib/libc.so.6
(gdb)


****** ADDITIONAL INFORMATION ******

I'll be able to provide more information about this problem (checking if this also happens when not using PyAstre, etc.) when I'm in the office in one of the next days. Further our firewall does include a more or less broken SIP-ALG. Asterisk is outside the firewall - the phones are inside.
Comments:By: Olle Johansson (oej) 2005-08-02 05:24:38

Attended transfers again...

By: Guenther Starnberger (gst) 2005-08-02 05:41:51

add-on: it seems that the log.txt was truncated and the following lines (at the end) are missing in the file:

Write to 46 failed: Unknown error 500
Short write: 0/6 (Unknown error 500)

further it may be possible that log.txt does start a little bit to late (and not at the beginning of the debugging). as said above i'll re-test this myself in the next days and then i'll attach some more debugging output.

By: Michael Jerris (mikej) 2005-08-02 06:23:25

please include a bt full at that time as well.  Thanks.

By: Olle Johansson (oej) 2005-08-02 06:47:30

Yes, the debug file is very incomplete. Thanks for getting a proper debug file with debug=4 and verbosity 4 and sip debug turned on.

By: Guenther Starnberger (gst) 2005-08-03 05:16:30

ok - i've now uploaded a complete sip log and a gdb bt full. i was able to reproduce the bug a few times. in most cases gdb reported a sigsegv but asterisk continued to run (it seems that just one of the subprocesses crashed). in one case i was able to reproduce a 'complete' crash of asterisk (but at this time i hadn't gdb or asterisk -r running).

the bug in asterisk seems to be caused by a problem with the snom (or with our ALG). sometimes the snom phone thinks (= displays this on the leds) that calls are still 'open' although they have already been closed. when switching back to these calls asterisk sometimes crashed. in other cases i had to play around with transfers and switch back and then asterisk crashed.

if the state of the snom phone is ok (= it doesn't display open channels although they have been closed) i'm not able to cause a segfault with transfers.

further i also removed the python app from the extensions.conf and used the following one instead (which still causes the sigsegv):
exten => 85,1,Dial(SIP/u5@voip.sysfrog.org,20,r)
exten => u5,1,Dial(SIP/u5@voip.sysfrog.org,20,r)
exten => 86,1,Dial(SIP/u6@voip.sysfrog.org,20,r)
exten => u6,1,Dial(SIP/u6@voip.sysfrog.org,20,r)
exten => 88,1,Dial(SIP/u8@voip.sysfrog.org,20,r)
exten => u8,1,Dial(SIP/u8@voip.sysfrog.org,20,r)

By: Mark Spencer (markster) 2005-08-03 13:47:55

Will it be possible to get a backtrace and log from latest CVS head?  There seems to be some bizarre behavior including an INVITE after a REFER on the *same* call number.

By: Guenther Starnberger (gst) 2005-08-04 05:15:08

I've attached 2 * 2 debug files created with the current CVS HEAD version. The first 2 files are the GDB and "asterisk -r" log when using the PyAstre module - the other 2 files are the logs when not using PyAstre but the extensions.conf I posted above. Both lead to a SIGSEGV but the PyAstre files are a little bit shorter because the SIGSEGV occured earlier.

Because of the INVITE after the REFER stuff: On the SNOM I first call 85, then 86 and then "transfer" them together. After this the SNOM still displays "Line 1" and "Line 2" as active. When I switch back to "Line 1" Asterisk does crash. Sometimes it doesn't crash instantly but after switching back and forth between the lines. (I wouldn't assume that either the SNOM or our ALG are fully RFC compliant - we had cases were the SNOM doesn't include an IP in the SDP part when it has problems using STUN).

By: Olle Johansson (oej) 2005-08-17 09:34:41

Well, Asterisk does not really do it correctly either. I'll take a log at the logs and see if I can find something...

By: xrobau (xrobau) 2005-08-19 02:38:30

I've been able to crash todays HEAD with transferring an incoming PRI (E1, wcte1xx from HEAD too) call, quite reliably. (Snom 360)
I'm heading off bush for a week to do a whole pile of ADSL installations (sigh), but I'll be able to supply SIP traces if needed when I get back (hopefully less than 7 days from the time of this post)

By: Olle Johansson (oej) 2005-08-22 01:11:41

When capturing logs, make sure that debug=4 and verbosity=4 so that we see what's going on inside Asterisk.

By: Olle Johansson (oej) 2005-08-22 01:14:57

For some reason, there's seems to be no RTP data structure allocated. Please capture a full debug (as described above) so we can see if there's any other error messages from Asterisk.

By: Guenther Starnberger (gst) 2005-08-23 07:59:52

oej: do you mean xrobau or me? the files 'log-with-pyastre.txt' and 'log-without-pyastre.txt' were both created with the debug and verbose level at 4. (using "asterisk -r" and then "set verbose 4" and "debug level 4").



By: Michael Jerris (mikej) 2005-08-24 01:11:05

gst. can you please provide the requested verbose\debugs based on current head.

By: Olle Johansson (oej) 2005-08-24 01:19:47

Check your logger conf and make sure that the console gets the verbose and debug channel - there is information missing from these log files.

By: xrobau (xrobau) 2005-08-26 06:17:11

Righto. This is a wierd one. I've been trying to get something useful from GDB, and it's not helping. I've crashed asterisk about 6 times, and each time the crash is in a different place, with totally different information.
The common factor doesn't seem to be the transfers, it's the establishing of a new call. The phones references in this (and any following SIP traces) are:

300 - SPA-2000
301 - GXP-2000 Line 1
302 - BT100
303 - Snom 360
304 - GXP-2000 Line 2
305 - GXP-2000 Line 3

The crash in this case was brought about by dialling a new call from 301 to an asterisk internal extension - nothing to do with the Snom at all, _except for the fact_ that I was transferring calls before it crashed. I think that there's possibly something being overflowed somewhere, which is confusing the asterisk server. Obviously, that's not much help, but the bt's seem to be useless.

The crash occured around the last 'INVITE' sent from the snom. I'll see if I can get some more debugging tomorrow. Any pointers on where I can look to get more debugging hints (apart from the voip-info wiki..)?

By: xrobau (xrobau) 2005-08-29 05:18:18

The common factor _does_ seem to be CDR's - I've attached a full trace that, just before it crashes, has a totally corrupt CDR information that it's trying to insert. There's probably a whole pile of stuff you don't need to know, but I didn't want to leave anything out.

By: xrobau (xrobau) 2005-08-29 05:37:16

Crash 2 doesn't really have anything to do with CDR, but you can see that number 3, 'send_response' is totally confused.. The 'full' trace is a bit big, I'm afraid, as this one took a bit of messing around to make it crash.

By: xrobau (xrobau) 2005-08-29 05:46:18

I think three should be enough.. *sigh*. Feel free to harass me on IRC if you want any more debugging.

By: Olle Johansson (oej) 2005-08-29 10:02:50

Hmmm. This might be related to the handling of CDRs in attempt_transfer - but only if it only happens in relationship to transfers.

By: Olle Johansson (oej) 2005-08-29 10:04:05

Wait - resetcdr and nocdr is executed after we have posted the CDR's. Let's see a dialplan.

By: Olle Johansson (oej) 2005-08-29 10:13:29

Try this little patch. Disclaimer on file.

By: Guenther Starnberger (gst) 2005-08-29 11:30:09

oej: i've upgraded to HEAD and fixed my logger.conf settings. attached are log.txt.gz (the debug + verbose output) and log-gdb.txt.

this time it took a little bit longer to crash asterisk - but after transfering calls around for a while it finally crashed.

By: Olle Johansson (oej) 2005-08-29 11:36:36

With or without the patch?

By: Guenther Starnberger (gst) 2005-08-29 12:07:03

i logged the debug stuff before reading about your patch so it isn't included in my version yet.

i can try to get another log with the patch (but i'm not sure if i have enough time to do this today - may take until tomorrow), but i don't think that my problem is related to the one of xrobau as the segfault always happens at the same location.

the debug log shows some (imo) interessting msgs like:
Aug 29 17:43:44 DEBUG[15423] rtp.c: Ooh, empty read...
Aug 29 17:43:44 DEBUG[15423] rtp.c: Difference is 195928, ms is 24511
Aug 29 17:43:44 DEBUG[15415] channel.c: Nobody there, continuing...
Aug 29 17:43:44 DEBUG[15415] rtp.c: Difference is 196384, ms is 24568

By: Guenther Starnberger (gst) 2005-08-29 12:22:15

oej: i was just able to produce the same segfault with your patch included.

By: Olle Johansson (oej) 2005-08-29 12:57:27

The patch may solve the problem in bt-full-3, but not the latest crash. Seems like there are several problems attacking you.

By: Guenther Starnberger (gst) 2005-08-29 14:32:28

bt-full-3 was submitted by xrobau - not by me - my bt's always show the problem in the same function :)

By: Olle Johansson (oej) 2005-08-29 14:38:09

Pls confirm that you are using the latest CVS head code.

By: Guenther Starnberger (gst) 2005-08-29 14:46:26

log6.txt.gz and log6-gdb.txt were done using a CVS checkout from today (CVS-NHEAD-08/29/05-19:16:57 [time here is GMT +2]).

additionally to asterisk i have the pyastre-0.03 module installed but it wasn't used in the extensions.conf at the time of the test. some weeks i also reproduced the problem (same bt as the ones which i submitted) without pyastre installed so this shouldn't be the case of the problem.

By: Olle Johansson (oej) 2005-08-29 16:49:34

gst: Ok, this is a REFER/Replaces gone wrong. The code for that in chan_sip is, well, to be nice, a bit interesting. It will be rewritten soon.

By: Kevin P. Fleming (kpfleming) 2005-08-29 19:14:09

A fix was just put into CVS HEAD to ensure that the channel's CDR pointer is cleared in the hangup processing code. Please re-test and see if you still have any issues.

By: xrobau (xrobau) 2005-08-30 04:37:44

Sorry to say it's still crashing, and it's still significantly random in how it's doing it.. One of my cores had the thread start, try to do a sipsock_read of total rubbish, and unsurprisingly die. I'm currently putting my limited debugging skills to the test to see if I can figure something out.

By: Olle Johansson (oej) 2005-09-07 05:37:22

Have tested on xrobau's system. He's got a macro that calls resetcdr() and then nocdr() in hangup as well as after each call. In some cases the resetcdr() causes a crash after a transfer and the cdr record has all weird fields or just a few.

If we remove resetcdr() everything works as expected. I don't know the inside of the CDR subsystem enough to locate this bug and the gdb backtraces seems weird.

The crash seems to happen when we call nocdr() after resetcdr(). When ast_cdr_free wants to free the cdr, we get a crash in libc somewhere. It suggests that we try to free something that is already gone. I tried a lot of different things, but could not cause the crash to go away. Need help from someone with a bit more insight in the cdr subsystem to solve this issue.

By: Olle Johansson (oej) 2005-09-07 05:41:20

Sorry, do we have two bug reports in one? gst - do you see the same problems and do you have the same extensions (from asterisk@home) ?

By: Guenther Starnberger (gst) 2005-09-07 06:18:32

xrobau's problem seems to be unrelated to mine. the segfaults at my system always happen at the same location and don't seem to be related to cdr's. i'm doing nothing fancy with cdr's right now execept using cdr_pgsql to write them in the database. i'm not using asterisk@home.

By: Olle Johansson (oej) 2005-09-07 07:52:41

gst: Is it possible to get access to your system? Please contact me on the IRC channel or via e-mail so we can work on finding your bug. I am trying to close open issues related to SIP.

xrobau: Can you open a new bug report with your issue, please, so we can continue with that in a separate issue report? Thank you.

By: Olle Johansson (oej) 2005-09-07 07:55:49

I made a clone of the issuereport, that stays with xrobaus problem so we can concentrate on the original problem in this issue report. It is now moved back to the SIP category.

By: Olle Johansson (oej) 2005-09-09 07:41:44

Logged in to the system and we got craches in mp3 subsystem. Removed format_mp3 and the python module and we could not get Asterisk to crash again. There is some form of problem in the combination of those modules, and one of the crashes was in ast_rtp_something caused by a non-existing rtp session. Waiting for more test results.

By: Michael Jerris (mikej) 2005-09-09 08:50:30

Also note that ASTERISK-4748 may be related.

By: Olle Johansson (oej) 2005-09-09 10:55:08

Thank you MikeJ!!!!

Soon I can happily move this out of the SIP category again...

By: Guenther Starnberger (gst) 2005-09-13 07:17:45

oej: I just tried to reproduce the crash. I wasn't able to do this using our config with enabled PyAstre and enabled format_mp3 (the same config with which i was able to reproduce the crash some time ago). The last time when i successfully reproduced the crash was on 08/29/05. Is it possible that one of the commits since 08/29/05 fixed this problem? I can try again with the older setup (08/29/05) and provide a core file if of interest - otherwise this problem may already be fixed - at least i can't reproduce it anymore using the current version (without your rtp null-pointer workaround).

By: Michael Jerris (mikej) 2005-09-13 07:42:25

As this is no longer reporoducable, and may very well be a duplicate of the format_mp3 bug, lets commit cdrnull.txt and close and open this back up if we are able to reproduce in the future without format mp3.

By: Mark Spencer (markster) 2005-09-13 10:54:24

Suspending pending duplication.  The attached patch is unnecessary since the structure is immediately freed.

By: Digium Subversion (svnbot) 2008-01-15 15:46:04.000-0600

Repository: asterisk
Revision: 6444

U   trunk/channel.c

------------------------------------------------------------------------
r6444 | kpfleming | 2008-01-15 15:46:04 -0600 (Tue, 15 Jan 2008) | 2 lines

ensure CDR pointer is cleared after detaching it from channel (related to issue ASTERISK-4754)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=6444