Summary: | ASTERISK-06648: [patch] [need disclaimers] retransmission-related segfault in chan_sip | ||
Reporter: | Roy Sigurd Karlsbakk (rkarlsba) | Labels: | |
Date Opened: | 2006-03-29 04:47:22.000-0600 | Date Closed: | 2011-06-07 14:00:31 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) crashlog ( 1) sched.zip ( 2) sip_retry.patch | |
Description: | hi i'm not sure yet what made this happen, but the backtrace is below.. (gdb) bt full #0 0x00002aaaaaccc8d0 in pthread_mutex_lock () from /lib/libpthread.so.0 No symbol table info available. #1 0x00002aaaabd0e9f0 in ast_mutex_lock (pmutex=0x0) at lock.h:592 No locals. #2 0x00002aaaabd0dfa3 in retrans_pkt (data=0xfcce0c) at chan_sip.c:1164 pkt = (struct sip_pkt *) 0xfcce0c prev = (struct sip_pkt *) 0x2aaaaacccbbe cur = (struct sip_pkt *) 0x0 iabuf = "\000\000\000\000\000\000\000\000?\003\000\000\000\000\000" reschedule = 1000 __PRETTY_FUNCTION__ = "retrans_pkt" #3 0x00000000004123de in ast_sched_runq (con=0x65addc) at sched.c:373 current = (struct sched *) 0x2aaab8b0244c tv = {tv_sec = 1143627533, tv_usec = 508802} x = 1 res = 4 #4 0x00002aaaabd398c6 in do_monitor (data=0x0) at chan_sip.c:11317 res = 1 sip = (struct sip_pvt *) 0x0 peer = (struct sip_peer *) 0x0 t = 1143627528 fastrestart = 0 lastpeernum = -1 curpeernum = 1963 reloading = 0 __PRETTY_FUNCTION__ = "do_monitor" | ||
Comments: | By: Andrey S Pankov (casper) 2006-03-29 08:50:05.000-0600 Can you do 'frame 2' and 'print pkt->owner' in gdb? How often do you encounter such a problem? Is there known scenario to reproduce it? By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-30 03:17:08.000-0600 (gdb) frame 2 #2 0x00002aaaabd0dfa3 in retrans_pkt (data=0xfcce0c) at chan_sip.c:1164 1164 ast_mutex_lock(&pkt->owner->lock); (gdb) print pkt->owner $1 = (struct sip_pvt *) 0x0 By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-30 03:17:45.000-0600 The box had been running for several days before this happened. I haven't seen it before. By: Andrey S Pankov (casper) 2006-03-30 14:39:52.000-0600 It won't be enough to check wether pkt->owner is NULL before calling ast_mutex_lock like in some other places in pkt_retrans... The only person who can suggest what to do is oej I think. I'm ready to make the fix if there is suggestions/directions how that should be handled from people involved in the project. By: Olle Johansson (oej) 2006-03-31 19:07:50.000-0600 I would need to know a bit more. Is this repeatable. Can you see what is happening before the crash happens, what we're trying to re-transmit? By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-07 07:25:10 it just happened twice again today with rev 17150 By: Andrey S Pankov (casper) 2006-04-07 10:10:48 Do you have some logs? Please do "tail -n 100 your_log_file_before_crash" and upload it. By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:06:23 I'll see if this is repeated with 1.2.6. I haven't had any crash since the last one, and the log files are huge and the syslog server timesync was fscked, so i'll rather wait :P aren't there anything else that can be done here? By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:39:11 This is a real PITA. It crashes daily.. By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:53:26 forget about the attached crashlog that was for ASTERISK-6751 sorry about that By: Andrey S Pankov (casper) 2006-04-11 06:06:54 Can you reply to oej: "I would need to know a bit more. Is this repeatable. Can you see what is happening before the crash happens, what we're trying to re-transmit?" It would be nice if you set verbose 4, set debug 4, sip debug. By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 06:51:16 I'm not sure how the box will perform will a full sip debug - there's quite high load on it already. Also - does sip debug allow for syslog logging? By: Andrey S Pankov (casper) 2006-04-11 06:53:40 sip debug uses ast_verbose. So I don't see the reason why it should not work via syslog. By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 07:11:09 sip debug works with syslog, it seems, but the box doesn't seem to be able to handle the load with full sip debug. By: Olle Johansson (oej) 2006-04-11 10:52:01 casper: I am in direct contact with rkarlsba to handle this. thanks. By: Olle Johansson (oej) 2006-05-17 14:40:28 Another void pointer that has empty value. There are at least patterns in your crashes. By: tkarlik (tkarlik) 2006-05-19 03:58:00 I have similar problem with retrans_pkt. I wonder if it is related to 6936 (segfault in malloc/calloc)? I am observing those 2 problems since 1 week. I think scheduler due to callback 'retrans_pkt', when 'data' (sip_pkt) has been freed: 1. SIP INVITE / sched_add 2. response received => race condition: 3.1. sched_del called before retrans_pkt (callback is unsychronized in scheduler) 3.2. sched_del called after retrans_pkt 4.1. OK 4.2. segfault possible Is it possible? By: tkarlik (tkarlik) 2006-05-25 07:46:54 I solved this problem here: http://trac.openpbx.org/cgi-bin/trac.cgi/changeset/1630 SIP retransmission DON'T WORK correctly on Asterisk/OpenPBX. Problem appears when something is wrong with sending UDP packets beetwen 2 nodes. Asterisk due to retransmit packets, but takes incorrect timer value (see RFC 3261). This patch solves problem with memory corruption too (ASTERISK-6751). Could anyone look into? By: Serge Vecher (serge-v) 2006-05-25 08:46:50 tkarlik: thanks for finding the solution. Could you please: 1) Confirm that you have disclaimer on file with Digium. If you don't, can you please file it (see bottom of this page http://bugs.digium.com/main_page.php )? 2) Upload the patch file here against the latest 1.2 svn branch? Thanks. By: tkarlik (tkarlik) 2006-05-26 04:40:07 I'm using Asterisk on production system, and is not 1.2 svn branch, so it's difficult to me to make patch. I made patch for OpenPBX, but it can be easily converted to work with Asterisk. If someone could do it, please test this changeset with Asterisk. By: Serge Vecher (serge-v) 2006-05-26 08:35:06 tkarlik: if you could upload the patch agaist the version you are running in production and note for us what version it is, that will be fine. Thank you. By: Serge Vecher (serge-v) 2006-05-26 15:08:15 willcampos: Please review the following page to make a patch: http://www.asterisk.org/developers/Patch_Howto . Thanks for the effort. By: wILMAR cAMPOS (willcampos) 2006-05-26 15:13:19 If I use the version 1.2.7.1 how can I make a path to this specific version? On the link you gave me, there is only info to update with latest svn. Thanks, Wilmar By: Serge Vecher (serge-v) 2006-05-26 15:20:18 in the first step, just checkout the 1.2.7.1 branch as follows instead of trunk: http://svn.digium.com/svn/asterisk/tags/1.2.7.1/ By: wILMAR cAMPOS (willcampos) 2006-05-26 15:24:21 Thank you, I will have a patch ready soon :> By: wILMAR cAMPOS (willcampos) 2006-05-26 15:58:53 Here is the patch for 1.2.7.1. Is working fine until now, segfaults stop after apply patch. By: Serge Vecher (serge-v) 2006-05-26 16:01:51 royk:? By: Serge Vecher (serge-v) 2006-05-26 16:04:53 willcampos, tkarlik: please confirm the disclaimer status. If you haven't filed the disclaimer to Digium yet, please be kind to fax one of the forms from the bottom of this page http://bugs.digium.com/main_page.php ... Thanks very much! By: Andrey S Pankov (casper) 2006-05-28 13:10:11 The patch won't be committed in that form anyway IMO. It alters at least two things: timers and sheduler enrty deletion. If there is a bug in timers that should be filed separately. By: wILMAR cAMPOS (willcampos) 2006-05-28 17:14:37 Anyway, that patch solve my problem, the system is stable and is not crashing anymore. By: Serge Vecher (serge-v) 2006-05-30 08:43:13 willcampos: we still need to know whether you have filed a disclaimer with Digium. tkarlic: please also confirm. Thanks By: tkarlik (tkarlik) 2006-05-30 11:26:50 I filled disclaimer, but currently I don't have time and possibility to send/fax it. Feel free to use patch I made. By: Serge Vecher (serge-v) 2006-05-30 13:20:39 tkarlik: you can send a scanned copy of your filled out disclaimer form to kpflemingA_Tdigium.com . Thanks. By: wILMAR cAMPOS (willcampos) 2006-05-30 14:09:34 I try, but your fax is always busy..... By: Serge Vecher (serge-v) 2006-05-30 14:22:01 Lot of people must be faxing in disclaimers. I just called 256-864-0464 and the fax machine answered. So keep trying or scan your disclaimer in and send to the email address above. Thanks By: Olle Johansson (oej) 2006-06-01 11:34:04 Please explain more why you think Asterisk does not follow RFC for retransmissions. By: tkarlik (tkarlik) 2006-06-02 02:28:40 SIP retransmision should be done after ~ 500ms (depending on network type: smaller value on local network). In chan_sip.c retransmission sometimes is scheduled with timer value near 0 ms... Look this: int siptimer_a = DEFAULT_RETRANS; // DEFAULT_RETRANS in chan_sip is 2*500 ms - this is OK for Internet pkt->timer_t1 = p->timer_t1; /* Set SIP timer T1 */ /* p->timer_t1 (pvt timer) could be equal 0 when transmitting first frame (SIP_INVITE - see bt from ASTERISK-6482 and ASTERISK-6751), but it is possible, that p->timer_t1 > 0) */ if (pkt->timer_t1) siptimer_a = pkt->timer_t1 * 2; /* if (pkt->timer_t1 == 5) then siptimer_a == 10 ms * IMO THIS IS NOT CORRECT TIMER VALUE FOR RETRANSMISSION * Asterisk follow RFC for retrans., but here we have some bug I think */ /* Schedule retransmission */ pkt->retransid = ast_sched_add_variable(sched, siptimer_a, retrans_pkt, pkt, 1); /* Consider this: * 1. Scheduling retransmission with timer value = 10 ms - this breaks in ast_sched_add_variable, but after mutex_unlock - pkt->retransid is not set to correct value, becose ast_sched_add_variable doesn't finished yet! * 2. Scheduller callback * 3. retrans_pkt when pkt->retransid has not been set yet * ==> correct timer value needed */ /* Other scenarion: * 1. Scheduling retransmision with timer_value 1000ms * 2. reply received * 3. retrans_pkt - breaks in first line * 3. attempt to delete retransmission entry and free pkt * 4. retrans_pkt goes on -> we have empty void pointer! * ==> sched_del_lock is needed */ Thats all. I hope this help. Please, excuse my english.... By: Olle Johansson (oej) 2006-06-27 12:00:07 Use the t1min setting to set your minimum t1. The standard says that t1 is the roundtrip time. If not known, the default is 500ms. If you use qualify=yes, we know the roundtrip time and use that as t1, otherwise we use 500. I don't see us *not* following the standard here. By: Olle Johansson (oej) 2006-06-27 12:06:32 Is this still a problem with latest 1.2? We've changed a few things. please test 1.2 from svn. Thanks. By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-28 00:53:37 I haven't seen this for a while, no, at least not on 1.2.9.1 which I'm currently running, so just close it for now. I'll reopen it if it happens again roy |