[Home]

Summary:ASTERISK-06648: [patch] [need disclaimers] retransmission-related segfault in chan_sip
Reporter:Roy Sigurd Karlsbakk (rkarlsba)Labels:
Date Opened:2006-03-29 04:47:22.000-0600Date Closed:2011-06-07 14:00:31
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) crashlog
( 1) sched.zip
( 2) sip_retry.patch
Description:hi

i'm not sure yet what made this happen, but the backtrace is below..

(gdb) bt full
#0  0x00002aaaaaccc8d0 in pthread_mutex_lock () from /lib/libpthread.so.0
No symbol table info available.
#1  0x00002aaaabd0e9f0 in ast_mutex_lock (pmutex=0x0) at lock.h:592
No locals.
#2  0x00002aaaabd0dfa3 in retrans_pkt (data=0xfcce0c) at chan_sip.c:1164
       pkt = (struct sip_pkt *) 0xfcce0c
       prev = (struct sip_pkt *) 0x2aaaaacccbbe
       cur = (struct sip_pkt *) 0x0
       iabuf = "\000\000\000\000\000\000\000\000?\003\000\000\000\000\000"
       reschedule = 1000
       __PRETTY_FUNCTION__ = "retrans_pkt"
#3  0x00000000004123de in ast_sched_runq (con=0x65addc) at sched.c:373
       current = (struct sched *) 0x2aaab8b0244c
       tv = {tv_sec = 1143627533, tv_usec = 508802}
       x = 1
       res = 4
#4  0x00002aaaabd398c6 in do_monitor (data=0x0) at chan_sip.c:11317
       res = 1
       sip = (struct sip_pvt *) 0x0
       peer = (struct sip_peer *) 0x0
       t = 1143627528
       fastrestart = 0
       lastpeernum = -1
       curpeernum = 1963
       reloading = 0
       __PRETTY_FUNCTION__ = "do_monitor"
Comments:By: Andrey S Pankov (casper) 2006-03-29 08:50:05.000-0600

Can you do 'frame 2' and 'print pkt->owner' in gdb?

How often do you encounter such a problem? Is there known scenario to reproduce it?



By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-30 03:17:08.000-0600

(gdb) frame 2
#2  0x00002aaaabd0dfa3 in retrans_pkt (data=0xfcce0c) at chan_sip.c:1164
1164            ast_mutex_lock(&pkt->owner->lock);
(gdb) print pkt->owner
$1 = (struct sip_pvt *) 0x0

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-30 03:17:45.000-0600

The box had been running for several days before this happened. I haven't seen it before.

By: Andrey S Pankov (casper) 2006-03-30 14:39:52.000-0600

It won't be enough to check wether pkt->owner is NULL before calling ast_mutex_lock like in some other places in pkt_retrans... The only person who can suggest what to do is oej I think. I'm ready to make the fix if there is suggestions/directions how that should be handled from people involved in the project.

By: Olle Johansson (oej) 2006-03-31 19:07:50.000-0600

I would need to know  a bit more. Is this repeatable. Can you see what is happening before the crash happens, what we're trying to re-transmit?

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-07 07:25:10

it just happened twice again today with rev 17150

By: Andrey S Pankov (casper) 2006-04-07 10:10:48

Do you have some logs? Please do "tail -n 100 your_log_file_before_crash" and upload it.

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:06:23

I'll see if this is repeated with 1.2.6. I haven't had any crash since the last one, and the log files are huge and the syslog server timesync was fscked, so i'll rather wait :P

aren't there anything else that can be done here?

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:39:11

This is a real PITA. It crashes daily..

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:53:26

forget about the attached crashlog
that was for ASTERISK-6751

sorry about that

By: Andrey S Pankov (casper) 2006-04-11 06:06:54

Can you reply to oej: "I would need to know a bit more. Is this repeatable. Can you see what is happening before the crash happens, what we're trying to re-transmit?"

It would be nice if you set verbose 4, set debug 4, sip debug.

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 06:51:16

I'm not sure how the box will perform will a full sip debug - there's quite high load on it already.
Also - does sip debug allow for syslog logging?

By: Andrey S Pankov (casper) 2006-04-11 06:53:40

sip debug uses ast_verbose. So I don't see the reason why it should not work via syslog.

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 07:11:09

sip debug works with syslog, it seems, but the box doesn't seem to be able to handle the load with full sip debug.

By: Olle Johansson (oej) 2006-04-11 10:52:01

casper: I am in direct contact with rkarlsba to handle this. thanks.

By: Olle Johansson (oej) 2006-05-17 14:40:28

Another void pointer that has empty value. There are at least patterns in your crashes.

By: tkarlik (tkarlik) 2006-05-19 03:58:00

I have similar problem with retrans_pkt. I wonder if it is related to 6936 (segfault in malloc/calloc)? I am observing those 2 problems since 1 week.


I think scheduler due to callback 'retrans_pkt', when 'data' (sip_pkt) has been freed:
1. SIP INVITE / sched_add
2. response received => race condition:
3.1. sched_del called before retrans_pkt (callback is unsychronized in scheduler)
3.2. sched_del called after retrans_pkt
4.1. OK
4.2. segfault possible

Is it possible?



By: tkarlik (tkarlik) 2006-05-25 07:46:54

I solved this problem here:

http://trac.openpbx.org/cgi-bin/trac.cgi/changeset/1630

SIP retransmission DON'T WORK correctly on Asterisk/OpenPBX. Problem appears when something is wrong with sending UDP packets beetwen 2 nodes. Asterisk due to retransmit packets, but takes incorrect timer value (see RFC 3261).
This patch solves problem with memory corruption too (ASTERISK-6751).

Could anyone look into?

By: Serge Vecher (serge-v) 2006-05-25 08:46:50

tkarlik: thanks for finding the solution. Could you please:
1) Confirm that you have disclaimer on file with Digium. If you don't, can you please file it (see bottom of this page http://bugs.digium.com/main_page.php )?
2) Upload the patch file here against the latest 1.2 svn branch?

Thanks.



By: tkarlik (tkarlik) 2006-05-26 04:40:07

I'm using Asterisk on production system, and is not 1.2 svn branch, so it's difficult to me to make patch.
I made patch for OpenPBX, but it can be easily converted to work with Asterisk.

If someone could do it, please test this changeset with Asterisk.

By: Serge Vecher (serge-v) 2006-05-26 08:35:06

tkarlik: if you could upload the patch agaist the version you are running in production and note for us what version it is, that will be fine. Thank you.

By: Serge Vecher (serge-v) 2006-05-26 15:08:15

willcampos:

 Please review the following page to make a patch: http://www.asterisk.org/developers/Patch_Howto . Thanks for the effort.

By: wILMAR cAMPOS (willcampos) 2006-05-26 15:13:19

If I use the version 1.2.7.1 how can I make a path to this specific version?

On the link you gave me, there is only info to update with latest svn.

Thanks,

Wilmar

By: Serge Vecher (serge-v) 2006-05-26 15:20:18

in the first step, just checkout the 1.2.7.1 branch as follows instead of trunk:
http://svn.digium.com/svn/asterisk/tags/1.2.7.1/

By: wILMAR cAMPOS (willcampos) 2006-05-26 15:24:21

Thank you, I will have a patch ready soon :>

By: wILMAR cAMPOS (willcampos) 2006-05-26 15:58:53

Here is the patch for 1.2.7.1.

Is working fine until now, segfaults stop after apply patch.

By: Serge Vecher (serge-v) 2006-05-26 16:01:51

royk:?

By: Serge Vecher (serge-v) 2006-05-26 16:04:53

willcampos, tkarlik: please confirm the disclaimer status. If you haven't filed the disclaimer to Digium yet, please be kind to fax one of the forms from the bottom of this page http://bugs.digium.com/main_page.php ... Thanks very much!

By: Andrey S Pankov (casper) 2006-05-28 13:10:11

The patch won't be committed in that form anyway IMO. It alters at least two
things: timers and sheduler enrty deletion. If there is a bug in timers that
should be filed separately.

By: wILMAR cAMPOS (willcampos) 2006-05-28 17:14:37

Anyway, that patch solve my problem, the system is stable and is not crashing anymore.

By: Serge Vecher (serge-v) 2006-05-30 08:43:13

willcampos: we still need to know whether you have filed a disclaimer with Digium.
tkarlic: please also confirm. Thanks

By: tkarlik (tkarlik) 2006-05-30 11:26:50

I filled disclaimer, but currently I don't have time and possibility to send/fax it. Feel free to use patch I made.

By: Serge Vecher (serge-v) 2006-05-30 13:20:39

tkarlik: you can send a scanned copy of your filled out disclaimer form to kpflemingA_Tdigium.com . Thanks.

By: wILMAR cAMPOS (willcampos) 2006-05-30 14:09:34

I try, but your fax is always busy.....

By: Serge Vecher (serge-v) 2006-05-30 14:22:01

Lot of people must be faxing in disclaimers. I just called 256-864-0464 and the fax machine answered. So keep trying or scan your disclaimer in and send to the email address above.

Thanks

By: Olle Johansson (oej) 2006-06-01 11:34:04

Please explain more why you think Asterisk does not follow RFC for retransmissions.

By: tkarlik (tkarlik) 2006-06-02 02:28:40

SIP retransmision should be done after ~ 500ms (depending on network type: smaller value on local network). In chan_sip.c retransmission sometimes is scheduled with timer value near 0 ms... Look this:

int siptimer_a = DEFAULT_RETRANS;
          // DEFAULT_RETRANS in chan_sip is 2*500 ms - this is OK for Internet
pkt->timer_t1 = p->timer_t1; /* Set SIP timer T1 */
          /* p->timer_t1 (pvt timer) could be equal 0 when transmitting first frame (SIP_INVITE - see bt from ASTERISK-6482 and ASTERISK-6751), but it is possible, that p->timer_t1 > 0) */

if (pkt->timer_t1)
   siptimer_a = pkt->timer_t1 * 2;

         /* if (pkt->timer_t1 == 5) then siptimer_a == 10 ms
          * IMO THIS IS NOT CORRECT TIMER VALUE FOR RETRANSMISSION
          * Asterisk follow RFC for retrans., but here we have some bug I think
          */

   /* Schedule retransmission */
   pkt->retransid = ast_sched_add_variable(sched, siptimer_a, retrans_pkt, pkt, 1);
         /* Consider this:
          * 1. Scheduling retransmission with timer value = 10 ms - this breaks in ast_sched_add_variable, but after mutex_unlock - pkt->retransid is not set to correct value, becose ast_sched_add_variable doesn't finished yet!
          * 2. Scheduller callback
          * 3. retrans_pkt when pkt->retransid has not been set yet
          * ==> correct timer value needed
          */

          /* Other scenarion:
           * 1. Scheduling retransmision with timer_value 1000ms
           * 2. reply received
           * 3. retrans_pkt - breaks in first line
           * 3. attempt to delete retransmission entry and free pkt
           * 4. retrans_pkt goes on -> we have empty void pointer!
           * ==> sched_del_lock is needed
           */

Thats all. I hope this help. Please, excuse my english....



By: Olle Johansson (oej) 2006-06-27 12:00:07

Use the t1min setting to set your minimum t1.

The standard says that t1 is the roundtrip time. If not known, the default is 500ms.

If you use qualify=yes, we know the roundtrip time and use that as t1, otherwise we use 500.

I don't see us *not* following the standard here.

By: Olle Johansson (oej) 2006-06-27 12:06:32

Is this still a problem with latest 1.2? We've changed a few things. please test 1.2 from svn. Thanks.

By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-28 00:53:37

I haven't seen this for a while, no, at least not on 1.2.9.1 which I'm currently running, so just close it for now. I'll reopen it if it happens again

roy