[Home]

Summary:ASTERISK-11529: Random crashes in different places
Reporter:ptorres (ptorres)Labels:
Date Opened:2008-02-27 13:03:06.000-0600Date Closed:2008-03-14 11:50:30
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:. I did not set the category correctly.
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt_full_NEW.txt
( 1) btfull_080226.txt
( 2) btfull_080227.txt
Description:We are experiencing random crashes once or twice a day on one of ~twenty servers, but backtraces always look different.

We have already changed the hardware, (the other sites are very stable with no crashes at all).


****** STEPS TO REPRODUCE ******

Unknown so far.


****** ADDITIONAL INFORMATION ******

Attaching "bt full" from two crashes,  (compiled * with dont_optimize and debug_threads )
Adding more samples in a couple of days :(
( valgrind degrades the system too much to have it running until crash )
Comments:By: Joshua C. Colp (jcolp) 2008-02-27 13:14:05.000-0600

Backtraces against latest version would be useful, tracking down issues that are already solved are never fun. As well - is call parking being used? Can you provide a thread apply all bt as well?

By: Jason Parker (jparker) 2008-02-27 13:14:40.000-0600

Reopen if you are able to reproduce on the most recent version of Asterisk.

There is no sense in even looking at this until then.

By: ptorres (ptorres) 2008-03-13 10:42:42

We have upgraded this particular site to 1.4.18, in about 50hrs we got 6 random crashes, however backtraces are similar now ( see bt_full_new.txt 2 included )
We do not use call parking, about 40 simultaneous calls, both sip to zap and sip to sip, with transfers and spies.




By: Russell Bryant (russell) 2008-03-13 10:44:11

Try with 1.4.19-rc3 or 1.4.19 when it gets released.  There are some significant chanspy fixes in there.

By: ptorres (ptorres) 2008-03-13 12:45:58

I can't see why chanspy is related in this issue, we have disabled it just in case. I checked 'thread apply all bt' on a couple of dumps and didn't find any 'spy' related function call.
We upgraded 2 days ago to the lastest avaiable 'stable' release and got (now) 7 crashes, 3 of them in the last 2 hours.

* UPDATE: just crashed, NO spies, same backtrace as before.



By: Jason Parker (jparker) 2008-03-13 14:17:30

See issue ASTERISK-11537 - the backtrace there looks exactly like yours, and there is a patch available.

By: ptorres (ptorres) 2008-03-14 10:47:38

Patched and looked fine for a while, however asterisk 'freezes' and has to be killed/restarted/etc, consoles get disconnected and does not accept new connections.

I guess we can close this and follow the 0012098 one.

By: Digium Subversion (svnbot) 2008-03-14 11:39:57

Repository: asterisk
Revision: 108737

U   branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r108737 | mmichelson | 2008-03-14 11:39:56 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108737

By: Digium Subversion (svnbot) 2008-03-14 11:48:57

Repository: asterisk
Revision: 108738

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r108738 | mmichelson | 2008-03-14 11:48:54 -0500 (Fri, 14 Mar 2008) | 41 lines

Merged revisions 108737 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108738

By: Digium Subversion (svnbot) 2008-03-14 11:50:30

Repository: asterisk
Revision: 108739

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r108739 | mmichelson | 2008-03-14 11:50:28 -0500 (Fri, 14 Mar 2008) | 49 lines

Merged revisions 108738 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r108738 | mmichelson | 2008-03-14 11:52:51 -0500 (Fri, 14 Mar 2008) | 41 lines

Merged revisions 108737 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108739