Summary: | ASTERISK-11529: Random crashes in different places | ||
Reporter: | ptorres (ptorres) | Labels: | |
Date Opened: | 2008-02-27 13:03:06.000-0600 | Date Closed: | 2008-03-14 11:50:30 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | . I did not set the category correctly. |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) bt_full_NEW.txt ( 1) btfull_080226.txt ( 2) btfull_080227.txt | |
Description: | We are experiencing random crashes once or twice a day on one of ~twenty servers, but backtraces always look different. We have already changed the hardware, (the other sites are very stable with no crashes at all). ****** STEPS TO REPRODUCE ****** Unknown so far. ****** ADDITIONAL INFORMATION ****** Attaching "bt full" from two crashes, (compiled * with dont_optimize and debug_threads ) Adding more samples in a couple of days :( ( valgrind degrades the system too much to have it running until crash ) | ||
Comments: | By: Joshua C. Colp (jcolp) 2008-02-27 13:14:05.000-0600 Backtraces against latest version would be useful, tracking down issues that are already solved are never fun. As well - is call parking being used? Can you provide a thread apply all bt as well? By: Jason Parker (jparker) 2008-02-27 13:14:40.000-0600 Reopen if you are able to reproduce on the most recent version of Asterisk. There is no sense in even looking at this until then. By: ptorres (ptorres) 2008-03-13 10:42:42 We have upgraded this particular site to 1.4.18, in about 50hrs we got 6 random crashes, however backtraces are similar now ( see bt_full_new.txt 2 included ) We do not use call parking, about 40 simultaneous calls, both sip to zap and sip to sip, with transfers and spies. By: Russell Bryant (russell) 2008-03-13 10:44:11 Try with 1.4.19-rc3 or 1.4.19 when it gets released. There are some significant chanspy fixes in there. By: ptorres (ptorres) 2008-03-13 12:45:58 I can't see why chanspy is related in this issue, we have disabled it just in case. I checked 'thread apply all bt' on a couple of dumps and didn't find any 'spy' related function call. We upgraded 2 days ago to the lastest avaiable 'stable' release and got (now) 7 crashes, 3 of them in the last 2 hours. * UPDATE: just crashed, NO spies, same backtrace as before. By: Jason Parker (jparker) 2008-03-13 14:17:30 See issue ASTERISK-11537 - the backtrace there looks exactly like yours, and there is a patch available. By: ptorres (ptorres) 2008-03-14 10:47:38 Patched and looked fine for a while, however asterisk 'freezes' and has to be killed/restarted/etc, consoles get disconnected and does not accept new connections. I guess we can close this and follow the 0012098 one. By: Digium Subversion (svnbot) 2008-03-14 11:39:57 Repository: asterisk Revision: 108737 U branches/1.4/channels/chan_sip.c ------------------------------------------------------------------------ r108737 | mmichelson | 2008-03-14 11:39:56 -0500 (Fri, 14 Mar 2008) | 33 lines Fix a race condition in the SIP packet scheduler which could cause a crash. chan_sip uses the scheduler API in order to schedule retransmission of reliable packets (such as INVITES). If a retransmission of a packet is occurring, then the packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if a response is received from the packet as previously transmitted, then when we ACK the response, we will remove the packet from the scheduler and free the packet. The problem is that both the ACK function and retrans_pkt attempt to acquire the same lock at the beginning of the function call. This means that if the ACK function acquires the lock first, then it will free the packet which retrans_pkt is about to read from and write to. The result is a crash. The solution: 1. If the ACK function fails to remove the packet from the scheduler and the retransmit id of the packet is not -1 (meaning that we have not reached the maximum number of retransmissions) then release the lock and yield so that retrans_pkt may acquire the lock and operate. 2. Make absolutely certain that the ACK function does not recursively lock the lock in question. If it does, then releasing the lock will do no good, since retrans_pkt will still be unable to acquire the lock. (closes issue ASTERISK-11537) Reported by: wegbert (closes issue ASTERISK-11529) Reported by: PTorres Patches: 12098-putnopvutv3.patch uploaded by putnopvut (license 60) Tested by: jvandal ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=108737 By: Digium Subversion (svnbot) 2008-03-14 11:48:57 Repository: asterisk Revision: 108738 _U trunk/ U trunk/channels/chan_sip.c ------------------------------------------------------------------------ r108738 | mmichelson | 2008-03-14 11:48:54 -0500 (Fri, 14 Mar 2008) | 41 lines Merged revisions 108737 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines Fix a race condition in the SIP packet scheduler which could cause a crash. chan_sip uses the scheduler API in order to schedule retransmission of reliable packets (such as INVITES). If a retransmission of a packet is occurring, then the packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if a response is received from the packet as previously transmitted, then when we ACK the response, we will remove the packet from the scheduler and free the packet. The problem is that both the ACK function and retrans_pkt attempt to acquire the same lock at the beginning of the function call. This means that if the ACK function acquires the lock first, then it will free the packet which retrans_pkt is about to read from and write to. The result is a crash. The solution: 1. If the ACK function fails to remove the packet from the scheduler and the retransmit id of the packet is not -1 (meaning that we have not reached the maximum number of retransmissions) then release the lock and yield so that retrans_pkt may acquire the lock and operate. 2. Make absolutely certain that the ACK function does not recursively lock the lock in question. If it does, then releasing the lock will do no good, since retrans_pkt will still be unable to acquire the lock. (closes issue ASTERISK-11537) Reported by: wegbert (closes issue ASTERISK-11529) Reported by: PTorres Patches: 12098-putnopvutv3.patch uploaded by putnopvut (license 60) Tested by: jvandal ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=108738 By: Digium Subversion (svnbot) 2008-03-14 11:50:30 Repository: asterisk Revision: 108739 _U branches/1.6.0/ U branches/1.6.0/channels/chan_sip.c ------------------------------------------------------------------------ r108739 | mmichelson | 2008-03-14 11:50:28 -0500 (Fri, 14 Mar 2008) | 49 lines Merged revisions 108738 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ................ r108738 | mmichelson | 2008-03-14 11:52:51 -0500 (Fri, 14 Mar 2008) | 41 lines Merged revisions 108737 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines Fix a race condition in the SIP packet scheduler which could cause a crash. chan_sip uses the scheduler API in order to schedule retransmission of reliable packets (such as INVITES). If a retransmission of a packet is occurring, then the packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if a response is received from the packet as previously transmitted, then when we ACK the response, we will remove the packet from the scheduler and free the packet. The problem is that both the ACK function and retrans_pkt attempt to acquire the same lock at the beginning of the function call. This means that if the ACK function acquires the lock first, then it will free the packet which retrans_pkt is about to read from and write to. The result is a crash. The solution: 1. If the ACK function fails to remove the packet from the scheduler and the retransmit id of the packet is not -1 (meaning that we have not reached the maximum number of retransmissions) then release the lock and yield so that retrans_pkt may acquire the lock and operate. 2. Make absolutely certain that the ACK function does not recursively lock the lock in question. If it does, then releasing the lock will do no good, since retrans_pkt will still be unable to acquire the lock. (closes issue ASTERISK-11537) Reported by: wegbert (closes issue ASTERISK-11529) Reported by: PTorres Patches: 12098-putnopvutv3.patch uploaded by putnopvut (license 60) Tested by: jvandal ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=108739 |