Summary: | ASTERISK-11391: This crash happenned this morning | ||
Reporter: | Private Name (falves11) | Labels: | |
Date Opened: | 2008-02-06 13:16:05.000-0600 | Date Closed: | 2008-02-29 17:57:35.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Channels/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) blowup_1254AM.txt ( 1) blowup_FEB19-22-42.txt ( 2) blowup_FEB19-4-02PM.txt ( 3) blowup_FEB19-4-02PM-1.txt ( 4) blowup_FEB20-12-41PM.txt ( 5) blowup_FEB20-1AM.txt ( 6) blowup_valgrind_1.txt ( 7) blowup13.txt ( 8) blowup15.txt ( 9) blowup16.txt (10) extensions.conf (11) frozen.txt (12) lockedat4_26PM.txt (13) modules.conf (14) thread_apply_all_valgrind1.txt (15) valgrind_001.txt (16) valgrind_startup.txt (17) valgrind.txt (18) valgrind1.txt | |
Description: | I upgraded my production servers to the current version based on advice from the the bug marshals regarding a dramatic improvement on memory issues, but got this crash today. | ||
Comments: | By: Joshua C. Colp (jcolp) 2008-02-11 16:32:16.000-0600 So any idea of what was happening at that moment? Is the console output available? By: Private Name (falves11) 2008-02-11 17:03:15.000-0600 No idea what was happenning. But my traffic is growing, so if it happens again, what should I do to capture the right info? By: Private Name (falves11) 2008-02-12 17:08:51.000-0600 my current version,where the latest blowup happenned is 103314 By: Private Name (falves11) 2008-02-12 17:20:26.000-0600 I noticed that the crash happenned at the same time when I did a "core show channels concise". I have a cron job that executes this command every hour. It seems like "core show channels concise" is locking. By: Private Name (falves11) 2008-02-13 10:55:01.000-0600 I am using version SVN-trunk-r103506M and I am not using any "core show channels" command. Yet it keeps blowing, and the traces are identical. I have a lot of calls. By: Tilghman Lesher (tilghman) 2008-02-13 12:26:43.000-0600 As usual, you're going to need to run this under valgrind. By: Private Name (falves11) 2008-02-13 19:37:23.000-0600 I cannot use valgrind with 250 open calls. Is there any other way to get to the bottom of this? It does not blow up if I stay below 100 open calls. What I am doing now is splitting the traffic between three Asterisks, but I would like to use only one for each 500+ calls, if possible. I think that we are close. Please help. I am not proxying the media. By: Tilghman Lesher (tilghman) 2008-02-13 19:53:23.000-0600 valgrind output is what I need to find the problem. If you're unable to provide that kind of information, there's very little we can do to track down the problem. By: Private Name (falves11) 2008-02-15 15:02:06.000-0600 I uploaded yesterday the valgrind traces. It keeps blowing up under pressure. Please help. By: Norman Franke (norman) 2008-02-15 15:38:53.000-0600 The issue with core show channels sounds like ASTERISK-11181. I'm getting valgrind issues with chan_sip as well in ASTERISK-11408. Perhaps all of these are related? falves11 you could try the patch in ASTERISK-11408. My issues seem to revolve around interactions between rtp and chan_sip as your valgrind dump also seemed to indicate. By: Private Name (falves11) 2008-02-15 15:45:17.000-0600 Dear Norman: I am in Trunk, so the patch that you suggest may not work. I am waiting for the bug marshals to indicate an action. Thanks. By: Tilghman Lesher (tilghman) 2008-02-18 16:37:29.000-0600 I agree with norman. This could have been fixed with revision 103781, committed this morning. By: Private Name (falves11) 2008-02-19 08:41:05.000-0600 Please look at the file "frozen.txt". I put all my traffic in one single Asterisk to test if it would indeed work. It got all the way to to 335 calls in the evening and in the morning a client woke me up saying "phones don't work". I logged in and Asterisk was up. I type show channels and get the listing. It seemed normal in the sense that I could log in and type commands, but calls would not work and it had 4000 "frozen" channels. I don't know what to to. I restarted Asterisk and it started to work again, but I split the traffic in two. The question is, if this happens again, how do I trouble shoot it?? if somebody wants to help I can force all traffic again in one Asterisk. By: Tilghman Lesher (tilghman) 2008-02-19 13:05:07.000-0600 I'd like to see a "core show locks" from a server which has multiple channels apparently locked up like this. By: Private Name (falves11) 2008-02-19 13:23:18.000-0600 Question: where is in source code tree is the place,if any, where I can expand the global amount of locks available? it seems also that it happens after some time running, and it is not related to the amount of calls at any particular time. So maybe the locks are being used but not returned to then pool(??). Question: Sould I compile only with "don't optimize" or do I need any other additional option ?? By: Tilghman Lesher (tilghman) 2008-02-19 13:50:35.000-0600 DONT_OPTIMIZE and DEBUG_THREADS. By: Private Name (falves11) 2008-02-19 14:38:55.000-0600 The latest version (SVN-trunk-r103798M) is still crashing. I uploaded the trace and the "thread_apply_all" information. I think that we should make Asterisk crash-proof before we make it lock-proof. Please advice if I should run this again under valgrind, because it seems like it is very similar, or identical to the older crashes. Maybe the same issue thate makes it lock makes it crash. By: Tilghman Lesher (tilghman) 2008-02-19 14:53:15.000-0600 falves11: if you are volunteering to diagnose the problem and produce a patch, I'm all for that approach. By: Private Name (falves11) 2008-02-19 15:26:22.000-0600 It locked up again. I had not recompiled it with "debug threads' yet, but neertheless I took a "core show locks" and maybe somebody can see what happens from it. I aleady recompiled it with "debug threads" and will continue to watch it to see if it locks again and then take a "core show locks" again By: Private Name (falves11) 2008-02-19 15:36:16.000-0600 This is very strange. I don't have any idea how the function "ast_say_number_full_de" might be called, since my dialplan has no mention of say_number or say digit,etc. fna = '\0' <repeats 255 times> __PRETTY_FUNCTION__ = '\0' <repeats 22 times> ASTERISK-9 0x08114faa in ast_say_number_full_de (chan=0x8235be8, num=86, ints=0xb6c58e90 <Address 0xb6c58e90 out of bounds>, language=0x1 <Address 0x1 out of bounds>, options=0x0, audiofd=0, ctrlfd=0) at say.c:810 By: Private Name (falves11) 2008-02-19 15:57:16.000-0600 It might be a problem with ODBC and the pooling of connections, since the command "odbc show" brings 247 open connections. No database can withstand that without slowing down. I don't think that I have so much calls to justify the amount of connections. When that command was applied I had only 89 open calls and 34 minutes in operation. Maybe we should include code to close connections after a number of seconds idle, for example, if they remain unused for more than 120 seconds,they should be disposed of. Sipserver*CLI> ODBC DSN Settings ----------------- Name: global DSN: mssql Pooled: Yes> Limit: 1000 Connections in use: 247 By: Private Name (falves11) 2008-02-19 21:48:34.000-0600 Version SVN-trunk-r103828M is very unstable, it blows up with a little amount of traffic. Please look at the file blowup_FEB19-22-42.txt. I also had a blow up that had this message on the screen Feb 20 00:35:38] ERROR[13621]: chan_sip.c:6255 process_sdp: Got SDP but have no RTP session allocated. By: Norman Franke (norman) 2008-02-19 23:57:44.000-0600 I had another similar crash after installing the 11960 patch as well. Perhaps there are multiple crashes. I'm using ODBC as well, perhaps that's the problem? What ODBC driver are you using? Looks like FreeTDS, which I'm also using. I ran my test under valgrind, and valgrind didn't find anything yet after using 200 threads it crashed and reported a corrupt stack trace just like in this latest blow up. So, at least in my case, valgrind can't find it. Do you use IAX? My crashed thread was in the middle of several IAX threads. By: Private Name (falves11) 2008-02-20 00:00:08.000-0600 I use only SIP. The problem is definetely new. I went back to version SVN-trunk-r103772M. I keep getting this error: [Feb 20 11:55:45] ERROR[18368]: chan_sip.c:6140 process_sdp: Got SDP but have no RTP session allocated. I googled it but cannot find an explanation. By: Private Name (falves11) 2008-02-20 11:56:46.000-0600 please help, it blows up every one or two hours, and I already went back in my version to 103772. It did not help. I can try valgrind if the marshals think that the traces do not offer insight. In fact, the traces show always the same information. By: Tilghman Lesher (tilghman) 2008-02-20 12:02:16.000-0600 falves: please compile with DONT_OPTIMIZE and DEBUG_THREADS and get me a "core show locks" from when you have multiple channels locked up, as requested. By: Private Name (falves11) 2008-02-20 12:05:30.000-0600 I already did that, please look at the fike lockedat4_26.txt Right now is not locking. I don't go above 120 calls per box but it crashes. The traces are there. By: Tilghman Lesher (tilghman) 2008-02-20 12:46:12.000-0600 Are you running any modules that do not exist in core Asterisk? By: Private Name (falves11) 2008-02-20 12:59:00.000-0600 Please look at my modules.conf. I am only loading what I need, not loading everything. The only extra piece would be the G723 codec, but I have no G723 traffic. I also have chan_h323 loaded but I don't receive h323 calls. My dialplan is a few lines long. I don't think that my operation could be simpler. By: Private Name (falves11) 2008-02-21 14:45:45.000-0600 It has been very stable (SVN-trunk-r103908M) for over 21 hours, with good traffic. This is what I did: MAX_LOCKS to 256 from 64, MAX_AUTOMONS to 25000 from 1500, I did compile it with optimizations and removed chan_h323. Maybe when I compiled with don't optimize I reached some speed-processing threshold that made it crash often. In any case, if you can merge those two changes, I think this is near a perfect SIP server. Maybe also chan_h323 is messing the whole thing. Some resources should be applied to chan_h323, since it is a very important piece of everybody's business. By: Abhay Gupta (agupta) 2008-02-21 21:03:16.000-0600 Can you please tell us where this variable MAX_LOCKS is ? Surely we are having this problem when number of channels crosses 70 and so this figure of 64 can be interesting . By: Private Name (falves11) 2008-02-21 21:10:12.000-0600 main/util.c MAX_LOCKS and the other one is in main/autoservice.c By: Private Name (falves11) 2008-02-22 10:23:47.000-0600 The problem is not gone. I hot a crash two hours ago. Since I did not compile with symbols, I got obnly this information, which look similr to the ones in the past traces: No symbol table info available. #1 0x08088e6d in __ast_read (chan=0xabb59c0, dropaudio=0) at channel.c:2403 tmp = Variable "tmp" is not available. By: Abhay Gupta (agupta) 2008-02-22 21:15:33.000-0600 falves11 , is there any similarity between the hardware and software that we use . WE use a lot of AGI with mysql connection . Whenenver we see a crash the load on mysql at that time it on a higher side . Normally load on our MYSQL server is around 2.0 and whenever we see a crashdump we see the load to be around 4+ . Moreover we use tor2 driver with 4E1 connectivity . Almost all times we have 100 + calls on the server . By: Private Name (falves11) 2008-02-22 21:27:45.000-0600 I don't use any hardware, my app is pure voip. I also do't use mysql but SQL Server 2005 on a separate computer, linked by freetds. The world owes me having fixed a big bug in freetds that made it crash asterisk with a few open connections open (that was two weeks ago). I have gone in the last 48 hours to 200 calls per box. I have a load balancer (cisco 3845) and several asterisk for processing the calls. The least cost routing is done at the SQL box, and if the call fails I route it again until there are no more routes. It is a classic wholesale softswitch. I am trying to put max out Asterisk and see how far will it go. I don't proxy the media. By: Norman Franke (norman) 2008-02-25 10:36:17.000-0600 falves11- what TDS fix is this? I'm using it as well. By: Private Name (falves11) 2008-02-25 10:57:42.000-0600 The freetds that you need is freetds-0.82RC2.tar.bz2. Upto three weeks ago, freetds had abig bug that made Asterisk crash when more than a few connections were open to the database. It all depends of your application, maybe you don't need this. In any case, I had to hire Frediano Ziglio, one the freetds developers, and he repreoduced the issue in my machine, found the bug and fixed it. I paid a lot Euros for his work. But the only other option was to keep shelling license money to Easysoft for their ODBC driver (I already own a license), but they became greedy lately, so in a virtulized environment like mine they want money per virtual copy, which of course is absurd because you are just dividing the same power among several copies, so I was forced to go back to freetds. In my opinion, freetds now works at the same level as the Easysoft driver, which is $1500 per machine, virtual or not. So the world should payme back the money I invested to fix this, a big bug that had gone undetected for years. By: Norman Franke (norman) 2008-02-25 11:12:24.000-0600 falves11- I'm only running one connection. Where is RC2 anyway? I only see RC1 and 0.83.dev.20080225. We use FreeTDS on our Mac OS X clients and it does work well and as a single real-time connection on Asterisk under Linux. Nothing directly attributable to it yet. Sounds like you'll still save money over time despite paying Frediano. We dropped EasySoft for the same reason, way too expensive. By: Private Name (falves11) 2008-02-25 11:16:01.000-0600 This version is not released to the public. It sounds like your model is different than mine. If you need it, please write to the freetds mailing list. Maybe they have a final version, better than mine. By: Digium Subversion (svnbot) 2008-02-29 17:30:58.000-0600 Repository: asterisk Revision: 105409 U branches/1.4/main/autoservice.c ------------------------------------------------------------------------ r105409 | russell | 2008-02-29 17:30:48 -0600 (Fri, 29 Feb 2008) | 23 lines Fix a major bug in autoservice. There was a race condition in the handling of the list of channels in autoservice. The problem was that it was possible for a channel to get removed from autoservice and destroyed, while the autoservice was still messing with the channel. This led to memory corruption, and caused crashes. This explains multiple backtraces I have seen that have references to autoservice, but do to the nature of the issue (memory corruption), could cause crashes in a number of areas. (fixes the crash in BE-386) (closes issue ASTERISK-11165) (closes issue ASTERISK-11391) The following issues could be related. If you are the reporter of one of these, please update to include this fix and try again. (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11545) (potentially fixes issue ASTERISK-11058) (potentially fixes issue ASTERISK-11453) (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11437) (potentially fixes issue ASTERISK-11259) ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=105409 By: Digium Subversion (svnbot) 2008-02-29 17:33:02.000-0600 Repository: asterisk Revision: 105410 _U trunk/ U trunk/main/autoservice.c ------------------------------------------------------------------------ r105410 | russell | 2008-02-29 17:33:00 -0600 (Fri, 29 Feb 2008) | 31 lines Merged revisions 105409 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines Fix a major bug in autoservice. There was a race condition in the handling of the list of channels in autoservice. The problem was that it was possible for a channel to get removed from autoservice and destroyed, while the autoservice was still messing with the channel. This led to memory corruption, and caused crashes. This explains multiple backtraces I have seen that have references to autoservice, but do to the nature of the issue (memory corruption), could cause crashes in a number of areas. (fixes the crash in BE-386) (closes issue ASTERISK-11165) (closes issue ASTERISK-11391) The following issues could be related. If you are the reporter of one of these, please update to include this fix and try again. (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11545) (potentially fixes issue ASTERISK-11058) (potentially fixes issue ASTERISK-11453) (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11437) (potentially fixes issue ASTERISK-11259) ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=105410 By: Digium Subversion (svnbot) 2008-02-29 17:57:04.000-0600 Repository: asterisk Revision: 105409 U branches/1.4/main/autoservice.c ------------------------------------------------------------------------ r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines Fix a major bug in autoservice. There was a race condition in the handling of the list of channels in autoservice. The problem was that it was possible for a channel to get removed from autoservice and destroyed, while the autoservice thread was still messing with the channel. This led to memory corruption, and caused crashes. This explains multiple backtraces I have seen that have references to autoservice, but do to the nature of the issue (memory corruption), could cause crashes in a number of areas. (fixes the crash in BE-386) (closes issue ASTERISK-11165) (closes issue ASTERISK-11391) The following issues could be related. If you are the reporter of one of these, please update to include this fix and try again. (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11545) (potentially fixes issue ASTERISK-11058) (potentially fixes issue ASTERISK-11453) (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11437) (potentially fixes issue ASTERISK-11259) ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=105409 By: Digium Subversion (svnbot) 2008-02-29 17:57:35.000-0600 Repository: asterisk Revision: 105410 _U trunk/ U trunk/main/autoservice.c ------------------------------------------------------------------------ r105410 | russell | 2008-02-29 17:36:46 -0600 (Fri, 29 Feb 2008) | 31 lines Merged revisions 105409 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines Fix a major bug in autoservice. There was a race condition in the handling of the list of channels in autoservice. The problem was that it was possible for a channel to get removed from autoservice and destroyed, while the autoservice thread was still messing with the channel. This led to memory corruption, and caused crashes. This explains multiple backtraces I have seen that have references to autoservice, but do to the nature of the issue (memory corruption), could cause crashes in a number of areas. (fixes the crash in BE-386) (closes issue ASTERISK-11165) (closes issue ASTERISK-11391) The following issues could be related. If you are the reporter of one of these, please update to include this fix and try again. (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11545) (potentially fixes issue ASTERISK-11058) (potentially fixes issue ASTERISK-11453) (potentially fixes issue ASTERISK-10713) (potentially fixes issue ASTERISK-11437) (potentially fixes issue ASTERISK-11259) ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=105410 |