ASTERISK-11391: This crash happenned this morning

[Home]

Summary: ASTERISK-11391: This crash happenned this morning

Reporter: Private Name (falves11) Labels:

Date Opened: 2008-02-06 13:16:05.000-0600 Date Closed: 2008-02-29 17:57:35.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) blowup_1254AM.txt
( 1) blowup_FEB19-22-42.txt
( 2) blowup_FEB19-4-02PM.txt
( 3) blowup_FEB19-4-02PM-1.txt
( 4) blowup_FEB20-12-41PM.txt
( 5) blowup_FEB20-1AM.txt
( 6) blowup_valgrind_1.txt
( 7) blowup13.txt
( 8) blowup15.txt
( 9) blowup16.txt
(10) extensions.conf
(11) frozen.txt
(12) lockedat4_26PM.txt
(13) modules.conf
(14) thread_apply_all_valgrind1.txt
(15) valgrind_001.txt
(16) valgrind_startup.txt
(17) valgrind.txt
(18) valgrind1.txt

Description: I upgraded my production servers to the current version based on advice from the the bug marshals regarding a dramatic improvement on memory issues, but got this crash today.

Comments: By: Joshua C. Colp (jcolp) 2008-02-11 16:32:16.000-0600

So any idea of what was happening at that moment? Is the console output available?
By: Private Name (falves11) 2008-02-11 17:03:15.000-0600

No idea what was happenning. But my traffic is growing, so if it happens again, what should I do to capture the right info?
By: Private Name (falves11) 2008-02-12 17:08:51.000-0600

my current version,where the latest blowup happenned is 103314
By: Private Name (falves11) 2008-02-12 17:20:26.000-0600

I noticed that the crash happenned at the same time when I did a "core show channels concise". I have a cron job that executes this command every hour. It seems like "core show channels concise" is locking.
By: Private Name (falves11) 2008-02-13 10:55:01.000-0600

I am using version SVN-trunk-r103506M and I am not using any "core show channels" command. Yet it keeps blowing, and the traces are identical. I have a lot of calls.
By: Tilghman Lesher (tilghman) 2008-02-13 12:26:43.000-0600

As usual, you're going to need to run this under valgrind.
By: Private Name (falves11) 2008-02-13 19:37:23.000-0600

I cannot use valgrind with 250 open calls. Is there any other way to get to the bottom of this? It does not blow up if I stay below 100 open calls. What I am doing now is splitting the traffic between three Asterisks, but I would like to use only one for each 500+ calls, if possible. I think that we are close. Please help. I am not proxying the media.
By: Tilghman Lesher (tilghman) 2008-02-13 19:53:23.000-0600

valgrind output is what I need to find the problem. If you're unable to provide that kind of information, there's very little we can do to track down the problem.
By: Private Name (falves11) 2008-02-15 15:02:06.000-0600

I uploaded yesterday the valgrind traces. It keeps blowing up under pressure. Please help.
By: Norman Franke (norman) 2008-02-15 15:38:53.000-0600

The issue with core show channels sounds like ASTERISK-11181. I'm getting valgrind issues with chan_sip as well in ASTERISK-11408. Perhaps all of these are related?

falves11 you could try the patch in ASTERISK-11408. My issues seem to revolve around interactions between rtp and chan_sip as your valgrind dump also seemed to indicate.
By: Private Name (falves11) 2008-02-15 15:45:17.000-0600

Dear Norman: I am in Trunk, so the patch that you suggest may not work. I am waiting for the bug marshals to indicate an action. Thanks.
By: Tilghman Lesher (tilghman) 2008-02-18 16:37:29.000-0600

I agree with norman. This could have been fixed with revision 103781, committed this morning.
By: Private Name (falves11) 2008-02-19 08:41:05.000-0600

Please look at the file "frozen.txt". I put all my traffic in one single Asterisk to test if it would indeed work. It got all the way to to 335 calls in the evening and in the morning a client woke me up saying "phones don't work". I logged in and Asterisk was up. I type show channels and get the listing. It seemed normal in the sense that I could log in and type commands, but calls would not work and it had 4000 "frozen" channels. I don't know what to to. I restarted Asterisk and it started to work again, but I split the traffic in two. The question is, if this happens again, how do I trouble shoot it?? if somebody wants to help I can force all traffic again in one Asterisk.
By: Tilghman Lesher (tilghman) 2008-02-19 13:05:07.000-0600

I'd like to see a "core show locks" from a server which has multiple channels apparently locked up like this.
By: Private Name (falves11) 2008-02-19 13:23:18.000-0600

Question: where is in source code tree is the place,if any, where I can expand
the global amount of locks available? it seems also that it happens after
some time running, and it is not related to the amount of calls at any particular time. So maybe the locks are being used but not returned to then pool(??).
Question: Sould I compile only with "don't optimize" or do I need any other
additional option ??

By: Tilghman Lesher (tilghman) 2008-02-19 13:50:35.000-0600

DONT_OPTIMIZE and DEBUG_THREADS.
By: Private Name (falves11) 2008-02-19 14:38:55.000-0600

The latest version (SVN-trunk-r103798M) is still crashing. I uploaded the trace and the "thread_apply_all" information. I think that we should make Asterisk crash-proof before we make it lock-proof. Please advice if I should run this again under valgrind, because it seems like it is very similar, or identical to the older crashes. Maybe the same issue thate makes it lock makes it crash.
By: Tilghman Lesher (tilghman) 2008-02-19 14:53:15.000-0600

falves11: if you are volunteering to diagnose the problem and produce a patch, I'm all for that approach.
By: Private Name (falves11) 2008-02-19 15:26:22.000-0600

It locked up again. I had not recompiled it with "debug threads' yet, but neertheless I took a "core show locks" and maybe somebody can see what happens from it. I aleady recompiled it with "debug threads" and will continue to watch it to see if it locks again and then take a "core show locks" again
By: Private Name (falves11) 2008-02-19 15:36:16.000-0600

This is very strange. I don't have any idea how the function "ast_say_number_full_de" might be called, since my dialplan has no mention of say_number or say digit,etc.

fna = '\0' <repeats 255 times>
__PRETTY_FUNCTION__ = '\0' <repeats 22 times>
ASTERISK-9 0x08114faa in ast_say_number_full_de (chan=0x8235be8, num=86,
ints=0xb6c58e90 <Address 0xb6c58e90 out of bounds>, language=0x1 <Address 0x1 out of bounds>,
options=0x0, audiofd=0, ctrlfd=0) at say.c:810
By: Private Name (falves11) 2008-02-19 15:57:16.000-0600

It might be a problem with ODBC and the pooling of connections, since the command "odbc show" brings 247 open connections. No database can withstand that without slowing down. I don't think that I have so much calls to justify the amount of connections. When that command was applied I had only 89 open calls and 34 minutes in operation. Maybe we should include code to close connections after a number of seconds idle, for example, if they remain unused for more than 120 seconds,they should be disposed of.

Sipserver*CLI>
ODBC DSN Settings
-----------------

Name: global
DSN: mssql
Pooled: Yes>
Limit: 1000
Connections in use: 247

By: Private Name (falves11) 2008-02-19 21:48:34.000-0600

Version SVN-trunk-r103828M is very unstable, it blows up with a little amount of traffic. Please look at the file blowup_FEB19-22-42.txt.
I also had a blow up that had this message on the screen
Feb 20 00:35:38] ERROR[13621]: chan_sip.c:6255 process_sdp: Got SDP but have no RTP session allocated.

By: Norman Franke (norman) 2008-02-19 23:57:44.000-0600

I had another similar crash after installing the 11960 patch as well. Perhaps there are multiple crashes. I'm using ODBC as well, perhaps that's the problem? What ODBC driver are you using? Looks like FreeTDS, which I'm also using.

I ran my test under valgrind, and valgrind didn't find anything yet after using 200 threads it crashed and reported a corrupt stack trace just like in this latest blow up. So, at least in my case, valgrind can't find it.

Do you use IAX? My crashed thread was in the middle of several IAX threads.
By: Private Name (falves11) 2008-02-20 00:00:08.000-0600

I use only SIP. The problem is definetely new. I went back to version SVN-trunk-r103772M. I keep getting this error: [Feb 20 11:55:45] ERROR[18368]: chan_sip.c:6140 process_sdp: Got SDP but have no RTP session allocated. I googled it but cannot find an explanation.

By: Private Name (falves11) 2008-02-20 11:56:46.000-0600

please help, it blows up every one or two hours, and I already went back in my version to 103772. It did not help. I can try valgrind if the marshals think that the traces do not offer insight. In fact, the traces show always the same information.
By: Tilghman Lesher (tilghman) 2008-02-20 12:02:16.000-0600

falves: please compile with DONT_OPTIMIZE and DEBUG_THREADS and get me a "core show locks" from when you have multiple channels locked up, as requested.
By: Private Name (falves11) 2008-02-20 12:05:30.000-0600

I already did that, please look at the fike lockedat4_26.txt
Right now is not locking. I don't go above 120 calls per box but it crashes. The traces are there.
By: Tilghman Lesher (tilghman) 2008-02-20 12:46:12.000-0600

Are you running any modules that do not exist in core Asterisk?
By: Private Name (falves11) 2008-02-20 12:59:00.000-0600

Please look at my modules.conf. I am only loading what I need, not loading everything. The only extra piece would be the G723 codec, but I have no G723 traffic. I also have chan_h323 loaded but I don't receive h323 calls. My dialplan is a few lines long. I don't think that my operation could be simpler.

By: Private Name (falves11) 2008-02-21 14:45:45.000-0600

It has been very stable (SVN-trunk-r103908M) for over 21 hours, with good traffic. This is what I did: MAX_LOCKS to 256 from 64, MAX_AUTOMONS to 25000 from 1500, I did compile it with optimizations and removed chan_h323. Maybe when I compiled with don't optimize I reached some speed-processing threshold that made it crash often. In any case, if you can merge those two changes, I think this is near a perfect SIP server. Maybe also chan_h323 is messing the whole thing. Some resources should be applied to chan_h323, since it is a very important piece of everybody's business.

By: Abhay Gupta (agupta) 2008-02-21 21:03:16.000-0600

Can you please tell us where this variable MAX_LOCKS is ?

Surely we are having this problem when number of channels crosses 70 and so this figure of 64 can be interesting .
By: Private Name (falves11) 2008-02-21 21:10:12.000-0600

main/util.c MAX_LOCKS
and the other one is in main/autoservice.c
By: Private Name (falves11) 2008-02-22 10:23:47.000-0600

The problem is not gone. I hot a crash two hours ago. Since I did not compile with symbols, I got obnly this information, which look similr to the ones in the past traces:
No symbol table info available.
#1 0x08088e6d in __ast_read (chan=0xabb59c0, dropaudio=0) at channel.c:2403
tmp = Variable "tmp" is not available.
By: Abhay Gupta (agupta) 2008-02-22 21:15:33.000-0600

falves11 , is there any similarity between the hardware and software that we use .

WE use a lot of AGI with mysql connection . Whenenver we see a crash the load on mysql at that time it on a higher side . Normally load on our MYSQL server is around 2.0 and whenever we see a crashdump we see the load to be around 4+ .

Moreover we use tor2 driver with 4E1 connectivity . Almost all times we have 100 + calls on the server .
By: Private Name (falves11) 2008-02-22 21:27:45.000-0600

I don't use any hardware, my app is pure voip. I also do't use mysql but SQL Server 2005 on a separate computer, linked by freetds. The world owes me having fixed a big bug in freetds that made it crash asterisk with a few open connections open (that was two weeks ago). I have gone in the last 48 hours to 200 calls per box. I have a load balancer (cisco 3845) and several asterisk for processing the calls. The least cost routing is done at the SQL box, and if the call fails I route it again until there are no more routes. It is a classic wholesale softswitch. I am trying to put max out Asterisk and see how far will it go. I don't proxy the media.
By: Norman Franke (norman) 2008-02-25 10:36:17.000-0600

falves11- what TDS fix is this? I'm using it as well.
By: Private Name (falves11) 2008-02-25 10:57:42.000-0600

The freetds that you need is freetds-0.82RC2.tar.bz2. Upto three weeks ago, freetds had abig bug that made Asterisk crash when more than a few connections were open to the database. It all depends of your application, maybe you don't need this. In any case, I had to hire Frediano Ziglio, one the freetds developers, and he repreoduced the issue in my machine, found the bug and fixed it. I paid a lot Euros for his work. But the only other option was to keep shelling license money to Easysoft for their ODBC driver (I already own a license), but they became greedy lately, so in a virtulized environment like mine they want money per virtual copy, which of course is absurd because you are just dividing the same power among several copies, so I was forced to go back to freetds. In my opinion, freetds now works at the same level as the Easysoft driver, which is $1500 per machine, virtual or not. So the world should payme back the money I invested to fix this, a big bug that had gone undetected for years.
By: Norman Franke (norman) 2008-02-25 11:12:24.000-0600

falves11- I'm only running one connection. Where is RC2 anyway? I only see RC1 and 0.83.dev.20080225. We use FreeTDS on our Mac OS X clients and it does work well and as a single real-time connection on Asterisk under Linux. Nothing directly attributable to it yet.

Sounds like you'll still save money over time despite paying Frediano. We dropped EasySoft for the same reason, way too expensive.
By: Private Name (falves11) 2008-02-25 11:16:01.000-0600

This version is not released to the public. It sounds like your model is different than mine. If you need it, please write to the freetds mailing list. Maybe they have a final version, better than mine.
By: Digium Subversion (svnbot) 2008-02-29 17:30:58.000-0600

Repository: asterisk
Revision: 105409

U branches/1.4/main/autoservice.c

------------------------------------------------------------------------
r105409 | russell | 2008-02-29 17:30:48 -0600 (Fri, 29 Feb 2008) | 23 lines

Fix a major bug in autoservice. There was a race condition in the handling of
the list of channels in autoservice. The problem was that it was possible for
a channel to get removed from autoservice and destroyed, while the autoservice
was still messing with the channel. This led to memory corruption, and caused
crashes. This explains multiple backtraces I have seen that have references
to autoservice, but do to the nature of the issue (memory corruption), could
cause crashes in a number of areas.

(fixes the crash in BE-386)
(closes issue ASTERISK-11165)
(closes issue ASTERISK-11391)

The following issues could be related. If you are the reporter of one of these,
please update to include this fix and try again.

(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11545)
(potentially fixes issue ASTERISK-11058)
(potentially fixes issue ASTERISK-11453)
(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11437)
(potentially fixes issue ASTERISK-11259)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105409
By: Digium Subversion (svnbot) 2008-02-29 17:33:02.000-0600

Repository: asterisk
Revision: 105410

_U trunk/
U trunk/main/autoservice.c

------------------------------------------------------------------------
r105410 | russell | 2008-02-29 17:33:00 -0600 (Fri, 29 Feb 2008) | 31 lines

Merged revisions 105409 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines

Fix a major bug in autoservice. There was a race condition in the handling of
the list of channels in autoservice. The problem was that it was possible for
a channel to get removed from autoservice and destroyed, while the autoservice
was still messing with the channel. This led to memory corruption, and caused
crashes. This explains multiple backtraces I have seen that have references
to autoservice, but do to the nature of the issue (memory corruption), could
cause crashes in a number of areas.

(fixes the crash in BE-386)
(closes issue ASTERISK-11165)
(closes issue ASTERISK-11391)

The following issues could be related. If you are the reporter of one of these,
please update to include this fix and try again.

(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11545)
(potentially fixes issue ASTERISK-11058)
(potentially fixes issue ASTERISK-11453)
(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11437)
(potentially fixes issue ASTERISK-11259)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105410
By: Digium Subversion (svnbot) 2008-02-29 17:57:04.000-0600

Repository: asterisk
Revision: 105409

U branches/1.4/main/autoservice.c

------------------------------------------------------------------------
r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines

Fix a major bug in autoservice. There was a race condition in the handling of
the list of channels in autoservice. The problem was that it was possible for
a channel to get removed from autoservice and destroyed, while the autoservice
thread was still messing with the channel. This led to memory corruption, and
caused crashes. This explains multiple backtraces I have seen that have
references to autoservice, but do to the nature of the issue (memory corruption),
could cause crashes in a number of areas.

(fixes the crash in BE-386)
(closes issue ASTERISK-11165)
(closes issue ASTERISK-11391)

The following issues could be related. If you are the reporter of one of these,
please update to include this fix and try again.

(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11545)
(potentially fixes issue ASTERISK-11058)
(potentially fixes issue ASTERISK-11453)
(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11437)
(potentially fixes issue ASTERISK-11259)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105409
By: Digium Subversion (svnbot) 2008-02-29 17:57:35.000-0600

Repository: asterisk
Revision: 105410

_U trunk/
U trunk/main/autoservice.c

------------------------------------------------------------------------
r105410 | russell | 2008-02-29 17:36:46 -0600 (Fri, 29 Feb 2008) | 31 lines

Merged revisions 105409 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r105409 | russell | 2008-02-29 17:34:32 -0600 (Fri, 29 Feb 2008) | 23 lines

Fix a major bug in autoservice. There was a race condition in the handling of
the list of channels in autoservice. The problem was that it was possible for
a channel to get removed from autoservice and destroyed, while the autoservice
thread was still messing with the channel. This led to memory corruption, and
caused crashes. This explains multiple backtraces I have seen that have
references to autoservice, but do to the nature of the issue (memory corruption),
could cause crashes in a number of areas.

(fixes the crash in BE-386)
(closes issue ASTERISK-11165)
(closes issue ASTERISK-11391)

The following issues could be related. If you are the reporter of one of these,
please update to include this fix and try again.

(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11545)
(potentially fixes issue ASTERISK-11058)
(potentially fixes issue ASTERISK-11453)
(potentially fixes issue ASTERISK-10713)
(potentially fixes issue ASTERISK-11437)
(potentially fixes issue ASTERISK-11259)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105410