|Summary:||ASTERISK-13187: Crash (included coredump)|
|Reporter:||Iñaki Baz Castillo (ibc)||Labels:|
|Date Opened:||2008-12-09 12:04:20.000-0600||Date Closed:||2009-02-05 13:30:15.000-0600|
|Environment:||Attachments:||( 0) gdb_1.4.23-rc2+patch.txt|
( 1) gdb.txt
( 2) more_gdb.txt
( 3) valgrind_2days.txt
( 4) valgrind_malloc_debug_2days.txt
|Description:||Asterisk 1.4.22 has crashed when handling around 60 SIP channels. This server supports usually much more traffic (~ 500 SIP channels).|
Before 1.4.22 version, this error (crash) occurred very often during the max calls period (once each 2 days). That error seemed to be fixed in 1.4.22, until now. Of course, it could be a very different issue.
There are not Zap devices, just SIP users and SIP trunks.
We do CDR against an external MySQL server via ODBC and use dynamic realtime dialplan.
GDB output is attached.
|Comments:||By: Tilghman Lesher (tilghman) 2008-12-09 14:14:39.000-0600|
In gdb, please report the output of the following commands:
By: Iñaki Baz Castillo (ibc) 2008-12-10 02:16:34.000-0600
I've attached "more_gdb.txt" file containing the suggested GDB commands.
By: Tilghman Lesher (tilghman) 2008-12-10 12:26:13.000-0600
Crud, you have corrupted memory. We're going to need to do this under valgrind. See doc/valgrind.txt.
By: Iñaki Baz Castillo (ibc) 2008-12-11 04:32:22.000-0600
Thanks, I will enable it and continue the report when the crash occurs again.
By: Mark Michelson (mmichelson) 2008-12-16 11:20:28.000-0600
I'm curious to know if the changes for issue ASTERISK-13204 fix this bug. Could you test with the final patch there and see if the problem is fixed? Thanks.
By: Iñaki Baz Castillo (ibc) 2008-12-17 05:32:28.000-0600
The revision 163080 of app_queue.c  doesn't compile in 1.4.22:
[CC] app_queue.c -> app_queue.o
app_queue.c: In function ‘try_calling’:
app_queue.c:3165: error: ‘AST_PBX_NO_HANGUP_PEER_PARKED’ undeclared (first use in this function)
app_queue.c:3165: error: (Each undeclared identifier is reported only once
app_queue.c:3165: error: for each function it appears in.)
Which code should I patch exactly to enable this patch in Asterisk 1.4.22?
By: Iñaki Baz Castillo (ibc) 2008-12-17 05:34:39.000-0600
Maybe I must upgrade to 1.4 SVN? Is it recommended for a server in production with high traffic?
By: TOC Jason (toc) 2008-12-17 07:59:50.000-0600
No, it is best to reproduce the issue in a test / development environment.
By: Iñaki Baz Castillo (ibc) 2008-12-17 08:26:48.000-0600
Well, I've installed the patch in Asterisk-1.4.23-rc2. I think I can test it in production (it's already crashing with 1.4.22 so it cannot be worse XD).
By: Leif Madsen (lmadsen) 2008-12-17 09:28:04.000-0600
Let us know if it is still crashing. If so, then you will need to get a valgrind trace as requested from Corydon76.
This is going to make your system quite slow, so you will need to reproduce in a test environment as mentioned by toc, then once reproduced, you can run under valgrind to get the information necessary to move the issue along (assuming it is a memory corruption issue).
Hopefully the patch putnopvut mentioned "just works".
By: Mark Michelson (mmichelson) 2008-12-17 09:31:52.000-0600
I should have been more specific when I referred to issue DAHLIN-153. It is closed, meaning that the changes are in the 1.4 branch of subversion, and the patch there was made against a recent checkout of the 1.4 branch. Therefore it may be that the patch does not apply cleanly to the 1.4.22 tag.
By: Iñaki Baz Castillo (ibc) 2008-12-17 11:18:07.000-0600
Yes. Anyway the patch is valid for Asterisk-1.4.23-rc2 in which I've already applied it.
By: Iñaki Baz Castillo (ibc) 2008-12-23 03:36:53.000-0600
I've applied in Asterisk 1.4.23-rc2 the patch:
and it is crashing a lot. I attach the new GDB output.
By: Tilghman Lesher (tilghman) 2008-12-23 09:06:31.000-0600
As previously requested, we need the valgrind output.
By: Leif Madsen (lmadsen) 2009-01-06 09:15:03.000-0600
Pinging ibc: have you been able to get the valgrind output as requested? Thanks!
By: Iñaki Baz Castillo (ibc) 2009-01-07 04:05:34.000-0600
Hi, no more crashes for now so I have no data. Please let this bug remain open, I will comment on it as soon as I have more data.
Thanks a lot.
By: Mark Michelson (mmichelson) 2009-01-13 14:03:11.000-0600
By the way, looking closely at this, it seems that this issue may be the same as issue ASTERISK-13226. I have recently added patches there and it is being tested. When I have heard positive feedback from there, I will update this issue too.
Anyone who is following this issue may feel free to try the patches on issue ASTERISK-13226 as well.
By: Leif Madsen (lmadsen) 2009-01-14 14:30:38.000-0600
I've added a relationship for now. Feel free to remove it if you don't feel they are related.
I kinda wish we had a "may be related to" function :)
By: Mark Michelson (mmichelson) 2009-01-14 18:17:35.000-0600
I have committed the second patch from ASTERISK-13226 to 1.4 at this point since it solved the crashes for the two people who reported them there.
I recommend reading the bug note history there since it was discovered that if you are using any officially released version of Asterisk (up to 1.4.22, in other words) then the patches there will actually cause more problems than they will fix.
By: zgrin (zgrin) 2009-01-27 05:32:07.000-0600
It doesn't seem to crash when running with valgrind. At least it hasn't crashed yet for me while running with valgrind. We took off valgrind last week and ran asterisk normally, and it continued crashing.
By: Tilghman Lesher (tilghman) 2009-01-27 09:54:29.000-0600
zgrin: it may not crash while running valgrind, but the information within the valgrind log is STILL useful to us. Please upload that file here.
By: Iñaki Baz Castillo (ibc) 2009-01-27 10:21:52.000-0600
So then, if Asterisk won't crash when running over valgrind, when should I end valgrind and upload the logs for this report? I was waiting for Asterisk to crash, but as zgrin states, it doesn't crash when running over valgrind.
By: Tilghman Lesher (tilghman) 2009-01-27 10:53:13.000-0600
ibc: if Asterisk normally crashes within 60 minutes when not running under valgrind, then running Asterisk for 60 minutes under valgrind should be sufficient.
By: Mark Michelson (mmichelson) 2009-01-27 11:00:02.000-0600
Please keep in mind that issue ASTERISK-13226, which I am 95% sure is the same issue as this one, has already been closed. The fix from that issue is in the recently-released 1.4.23. If someone can confirm that you are not getting this same crash on 1.4.23, then I can get this issue closed, as I suspect this has been fixed.
By: Iñaki Baz Castillo (ibc) 2009-01-27 11:06:11.000-0600
Corydon76: When no running valgrind, Asterisk doesn't crash after 60 minutes. Sometimes it takes long time without crashing, but when it does it the coredump suggests a segmenfault about the queue application. We suspect the crash occurs when handing several queue calls.
putnopvut: We plan to upgrade to 1.4.23. Asterisk has been running with valgrind for long time (more than a week) without crashing in normal production state. So we'll upgrade Asterisk and after some days I'll comment here if the bug has happened again or not.
Thanks to all.
By: Leif Madsen (lmadsen) 2009-01-28 16:55:24.000-0600
ibc: note that asterisk will quite likely NOT crash while running in valgrind, so you'll want to running Asterisk in the situation where it would normally crash (even though it won't), and then attach the valgrind output to this issue.
By: Iñaki Baz Castillo (ibc) 2009-01-29 05:32:10.000-0600
I attach the valgrind output after two days running it in normal production status.
By: Mark Michelson (mmichelson) 2009-01-29 16:34:59.000-0600
ibc's valgrind output is showing the same sort of invalid memory access to a freed datastore that was seen in issue ASTERISK-13226. At this point, I'll take just a single confirmation that the patches there fix your problems, because I am now 99.999% sure that these are the same problem.
By: Iñaki Baz Castillo (ibc) 2009-01-30 05:01:54.000-0600
Thanks putnopvut, we have already upgraded to 126.96.36.199. Let's check if it doesn't crash now. After some days (10-20) I'll comment here the results.
By: Leif Madsen (lmadsen) 2009-01-30 08:34:53.000-0600
Instead of waiting for 10-20 days for a report back, I'm just going to close this issue because putnopvut seems pretty sure this is the same fix as bug 14086.
If 1.4.23 does not fix this for you, PLEASE try the latest SVN branch before re-opening this issue as it MAY have been fixed after that release was created.
By: Iñaki Baz Castillo (ibc) 2009-02-02 03:09:39.000-0600
blitzrage: I agree in closing this bug and reopening it in case the problem persists in 188.8.131.52. But please, could I know if this version 184.108.40.206 alredy includes the patch putnopvut means? I can't use the SVN version since it's a host in production.
We have more Asterisk hosts running 1.4.22 version with more traffic than this one, but they never crash (probably because queues usage is not so high). This is, if I test 220.127.116.11 or trunk version in a test environment I'm sure that it won't crash due to the low traffic and queue usage. But I cannot try a trunk version in production since it must works.
Thanks a lot.
By: Mark Michelson (mmichelson) 2009-02-05 13:30:13.000-0600
I'm closing this since ibc has agreed to it. I noted in note ~98860 that the fix for ASTERISK-13226 is already in 1.4.23, so upgrading to it or 18.104.22.168 will contain the fix.