ASTERISK-12518: Memory leak in Asterisk 1.4 and Trunk

[Home]

Summary: ASTERISK-12518: Memory leak in Asterisk 1.4 and Trunk

Reporter: Private Name (falves11) Labels:

Date Opened: 2008-08-04 18:31:24 Date Closed: 2008-09-11 17:58:39

Priority: Major Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 13235.diff
( 1) asterisk.zip
( 2) bug_conf.zip
( 3) dialplan.txt
( 4) memory.txt
( 5) memory-post_patch.txt
( 6) valgrind.txt
( 7) valgrind-32.txt

Description: Both versions of Asterisk have a huge memory leak. I thought that it was Trunk only and ported my app to 1.4. After 1 day and 17 hours the memory has gone up 1.2 GB. I only have 300 open calls. My machine is open for inspection. I am not using "malloc debug" and "don't optimize", for performance reasons,but I will restart tonight the server. if somebody wants to suggest any diagnostic technique, please let me know before I restart.

Comments: By: Russell Bryant (russell) 2008-08-04 20:22:16

Yes, you will need to rebuild with MALLOC_DEBUG.

You said you ported your app to 1.4. Are you talking about an Asterisk C application or something external to Asterisk?
By: Private Name (falves11) 2008-08-05 03:40:14

My "application" is only a short dialplan that uses func_odbc and cdr_odbc. There is noting external or writen in "C". That's why I am sure the issue is inside Asterisk. It is a wholesale switch that uses SQL Server in the back end. When you go from Trunk back 1.4, the dialplan and the configuration files change slightly.
By: Private Name (falves11) 2008-08-05 08:23:46

I think, buy looking at the memory allocations, that the culprit is cdr.c. The small file is taken after a few minutes and the second one is taken after 4hours and 10 minutes. The memory will continue growing without limit, until Asterisk is recycled.
By: Private Name (falves11) 2008-08-05 08:25:21

this my dialplan:
[default]
exten =>_X.,1,Set(ARRAY(CARRIERLIST,Z,NANI,CLIST,PLIST)=${MINIXEL_ROUTING(${EXTEN},${anix},${X})})
exten =>_X.,n,GotoIf($[${Z} = 0]?default,rejected,1)
exten =>_X.,n,Set(CALLERID(all)=${NANI})
exten =>_X.,n,Set(TIMEOUT(absolute)=3600)
exten =>_X.,n,Set(i=1)
exten =>_X.,n,ResetCDR()
exten =>_X.,n,While($[${i} <= ${Z}])
exten =>_X.,n,Set(CDR(userfield)=${CLIST}${CUT(PLIST,-,${i})})
exten =>_X.,n,Set(CDR(accountcode)=${SIPIP:1})
exten =>_X.,n,Dial(${CUT(CARRIERLIST,-,${i})},45,L(3600000)) ;dial once
;exten =>_X.,n,Verbose(0,"${DIALSTATUS} Result: ${SIPIP:1}")
exten =>_X.,n,ResetCDR(w)
exten =>_X.,n,GotoIf($["${DIALSTATUS}" = "BUSY"]?default,dropcall,1)
exten =>_X.,n,GotoIf($["${DIALSTATUS}" = "NOANSWER"]?default,dropcall,1)
exten =>_X.,n,Set(i=$[${i} + 1])
exten =>_X.,n,EndWhile
exten =>_X.,n,Hangup(34)
exten => dropcall,1,Hangup()
exten => rejected,1,Verbose(0,"Rejected:" ${destino}-From: ${SIPIP:1})
exten => rejected,2,Hangup(34)
By: Steve Murphy (murf) 2008-08-06 08:11:49

To help me reproduce this problem, please attach to this bug, all the files in /etc/asterisk, whose names begin with cdr, as in cdr.conf, cdr_odbc.conf, and any others you might have there.
By: Steve Murphy (murf) 2008-08-06 11:58:50

First, try removing entirely the unnecc. cdr config files. I fear that that their backends may be registered and getting in the way.

Remove the cdr_mysql.conf, cdr_pgsql.conf, and cdr_tds.conf files. That should keep those backends from loading (if you have the requisite libs, etc).

cdr_manager is OK, it has enable=no, which should keep it silent.

Then, after reloading, see if this reduces your memory consumption, and report back
By: Private Name (falves11) 2008-08-06 12:03:30

If you check the file supplied "modules.conf", you can see that I am loading only the modules that I need, and I am not loading any of those. Please look again for a culprit. if you want look at the live system, please contact me at the email on file.
By: Steve Murphy (murf) 2008-08-11 12:27:31

OK, then, on to step 2.

I have taken your dialplan code and converted it to AEL.
I notice that you are using a function I am not familiar with,
MINIXEL_ROUTING. Since I do not have this func, please give
me some example values for:
CARRIERLIST,Z,NANI,CLIST,PLIST
with which I can simulate your environment.

Also, while I'm doing this at my end, can you please
try running your asterisk with valgrind? It involves these
sorts of steps:

1. run "valgrind --leak-check=full --show-reachable=yes asterisk -cgv"

I don't think you'll be able to run this in a production setting. It will
slow down asterisk tremendously.

2. Let Asterisk run for a while, and process a series of calls. Maybe 5 or 10
should suffice.

3. then, enter the command "stop gracefully"

4. Take all console output from beginning to end, and attach to this bug report.

By: Private Name (falves11) 2008-08-12 08:35:54

You do actually have the function MINIXEL_ROUTING. It is defined in func_odbc and it means it is a call to the database. I don't think you can simulate my application entirely. I will run the valgrind command as soon as possible.
By: Steve Murphy (murf) 2008-08-12 12:30:38

Oh, yes, excuse me, I forgot to check the zip files you attached. THere are
the defs for the MINIXEL_ROUTING. I guess to fully simulate I'd need a chunk
of your databases, tables, functions, etc., but this is probably a bit impractical. Just the output of one or two queries would be sufficient, I hope.
You could get those by inserting NoOp(${CARRIERLIST}.${Z}#${NANI}#${CLIST}#${PLIST}) to show the resulting
fields on your console.... This can wait till after the valgrind, if it is
necessary at all.

I forgot to mention that you should compile Asterisk with DEBUG flags
turned on in the "make menuselect" stuff before using valgrind. I suspect
it will produce better backtraces.
By: Steve Murphy (murf) 2008-08-26 16:21:15

falves11-

No word from you on this bug; if we can't reproduce this bug, we can't solve
it. Shall I close it "Can't reproduce" ?
By: Private Name (falves11) 2008-08-26 17:30:57

I use the ODBC technology for the CDR and routing, through func_odbc. I noticed while using both 1.4 and Trunk, that over the hours the memory keeps growing
and never stops doing so until Asterisk has to be recycled. Please find atatched a file called memory.txt. It stores the output (erery 5 mins) of the command free -lm, only the section of the applications memory, not the cache usagae. You will notice that it grows and forever, regardless of the number of calls open. I have 61 open calls. Tomorrow I can upload the same file and who knows where it will be.
By: Steve Murphy (murf) 2008-08-26 17:51:57

I believe you, but that does not help me find the problem. You were going to compile asterisk with DONT_OPTIMIZE, and run it under valgrind for a while, (cmd: valgrind --leak-check=full --show-reachable=yes asterisk -cgv), until it processed 10 to 100 calls or so, and then shut down asterisk with "stop gracefully", and capture all the output, which should tell us in great detail where the leak is happening.

It could be in the odbc libs, the backend, but we won't know for sure where or how until you do the above.

If you cannot do the above, we'll have to close the bug until we get more data.

By: Private Name (falves11) 2008-08-26 17:59:13

I cannot do any of that, since I am using the Business Edition "C". Digium has control over it. They have a ticket open. I just want to make sure that when they fix it the knowledge spills over to the open source.
By: Private Name (falves11) 2008-08-26 18:01:54

It is so bad that when I posted the file one hour ago it was using 570 MB, now it is over 747 MB.
By: Steve Murphy (murf) 2008-08-26 18:42:58

falves11-- are you going to be able to run asterisk under valgrind? If we can locate the leak, we might be able to fix it quickly.
By: Private Name (falves11) 2008-08-26 18:48:14

I am under a veto from Digium to mess the Business Edition. The have the authentication information, ticket RYM-150398. Please feel free to ask them to let you in and run any tests you like. it is adding 21Mb every 5 minutes. Asterisk is the only app running in the box, and there is no hardware, this is a pure Sip to SIP application. It might not be Asterisk, but unixODBC or Freetds 0.82, since unixODBC uses freetds to talk to SQL Server. In any case, it has to be identified and fixed.
By: Steve Murphy (murf) 2008-08-26 22:35:01

falves11 -- run valgrind anyway and show me what it says (if anything). Forget
trying to compile with debug flags, then. I'll try to locate the corresponding source here at Digium, then. I'll need your exact version number (core show version).

This is not the right place to file bugs against ABE. You reported trunk. If you are having trouble with ABE, it's silly to file a bug against asterisk trunk.
By: Private Name (falves11) 2008-08-26 23:56:05

The bug runs across all versions, and I mentioned. I cannot run valgrind since I have 100% of my business in that box. If I stop the app I lose tons of money. Right now is 1:00 AM and asterisk has 7 calls and 730 MB.

rhel5-225*CLI> odbc show
Name: routing>
DSN: routing
Pooled: yes
Limit: 1000
Connections in use: 0

Name: global
DSN: mssql
Pooled: yesLI>
Limit: 1000
Connections in use: 6
Connection 1: connected
Connection 2: connected
Connection 3: connected
Connection 4: connected
Connection 5: connected
Connection 6: connected
rhel5-225*CLI> cdr status
CDR logging: enabled
CDR mode: simple
CDR output unanswered calls: no
CDR registered backend: cdr_manager
CDR registered backend: ODBC
By: Steve Murphy (murf) 2008-08-27 08:37:34

OK, then, let's not deprive you of income, but rather, go find another box, and put trunk version of Asterisk on it, and set it up just like your production box. you can connect this to the db from the trunk box just like your production machine (you may have to go and delete the cdrs from your db afterwards.)

You probably will need to wipe your sip config on the trunk box, so you don't worry about your customer base. Grab two sip phones, and set up so they connect to your box like your customers do. Then, start asterisk, work out the setup bugs, and get out of asterisk. Do the valgrind thing. Make calls from/to the phones,
maybe 10 or so calls. Do the stop gracefully, and collect the results.

Can you do the above?
By: Private Name (falves11) 2008-08-27 09:52:31

Let me get a separate box.
By: Private Name (falves11) 2008-08-27 11:41:59

Your valgrind idea does not work. It runs and then it says "killed". I compiled Asteriks with "no optimize" and run this comand
valgrind --leak-check=full --show-reachable=yes asterisk -cgv

Please see the attached file: valgrind.txt

By: Private Name (falves11) 2008-08-27 12:44:37

The result is identical with Trunk. The valgrind command fails.
By: Steve Murphy (murf) 2008-08-27 18:03:32

Just adding this note: valgrind fails because it cannot interpret an opcode; the compiler/cpu is too recent a model for valgrind to emulate it.
By: Steve Murphy (murf) 2008-08-28 13:14:38

falves11-- I just ran several tests of ResetCDR() to see if ResetCDR(w) leaks memory, and the answer appears to be no. I ran it 1000 times in a dialplan loop, and could not find one byte of leaked memory because of it.
By: Private Name (falves11) 2008-08-28 13:18:35

The mistery deepens.
By: Private Name (falves11) 2008-08-28 18:18:42

I createds a 32 Bit test machine, loaded the app and send about 5 calls. Please see the valgrind log
By: Private Name (falves11) 2008-08-30 15:34:39

I believe to have found the issue of the dramatic memory leak. It seems to be unrelated to CDR or ODBC. It is the Dial function:
I wrote a simple dialplan like this:
exten =>_X.,1,Set(i=100000)
exten =>_X.,n,While($[${i} >= 0])
exten =>_X.,n,Dial(SIP/333333333333@xx.xx.xx.xx) ; call to cisco fails immediately
exten =>_X.,n,Set(i=$[${i} - 1])
exten =>_X.,n,EndWhile

I sent 12 calls only to the loop, using two SIP phones, and the memory keeps growing as it shows this command:
ps auxf --width=200 | grep -v grep | grep usr/sbin/asterisk

Over time, the memory will exhaust and Asterisk will have to be restarted.
By: Konrad Rozycki (krdian) 2008-09-02 09:36:31

I have the same issue on my asterisk box (1.4).
By: Private Name (falves11) 2008-09-02 18:32:16

I need to know when this patch is going to applied to Business, 1.4 and Trunk. It is a very urgent matter.
By: snuffy (snuffy) 2008-09-02 18:48:48

murf committed a fix from ASTERISK-12671 which may address your issues with leaking memory usage.
Please checkout latest svn for 1.4/1.6/trunk and see if you still have the same problems.
By: Steve Murphy (murf) 2008-09-02 18:51:43

falves11--

I've already applied the patch to 1.4, trunk, and 1.6.1.

I've written letters to a few folks to let them know the exact revisions they need to apply (there are 3 now). I suspect they'll let you know tomorrow, what the schedule will be. But it does look good that this is the problem that is biting you. Sorry it took so long. My trust in valgrind was definitely misplaced in this instance.
By: Private Name (falves11) 2008-09-02 19:11:10

Just out curiosity, how did this guy indentified the issue? I went as far as simulating it in a loop.
By: Private Name (falves11) 2008-09-03 01:37:58

The issue continues. Please check my dialplan and the memory log attached. The file is updated every 5 minutes. So the amount nof memory that it grows every 5 minutes can be calculated easily. The versionn is SVN-branch-1.4-r140751 . Steve Murphy can log into the box and check it.
By: Steve Murphy (murf) 2008-09-03 10:44:21

falves11--

In your note about a leak, where you run a loop:

exten =>_X.,1,Set(i=100000)
exten =>_X.,n,While($[${i} >= 0])
exten =>_X.,n,Dial(SIP/333333333333@xx.xx.xx.xx) ; call to cisco fails immediately
exten =>_X.,n,Set(i=$[${i} - 1])
exten =>_X.,n,EndWhile

You will build up memory while the loop is running, basically
the calling channel, which remains alive to allow the loop to
run, builds up a (big) list of dial_feature structs. But, when
the loop ends and the channel is hung up and freed, so are all
those dial feature structs. In normal practice, this is not
a problem.

By: Steve Murphy (murf) 2008-09-03 10:54:12

falves11 --

I have supplied the folks who handle ABE a list of the revisions that are
part of the leak fix. My initial tests on around 10 successful sip calls
shows no leaks. I asked the guys in ABE to supply you with both a plain (production) with the fix included, and a DEBUG_MALLOC version, which you can
use in case the fix doesn't apply to you. The best way to use the DEBUG_MALLOC
version is to start up asterisk, run "memory show allocations" on the CLI,
process a couple hundred calls, then do "memory show allocations" again, and send the results of both. Hopefully the fix will halt the memory growth and we won't have any testing to do.
By: Private Name (falves11) 2008-09-05 23:20:58

I think that I found a probable cause for the memory growth. I have two boxes, identical dialplan, version, etc. In one of them I type "sip show channels" and I have 60 active channels. In the one where the memory goes crazy I do the same and I get:
7.110.179.253 207.2.123. 44938D3E-7A 00102/00102 0x0 (nothing) No Rx: BYE
67.110.179.253 207.2.123. 43589C8D-7A 00102/00102 0x0 (nothing) No Rx: BYE
67.110.179.253 207.2.123. 3D4438C4-7A 00102/00102 0x0 (nothing) No (d) Rx: BYE
67.110.179.253 207.2.123. 375ED38F-7A 00102/00102 0x0 (nothing) No (d) Rx: BYE
67.110.179.253 207.2.123. 36CD8F87-7A 00102/00102 0x0 (nothing) No (d) Rx: BYE
67.110.179.253 207.2.123. 34A66BA1-7A 00102/00102 0x0 (nothing) No Rx: BYE
5552 active SIP channels

Please notice the "(d)" in almost every line. It does not happen in the first box. In the second box, with the 5552 active channels, I have only 9 active calls, while in the first one I have 30 open calls.

The second box receives calls only from a Cisco 7301, while the other one receives calls from switches unknown. Maybe something on the SIP interop is not understanding the Cisco dialogs and this the calls never close. Both boxes generate CDR's successfully.
By: Joshua C. Colp (jcolp) 2008-09-06 13:42:46

Attached is a patch which will allow a dialog to be properly destroyed if a BYE is received on it before it moves to the early state.
By: Steve Murphy (murf) 2008-09-08 10:45:43

I think we found the problem. I just logged into the production server,
and asterisk has been up for 1 day 18 hours, and was at 245 Meg of virtual memory.
It was handling around 60 sip channels.

A few minutes later, it had risen in memory, but a quick check showed that it
was handling 80 sip channels. None were in Rx: BYE state; the list of channels
looked as expected.

What was the problem? A large number of channels were dead, reporting Rx: BYE
in a "sip show channels". All "dead" channels were to one certain vendor. Doing
a "sip debug" revealed that after asterisk sent an INVITE to that vendor, the
vendor responded with 100 Trying message, then a BYE, to which asterisk
would send an ACK, and the channel would be left in a limbo state.

What was happening was, the channel had some packets attached, and therefore,
the driver would re-schedule the destruction, expecting the packets to be
removed in the normal course of events. But the packets were NOT removed,
and the rescheduling would occur over and over again ad infinitum, and the
channel would never get freed.

Josh Colp came up with the fix, which he attached above, but the patch, in
that format, did not solve the problem. The current invite state was not
PROCEEDING, but rather either TERMINATED or CONFIRMED. Once we knew this,
Josh suggested we make the call to __sip_pretend_ack(p) unconditionally,
and indeed, this works.

So, I committed this fix via:

1.4: v. 141565
trunk: v. 141566
1.6.0: v. 141567
1.6.1: v. 141572

We've waited till today, just to be sure that
the asterisk reaches equilibrium state. It so appears.
If problems remain, tho, feel free to reopen this issue.