ASTERISK-12671: [patch] Huge memory leak because memory of channel cdr struct is never returned

[Home]

Summary: ASTERISK-12671: [patch] Huge memory leak because memory of channel cdr struct is never returned

Reporter: Thomas Arimont (tomaso) Labels:

Date Opened: 2008-09-01 09:48:24 Date Closed: 2008-10-08 08:48:14

Priority: Blocker Regression? No

Status: Closed/Complete Components: Core/Channels

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 13406-reorder
( 1) asterisk_svn_141566.txt
( 2) asterisk_svn_141567.txt
( 3) asterisk-1.6.0-rc2-cdrmemleak.patch
( 4) bt-SVN-branch-1.4-r141267.txt
( 5) bt-SVN-branch-1.4-r141565.txt
( 6) dead_channel.txt
( 7) issue13409.rev1.txt
( 8) memory_leak_test_06_10_08.txt
( 9) memory_leak_test_sip_b2b_07_10_08.txt

Description: After two days of stress testing by making lots of calls across sip and dahdi channels the asterisk process memory reached dizzy values: VSZ=3,2GB, RSS=1,6GB before asterisk stucked completely (even the RAM of our server (2GB) is finite ;-) ).

Actually this problem is not a question of load, but appears for each single call.

Using valgrind the reason for that was quickly found: The memory of the channel cdr struct (chan->cdr) is never returned, not for sip, not for dahdi channels, when a channel is cleared.

Reproduce:
a.f.a.p. default configuration (modules.conf, etc.)
Make lots of calls and see ps's VSZ and RSS values.

Interested in a patch ? Or is someone revise this anyway?

****** ADDITIONAL INFORMATION ******

Version tag asterisk 1.6.0-rc2
Version tag dahdi 2.0.0-rc2&2.0.0-rc1
Version tag libpri 1.6.0-beta1

Comments: By: snuffy (snuffy) 2008-09-01 15:43:34

Your more than welcome to produce a patch.
If you dont have the disclaimer signed might take a day or so to be approved.
By: Thomas Arimont (tomaso) 2008-09-01 23:45:19

Attached the patch we can provide to fix this bug. Sorry, we only applied the patch only to asterisk-1.6.0-rc2 yet.

Thomas Arimont
Dipl. - Ing.
Voice & System SW
Research & Development

DATUS AG
Tempelhofer Strasse 4-8
D-52068 Aachen
By: Digium Subversion (svnbot) 2008-09-02 17:06:46

Repository: asterisk
Revision: 140670

U branches/1.4/main/channel.c
U branches/1.4/main/pbx.c

------------------------------------------------------------------------
r140670 | murf | 2008-09-02 17:06:45 -0500 (Tue, 02 Sep 2008) | 14 lines

(closes issue ASTERISK-12671)
Reported by: tomaso
Patches:
asterisk-1.6.0-rc2-cdrmemleak.patch uploaded by tomaso (license 564)

I basically spent the day, verifying that this patch
solves the problem, and doesn't hurt in non-problem
cases. Why valgrind did not plainly reveal this leak
absolutely mystifies and stuns me.

Many, many thanks to tomaso for finding and providing the fix.

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=140670
By: Thomas Arimont (tomaso) 2008-09-04 17:00:20

I reopened the bug by request of Daniel who posted the following on the asterisk developers mailing list:

I tested today 1.4.22-rc3, as I see it includes patch for issue ASTERISK-12671.

This patch is causing me CDR problems, I use cdr_mysql (from addons
package) as backend, and in a normal call A->B, this patch causes to
post the CDR twice, giving me a constraint key violation (uniqueid is
primary key in cdr table).

Can you reopen this? I've a slighty modified patch that calls also
ast_cdr_detach but doesn't post a new cdr. I cannot post it or write a
note to this issue because it is in 'fixed' state.

thanks
bye

--
Ing. Daniel Ferrer
IPContact Software S.R.L.
(+5982) 4025420

By: Thomas Arimont (tomaso) 2008-09-04 17:15:52

The provided patch was first of all only regarding Asterisk 1.6.
It is not related to 0013235 in Branch 1.4. as you can already see in the recent notes of 0013235. This memory leak is caused by something else (rtp stuff?). Someone should make clear if it is reasonable to apply this patch to 1.4 and if the CDR problems reported by Daniel is also a issue for 1.6 (when patch is applied).
By: Steve Murphy (murf) 2008-09-05 08:28:29

tomaso-- 13235 could still easily be because of this bug; falves11 noted truthfully that in a certain situation, memory did pile up, but I investigated, and the dialing code is not the reason.

Daniel's letter says that using detach to free the nodes is probably not the best thing; I think I will change it to use cdr_free instead, and completely remove any diagnostic logging from cdr_free. Daniel might have a nice fix to cdr_detach, which I will welcome, but if he's right about double cdr postings, I need to take action now.

As to the question about 1.4: I have personally investigated the applicability to 1.4, and whatever problem we have in trunk/1.6.x is in 1.4 also. That's why I apply it there.

I will make the changes I described today, and if Daniel has a good patch, we can implement that then.
By: Daniel Ferrer (dferrer) 2008-09-05 09:56:16

Ok, perfect murf. Your patch is better, I didn't notice ast_cdr_discard() function. Mine's calls ast_cdr_free, and also gets "not elegant" messages like "CDR on channel xxx lacks end", etc.
I'll test it
By: Steve Murphy (murf) 2008-09-05 10:18:20

Yeah, bumped into cdr_discard, and it didn't have the logging calls in it, and thought, Eureka! This is the func I need!

Otherwise, I would have stripped the logging from cdr_free, and used that.
By: Daniel Ferrer (dferrer) 2008-09-08 09:52:25

Hi again.
I've testing version SVN-branch-1.4-r141267, and I'm getting asterisk crashes in cdr stuff, I'm not sure if this is not a consecuence of this cdr fix, I've some suspects.
I'm attaching GDB backtrace, I see that in frame #1 (func ast_cdr_specialized_reset) cdr is defined, but in frame #0 cdr points to another adress, giving a segfault. I've got 2 crashes today (in 4 hours) with same symptom.
I'm testing now new version SVN-branch-1.4-r141565, and I'll tell you if I get same segfault
By: Private Name (falves11) 2008-09-08 10:06:41

I have beeing using 141565 since inception, since it was born un my machine, and it is perfect. No complaints whatsoever.
By: Steve Murphy (murf) 2008-09-08 11:10:23

tomaso-- I'm sure you noticed that we finally tracked down the problem in
13235, the dead channel problem, which compared to the cdr leak, was
like an elephant compared to an ant, where the cdr leak was the 'ant'.

But, at any rate, if you check out 1.6.0 at version 141567 or greater, has
this fix resolved this issue, or do we need to dig further? To see if
your problem is related to 13235, before applying these patches, with a version
of 1.6.0 less than 141567, do a 'sip show channels', and see if a large number of channels are in the Rx: BYE state (last msg). If so, this bug is a duplicate
of 13235, and updating to the latest version should fix it. If not, let me
know, and we'll see if we can find and solve the problem.

By: Steve Murphy (murf) 2008-09-08 11:13:20

dferer-- I'll be looking into your issue today.
By: Thomas Arimont (tomaso) 2008-09-08 11:25:26

Sorry, the last days I was to busy for doing more than only tracking these two bug(s) and its rapid development. I try do the checks as fast as I can.
I hope this is o.k. for you.
Regarding the further cdr issues I had to hold off because my initial bug report was related to a (production) system where cdr generation was totally switched off and not of any focus.
By: Daniel Ferrer (dferrer) 2008-09-08 15:05:28

Murf & tomaso, I've can reproduce the bug that I mentioned. Asterisk crashes when doing an and attended transfer.
In a normal call, A -> B, A press "attended_transfer" combination (res_features, #0 in my case), then A calls B, B answers. B accepts call, so A must hungup, when hungup is done, Asterisk crashes with segmentation fault, with same backtrace that I post.

I've tested with SVN revs 141565 and 141267. Also I can confirm that code that frees the cdr is what causes the segfault:

main/channel.c: svn rev 141156
1281 if (chan->cdr) {
1282 ast_cdr_discard(chan->cdr);
1283 chan->cdr = NULL;
1284 }

If I comment this chunk of code, attended transfer works as expected (no asterisk crash).
The problem appears that the free of CDR gives problem in ast_cdr_specialized_reset(), cdr->next is is pointing anywhere, but not NULL. I'm attaching the last bt, for rev 141565.

Should I open another bug or continue from here?
By: Steve Murphy (murf) 2008-09-08 16:13:46

This is quite interesting. This implies that between the time the CDR is freed, and it's pointer set to NULL, that other code in parallel is accessing this pointer, seeing a garbaged CDR struct, and dying on a bad pointer in it (the
cdr.next ptr, in fact).

So, I recoded that short segment to remove the pointer first, then free that
struct. If that is not sufficient, I will lock the channel.

I attached a patch 13406-reorder, please test to see if this solves the problem.
By: Thomas Arimont (tomaso) 2008-09-09 07:41:35

Hi folks, I checked out the SVN revs 141566 and 141567. In the attached files you can see the channels which are in Rx: BYE state (during stress test). In rev 141567 they are less because the channels seem to be cleared earlier. But even in svn 141566 all Rx: BYE state channels are cleared earlier or later. So this was never a real problem for us.

The memory usage of asterisk in svn 141567 is almost the same (here about 13MB) as in svn 141566, i.e. no increasing memory leak during the stress test. That's o.k. for both versions.

Regarding the cdr issues / segmentation fault:
Yes the code in ast_bridge_call() is quite ugly! So it's not a surprise that asterisk stuck. The way the cdr's are handled in some functions is not acceptable.
And someone should take a look on the ast_channel_unlock() and ast_channel_lock() calls. There are leaks which make it easy/possible that segmentation faults like the one above occur.

@murf: So it seems there are still a lot of ants which can even frighten the big elephant! ;-)
By: Russell Bryant (russell) 2008-09-09 07:45:58

murf: in 13406-reorder, this code is at the end of ast_channel_free(). These changes should not be necessary because at this point in the code, there should be _NO_ chance that any other code could still be referencing the channel being free'd (or any of its data!). So, if that is possible, then that is the bug we need to track down.
By: Russell Bryant (russell) 2008-09-09 08:05:18

I have uploaded a patch that I think will fix this issue. However, it needs some testing. I'd like to verify that it does fix the crash, and I'd also like to make sure it doesn't have adverse affects on the CDRs that get logged.
By: Thomas Arimont (tomaso) 2008-09-09 08:44:10

Amendment: After one hour stress testing it seems that SVN revs 141566 indeed has a little increasing memory leak (additional to the original unpatched version). Maybe that's the Bug 0013235 related thingy.
I will keep track of that ...
By: Steve Murphy (murf) 2008-09-09 10:05:13

Many thanks to Russell. The patch looks good. It should prevent the crash, and is a light year ahead of what I earlier proposed.

tomaso--

You don't have to worry about the cdr issue; your leak is much more important right now.

so, the plug we installed did not affect you? Very interesting!
And you still have channels stuck in the Rx: BYE state? Then here is how we
find it. If 100% of your channels end up in the Rx: BYE state, then it will be
easy:

1. "sip set debug", while you are processing calls.If you are processing a lot of channels, you will not have to do this for long.
2. Collect all the output, and copy it into a file.
3. Wait a liitle more than 32 seconds, because that's how long the destruction callback is scheduled for; do a "sip show channels", and find one channel that went into the Rx:BYE state while you had "sip set debug" on, and is still in that state after 32 seconds.
4. The channel name is the dialog name. Find that dialog id in your file, and find just those transactions that involve this dialog id. The text above each such transaction is sometimes also revealing. Hopefully, this will reveal to us, what is happening in the driver that leads to the channel getting stuck.
5. Attach your filtered output to this bug.

If it is so, that I misunderstood your bug, and there are still noticeable leaks, but not the dead-channels, then use "make menuselect" to set compiler option "MALLOC_DEBUG". Recompile your source. start asterisk.
Enter the command "memory show allocations" and collect the output into a file.
Then run a couple hundred calls. then "memory show allocations" again, and collect the output into another file. Diff the two files. share with us the result.

By: Daniel Ferrer (dferrer) 2008-09-09 10:29:04

rusell:
The patch works for me, it does avoid the crash and CDR is only affected by in a "better way" I think. I see this difference:
before this, in a call A(DAHDI/1) -> B(DAHDI/3), A unnatended to C (SIP/158) y get 2 cdrs:

src dst channel dstchannel (descending time)
A s DAHDI/1 SIP/158
A B DAHDI/1 DAHDI/3

after patch:
src dst channel dstchannel (descending time)
A C DAHDI/1 SIP/158
A B DAHDI/1 DAHDI/3

so dst is C not 's' anymore.

Personally I think that some work in cdr stuff has to be done in Asterisk. Some time ago in the past (a couple of months) I read murf's blog and tested his CDRfix patches, but finally I didn't apply them. I'm still not convinced in how to manipulate CDR's when there are transfers, but I think that's not an easy problem to solve.
By: Private Name (falves11) 2008-09-09 10:29:51

Steve: Please keep monitoring my production box. If there is indeed more memory leaks, it has got to show there. I am capturig the overall Asterisk memory consumption in /var/lib/asterisk/agi-bin/memory.txt, every 5 mins. The amount of calls should reach near zero at night, so memory should fluctuate but not exceed some boundaries. Feel free to upgrade the code and restart if need be. I'd rather lose a client than the company.
By: Digium Subversion (svnbot) 2008-09-09 10:31:00

Repository: asterisk
Revision: 142063

U branches/1.4/res/res_features.c

------------------------------------------------------------------------
r142063 | russell | 2008-09-09 10:31:00 -0500 (Tue, 09 Sep 2008) | 5 lines

Ensure that the stored CDR reference is still valid after the bridge before
poking at it. Also, keep the channel locked while messing with this CDR.

(fixes crashes reported in issue ASTERISK-12671)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=142063
By: Steve Murphy (murf) 2008-09-09 10:44:30

falves11-- yes, I just checked. Your asterisk shows uptime:
2 days, 19 hours, 38 minutes, 14 seconds

and since then, we now show:

root 3473 16.0 7.7 618596 318772 ? Sl Sep04 1200:25 /usr/sbin/asterisk -f -vg -c

which is about 200K more memory since yesterday. This is a lot less drastic a leak
than previously, but still not good. sip show channels shows almost 80 channels. (but none are in the bye state).

Later tonight, when the call volume is less, we will do some experiments, and find the leak, unless tomaso can find it today.
By: Steve Murphy (murf) 2008-09-09 10:46:35

Oops, I'm sorry, that's 200 MEG more than yesterday (assuming it's counting in 1024 byte chunks). Is there any way to estimate the total number of calls in the last 24 hours?
By: Private Name (falves11) 2008-09-09 10:55:52

Since noontime yesterday that box has processed 30.000 (thirty thousand) calls.
By: Steve Murphy (murf) 2008-09-09 11:08:20

Ok, so I'll estimate that that in the last 24 hours, we had roughly 40k calls, then, as hopefully that covers roughly the 3 or so more hours under fairly heavy load. That calculates to roughly 5K bytes per call that is being leaked. This is just a 'ballpark' estimate, as they say.
By: Thomas Arimont (tomaso) 2008-09-09 12:07:58

murf--

For clarification the little memory leak I noticed was regarding SVN rev 141566 (the version before the second patch).

With the SVN rev 141567 (with second patch) I've (almost) done what you told me. Yes I have dead-channels, but they are all in Rx: INVITE state (session timer off). The small resulting memory leak with this version seems to be caused by the dead channels.

The sip channel seems to become dead after a session progress is sent to the remote end. See attachment of one of these channels. Maybe the SipP-client has lost its 'interest in this call'. Is the session timer stopped totally after a session progress is sent? That would explain the behaviour.

After a approx. 1 hour I've got 20 dead channels, 10 alive channels.
The memory used by asterisk changed from 10144 kb (idle) to 13232 (last state)
If average memory use for a call is about 300kb (if this can be assumed at all)this would be a fortification of 'memory leak only caused by the dead channels'.

Is this enough information for you?
By: Steve Murphy (murf) 2008-09-09 14:54:06

OK, I've just tried to confirm this the reason for falves's leakage, but I cannot.
If I say "sip show channels" and take the output and put it in a file, and then wait 20 minutes or so, and do it again, I cannot find any common channels.

You try this; first, say "sip history"; then "sip show channels", and collect the output, and copy it into a file. Wait 20 minutes and do the "sip show channels" again. Copy and paste into a different file.

use 'sort' to merge and sort the files by the the Call-ID column. Look for
two rows that have the same Call-ID column; I did this by using an editor to
move the callid field to the front of each line. Then I don't have to play with the -k option in sort.

If you find a pair, then "sip show history xxxx with the callid; it will do command completion for you if you hit tab.

Show us the history for that channel. Falves11's machine has been running 2 days, and I didn't see one channel that survived after 20 minutes. (he has live traffic running). So, in his case, this is not the leak. But it might be yours, so please, verify; jcolp couldn't make sense of your trace. He wants to see the sip history of the channel.
By: Thomas Arimont (tomaso) 2008-09-10 02:41:58

murf --
Was this last notice addressed to me? Sorry murf, I would like to track that bug further on and help you as fo as possible, but I'm again to busy to go on with it the next time. We need the asterisk server for other tests and I still have to report other bugs and provide patches (libpri, chan_dahdi) - for a constantly growing karma ;-) -. And not least there a lot of things to do for my company and my team.
By: Steve Murphy (murf) 2008-09-10 10:26:21

tomaso--

yes, I was aiming at you. I cannot solve the problem without your participation, so I'm closing this bug in a "suspended" state until you can come back to it.

If anyone else has the same problems, and you are willing to do the tests I've outlined, and have conclusive results, then you are welcome to re-open this bug or create a new one.
By: Thomas Arimont (tomaso) 2008-10-06 11:30:55

murf--
Here I am back again with that issue. I checked the current branch 1.6.0 (SVN Revision 146598) where all(?) the patches should have been committed in now.
It seems that there is a little memory leak left which is not dependent on dead calls (see attachment - after 20 minutes of load testing, 2x20 calls/sec, no dead calls).

Since my test setup is a little different the sip info you requested won't help to identify the dead calls. I've got 2x2 PRI loops inbetween:

SIPP client -> Asterisk -> 2xPRI -> loop -> 2xPRI -> Asterisk -> SIPP server.

If i increase load from 2x20 calls/sec to 2x25 calls/sec dead channels begin to start.
I guess you want me to remove the PRI loop ... I'll do what I can ...

By: Thomas Arimont (tomaso) 2008-10-07 03:32:11

murf--
I did the test now only with SIP B2B (no PRI loop). After millions of call with a rate of 70 calls/sec (*) setups there is no significant memory leak (no cdr modules loaded).
On my part you can close this bug now entirely.

* Used hardware:
System
Dell PowerEdge 1950 III SATA (SV31951)
CPU
Quad Core Intel® Xeon® E5410, 2X6MB Cache, 2.33GHz, 1333MHz FSB
Memory
2GB FB 667MHz Memory (4x512MB single rank DIMMs)

Harddisk
PERC 6/i Integrated RAID Controller Card
RAID 1 Configuration, Add-in PERC6i Controller (requires 2 Hard Drives)
2 X 73GB 15.000 rpm 3.5-inch SAS Hard Drive

Ethernet onboard
dual embedded Broadcom® NetXtreme II 5708 Gigabit Ethernet NIC

1 x Digium TE420P (4fach PRI E1 PCIe -Karte)
By: Daniel Ferrer (dferrer) 2008-10-07 07:43:30

Hi all
Please don't close this issue, some change after russell patch give me again duplicate CDR's when doing an unattended transfer. This is happening after 1.4.22-rc5 and also is now in 1.4.22.
Give me some time to test (a couple of hours) and make a better report.
By: Steve Murphy (murf) 2008-10-08 08:48:07

deferer--

If you are seeing a duplicate CDR issue, it would be better to file it as a separate bug report, as this one really was mainly about a fire-hose level of
memory leak in Asterisk, mainly (as it turned out) caused by zombie
channels hanging around when they should have been freed up.

I think we have the "Bloating Zombie Channel Death" problem cleared up,
but if there's a regression, especially with CDR's, please open up a new bug,
categorize it in the CDR/general or whatever category, and assign it to me,
and I'll look at it. Be specific about the sequence of events necessary to
reproduce it, and I'll be happy to have a look at it!

But, if this is indeed a memory leak sort of problem, re-open it again,
and we'll get back on it.