ASTERISK-07677: segfault when zap channels are full (calls are Originate'd via AMI and exacerbated by app

[Home]

Summary: ASTERISK-07677: segfault when zap channels are full (calls are Originate'd via AMI and exacerbated by app_amd)

Reporter: colin westlake (colinwes) Labels:

Date Opened: 2006-09-05 09:58:00 Date Closed: 2011-06-07 14:00:56

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) Asterisk_1.4.0-beta2_2ndOct.txt
( 1) Asterisk_SVN-branch-1.4-r47540_memory_show_CLI_command_14-09-06.txt
( 2) Asterisk_SVN-trunk-42058_bt_chan_iax_socket_process_25-09-06_#1.txt
( 3) Asterisk_SVN-trunk-42058_bt_chan_iax_socket_process_25-09-06_#2.txt
( 4) Asterisk_SVN-trunk-42058_bt_chan_iax_socket_process_25-09-06_#3.txt
( 5) Asterisk_SVN-trunk-42058_bt_chan_iax_socket_process_26-09-06_#1.txt
( 6) Asterisk_SVN-trunk-r42058_overdriven.txt
( 7) SVN-trunk-r39295_AMD_seg_fault.txt
( 8) SVN-trunk-r42058__seg_fault.txt
( 9) SVN-trunk-r43961_trace_29thSep.txt

Description: This may be related to 7652 - seg fault under moderate load every hour or so - calls initiated from AMI. Only occurs when app_AMD is used in the dialplan. Backtrace attached. NB this is happening on both 32 bit and 64 bit platforms.

Comments: By: Clod Patry (junky) 2006-09-05 10:52:15

Could you also, add a thread apply all bt full?
By: colin westlake (colinwes) 2006-09-05 12:28:08

Hmmm... this does not seem like much help - all I'm getting is this:

(gdb) thread apply all bt full

Thread 274 (process 25097):
#0 0x00000292 in ?? ()
No symbol table info available.
Cannot access memory at address 0x0
#0 0x00771083 in _int_malloc () from /lib/i686/libc.so.6

is there something misconfigured here ?
By: Serge Vecher (serge-v) 2006-09-05 13:01:20

I think you need to try a more recent revision of trunk, it's currently at r > 42000
By: Clod Patry (junky) 2006-09-05 15:12:47

Anyway, your problem isnt related to app_amd.c
By: colin westlake (colinwes) 2006-09-06 03:24:55

well by removing the call to app_AMD from the dialplan the server will run for many days and pass many tens of thousands of call without a hitch. When I put AMD back in it falls over within an hour or so or a couple of thousand calls. You could call that circumstantial, but it's what happens every time (we are taking in excess of 30 seg faults here)

I'll load up the latest trunk and repeat the experiment.
By: Clod Patry (junky) 2006-09-06 06:28:31

Since the segfault not happens in the app_amd.c, if you do the same thing (originate calls from AMI, and replace the call of AMD(foo) by any Playback(bar), you should get segfaults too.
and the thread apply all bt full looks like you dont run the DONT_OPTIMIZE and DEBUG_THREADS. Make you you have it.
By: colin westlake (colinwes) 2006-09-07 09:01:15

We have now replicated the fault in a test harness. The attached back trace shows a seg fault after placing approx 18,000 calls over a 45 min period.
By: colin westlake (colinwes) 2006-09-07 09:16:17

I forgot to mention....
just before the crash we got a LOT (many tens per sec) of WARNING[27063]: chan_iax2.c:7552 socket_read: Received mini frame before first full voice frame and a few NOTICE[16125]: chan_iax2.c:3120 iax2_read: I should never be called!
on the server were the agents were logged in (ie the target of the iax dial after AMD had detected human speech) - for topology see ASTERISK-7454

At this point the server doing the dialing seemed to slow down dramatically and the AMI response became sluggish. This lasted several seconds and then recovered agin. The pattern re-occured several times over a 2-3 min period just before the crash

EXTRA INFO - further testing has revealed that there appears to be a memory leak of about 300 bytes per call when using app_AMD (observed over 10k+ calls)

By: colin westlake (colinwes) 2006-09-07 14:28:11

More experiments.... using the test harness I have confirmed that without app_AMD so long as you make sure that you never try to originate a call when all zap channels are full, you can run for many tens of thousands of calls without probs.

IF HOWEVER you attempt to originate calls when all zap channels are full, you can precipate a crash within a few minutes (see backtrace r42058 overdriven.txt)

Whether this is related to the seg faults that happen with AMD in the dial plan, I'm not sure (those are definately happening without overdriving)
By: seb7 (seb7) 2006-09-26 08:54:55

I'm a colleague of colinwes.

I have uploaded four more (new) backtraces from the same Asterisk server (same SVN revision as the last two backtraces) - they are seem to show a problem in chan_iax socket_process. This is the same as the 2nd backtrace already attached.

Backtraces 1 and 3 already attached show different backtrace information. So it seems (to me and colinwes) that there is definately more than one problem:
(a) the app_AMD seg fault from bt #1 (05-SEP-06) and apparent memory leak from the test harness tests only apparent when using app_AMD.
(b) the IAX2 seg fault from bt #2 (07-SEP-06 09:02), #4 (#1 on 25-SEP-06), ASTERISK-1 (#2 on 25-SEP-06), ASTERISK-2 (#3 on 25-SEP-06), ASTERISK-3 (#1 on 26-SEP-06).
(c) the Zaptel overdriven segfault from bt #3 (07-SEP-06 14:22).
By: Justin R. Tunney (jtunney) 2006-09-26 12:18:58

Can any of you hackers give me an educated guess as to whether or not this will happen in 1.2?
By: seb7 (seb7) 2006-09-26 12:37:45

I believe issue (c) - the Zaptel (PRI) overdriven segfault - may happen on asterisk-1.2.0-beta1, and be implication, everything between there and most likely asterisk-1.4.0-beta2. However, I have not verified this - we do also get crashes on asterisk-1.2.0-beta1 (which we still run for compatibility with app_machinedetect), and these crashes only (or mostly) seem to happen when too many calls are going through the platform - probably because a call is originated (via the AMI) and tries to grab a zap channel when none are available. However, we have not tried to replicate this problem in a test harness on asterisk-1.2.0-beta1 or posted any bug reports on that version. I've just had a quick look at a couple of core dumps, and the backtrace doesn't look like any of the ones on already attached to this issue (they look related to hanging up a channel when a variable called "data" is not available).

By: Serge Vecher (serge-v) 2006-09-28 12:53:26

alright, just to rule out that these were not chan_iax2 problems, please test the r > 43917 of trunk.
By: colin westlake (colinwes) 2006-09-29 10:59:07

see new trace on trunk
By: colin westlake (colinwes) 2006-10-02 07:51:33

New trace on 1.4.0-beta2 - this time outdialing via SIP (no Zap card installed). Still looks like prob is connected with the iax leg of the call where it is passed to another machine hosting the queues
By: Serge Vecher (serge-v) 2006-10-02 10:12:35

alright, those fixes went in after the 1.4-beta2 release, so you need to check out the latest 1.4 branch.
By: Matt O'Gorman (mogorman) 2006-11-01 12:14:47.000-0600

any luck with asterisk 1.4 branch checkout?
By: colin westlake (colinwes) 2006-11-03 04:41:19.000-0600

Well we tried it yesterday with Asterisk SVN-branch-1.4-r46857 in the test harness and passed 180,000 calls (10 times the level that it previously took to precipitate a crash). We still apparently have a memory leak in app_AMD but as far as the other issues are concerned, I think we can close this one !
By: seb7 (seb7) 2006-11-06 13:17:28.000-0600

I was working with colinwes to test if the recent fixes to chan_iax2 (since 1.4.0-beta2), in particular rev 46775, appears to have solved our problems and indeed issue
(b) the IAX2 seg fault from bt #2 (07-SEP-06 09:02), #4 (#1 on 25-SEP-06), ASTERISK-1 (#2 on 25-SEP-06), ASTERISK-2 (#3 on 25-SEP-06), ASTERISK-3 (26-SEP-06), ASTERISK-4 (29-SEP-06), ASTERISK-5 (02-OCT-06)...
seems to be finally fixed! A *huge thanks*, file! This has made asterisk 1.4 a lot more usable for us.

However these issues remain:
(a) the app_AMD seg fault from bt #1 (05-SEP-06) and the apparent memory leak from our test harness tests that only showed up when using app_AMD.
(c) the Zaptel overdriven segfault from bt #3 (07-SEP-06 14:22).
By: Joshua C. Colp (jcolp) 2006-11-07 17:05:04.000-0600

I'll take this bug under my wing since I'm waiting for some memory info about it.
By: Joshua C. Colp (jcolp) 2006-11-20 21:05:08.000-0600

Any update/info on your memory leak?
By: seb7 (seb7) 2006-11-22 11:17:33.000-0600

Here is the 'top' output on one of our servers running a recent svn version of 1.4 (r47540). It has been running since Nov 16.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7276 root 16 0 81184 60m 5412 S 0.0 3.0 0:03.07 asterisk

'ps aux' output:
root 7276 0.0 3.0 83624 62548 ? S Nov16 0:03 /usr/sbin/asterisk -vvvg -c

One of our developers (at Syntec Telecom), Andre Boelhouwer, has also been looking into this memory leak. I compiled Asterisk with the MALLOC_DEBUG compiler flag. But when Andre issued one of the 'memory show' commands from the Asterisk CLI after it had been running for a while and leaked some memory, Asterisk crashed, so I'm a bit reluctant to do that again at the moment when there are calls up. I am attaching the core dump trace.

I noticed that Asterisk is writing an mmlog file in /var/asterisk/log. When there are calls going through the server, we get following line added to the log around every 5-10 seconds on average (but variable):-
WARNING: Freeing unused memory at (nil), in ast_yyfree of ast_expr2f.c, line 3091

Andre has been trying to dig further and says he has found some memory leaks, but he doesn't think he has yet found the one that is affecting this server. I am giving him your contact details, file, in case he gets bogged down, or as two heads are better than one.

By: Joshua C. Colp (jcolp) 2007-01-22 13:29:44.000-0600

Since there has been no updates and no emails in regards to this I am suspending it for now. If you have any news please reopen.