ASTERISK-05455: runaway chan_local "Not posting to queue" causing Asterisk freeze(defunct)

[Home]

Summary: ASTERISK-05455: runaway chan_local "Not posting to queue" causing Asterisk freeze(defunct)

Reporter: Matt Florell (mflorell) Labels:

Date Opened: 2005-11-04 11:27:17.000-0600 Date Closed: 2011-06-07 14:00:31

Priority: Blocker Regression? No

Status: Closed/Complete Components: Channels/chan_local

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) chan_local_runaway_block_messages.zip

Description: We just installed 1.2 beta 2 last night and it ran fine for about 12 hours(the server was previously 1.2 beta 1). After about 12 hours, the Zap phones connected through channelbank one-by-one lost dialtone and the CLI started showing some strange characters when doing a 'show channels verbose' command: (not sure if these will show up)
¸°Èáò

we don't use any non american-latin characters in our system and have never seen this problem before on 1.2b1.

Also, when I attempted to forcibly kill the Asterisk process from the CLI I got this message:
*** glibc detected *** free(): invalid pointer: 0x40237d24 ***

Looking at the DEBUG from messages I saw this line repeated several hundred times per second right when the problems started:
Nov 4 11:29:24 DEBUG[26699] chan_local.c: Not posting to queue since already masked on 'Local/913125551212@demo-5701,2'

I have included the full messages file for this segment of time

Comments: By: BJ Weschke (bweschke) 2005-11-04 12:55:23.000-0600

Can you update to the latest CVS HEAD and report date/time? I think there a few memory related bugs that were corrected/cleaned-up after 1.2b2 was posted but I'm not 100% certain on the timing of them.
By: Kevin P. Fleming (kpfleming) 2005-11-07 22:14:11.000-0600

No, there have not been any memory corruption/leakage issues corrected since beta2 was released.

Do you have any clues as to a particular issue that may have caused this? I'm not keen on reading through a few megabytes of output looking for a clue :-(
By: Matt Florell (mflorell) 2005-11-09 01:35:02.000-0600

All I can see is that in this instance something doesn't seem to be allowing the local/ channel to be destroyed even after the call has been Answered and successfully MASQd and that's what the "Not posting to queue" seems to indicate because chan_local cannot write audio to the local/ channel because it doesn't technically exist anymore. In our setup we'll do over 100 Local channel calls per minute so we do a pretty good job of stressing chan_local out.

I'll load 1.2rc1 on one of our systems today and see if this happens again.

Is there anything else I can do to better debug when a BLOCK occurs?
By: Kevin P. Fleming (kpfleming) 2005-11-10 23:00:01.000-0600

Not specifically, no, we just need as many details as possible to try to reproduce it.
By: Matt Florell (mflorell) 2005-11-11 01:20:49.000-0600

Since upgrading to 1.2rc1 48 hours ago we have not seen this BLOCK occur again. This server has done over 80,000 Local channel calls without incedent.

You can close the bug and I'll reopen if it occurs again.
By: BJ Weschke (bweschke) 2005-11-11 07:23:02.000-0600

closed per original reporter's request. not reproducible with rc1.
By: Matt Florell (mflorell) 2005-11-28 11:19:33.000-0600

The exact same thing happened today, but this time with release 1.2.0. Same thousands of 'DEBUG[14274] chan_local.c: Not posting to queue since already masked on 'Local/917275551212@demo-bf91,2' DEBUG entries in the messages file (hundreds per second for the same Local/ channel instance).

Still no idea what causes this or how it happened, and cannot reproduce. I could post the logs again, but they look pretty much like the ones already on this issue. System ran for 16 days with this not happening under very heavy load. If anyone else runs into this issue and wants to try to figure it out I can send you full logs.

Just wanted to add this to the issue tracker for documentation sake, you can close this again.
By: Mark Spencer (markster) 2005-11-29 21:04:35.000-0600

Are you running any database or other CDR modules?
By: Matt Florell (mflorell) 2005-11-29 21:32:10.000-0600

No RealTime, no CDR modules. We are running several agi scripts that do access external databases, but no modules that do so.

This system ran CVS_2005-09-06(just after 1.2b1) for over a month with this never happening.

Similar system still running 1.2b1 and idential operations and Local/ channel volume has never had this problem in over 2 months.

This system does have a Digium TE406P card if that matters.
By: Matt Florell (mflorell) 2005-12-08 08:17:03.000-0600

OK, happened again earlier this week and I was determined to figure out why before I posted again. I took the server out of service and replaced it with another machine with the same T1 card and 1.2.0 on it. That new machine has run fine for the last 4 days.

I ran memtest on the old machine and discovered that on the last DDR2 memory DIMM at memory location 1945MB on the system that there is a flaw in the memory. Every time it hit that in the test it showed an error. Since I haven't had memory problems on a server in quite some time I didn't even think about that as a possibility especially since the freezes were so few and far between, but looking at out astguiclient system performance logs(which tracks memory usage), the freezes happened when the memory usage went above 1945MB every time.

So now that there is an explanation I think that you can close this bug for good. And anyone else that runs into a malloc memory error, take an hour and run memtest on you system and see if you have bad RAM.
By: BJ Weschke (bweschke) 2005-12-08 09:19:13.000-0600

hardware issue. not a software problem.