ASTERISK-05869: We have random crashes in less than 30 minutes.

[Home]

Summary: ASTERISK-05869: We have random crashes in less than 30 minutes.

Reporter: Jose Pablo Fernandez (pupeno) Labels:

Date Opened: 2005-12-19 11:44:21.000-0600 Date Closed: 2006-02-24 21:26:30.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) asterisk-backtrace.txt
( 1) bt-adomjan.txt
( 2) coredump.txt
( 3) extensions.conf
( 4) thread_all_bt_kanelbullar.txt
( 5) thread-apply-all-bt.txt
( 6) wunderkin-mixmonitor-r7847-crash1.txt

Description: We are tring to migrate from Monitor to MixMonitor, since there were some bugs there we are running Asterisk 1.2, the SVN branch, updated as of today.
When we start Asterisk with a configuration that uses MixMonitor, within 30 minutes we surely get a crash. This asterisk is serving about 18 concurrent calls for a call center with 40 workstations (and about 60 more phones for other workstations that are used less).
The crash generated a core, so I was able to generate backtraces which I'll be attaching.
Anything else you need just ask.

Comments: By: Tilghman Lesher (tilghman) 2005-12-19 11:50:05.000-0600

Which revision of 1.2 are you running?
By: Jose Pablo Fernandez (pupeno) 2005-12-19 11:55:56.000-0600

Lattest as of today of the 1.2 SVN branch. The subversion revision is 7521. Is that what you need ?
By: Tilghman Lesher (tilghman) 2005-12-19 12:04:20.000-0600

Yes.

You have some severe memory corruption. The way the code is written, it's not possible for f to be NULL at that point (it would have segfaulted about 10 lines earlier). There isn't even a race condition that could be formulated that would do this. The only thing I can think of is that your memory is either going bad or is inadequately cooled.
By: Jose Pablo Fernandez (pupeno) 2005-12-19 12:11:23.000-0600

That is strange, this is a new expensive IBM server with expensive IBM memory (with parity, bells and whistles), I even have run memtest86+ for a while and no error was reported.
The memory is cooled by two coolers but I can't check the temperature right now.
As a note, this server is on production, using 1.2.1 and Monitor, it run for several days without problem.
By: Jose Pablo Fernandez (pupeno) 2005-12-19 12:15:05.000-0600

In less than an hour, we'll upgrade to 1.2 SVN but without using MixMonitor, to see if we have the problem or not.
By: Jose Pablo Fernandez (pupeno) 2005-12-19 12:22:20.000-0600

This Asterisk was being run by safe_asterisk, so when it crashed it was run again automatically and it crashed again, so I have two cores, both of them show the crash in exactly the same place:

#0 ast_frdup (f=0x0) at frame.c:351
#1 0x0806899f in queue_frame_to_spies (chan=0x83048c8, f=0x8327b00, dir=SPY_WRITE) at channel.c:1185
#2 0x080628aa in ast_write (chan=0x83048c8, fr=0x8327b00) at channel.c:2253
By: Jose Pablo Fernandez (pupeno) 2005-12-19 14:09:09.000-0600

We upgraded to branch 1.2 from SVN, revision 7521, an hour has passed and its working.
The only difference between now and the crashing time was that then we were using MixMonitor instead of Monitor.
This might be related to the bug I reported on Friday.
By: BJ Weschke (bweschke) 2005-12-19 14:27:45.000-0600

pupeno: was it you on the forums that is reporting a crash with MixMonitor as well or is that someone else?
By: Jose Pablo Fernandez (pupeno) 2005-12-19 14:49:26.000-0600

I haven't posted to the forums ever, so, it must be someone else.

PS: Everything I do, I do it as Pupeno.
By: Mark Spencer (markster) 2005-12-20 01:55:04.000-0600

Would it be possible to get a "thread apply all bt" on this same core?
By: Jose Pablo Fernandez (pupeno) 2005-12-20 10:23:26.000-0600

Yes, I uploaded it. Is it ok ?
By: Kenneth Holm (saitech) 2006-01-03 17:19:20.000-0600

i think im having a similar problem to this. I have uploaded an backtrace of my coredump.

coredump.txt
By: Clod Patry (junky) 2006-01-03 19:58:55.000-0600

saitech: could you add "thread apply all bt" too?
and which Version do you run exactly?
Try to specify that when you post backtrace.
thanks.
By: wunderkin (wunderkin) 2006-01-08 11:13:38.000-0600

I just got a couple of crashes too on the latest 'release branch' (rev 7847)

I was doing a load test with 46 Zap and 46 SIP connections, but only monitoring the SIP connections with MixMonitor(${UNIQUEID}.ul)

On the second crash, I saw this on the console:
Jan 8 10:53:54 ERROR[5011]: ../include/asterisk/lock.h:163 __ast_pthread_mutex_destroy: app_mixmonitor.c line 263 (mixmonitor_thread): Error: attempt to destroy locked mutex '&spy.lock'.
Jan 8 10:53:54 ERROR[5011]: ../include/asterisk/lock.h:165 __ast_pthread_mutex_destroy: channel.c line 1049 (ast_channel_spy_remove): Error: '&spy.lock' was locked here.
Jan 8 10:53:54 ERROR[5011]: ../include/asterisk/lock.h:171 __ast_pthread_mutex_destroy: app_mixmonitor.c line 263 (mixmonitor_thread): Error destroying mutex: Device or resource busy

I have attached a BT. Both crashes are the same.

By: Paulo Mendes da Silva (kanelbullar) 2006-01-17 03:48:58.000-0600

We are also trying to use MixMonitor and experiencing a random crash, after a exexecuting our test scenario for a few hundred or thousand calls. We are generating 170 simultaneous calls, 60 of them are Zap calls and 110 are SIP calls. Sometimes, calls may generated at the same time, sometimes they may have intervals between them. This means that in some cases they may all be hung up pretty much at the same time. Our crashes seem to occur in the same function:

#0 ast_channel_spy_remove (chan=0x9887b30, spy=0xb534f3b0) at channel.c:1053
1053 spy->write_queue.head = f->next;
(gdb) bt
#0 ast_channel_spy_remove (chan=0x9887b30, spy=0xb534f3b0) at channel.c:1053
#1 0x08066dbc in ast_hangup (chan=0x9887b30) at channel.c:1093
#2 0x08091461 in __ast_pbx_run (c=0x9887b30) at pbx.c:2457
#3 0x08092bdc in pbx_thread (data=0x9bb2ec0) at pbx.c:2507
#4 0x00b79341 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-1 0x00a656fe in clone () from /lib/tls/libc.so.6

This appears to be called when a call is hung up.

I will attach a "thread apply all bt" obtained from one of our core files.

By: Paulo Mendes da Silva (kanelbullar) 2006-01-17 03:50:26.000-0600

Just an additional note, we are using Asterisk 1.2.1.
By: Steve Davies . (stevedavies) 2006-01-22 10:34:08.000-0600

You might like to try the patch I uploaded in bug 6321; this fixes a "read too far" bug in code used when a channel is being spied.

In my case my box had an immediate segfault (I use grsecurity stuff). But perhaps the bug causes other corruption in other cases).

Steve
By: Clod Patry (junky) 2006-02-13 23:04:01.000-0600

pupeno: is it still an issue?
Can you try what stevedavies said?
By: adomjan (adomjan) 2006-02-14 05:14:48.000-0600

Hi, I still have crash with ChanSpy. You can reproduce the crash by the uploaded extensions.conf.
By: Jose Pablo Fernandez (pupeno) 2006-02-14 10:18:06.000-0600

We still have crashes in our PBX (20 busy phones and 40 not-so-busy phones) in less than half an hour with stevedavies' patch.
By: Martin Vit (festr) 2006-02-22 16:44:21.000-0600

i have exactly same problems.

problem is when passing struct f to function ast_frdup (f=0x0) at frame.c:351 where f=0x0 and this f is dereferenced f->something on line 351. I've no time to understand all things in framce.c and channel.c why f is 0x0.