[Home]

Summary:ASTERISK-09300: Asterisk dies unexpectedly, cannot find out reason why. I suspect due to nagios monitoring on management interface.
Reporter:Attila Megyeri (amegyeri)Labels:
Date Opened:2007-04-23 14:26:45Date Closed:2007-08-02 15:44:56
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace1
( 1) backtrace2
( 2) backtrace-noqualify-1
( 3) backtrace-noqualify-2
( 4) check_asterisk.pl
( 5) manager.conf
( 6) manager.conf.permit_removed
Description:Asterisk died over 10-20 times today, in unexpected and non-periodical intervals. It is running on a XEN virtual machine, and the only recent change is that I have introduced a nagios monitoring. Nagios checks asterisk every minute.
After a random time, asterisk simly drops out my remote unix connection (asterisk -r). No entries even in the FULL log.
I suspect this is up to some issue in the management interface, however I'm not sure.
The same nagios checks another asterisk server (though it is a remote one, not localhost), and that one runs flawlessly.
The issue happens with version 1.2.17, the other asterisk is 1.2.14.
Any help would be highly welcome.
How could I try to locate the problem? Even the highest level logs do not show anything at all, and there is no common pattern in the logs before it crashes...

Comments:By: Attila Megyeri (amegyeri) 2007-04-23 14:28:15

One thing to add: I introduced nagios monitoring, because * was dying even before. But I noticed this issue when running "Phonecall", the GPL management utility. That one also connects to the management interface to get 'current call' statuses.

By: Attila Megyeri (amegyeri) 2007-04-23 16:45:59

I added the backtraces of the two most recent crashes.

Interestingly, * does not crash when I turn of the nagios monitoring. I'm using the check_asterisk.pl plugin, but it works fine with the REMOTE 1.2.14 system. Strange. I have also attached the check_asterisk.pl.

By: Attila Megyeri (amegyeri) 2007-04-24 01:44:47

Having a look at the backtraces, it seems the issue was with the qualify=yes with the sip peers.
For this reason, replaced all qualify=yes with qualify=no, and * is sill crashing, but right now not in the peer poke function.
Please see the new backtraces
backtrace-noqualify-1
backtrace-noqualify-2
Now it dies here:
(gdb) bt
#0  0x4018950d in mallopt () from /lib/libc.so.6
#1  0x40188d4b in mallopt () from /lib/libc.so.6
#2  0x40188763 in calloc () from /lib/libc.so.6
#3  0x4061fa72 in sip_alloc (callid=0x81bf648 "7286406701c23421163822d30e42fb01@127.0.0.1", sin=0x0, useglobal_nat=0, intended_method=2)
   at chan_sip.c:3108
#4  0x4061661c in transmit_register (r=0x81bf330, sipmethod=2, auth=0x0, authheader=0x81b7800 "o") at chan_sip.c:5554
ASTERISK-1  0x40615664 in __sip_do_register (r=0x81b7800) at chan_sip.c:5475
ASTERISK-2  0x406300f1 in sip_reregister (data=0x81bf330) at chan_sip.c:5465
ASTERISK-3  0x080566c8 in ast_sched_runq (con=0x81a84e8) at sched.c:373
ASTERISK-4  0x40628d82 in do_monitor (data=0x0) at chan_sip.c:11688
ASTERISK-5  0x40026e51 in pthread_start_thread () from /lib/libpthread.so.0
ASTERISK-6 0x401ee8aa in clone () from /lib/libc.so.6

This still happens when using nagios.

Can anyone help ? :(

By: Attila Megyeri (amegyeri) 2007-04-24 09:06:40

I made a couple of new backtraces, the issue is always with a rescheduled sip_destroy:

(gdb) bt
#0  0xb7d772f2 in mallopt () from /lib/libc.so.6
#1  0xb7d760df in free () from /lib/libc.so.6
#2  0xb79106fa in __sip_destroy (p=0x8213228, lockowner=1) at chan_sip.c:2213
#3  0xb7910ff9 in sip_destroy (p=0x8213228) at chan_sip.c:2301
#4  0xb790d1e0 in __sip_autodestruct (data=0x8213228) at chan_sip.c:1342
ASTERISK-1  0x08056c50 in ast_sched_runq (con=0x81a2018) at sched.c:373
ASTERISK-2  0xb7936aab in do_monitor (data=0x0) at chan_sip.c:11688
ASTERISK-3  0xb7ee2e51 in pthread_start_thread () from /lib/libpthread.so.0
ASTERISK-4  0xb7ddc8aa in clone () from /lib/libc.so.6

The full log shows a couple of autodestroying rows:

chan_sip.c: (Provisional) Stopping retransmission (but retaining packet) on '51027cd72505383d7eca4e135b249be7@127.
Apr 24 16:05:14 DEBUG[22989] chan_sip.c: Stopping retransmission on '51027cd72505383d7eca4e135b249be7@127.0.0.1' of Request 105: Match Found
Apr 24 16:05:14 DEBUG[22989]

What are these localhost calls??

By: Attila Megyeri (amegyeri) 2007-05-03 07:02:32

Is there any chance someone will look into this issue, please?

By: Attila Megyeri (amegyeri) 2007-05-06 07:14:35

One important remark:

I tried to set up monitoring from remote hosts (nagios and phonecall connecting to my asterisk box from a non-localhost IP) and the problem does not exist!

So, basically I isolated the problem I just need some help to fix it:

Asterisk, running on a XEN virtual machine dies after a couple of minutes, if a process connects to the management interface FROM LOCALHOST. It usually dies with a libc memalloc problem, backtraces attached.

By: Tilghman Lesher (tilghman) 2007-06-25 15:00:09

Please upload your manager.conf, with the secrets X'ed out.

By: Attila Megyeri (amegyeri) 2007-06-25 16:57:45

manager.conf uploaded.

It has been a long time since I opened this issue and tried many things, but was unable to find any good replication point. The crash still exists if I use any app connecting to the manager interface regularly.

Unfortunately the crash happens in the case of remote manager connection as well ( even though I said the opposite in my last note.)

By: Tilghman Lesher (tilghman) 2007-06-25 17:17:31

I think I see what the problem is.  You have given nagios unlimited read permissions (which are for events) and no write permissions (which are for interactive commands).  Given that you're trying to run interactive commands with your script, you're likely to find that everything is denied.  Plus, given that you're never reading the asynchronous events, you're building up a large queue; while that shouldn't cause Asterisk to crash, it certainly is a reason why stuff isn't working as expected.

What's possibly happening is that you're building up a large number of nagios sessions, each of which is eating a large queue, and you're simply running out of memory.

By: Tilghman Lesher (tilghman) 2007-06-25 17:19:01

Also, your permit lines are doing absolutely nothing to prevent other hosts from connecting.  You need a deny=0.0.0.0/0.0.0.0 before the permit line to prevent other hosts from connecting.  This is because the manager interface defaults to allowing all connections.

By: Attila Megyeri (amegyeri) 2007-06-25 18:04:00

Well, actually in the most recent config nagios uses the "status", the nagios user was left in the manager.conf by mistake.

The monitoring works perfectly, i.e. I'm not getting any denials. I just turned it of recently to avoid crashing.

Now, I removed the permit line (as I'm using a firewall anyway) but I doubt this would change anything.

So I guess we need to dig inside deeper... :(

By: Jason Parker (jparker) 2007-07-31 11:22:38

Can this issue be reproduced on 1.4?

Once 1.2 goes into maintenance mode (scheduled for tomorrow - August 1st), all issues that only affect 1.2 may be closed.

By: Steve Murphy (murf) 2007-08-02 15:44:54

I apologize for this, but I'm closing this bug because the time for 1.2
support is now expired.

All hope is not lost, tho! Hopefully, there will come a time and chance for you
to move your installation to 1.4; and if this problem persists in 1.4, you are more than welcome to re-open this bug, or open a new one, and we'll see if it can't be solved.

If the problem disappears in 1.4, all I can say is Yay!; but if not, we'll dive back in and sort it out.

Sorry for this inconvenience! We hope you'll stick with it and try this on 1.4.