ASTERISK-18074: SIP messages stop being processed

[Home]

Summary: ASTERISK-18074: SIP messages stop being processed

Reporter: ppower (ppower) Labels:

Date Opened: 2011-06-29 14:08:57 Date Closed: 2011-07-11 13:54:42

Priority: Critical Regression?

Status: Closed/Complete Components:

Versions: 1.8.2 Frequency of
Occurrence

Related
Issues:

Environment: gentoo 2.6.36 dual quad core hyper threaded 2.67GHz intel processors 8GB RAM Attachments: ( 0) locks.txt
( 1) locks2.txt

Description: The other day asterisk stopped processing SIP messages. This caused the polycom phones to register else where. The DUNDi and IAX2 threads continued to operate. The console was still functional. IAX calls that came in to the server and tried to dial SIP end points were hung. This resulted in the 'Max retries exceeded to host' messages to come up. top reported asterisk using 100% CPU. vmstat reported 6-7% (which is 100% of one CPU) usage. Nearly all of this was in the system processes not the user processes. This may well be related to other CPU bound issues listed in system, but there is little data I can use from those issues to correlate them to this one. I have recompiled asterisk with the DEBUG_THREADS switch so that I may get a lock report if this happens again. It happened twice yesterday, and when attempting to use 1.8.2 about 5 weeks ago it happened several times, but my debugging was looking at the 'Max retries exceeded to host' warnings. To be plain, I receive no SIP related warnings or errors when this happens. By all means let me know how to help getting this sorted out. I cannot really use 1.8.2 until this thing is resolved.

Comments: By: ppower (ppower) 2011-06-29 16:14:29.351-0500

We had a lock up a bit ago. Here is the result of the core show locks command.
By: Gregory Hinton Nietsky (irroot) 2011-06-30 01:21:13.822-0500

Ok this is not a deadlock ...

please when it happens "kill -6" asterisk to obtain a core dump

read

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: David Woolley (davidw) 2011-06-30 05:38:25.520-0500

Do not use kill to get a backtrace, it tends to result in only one thread being traced and that might not be one that is invovled in the deadlock/livelock.

Either use gcore, or attach gdb to the running process.

Also, the standard signal for forcing a dump, from outside the program, on most or all Unix-like systems is number 3. Signal six is for program initiated aborts, e.g. when freeing memory that wasn't allocated.
By: ppower (ppower) 2011-06-30 07:50:53.425-0500

When it happens again i will use gdb in accordance with the instructions @ https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Freddi Hansen (freddi_fonet) 2011-06-30 13:59:53.157-0500

This issue has several duplicates with patches here on jira, looks like either install all the related patches or upgrade to 1.8.5 which doesn't have this problem. The SIP channel on your release (assuming unpatched 1.8.2) freezes signalling upto 10 minutes and then it will start working again. Keeping timing on dahdi and disabling tcp on sip will usually fix the problem but thats not always possible.

By: ppower (ppower) 2011-06-30 14:14:54.626-0500

I will see if i can find those duplicates. I had my signalling freeze for almost 50 minutes before taking matters into my own hands. I did not explicitly enable tcp/sip, so i hope that is not an issue.

By: ppower (ppower) 2011-06-30 14:56:16.754-0500

I found this as a duplicate: ASTERISK-17934, but is suspended.
By: Freddi Hansen (freddi_fonet) 2011-06-30 17:01:40.534-0500

I didn't keep track on issue numbers on mantis/jira migration (numbers should be the same but my bookmarks ?) check following issues: 17129,17255,17512,18497 some may be duplicates
By: ppower (ppower) 2011-07-01 09:05:05.076-0500

Thanks for the issue numbers. ASTERISK-17129 has the same lock profile as i do, but the TLS patch seemed to fix things for them. ASTERISK-17255 and ASTERISK-17512 appear to be duplicates as well. ASTERISK-17512 also has the same lock profile.

For the record i am using: res_timing_timerfd

The last three threads in my core show locks are new calls trying to get started. The first thread is the sip monitor thread. It has the monitor lock and is trying to get a channel lock. The second thread has the channel lock and says it started life as a pbx thread, but the ast_io_wait call from chan_sip only occurs in code from the monitor thread. This has me confused. It seems as though there are two monitor threads running, sort of. Anyway this is the type of locking situation that i have seen in two of the other issues. ASTERISK-17255 lock dump was more complicated.

This issue did crop up yesterday, but the https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace instructions failed me. I will work on that today.

By: ppower (ppower) 2011-07-01 10:34:34.350-0500

Uploading lock dump from yesterday. Definite deadlock. back trace instructions work better when gdb is installed...
By: Freddi Hansen (freddi_fonet) 2011-07-01 17:45:19.866-0500

If possible try to use dahdi timer for now. Even 1.8.5-rc1 still struggles with a nasty 'idle read on ..' thing - see issue 17867
By: ppower (ppower) 2011-07-05 09:31:21.330-0500

Using dahdi timing as of this morning. We shall see how this goes. Thanks Freddi.

By: Leif Madsen (lmadsen) 2011-07-11 13:54:42.207-0500

I'm closing this issue as it is mostly likely to do with the timing module in use, and those issues have already been reported several times. If you run into an issue while using res_timing_dahdi please open a new issue and attach a backtrace and 'core show locks'. Thanks!
By: ppower (ppower) 2011-07-13 08:46:02.522-0500

Leif-
Which issue is the one this is a duplicate of? Or, which issue should i keep an eye on and help with when i have the time?