ASTERISK-17255: Possible deadlock on 1.6.2.12, 14 and 15

[Home]

Summary: ASTERISK-17255: Possible deadlock on 1.6.2.12, 14 and 15

Reporter: Dieter Jansen (justintonation) Labels:

Date Opened: 2011-01-17 01:22:08.000-0600 Date Closed: 2013-01-14 14:51:32.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) case.201101121019.bt.log
( 1) case.201101121019.caseinfo
( 2) case.201101121019.gcore.log
( 3) case.201101121455.bt.log
( 4) case.201101121455.caseinfo
( 5) case.201101121455.gcore.log
( 6) case.201101131248.bt.log
( 7) case.201101131248.caseinfo
( 8) case.201101131248.gcore.log
( 9) case.201101221250.backtrace-threads.txt
(10) case.201101221250.caseinfo
(11) case.201101221250.core-show-locks.txt

Description: After hours to days the system appears to reach a deadlocked state. SIP phones lose registration, in-flight calls continue, no new calls can be established. Need a "service asterisk restart" to get going again.

Problem is usually first observed when phones lose registration (120 seconds between registrations).

System has never been observed to reach this state overnight or on a weekend so it appears to only happen when the system is actively handling calls.

****** ADDITIONAL INFORMATION ******

I have experienced this (or similar) with 1.6.2.12, 1.6.2.14 and now 1.6.2.15 though I only have backtraces for 1.6.2.15 .

Reported against 1.6.2.14 because that is the latest on the pulldown.

System started as:

AsteriskNOW 1.7.1 distribution
Asterisk 1.6.2.12
B410P BRI interface card connected to Australian Telco
DAHDI Version: 2.4.0 Echo Canceller: MG2
libpri version: 1.4.11.4

Updated to:

Asterisk 1.6.2.14
libpri version: 1.4.11.5

And then to:

Asterisk 1.6.2.15

Originally posted details to http://forums.digium.com/viewtopic.php?f=1&t=76457

Gathered backtraces of three occurrences as suggested by malcolmd and will attach files.

Comments: By: Dieter Jansen (justintonation) 2011-01-17 01:30:42.000-0600

I have the core dumps available but need some info about how to remove sensitive information before uploading to a public forum.
By: Dieter Jansen (justintonation) 2011-01-17 01:55:40.000-0600

Should also have said...

System comprises some 35 SNOM 320 SIP phones and a few SNOM 360s (with sidecars). Lots of BLFs.

2 of the 4 BRI ports are in use, all incoming calls arrive on the B410P. Outgoing calls mostly use the B410P (up to two calls) and overflow to a SIP VSP (infrequent).

A previous implementation with Asterisk 1.4 and mISDN (sorry - don't have full details to hand - releases were from about 3 to 4 years ago) worked.
By: wufan (wufan) 2011-01-17 05:32:56.000-0600

hi,
how often is this crash?
are there any errors on the console?
thanks
By: Dieter Jansen (justintonation) 2011-01-17 05:44:51.000-0600

About 4 days apart is the longest I have seen, 2 to 3 hours the shortest. We only process about 400 - 600 calls per day.

No errors in the logs at all - the phones just start to show "NR" (no registration) and we have to restart the asterisk service.

Can still do a "core show channels" in the CLI when it's in the locked state.
By: wufan (wufan) 2011-01-17 06:43:42.000-0600

i am not a programmer but i think you have to recompile your asterisk with the dont optimize flag, because of the <value optimized out>
https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
(its only for a better backtrace)
By: Leif Madsen (lmadsen) 2011-01-17 09:00:45.000-0600

You'll need to provide the information as documented here:

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

You're missing 'core show locks' and unoptimized backtraces.
By: Dieter Jansen (justintonation) 2011-01-17 16:22:05.000-0600

I apologise for the missing information but I am unfortunately reliant on what is packaged with AsteriskNOW.

When I tried 'core show locks' I got a 'No such command' so I assume AsteriskNOW is compiled without DEBUG-THREADS. The same is true of the unoptimized backtrace as AsteriskNOW is compiled without DONT-OPTIMIZE.

I'm a little surprised that AsteriskNOW is not set up appropriately for this sort of problem reporting.

It will still be some time before I can prepare a system built from source.

By: Dieter Jansen (justintonation) 2011-01-17 16:32:32.000-0600

BTW https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace doesn't mention DEBUG-THREADS. Should it?
By: Leif Madsen (lmadsen) 2011-01-19 10:40:38.000-0600

Weird, it used to. Must have been a problem when it got imported. I have added a note now. Thanks!
By: Dieter Jansen (justintonation) 2011-01-22 05:01:12.000-0600

I'm a little reluctant to add this info as I'm not sure how well it relates to what has gone before. But FWIW...

I finally had a chance to lock the users out of the building for a few hours and I finished compiling and configuring a system on another server from source. Release info as above with Asterisk 1.6.2.15. I built with DONT-OPTIMIZE and DEBUG-THREADS.

The system started intermittently dropping SIP registrations immediately (with no calls) and when I did a few incoming calls to a queue of four extensions the system locked up almost immediately.

I took the required dumps BUT I am by no means certain that they relate to the same issue as originally reported.

I recompiled without DEBUG-THREADS and as far as I was able to tell with some simple tests the system worked normally at that point. Of course previously the deadlock was never observed until we had a normal workload on the system.

By way of an experiment I tried building 1.6.2.8 with both DONT-OPTIMIZE and DEBUG-THREADS and also had sip registration issues (though no lockup). Build with just DONT-OPTIMIZE everything again appeared to work normally. I ONLY rebuilt Asterisk in these experiments, not DAHDI and libpri, so I got a warning about possibly incompatible modules - all seemed to work despite that. [Thanks to sysreq for his comments on 1.6.2.8 in issue 18619 - I needed something to compare with and this seemed like a suitable candidate after I read your comment.]

I did not formally monitor resource use but when I ran top a few times I never saw more than 5% CPU busy in the DONT-OPTIMIZE configurations and about 10% in the DONT-OPTIMIZE and DEBUG-THREADS configuration.

So the upshot is I've uploaded the files but have no idea if they are related to the original assumed deadlock.

I hope to run some more tests before people get back to work on Monday and am still mulling over what configuration to leave in production for next week (if I change it at all). The frustrating thing is that a DEBUG-THREADS compile seems to be unusable in our config for whatever reasons, so I probably will never get a backtrace of a real life occurrence of the problem.

By: Leif Madsen (lmadsen) 2011-01-24 07:51:29.000-0600

Based on your quick responses, and your obvious ability to enable debugging and add useful information, I'm going to import this issue for evaluation by the development team. When this gets assigned to someone then hopefully the developer will have a better idea what to ask for, as I'm not sure what else I can ask you to provide (other than information on what your testing brings).

Thanks!
By: Dieter Jansen (justintonation) 2011-01-24 18:31:36.000-0600

Thanks lmadsen - I appreciate it. I am happy to do what I can to assist in further debugging this issue.

I ended up putting 1.6.2.8 (with DONT-OPTIMIZE and without DEBUG-THREADS) into production this week as I felt I had to throw the users a bone and try and give them a more stable platform, if only until the next round of testing.

So far (Monday and half of Tuesday here) we have had about 550 calls through the system with no lockups. That's by no means a record and we have a Public Holiday for Australia Day this Wednesday so my sense is that the workload is a little subdued. Still, its looking promising as a reliable release in our environment for the moment. Maybe this helps to localise the change that has introduced this presumed deadlock?

As I say, I'm happy to test something else if I can (I've not use SVN before but I expect I can make it work given time).
By: Francesco Segato (fsegatoz) 2011-01-25 08:43:47.000-0600

I think I've run into the same issue. The system started as:
AsteriskNOW 1.7.0 distribution
Asterisk 1.6.2.12
Dialogic Diva V-4BRI card (with chan_capi and melware divas driver)
with 12 Snom320 and 3 Snom370 phones (all running firmware 7.3.30 and BLFs).

Every day or so the Asterisk SIP module got freezed and a full Asterisk restart was necessary to recover it. The call volume is quite lower than the other reported case.

I can exclude the Dialogic board being part of the issue, as we temporarily removed it replacing with an external SIP-ISDN gateway, with no improvements.

I could get rid of this issue after two changes: first, I downgraded Asterisk to 1.6.2.9; second, I reprogrammed Snom BLFs in order to avoid phones to subscribe themselves (originally, every phone did show its own state among the other BLFs). Unfortunately I could not apply these changes independently, as this is a production system and I had limited time to work on it, customer was too angry to ask him for some experimenting time :-(

By: Dieter Jansen (justintonation) 2011-01-25 17:38:28.000-0600

Thanks fsegatoz!

SNOMs and BLFs keep coming up - may not mean anything because people will search for issues that report similar environments to their own, but it does make you think...

I had a quick look and at our site we have 14 out of 35 phones with a BLF subscribed to their own number. This includes 3 of the four extensions that are in the ringall queue on the main switchboard number which is where most of the call action is.

Was there something specific that made you think that this self-monitoring might be associated with the problem or were you in a situation where you just had to try whatever you could think of to get the system running reliably again?
By: Francesco Segato (fsegatoz) 2011-01-26 02:19:24.000-0600

I have a second customer who has not experienced any failure in months, running a very similar system. I checked the differences between the two: Asterisk version (1.6.2.9 vs 1.6.2.12), and Snom without "self"-BLF. So last week I ported these differences into first system, which has been running fine since then.

Moreover, I googled a bit about interoperability issues between Snom 3x0 phones and Asterisk, finding some cases not really similar, but suggesting that BLF function may cause some trouble to Asterisk 1.6.2.x: see e.g. http://forum.snom.com/index.php?showtopic=4836
By: Dieter Jansen (justintonation) 2011-01-26 04:55:14.000-0600

Thanks fsegatoz - that's a very interesting read.

We are running 7.3.14 SNOM firmware which is pretty old.

Next time I'm testing with an Asterisk release that exhibits the problem I'll look for the things pointed out in that article.

We got to 750 calls over 48 hours without a failure running 1.6.2.8 - if we make it to the end of the week without a lockup I think we'll have established a known-good configuration we can measure our progress against. If I can give the users a good run without failures I should be in a much better position to bargain with them for test windows where they and I know the system might fail.
By: wufan (wufan) 2011-02-03 10:50:55.000-0600

hi,
hope your systems are doing fine :)
i have a question. i have noticed before my deadlocks there were transfers with
call came in on queue
-> someone (10) picked up
-> transfered it to somebody (20)
-> somebody (20) was busy or nav
-> someone (10) speaks with incomming caller again
-> someone (10) transfered it to another ex (30)
-> another ex (30) picked up the phone and they were transfered.

do you have logs from the console? can somebody look if there are some
== Extension Changed 90[custom] new state Hold for Notify User 70
== Extension Changed 90[custom] new state Hold for Notify User 71
== Extension Changed 90[custom] new state Hold for Notify User 72
before your deadlocks?

i have tried to reproduce this, but if i test it in the evening and the people are not registered there are no hints :(

maybe its not only a hint problem.
and maybe that would explain why it deadlocked twice a day by your issue and only every 1-2 weeks by my issue.

Thank you!
By: Francesco Segato (fsegatoz) 2011-02-04 09:48:27.000-0600

My system is fine now, thank you ;-)
I have lot of console logs during system crashes, they are not necessarily related to internal transfers. In a couple of cases, system crashed when a single call entered from outside, making three phones ring.
So I'm not suspecting of transfers as a cause of the crash.
By: Ronald Chan (loloski) 2011-02-19 14:04:58.000-0600

fsegatoz: do you mean the problem cure itself? could you please state your asterisk/dahdi/libpri version.

Most of our implementation with asterisk is functioning as sip gateway, we are planning to deploy asterisk in a very busy call center with a lot's of BLF and transfer stuff

we really love to know which 1.6 or version is stable enough for this purposes. there were lot of fixes on transfer issues we have seen in the tracker but we really love some feedback thanks.

Regards

Ronald
By: Dieter Jansen (justintonation) 2011-02-19 18:51:25.000-0600

wufan: Sorry not to respond sooner - I have been working on other matters and missed the update emails. I did not notice the pattern you describe in our system although its a bit hard to be 100% sure - the full logs are very useful but its sometimes hard to spot patterns amongst the detail.

loloski: Our deadlock problems were resolved by reverting to 1.6.2.8 as suggested by sysreq in issue 18619 - since we reverted we have had several trouble-free weeks. Of course its in the nature of things that all software has bugs so you may need a bit of functionality that is broken in 1.6.2.8 that we just don't notice. If you have nothing else to go on then 1.6.2.8 may be a good place try to stabilise your site.

I expect to get some time back on Asterisk shortly and intend then to see if there is anything that looks like it has addressed our issue and will give it a try. I also want to try 1.8 but am waiting for more user experience on how well FreePBX coexists with it - early reports seem promising.
By: rsw686 (rsw686) 2011-02-19 19:33:21.000-0600

If you are looking for a stable version I am running 1.6.1.18 processing 4,000 calls per day with 200 phones. I have had to backport a few patches from later versions to address some issues I ran into. If you need a supported release 1.6.2.6 was released at the same time. However they did change the scheduler for 1.6.2.

I also have a 1.6.2.9 system with 230 days of uptime that has processed 10,000 calls. This is a small setup with 7 phones.

1.8 works fine with FreePBX. I have two systems systems running 1.8.2.2 with FreePBX 2.7. I'm not a fan of FreePBX 2.8's outbound route dial plan editor. One box has two phones and the other three phones.

To troubleshoot issues I like to find the version that is stable and the next version that is unstable. I look through the changelog between these versions and test reverting various patches until I find the problem patch.
By: Ronald Chan (loloski) 2011-02-19 21:38:43.000-0600

Guys, thanks for your feedback much appreciated

Regards
By: Gregory Hinton Nietsky (irroot) 2011-02-20 01:53:08.000-0600

please see ASTERISK-17387 and WRT to ASTERISK-17287 please try with only the dahdi timer see ASTERISK-17407
By: Freddi Hansen (freddi_fonet) 2011-02-20 16:41:55.000-0600

I did upgrade one of my production 1.6.2 servers from revision 299450 to 303455 and then I started to get 2-4 deadlocks per day where sip signalling freezes. I can use console,make iax calls and established sip calls stays up but sip registration fails and of course no new sip calls. The deadlock clears itself after max 15 minutes.

Someone else was writing that release 302265 should be ok too. Maybe others can chime in so we can narrow down what changes that were made.
By: Francesco Segato (fsegatoz) 2011-02-21 05:19:30.000-0600

loloski: to get rid of the problem I downgraded Asterisk to 1.6.2.9 (from 1.6.2.12) and I reprogrammed Snom BLFs in order to avoid phones to subscribe themselves
By: Ronald Chan (loloski) 2011-02-21 06:53:46.000-0600

fsegatoz: Noted, another thing when you say "I reprogrammed Snom BLFs in order to avoid phones to subscribe themselves" you mean you prevent the phone itself to subscribe or monitor itself via hints?
By: Francesco Segato (fsegatoz) 2011-02-24 06:26:03.000-0600

loloski: correct
I know it's almost unuseful to make a phone monitor itself, but when you deploy several phones with a single centralised configuration you adopt the same monitoring config for all phones, therefore it may happen that a phone monitors itself

By: Marco Marzetti (manzo_zeti) 2011-04-04 02:26:17

We had the same issue with Asterisk 1.6.2.11 and now with 1.8.1.1.
We have many ( about 60 ) Snom phones with BLF enabled.

By: Anthony H (ahave) 2011-04-19 15:16:28

I have the same issue with asterisk 1.6.2.16 and 1.6.2.17.2
By: Volnikov Ivan (ivan) 2011-07-01 06:35:50.194-0500

I have the same issue with asterisk 1.6.2.18-1.6.2.19
In the 1.6.2.13 is working normally
By: Anthony H (ahave) 2011-08-09 03:55:11.403-0500

This issue seems to be the same that ASTERISK-18166 and ASTERISK-18142
By: Matt Jordan (mjordan) 2013-01-14 14:51:23.561-0600

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions. After testing with Asterisk 1.8, if you find this problem has not been resolved, please open a new issue against Asterisk 1.8.

In addition, it looks like this problem was resolved by ASTERISK-18166. If you find you still have problems with a deadlock in Asterisk 1.8, please contact a bug marshal in #asterisk-bugs.