ASTERISK-17245: Asterisk deadlocks with no errors

[Home]

Summary: ASTERISK-17245: Asterisk deadlocks with no errors

Reporter: wufan (wufan) Labels:

Date Opened: 2011-01-14 07:57:05.000-0600 Date Closed: 2011-06-07 14:05:00

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) backtrace-threads.txt
( 2) backtrace-threads-20110125.txt
( 3) core-show-locks.txt
( 4) core-show-locks-20110125.txt

Description: Hi,

we have every ~1-2 weeks a deadlock.
Max 10 cc. About 40 users.
Never get any error message or warning.

Last time it happend 2010-12-21 (with version 1.6.2.13) the console outputs some sip destroying packets and the peers go offline about 5 minutes later availible again and so on.. the asterisk didnt answered to the sip packets or answered after a few seconds.

This time it happened (1 hour ago) (with version 1.6.2.15) there are no such errors .. only nothing :)

When i wrote "sip show peers" it showed me the peers
but console dial 50@context did nothing.

Comments: By: David Woolley (davidw) 2011-01-14 08:28:15.000-0600

Which version of dahdi do you have? Does your machine use HPET to simulate the RTC? How close to 5 minutes is the dropout? Which kernel version and OS if not Linux?

By: wufan (wufan) 2011-01-14 08:42:27.000-0600

DAHDI Version: 2.4.0 Echo Canceller: MG2
how can i found out if my machine use hpet?
the dropoutS last time was:

[Dec 15 16:48:47] NOTICE[15769]: chan_sip.c:22986 sip_poke_noanswer: Peer '661' is now UNREACHABLE! Last qualify: 82
[Dec 15 16:48:47] NOTICE[15769]: chan_sip.c:18378 handle_response_peerpoke: Peer '661' is now Reachable. (146ms / 3000ms)
[Dec 15 16:54:35] NOTICE[15769]: chan_sip.c:22986 sip_poke_noanswer: Peer '661' is now UNREACHABLE! Last qualify: 87
[Dec 15 16:54:36] NOTICE[15769]: chan_sip.c:18378 handle_response_peerpoke: Peer '661' is now Reachable. (153ms / 3000ms)

and lots of chan_sip.c:3831 retrans_pkt: Maximum retries exceeded on transmission 3c3bcbfb13af-3zjooegz1v5b for seqno 1 (Critical Response) -- See doc/sip-retransmit.txt.
By: Stefan Schmidt (schmidts) 2011-01-14 08:45:26.000-0600

typical deadlock situation.

please try version 1.6.2.16rc1 cause a deadlock patch which could solve your problem is included it.

thx

stefan
By: David Woolley (davidw) 2011-01-14 08:45:42.000-0600

Dahdi 2.4.0 shouldn't have the problem I'm thinking about.
By: wufan (wufan) 2011-01-14 09:07:17.000-0600

sorry david. did just see it:
Linux asterisk-de 2.6.32-21-generic-pae ASTERISK-28-Ubuntu SMP
thank you!
By: David Woolley (davidw) 2011-01-14 09:14:34.000-0600

That version of Linux is too recent for the problem, as well. (There is a problem in some versions of Linux where an interrupt latency problem can cause an HPET wrap round in the RTC simulation. Typically that wrap round takes about 300 seconds, during which an Asterisk relying on dahdi-dummy has no internal timing. Dahdi 2.4.0 no longer uses the RTC and later versions of Linux handle the RTC simulation safely. Your mention of 5 minutes made me wonder if you had this problem.)

By: wufan (wufan) 2011-01-14 09:36:14.000-0600

ok thanks :)
maybe there are 2 different problems.
the last time (on 15.dec and 21.dec) it was 2.6.2.13 with dahdi 2.3.0.1 and the strange sip ON/OFLine notices. but the incoming calls were logged (dont know if it helps: dahdi_transcode wasnt enabled). Even a machine reboot didn´t solved the problem. after about 1 hour and 30 it worked again without any action from me.

This time nothing appered with no messages. it just a hangs.
thank you!!
By: Leif Madsen (lmadsen) 2011-01-17 08:35:36.000-0600

wufan: per schmidts, please try the most recent version of Asterisk.
By: Dieter Jansen (justintonation) 2011-01-18 17:37:44.000-0600

What deadlock patch is in 1.6.2.16rc1 that is not in 1.6.2.15? I couldn't see it in the ChangeLog.
By: wufan (wufan) 2011-01-19 03:08:16.000-0600

i couldn't see a deadlock patch in the changelog for 1.6.2.16 either.
By: Dieter Jansen (justintonation) 2011-01-19 04:58:49.000-0600

There is a fix for issue 0018310 in the pipeline for 1.6.2.17 which may be relevant to one or both of our problems wufan.

Are you using a lot of BLFs in your installation? We have most incoming calls go to a pool of three phones in the reception/admin area and almost every phone in the place has those three monitored on a BLF. The upshot is that there are a lot of SIP NOTIFYs (maybe 40-50 or so) each time a call comes in or when the call is answered.

I've go no evidence, but it looks like an invitation to exercise a race condition to me.

By: wufan (wufan) 2011-01-19 05:52:39.000-0600

thanks justintonation
i am hoping it is this issue :)
hope it gets released soon. have about 120 subscriptions and also snom 320 and snom 360.
20 phones has about 6 BLF each.
I did configured it as Extension (not as BLF) in the snom, but it worked.
any hints in the backtrace that this is this race condition issue?
By: Dieter Jansen (justintonation) 2011-01-19 06:53:30.000-0600

Sorry wufan - I'm really only guessing, I'm not in a position to credibly evaluate the contents of the backtrace. Whilst I've been in IT for more than 40 years, my serious developer days are a long time in the past and I'm only just starting to dabble with asterisk and linux. I hope to get our production setup duplicated on a server where I've built Asterisk from source so I can better document issues and also try patches (the FreePBX config that I've inherited is pretty baroque so this is not a trivial thing - the asterisk build has been the simplest bit so far!)
By: Stefan Schmidt (schmidts) 2011-01-19 07:12:30.000-0600

wufan i am sorry i meant a patch which was not really in release state but was submited in revision 302265 so there is a big chance for 1.6.2.17.

this should fix the race condition for this deadlock situation at all.
By: Leif Madsen (lmadsen) 2011-01-19 14:43:21.000-0600

Can you test the latest 1.6.2 branch? 1.6.2.17-rc1 does not contain this, but if it is fixed in the branch I will do an rc2 and get it imported into 1.6.2.17-rc2.
By: wufan (wufan) 2011-01-19 16:51:26.000-0600

lmadsen, stefan, thanks for your help!
can i use this branch for a production system?
the problem occurs "only" every 1-2 week so i dont know if it is worth it (for my case) or just wait for 1.6.2.17.
By: Sébastien Couture (sysreq) 2011-01-20 01:16:53.000-0600

We've had the same problem here, where Asterisk would stop processing SIP requests altogether. Unfortunately, we haven't been able to provide the output of a 'core show locks' since enabling the DEBUG_THREADS compilation flag on our servers would more than double the CPU utilization and affect calls through audio quality issues and such.

Although, we did notice that the issue seemed to arise from 1.6.2.10 up until 1.6.2 SVN r299242M (a little after 1.6.2.15, which had issues with blind transfers.. so we had to upgrade). I have 1.6.2.8 running on a server that used to hang once a day when using 1.6.2.15 and I haven't had any problems since about a week now.

Here's a couple bug tracker issues I stumbled upon while seeking for patches that might be related to the issue at hand:

https://issues.asterisk.org/view.php?id=18036
https://issues.asterisk.org/view.php?id=18511
https://issues.asterisk.org/view.php?id=18132
By: Sébastien Couture (sysreq) 2011-01-20 01:29:56.000-0600

Actually.. as Leif suggested, I'll try the latest 1.6.2 branch, and I'll keep you posted tomorrow.
By: Sébastien Couture (sysreq) 2011-01-20 12:15:49.000-0600

No issue on the two servers where I've upgraded to the latest 1.6.2 branch yet. I've finally had one of my 1.6.2 SVN r299242M servers with DEBUG_THREADS enabled hang this morning and was able to gather backtraces and 'core show locks' outputs.

Since we have a support plan with Digium and already had a ticket opened regarding this issue, I've sent them the debugging output files and told them about this bug tracker issue (as well as a possible link with ASTERISK-16961).

I'll keep you informed.
By: Dieter Jansen (justintonation) 2011-01-24 18:48:02.000-0600

@wufan: I've been meaning to follow up on your comment 0130662 that you were using an "Extension" function key defininition on your SNOM phones rather than a "BLF".

I had been using the term BLF generically and I've just checked our phones and they are also using "Extension" definitions just as yours are. I have no idea if this makes any difference to the protocol used but thanks for pointing it out.

@sysreq: I've tried reverting to 1.6.2.8 per your comments and so far it has not deadlocked. Thanks for sharing your observations.
By: Sébastien Couture (sysreq) 2011-01-24 19:09:02.000-0600

I haven't had any problems with what was the latest 1.6.2 branch (rev. 302265) five days ago either. I'll give it a few more days.
By: wufan (wufan) 2011-01-25 10:54:44.000-0600

A hour ago i had another crash (didnt upgraded. its 1.6.2.15)-
But the behavior was different.
The Sip didnt send sip packets or didnt get the answers.
The incoming dahdi calls were logged.
-- Started music on hold, class 'default', on DAHDI/4-1
== Using SIP RTP CoS mark 5
== Extension Changed 77[custom] new state Ringing for Notify User 12
== Extension Changed 77[custom] new state Ringing for Notify User 81
== Extension Changed 77[custom] new state Ringing for Notify User 78
== Extension Changed 77[custom] new state Ringing for Notify User 83
But there was no "SIP/77 is ringing"
and no response to outgoing calls.

Shortly before the "crash" there was a weired transfer action.
83 answered a call.
83 want to transfer it to 12.
12 was busy.
12 was idle.
12 calls 83 (back)
83 was busy.
12 was idle.
83 transfered it to 12.
Dont know if this was by accident or it is related to my other bug.
I only found some (snom) transfer bugs with 1.8.
i upload backtrace and locks.
thanks
By: Sébastien Couture (sysreq) 2011-01-26 09:37:49.000-0600

I still haven't ran into any problems with SVN 1.6.2 r302833M (sorry, wasn't 302265 as mentioned previously).. it's been a full week now.
By: wufan (wufan) 2011-01-26 09:48:14.000-0600

thanks for the info. - hope the best for you.
what me make a little bit perplex is that i had a crash "only" every 1-2 week. By over 1000 calls a day. and the other bugreporter have crashes twice a day.
I think i have 2 issues and this bugreport was an exception.
The default behaivor on a crash is like in comment 0131011 (without the weird transfer)..
i am a little bit confused right now .. :(
By: Leif Madsen (lmadsen) 2011-02-07 14:27:11.000-0600

I'm closing this issue for now as it looks like we might be moving into other issues that are not related to the initial issue, so if you continue having issues please open a new issue and provide the necessary information to move it forward (such as backtraces if you're getting crashes, etc..)

Thanks!