[Home]

Summary:ASTERISK-16731: Asterisk freezes when reloading dialplan
Reporter:Michael Gaudette (bluefox)Labels:
Date Opened:2010-09-25 09:00:36Date Closed:2010-12-20 22:30:20.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Applications/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:Hi,

I had this occurs twice in the last 3 days.  When doing 2 "dialplan reload", asterisk just freezes.  It doesn't crash (and therefore doesn't restart). It just stays there until I rstart it manually, and it doesn't process calls or SIP registrations.

****** STEPS TO REPRODUCE ******

Cannot reproduce it reliabily, but a dialplan reload (or many of them quickly) might do it.
Comments:By: Leif Madsen (lmadsen) 2010-09-27 15:06:44

This sounds like a deadlock to me. Please follow the instructions below. It might be also useful to provide a backtrace of the running process when this happens.

~~~~~~~~~~~~~~~~~~~

Debugging deadlocks:

Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install)

This will then give you the console command:

core show locks

When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt

# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt

gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!


~~~~~~~~~~~~~~


Thank you for your bug report. In order to move your issue forward, we require a backtrace from the core file produced after the crash. Please see the doc/backtrace.txt file in your Asterisk source directory.

Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

after enabling, reproduce the crash, and then execute the instructions in doc/backtrace.txt.

When complete, attach that file to this issue report. Thanks!

By: Michael Gaudette (bluefox) 2010-09-29 14:51:29

Sorry, trying to do this during off-hours, but can't right now.

By: Michael Gaudette (bluefox) 2010-09-29 14:58:46

Sorry, maybe it's been answer before, but what effect will  DEBUG_THREADS and DONT_OPTIMIZE have on my server (I can only reproduce this on a busy server).

Are we talking about performance loss of 10-15%? 50%? Nothing?

By: Stefan Schmidt (schmidts) 2010-09-29 15:54:38

if your server is that busy that you will have a side effect of debug threads and dont optimize its time to buy biger hardware ;)

i have not tried how much % you will loose but its not that big deal, cause it would be more memory and less cpu time which is needed from this changes.

By: Michael Gaudette (bluefox) 2010-10-13 09:37:04

Could these attached files potentially show passwords, etc? If so, could you make this issue private so only asterisk dev see the files I will eventually upload?

I will try to reproduce this, it's very random but it did happen once in the last 2 weeks.

In en effort to automate this information gathering, is there a reliable way to programmatically see if asterisk is frozen?

By: Stefan Schmidt (schmidts) 2010-10-13 09:57:21

if you dont remove the passwords they would be readable by everyone, but we can make this private if you wish so.

you can set up a cronjob on another host to check if asterisk respond to an option message sent by sipsak for example.

By: Michael Gaudette (bluefox) 2010-10-13 09:58:58

I usually remove passwords from config files when I attach them, but I'm not sure how to do that with a backtrace without loosing relevant information.  I guess we'll see when the backtrace is produced.

Thanks, I will set up something with sipsak.

By: Stefan Schmidt (schmidts) 2010-10-13 13:52:58

sorry i get you wrong. in a backtrace a password would not be shown so you can add a backtrace without any problems.

By: Michael Gaudette (bluefox) 2010-10-13 21:09:17

I'm willing to help, but changing those two compiler flags (DEBUG_THREADS and DONT_OPTIMIZE) turned my system into something unworkable. "SIP qualifies" went from 10-50ms to 1000+ms. I had about 700 peers on that system and no calls.

Major culprit was user CPU usage.

Had to roll-back to a normal build.

What can I do to help now?  Anything that for all practical purposes won`t shut down the system?

By: Leif Madsen (lmadsen) 2010-10-14 13:34:58

Unfortunately without that information there is very little, if anything, we can do to move this issue forward. The information is required.

By: Michael Gaudette (bluefox) 2010-10-14 13:58:28

I can't put my system down for a week in the hope of it freezing, unfortunately.  

I'll try to find a reliable way to reproduce it, and if I can I will do what you asked and create the problem.Until then you can close this, I will re-open it if I can go further.

By: Stefan Schmidt (schmidts) 2010-10-14 15:34:08

i have make this private so you can attach your dialplan and tell me exactly what you do when this happens. maybe i am able to reproduce this with generated load to the system

By: Michael Gaudette (bluefox) 2010-10-14 18:32:17

Well, I do nothing.  I simply "dialplan reload" through the CLI. Then a complete
freeze.

I can do this 50 times without it freezing, but then for no reason it freezes.

This never happens just out of the blue, always after a dialplan reload (in the CLI or using the manager interface)

By: Michael Gaudette (bluefox) 2010-10-14 18:44:44

BTW, if it matters: this is the size of the dialplan

= 1623 extensions (5477 priorities) in 361 contexts. =-

By: Stefan Schmidt (schmidts) 2010-10-14 18:56:05

how many hints do you have in your dialplan? hints could be a bastard in locking things even if you do this in runtime by dialplan reload i could think of a race condition if hints are reloaded, out of the dialplan and meanwhile a call is started which want to use one of this hints.

By: Michael Gaudette (bluefox) 2010-10-14 18:57:43

656 hints, 291 subscriptions. Might be it.

By: Michael Gaudette (bluefox) 2010-10-14 19:07:29

Extra note: some of them, maybe 15, are parking hints (hints to see if a parked call is at a specific spot)

By: Stefan Schmidt (schmidts) 2010-10-15 02:02:50

i am gone try this on my system with some load testings, if it would be the hints it should be easy to reproduce.

By: Stefan Schmidt (schmidts) 2010-10-15 02:36:59

i have tested this with 2500 hints, 522 subscriptions out of this hints and 150 calls per second and doing several dialplan reloads.
what i can see is that sip processing does not work reliable during a dialplan reload. If i do 10 dialplan reloads directly after another i see on sipp several retransmits and also many lost packets but the system does not stuck at all. but that my patched version which has a little more power than plain 1.6.2.13
i will try it again with a fresh 1.6.2.13 without any patches.

By: Stefan Schmidt (schmidts) 2010-10-15 04:34:00

ok testing this is not soo easy cause i have found a change in between 1.6.2.13 and 1.6.2.14 which slows down statehandling (which block hints).

cause with plain 1.6.2.13 i can do 10 dialplan reloads and loose only 50 calls per second. with 1.6.2.14-rc1 i dont have to do a dialplan reload do loose this amount of calls :(

but i still think there could be a race condition which cause the lock you ran into.

By: Michael Gaudette (bluefox) 2010-10-15 07:35:57

I'll definitely try 1.6.2.14 when it comes out.  Until then, here might be a lead: one of the 4-5 times it happened, I was under some sort of bot attack (you know, trying exten 1001,1002, 1003....) and I reloaded the dialplan.

The other times I wasn't, but that might be an easy way to reproduce (you have any SIP-attack software handy?)

By: Stefan Schmidt (schmidts) 2010-10-15 08:31:19

sipvicous is the scanner you mean and this is just slow if you compare to sipp ;)

with sipp you can send several hundreds sip messages like invite per second and i didnt hit any problems with dialplan reload only some lost messages but no locking at all.

By: Michael Gaudette (bluefox) 2010-12-20 14:39:33.000-0600

It hasn`t happened ever since I moved to a version with a deadlock fix (1.6.2.15SVN, 1.6.2.16rc-1 is fine too)

You can close this I imagine

By: snuffy (snuffy) 2010-12-20 22:30:19.000-0600

Reporter claims its fixed in a later revision