ASTERISK-10182: Asterisk crashes on reload of pbx

[Home]

Summary: ASTERISK-10182: Asterisk crashes on reload of pbx_config.so

Reporter: dtyoo (dtyoo) Labels:

Date Opened: 2007-08-27 21:21:22 Date Closed: 2007-11-09 16:14:32.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: PBX/pbx_config

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) reload-bt-btfull.txt

Description: After a recent upgrade from 1.2 to 1.4 we are getting intermittent crashes when we issue a "reload pbx_config.so" from the asterisk console. Notably, our dialplan is quite large (~50MB). This did not appear to be a problem under 1.2. bt and bt full attached. Since its not always re-producible, I'm hoping that someone the bt will prove useful / informative to someone.

****** ADDITIONAL INFORMATION ******

Dell 1950, CentOS 4.5, 2 x Xeon 3GHz, 2GB RAM

Comments: By: Steve Murphy (murf) 2007-08-30 15:18:42

Well, my gut instinct is that this is some thread related thing. When the dialplan gets big, it takes a long time to reload... and other things going on at the same time.... the code is supposed to handle this well, but some (imp)robable event might... well... cause a problem.

We'll see.

The backtrace says it died while "destroying" (freeing up) a context, and an exten, and died in the free func. So, something is getting freed twice, or mem is corrupted, or something...!

Is there any chance at all, that two reloads are being run in parallel? I know, I know, I know, that should be safe....

Is there any way you can send, perhaps, a sanitized version of that 50 meg of config file so we can see if we can reproduce the problem?
By: dtyoo (dtyoo) 2007-08-31 10:54:52

murf-

I'm pretty sure there weren't 2 reloads going on simultaneously.

I've been trying to reproduce this issue, and its definitely not trivial to reproduce. We've had a lot of weird crashes since upgrading from 1.2 to 1.4, and only on our production servers that are under load. My suspicion is that there is some sort of underlying memory issue that is manifesting itself in different sorts of ways.

I'll see what I can do about the sanitizing our files for you, but this is going to be a bit of a job in and of itself.

Its a little off topic, but I wanted to ask your opinion. Given the size of our dialplan, reloads, particularly of chan_sip are causing some problems for us. It seems that asterisk stops processing sip for some amount of time during reloads. This causes weird dialing behavior for our end users trying to make / recieve calls during the reloads, and some of our sip endpoints start re-registering to other, backup servers when the reloads occur. Do you have any suggestions / thoughts on this? I was going to start looking at realtime as a possible solution, but given the size and complexity of our dialplan, and integration with our existing backend systems this is probably not going to be a quick fix.
By: dtyoo (dtyoo) 2007-09-11 11:31:15

murf-

I privately sent over our configs to you. Let me know if you need anything else.
By: Steve Murphy (murf) 2007-11-07 16:23:23.000-0600

OK, I've pretty much spent the whole day working on this one; I've gone thru 1.4 with valgrind, to find any memory leaks or gross uninitialized var refs, and found a few minor leaks and repaired them in 1.4 and trunk. But I have also previously cleaned up the config code as far as freeing up extension/contexts/priorities. Since your crash is related to possible corruption in that part of the code, I am now quite curious to see if that fixed the problem.

BTW, I have been playing with your extensions, and find that on my little test machine, your config takes anywhere from 3.5 minutes to 5 minutes to "extensions reload" !! Someday, I'm going to try to find out why it's so slow. The initial load is much faster. (Maybe something slow in merge_and_delete()).
By: dtyoo (dtyoo) 2007-11-08 09:50:11.000-0600

Murf-

Awesome on the memory leak fixes. I'm sure these will improve long term stability for us as we still reload our dialplan a lot. In general though, we haven't been seeing this specific crash lately. We are running 1.4 trunk r87739 these days.

Other than this crash, the most serious issue for us is the chan_sip.so reload issue I had previously mentioned. With 5000 sip peers on a single asterisk server, sip call processing is stopped for approx 45 sec when chan_sip reloads on a dual CPU dell 1950. We are in the process of migrating our sip peers to realtime to try to avoid this issue. You may want to check this out since you have our dialplan. Extension / pbx_config reloads are much less of an issue because calls do not stop processing on these reloads. We have had to back off our sip reloads to a bare minimum to avoid causing service issues, but this limits our ability to be responsive to necessary config changes.
By: Steve Murphy (murf) 2007-11-09 16:14:31.000-0600

OK, if you have been using the latest 1.4, and see no more crashes, then let's close this bug.

As to the sip reload causing a halt in processing, I've had Corydon76 say that this move may not be a bad one, as it will "mean that we don't spend any time reloading peers for which we have no activity".

Furthermore, I have opened bug 11210 in your name, to report the sip reload locking up the sip network processing.