ASTERISK-08338: Call processing stops during reload on systems with large dialplan

[Home]

Summary: ASTERISK-08338: Call processing stops during reload on systems with large dialplan

Reporter: callguy (callguy) Labels:

Date Opened: 2006-12-12 16:10:43.000-0600 Date Closed: 2008-03-07 13:23:28.000-0600

Priority: Major Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) FastAstPerf.png
( 1) patch1

Description: We have a relatively large dialplan (6719 extensions (31876 priorities) in 934 contexts). Whenever we reload to update for a configuration change, call processing stops until the reload is finished.

As the size increases the delay has become severe, causing peers to be come unreachable and other call delivery problems.

We have seen this against many versions, but have most recently been using:

SVN-branch-1.2-r48356

Comments: By: Serge Vecher (serge-v) 2006-12-12 16:28:31.000-0600

what about the latest 1.4?
By: Donny Kavanagh (donnyk) 2006-12-12 23:00:28.000-0600

Try doing your reload with debug and verbose both set to 0.
By: Russell Bryant (russell) 2006-12-13 01:55:41.000-0600

Yes, I would definitely say that is a large dialplan. Are all of your extensions actually unique in such a way that you can't combine them using patterns?

Anyway, I would say this is a "known issue" given the current design of how the dialplan is stored internally. It is going to take some significant architecture changes to improve this situation. Some of the necessary work is already being done, but it's not something that will be fixed extremely quickly, and certainly not in the 1.2 branch.

In the meantime, to improve performance, I would combine as much of your dialplan into patterns as possible. Also, you would probably get better performance keeping the configuration in a database as opposed to the text file. I'm referring to the "static" database configuration method, not realtime extensions.
By: callguy (callguy) 2006-12-13 06:50:56.000-0600

While there is a bit more optimization that could be done (I estimate around 10%), we are already using patterns extensively to reduce the size of the dialplan as much as possible. We are also adding quite a bit to it on an ongoing basis, so any optimization is basically a band-aid solution.

Static database may be an eventual option - but is there any difference in the handling of reloads in 1.4 vs 1.2?

Even if call processing still pauses during reload if the delay is short enough it would effectively be harmless. The only way to handle this robustly (that I can think of) would be to maintain checksums internally of the various configurations and only reload those that have changed.
By: Donny Kavanagh (donnyk) 2006-12-13 08:34:12.000-0600

See my suggestion above, try your reload with the console output disabled. It wont solve the problem, but it should improve it.
By: Joshua C. Colp (jcolp) 2006-12-18 20:38:14.000-0600

I talked to murf today about possible optimizations we could do to make reloading better and we came up with something so that my appear in a separate branch for trunk. The suggestions provided though are probably the best we can do for 1.2
By: callguy (callguy) 2006-12-18 20:44:23.000-0600

That's great news - if you have anything you'd like us to test from we're happy to do so. The suggestions do improve the situation, but we're also adding to the dialplan pretty consistently, so any of these items are really a stopgap.
By: Russell Bryant (russell) 2007-01-15 18:20:33.000-0600

Since this isn't really a bug, and just a downfall of using a text file for an extremely large dialplan, I'm going to go ahead and close this out. As mentioned before, the performance of the related code is already being worked on and will be improved in the future.
By: Steve Murphy (murf) 2007-01-15 19:20:22.000-0600

Ok, I'm digging into this. I need to get a feel for your dialplan.
You have a fairly large number of contexts; Tell me if they all contain the same number of extensions, or if some are bigger than others, how many extensions are in the biggest contexts? Do all your extensions have the same number of priorities? If not, how big are the biggest extensions? Are the biggest extensions in the biggest contexts?

This may seem irrelevant, but it may not be. I'd like to get a feel where the engine is spending most of its time during a reload, with your dialplan.
By: Russell Bryant (russell) 2007-01-15 19:46:12.000-0600

Since murf has taken more interest in looking at this, I'm reopening it. :)
By: Steve Murphy (murf) 2007-01-15 23:53:09.000-0600

That'll teach me to wait an hour between reading the page and adding a comment!

This issue isn't as simple as I first thought, but I'd like to do an experiment or two to see if there is some solution, before we write it off.
By: callguy (callguy) 2007-01-16 14:07:01.000-0600

"show dialplan" gives:
-= 9462 extensions (43080 priorities) in 1086 contexts. =-

There are only a few contexts that are very large. Most contexts are small, e.g. less than 50 extensions. Most extensions have less than 10 priorities, avg maybe 5 or 6 priorities. the largest has:

-= 1711 extensions (10487 priorities) in 1 context. =-

Everything else will have far less than this.

The second largest context has:

-= 85 extensions (553 priorities) in 1 context. =-

then

-= 75 extensions (371 priorities) in 1 context. =-

Everything else has less than 100 extensions, and less than 500 priorities.
By: Steve Murphy (murf) 2007-01-16 19:20:29.000-0600

OK, callguy!

Can you share your dialplan? If not, could you "sanitize" it, and irrecognizably modify the sensitive stuff out of it? If not, can you give me a list of list contexts, with the number of extensions in that extension, and the length of the extension name, and whether it is a pattern or not, and for each exten, how many priorities are in there? From this, I can generate a statistically equivalent test dialplan, and use that to find out where the lockups are happening and where the time is being spent, and plan a counterattack.

By: Peng Yong (ppyy) 2007-01-17 08:20:37.000-0600

pattern and macro is your friend
By: Steve Murphy (murf) 2007-01-17 16:54:13.000-0600

Callguy--

I'm simulating your load right now. I slapped together a 1110-context, 14K extension, 84k priority dp.

How long does it take for your dialplan to load? On a completely unloaded test system, I see .75 seconds. How long for you?
By: Steve Murphy (murf) 2007-01-17 18:12:54.000-0600

To aid you in measuring time delays, I have created a branch,
http://svn.digium.com/svn/asterisk/team/murf/bug8574-1.2

You can either build it or use it to get a patch for the debug messages sent to the console. --I use ast_log(AST_NOTICE, ....) to print out NOTICE messages to the log/console.

Try seeing if you can get output on a fully loaded system. This will give me an idea if what is happening on your system. It looks to me like the principle troublemaker/target would be the context_destroy(NULL,registrar) func. Please post your NOTICE messages.
By: callguy (callguy) 2007-01-17 18:32:23.000-0600

we're seeing roughly .25 seconds, but the behavior differs quite a bit between an idle and loaded system (possibly due to the number of SIP messages getting interrupted, etc.).

I'll take a look at the debug branch you posted and see what it produces, but will need to do it against an idle machine to start - we can't easily do it on our production boxes, but should be able to set-up a test with simulated traffic.
By: Steve Murphy (murf) 2007-01-18 22:56:14.000-0600

callguy--

I'm in impatient cuss, I guess. I had an idea (or 5) about how I might reduce the time spent in the "critical section" that would lock up the dialplan engine too long. I ended up with two candidates, and I coded up the second of the two. The numbers look good on an unloaded system.

But, it may end up a total flop on a loaded system. I've attached a patch (patch1), or you can diff the team/murf/bug8574-1.2 branch yourself; PLEASE, test with these changes, and see if it helps!

BTW-- you might want to read about the experiment I did with large dialplans, and the work I'm doing (when I have a moment) in my "fast-ast" branch. If you have near 1000 contexts, then you are experiencing big slowdowns due to linear search algorithms gone exponential. See http://www.asterisk.org/node/112
There was a plot that used be attached to that page, but in the web redesign, it got lost. So, I'll upload FastAstPerf.png. At over 1000 contexts, all extensions.conf processing will be slowed to roughly 16% of normal. Add to that the high number of extensions in a few of the contexts, with code in those extensions running maybe at .1% of normal, and you get a total speed of .016% of normal.

This slowdown represents the burning of large numbers of cpu cycles searching large linear lists for items, and will slow all aspects of asterisk.

So, you'll absolutely LOVE the work I'm doing in fastAst. I get rid of the linear list searches, and make response flat, no matter how big the dp. You'll doubtless notice that with the fastAst improvements, your dialplan will run 1000x faster, which will probably mean a great improvement in call volume capability.

The downside: the fastAst work won't appear in Asterisk until the 1.6 release, iff I can find time to finish it up and solidify it.
By: Steve Murphy (murf) 2007-01-23 13:45:28.000-0600

callguy!! (prod) (prod) I'm dying of curiosity! did my patch help?? (prod) (prod!)
By: Steve Murphy (murf) 2007-02-09 12:45:25.000-0600

Seems like a shame to have done all that work, and have it go for naught.

I'm closing this bug until callguy can find some time and test it. Feel free to re-open when you have some results.
By: Steve Murphy (murf) 2008-03-07 13:23:27.000-0600

6002 required a major overhaul of merge_contexts_and_delete, which is where this bug required changes, also. So, when I fixed the problems in 6002, as a side affect, I HAD to fix the problem reported here. I would tend to think that the
methods I used this time around are probably much superior to what I did previously, so this is a win-win. write-lock time is now a minimum of 3-6 microseconds, affected only by the number of hints to save/restore.

Sorry, these fixes applied only to trunk. (1.6.something, eventually)
They might have gotten in sooner if the reporter had tested the mods over a year ago.