[Home]

Summary:ASTERISK-16299: [patch] Asterisk "locks up" the system when an external process is called from the 'h' extension with a lower priority than Aste
Reporter:Ramon Peek-Fares (ramonpeek)Labels:
Date Opened:2010-06-29 05:47:48Date Closed:2010-11-23 16:36:31.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Resources/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20100630__issue17564.diff.txt
( 1) backtrace.txt
( 2) core-show-locks.txt
Description:We noticed that Asterisk "locks up" the system when an external process is called from the 'h' extension with a lower priority than Asterisk.
This issue is a follow up on issue: ASTERISK-15239

See Additional Information Field for more info....

****** STEPS TO REPRODUCE ******

Create an "h" extension that is called when the calling party hangs up.
Make sure that the "h" extension calls an external process like ODBC or via the System() command, so for example:


exten = h,1,Set(NAME=${ODBC_xxx(SELECT name from extensions WHERE extension = 100)})

or

exten = h,1,Set(NAME=System(cat /tmp/ext_name.txt))


Now let Asterisk run at a higher priority by using the "nice" command in the safe_asterisk script or using "renice <prio> <pid>".

When running at a priority less than -5 difference from the system processes the system will work and not "lock-up", but it will be noticeably slower then when executing system-command from the 'h' extension. If the priority is raised to -10 or higher the system will again "lock-up".
This behaviour is somewhat depend on the type of kernel you are using. But we have tried with the latest vanilla kernel and were still able to reproduce the problem, although the problem is worse when using for example 2.6.9 kernels.

Now to reproduce the error I used "renice" to lower the priority of our postgreSQL database process, but you could also raise the priority of asterisk a bit, the result is the same.

When our external process is running at priority 0 (the same as asterisk by default) then there is no problem. However when we run the external process at priority 5 (lower that asterisk) the system becomes slow on hangups.

Renice command used:
renice 0 $(ps -e |grep postmaster | cut -d? -f1) or
renice 5 $(ps -e |grep postmaster | cut -d? -f1)


****** ADDITIONAL INFORMATION ******

During an update to Asterisk 1.4.22 or higher we encountered this problem.
After speaking to Russell about this issue we initially thought of a kernel problem, and indeed a new kernel seemed to resolve the issue. But this was an incorrect assumption because we later on learn that we were running asterisk as non-root at highpriority which apparently should not be attempted ;-)

Further testing taught us that Asterisk would become unstable and the system would "lock-up" on all external events called from the 'h' extension.
And if we tweaked this process scheduling priority using the "renice" command we could simply get the system to run slow or lock up on demand.

See steps to reproduce field for how we did this.
And see issue ASTERISK-15239 for traces etc..
Comments:By: David Woolley (davidw) 2010-06-29 06:23:34

Do you mean a lower priority or do you mean less niceness?  I don't think Asterisk is normally run at a real time priority.  If it is unstable just as the result of niceness, it may be unstable even without niceness.

By: Ramon Peek-Fares (ramonpeek) 2010-06-29 06:39:55

Asterisk in our case is running at the default priority "0" as set in the safe_asterisk script. This uses "nice" to set the priority, so less niceness is the answer to your question.

Note:
We use nice to have external processes, such as PostgreSQL, to run them at a "nicer" priority in contrast to Asterisk. And it doesn't really matter whether asterisk is running at a "less nicer" negative priority lower than "0" or not.

By: Paul Belanger (pabelanger) 2010-06-29 06:42:04

Debugging deadlocks:

Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install)

This will then give you the console command:

core show locks

When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt

# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt

gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!

By: Ramon Peek-Fares (ramonpeek) 2010-06-29 07:32:13

At first our system would become fully unresponsive, that's why I listed this issue at "crash". Just recently I've found ways to make the system "not-crash" on this issue.

I'll try to get to get a GDB trace during the worst kind of "lock-up" I can replicate without having the system become fully unresponsive...
To be continued...

By: Ramon Peek-Fares (ramonpeek) 2010-06-29 08:05:37

After recompiling asterisk with DONT_OPTIMIZE the problem got a lot less.
Meaning Asterisk slowed down noticeable, but didn't "lock-up" my system unless I ran Asterisk with the -p option (but that's normally not what we do!)

Anyway because I still needed to reproduce the problem I set our PostgreSQL process to run at a "nicer" level of 19 and left Asterisk at the default level of 0.
To create the traces I create a little bash script to execute all command in a row and use "nice -19 gbd...." to give gdb a higher priority.
The funny thing was that the Asterisk process fully locked up whilst gdb was running. After I typed "quit" in gdb Asterisk would continue its dialplan.
So there is a good chance this gdb will actually show what we are looking for.

By: David Woolley (davidw) 2010-06-29 08:08:18

I'd suggest setting processor affinity, to limit Asterisk to one virtual processor, if you want to reliably reproduce scheduling related problems.

By: Ramon Peek-Fares (ramonpeek) 2010-06-29 08:37:49

I'm only running one CPU in this system, so processor affinity should be relevant:.

cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping        : 9
cpu MHz         : 2790.855
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 5585.13

By: Tilghman Lesher (tilghman) 2010-06-29 12:59:36

How are you running your "external process"?  System?  AGI?  SHELL()?  Something else?

This:
exten = h,1,Set(NAME=System(cat /tmp/ext_name.txt))
does not actually run anything external.



By: David Woolley (davidw) 2010-06-29 13:47:28

What does "ht" mean.  Some Pentium 4s are hyperthreading and therefore count as two for scheduling.

In any case, it is likely to be relevant for someone trying to reproduce the problem.

By: Ramon Peek-Fares (ramonpeek) 2010-06-30 03:03:48

@tilghman:
In the tests used for these traces I'm using the res_odbc resource to write or read data from my PostgreSQL database. In my dialplan I execute many SQL queries with no problems, but when executing them from the 'h' extension the execution is very slow or even "locks up" the system.
Please note that this behavior was introduced somewhere in Asterisk 1.4.22 and that the DONT_OPTIMIZE option reduces the problem tremendously.

In other tests I've also tried calling other external processes using the System() command, which showed exactly the same behavior. Which is why I don't think this problem is related to the ODBC driver.

PS: Hardware should also not be the problem, since we run similar images of this system on multiple systems at our customers with 4 different types of hardware; single core processors and dual core, they all show the same problem.

@davidw:
The "ht" option in cpuinfo indeed stands for the capability of the processor to support HyperThreading, however hyperthreading support had been turned of in the kernel and BIOS of this system. So there is only 1 CPU available in this test-system. Please note that other systems running similar setups and multiple cores show the same behavior.

By: Ramon Peek-Fares (ramonpeek) 2010-06-30 03:11:18

Extra info:
-------------
If I run Asterisk using the "-p" option the system ALWAYS locks up when calling an ODBC function or a simple "exten = h,1,Set(NAME=System(cat /tmp/ext_name.txt))", BUT ONLY IN THE 'H' EXTENSION.

However if I run Asterisk without the "-p" option and use "nice" to prioritize Asterisk (less nicer) AND  de-prioritize other processes such as PostgreSQL making them nicer to asterisk then Asterisk slows down when calling such processes, BUT ONLY IN THE 'H' EXTENSION.
(PS: No other large processes are active on the machine other than the usual)

If I do not prioritize any process, so all processes are running default, then no problem occurs... But he... that not really what we want..



By: Ramon Peek-Fares (ramonpeek) 2010-06-30 04:28:50

Good news!

We've found the culprit!!
In asterisk 1.4.22 changes made to autoservice.c  [r122713] cause this behavior.
If I revert this change the problem completely disappears!
So even running asterisk at "highpriority" is possible again.

Change listed in Asterisk Change Log:
2008-06-13 21:44 +0000 [r122713]  Mark Michelson <mmichelson@digium.com>


* main/autoservice.c: Short circuit the loop in autoservice_run if there are no channels to poll. If we continued, then the result would be calling poll() with a NULL pollfd array. While this is fine with POSIX's poll(2) system call, those who use Asterisk's internal poll mechanism (Darwin systems) would have a failed assertion occur when poll is called. (related to issue ASTERISK-9989)

Question remains: "Is it wise to revert this change?"



By: Ramon Peek-Fares (ramonpeek) 2010-06-30 07:17:49

As far as I can see the added code in autoservice.c, is indeed the problem!:

Because we are in the 'h' extension the value of 'as->chan->_softhangup' is true and variable x is not yet set when it reaches this new piece of code:

+        if (!x) {
+           continue;
+        }

This is normally not a problem because the asterisk process is running at the same kernel_priority as the process run in the 'h  extension so that process gets the chance to be executed.
However when asterisk runs at a higher kernel_priority then the process running in the 'h' extension, asterisk will 'loop' this part of the code and the the process in the 'h' extension will hardly get a chance to execute. Which causes the system to slow down.
And the more priority is given to asterisk, the more likely this is going to be a problem. Which is very likely the reason that running Asterisk at highpriority (-p) causes the whole system to "lock up".

A simple solution could be to introduce a 50ms sleep before we continue the loop, but I guess that's considered bad programming ;-)



By: Tilghman Lesher (tilghman) 2010-06-30 12:43:49

I'd go with a 10ms loop, but I'm going to ask Mark to verify if he thinks that would cause a problem.

By: Mark Michelson (mmichelson) 2010-06-30 13:14:23

I'd say the 10ms sleep for the if (!x) case would probably do the trick. It may even work with a smaller sleep time.

It may also be worth investigating the idea of not adding channels to autoservice if their _softhangup flag is set when calling ast_autoservice_start(). The sleep would still be needed for cases where the _softhangup gets set after autoservice is started, though.

By: Ramon Peek-Fares (ramonpeek) 2010-07-01 01:56:40

Yes this patch works fine!
Thanks everyone, case may be considered resolved ;-)

PS: Watch out for similar issues, I've seen a few that might be related...



By: Digium Subversion (svnbot) 2010-07-02 12:09:46

Repository: asterisk
Revision: 273717

U   branches/1.4/main/autoservice.c

------------------------------------------------------------------------
r273717 | tilghman | 2010-07-02 12:09:46 -0500 (Fri, 02 Jul 2010) | 8 lines

Autoservice loop optimization causes a busy loop, when channels are serviced while in hangup.

(closes issue ASTERISK-16299)
Reported by: ramonpeek
Patches:
      20100630__issue17564.diff.txt uploaded by tilghman (license 14)
Tested by: ramonpeek

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=273717

By: Digium Subversion (svnbot) 2010-07-02 12:10:59

Repository: asterisk
Revision: 273718

_U  trunk/
U   trunk/main/autoservice.c

------------------------------------------------------------------------
r273718 | tilghman | 2010-07-02 12:10:59 -0500 (Fri, 02 Jul 2010) | 15 lines

Merged revisions 273717 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
 r273717 | tilghman | 2010-07-02 12:09:47 -0500 (Fri, 02 Jul 2010) | 8 lines
 
 Autoservice loop optimization causes a busy loop, when channels are serviced while in hangup.
 
 (closes issue ASTERISK-16299)
  Reported by: ramonpeek
  Patches:
        20100630__issue17564.diff.txt uploaded by tilghman (license 14)
  Tested by: ramonpeek
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=273718

By: Digium Subversion (svnbot) 2010-07-02 12:11:37

Repository: asterisk
Revision: 273719

_U  branches/1.6.2/
U   branches/1.6.2/main/autoservice.c

------------------------------------------------------------------------
r273719 | tilghman | 2010-07-02 12:11:37 -0500 (Fri, 02 Jul 2010) | 22 lines

Merged revisions 273718 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
 r273718 | tilghman | 2010-07-02 12:10:59 -0500 (Fri, 02 Jul 2010) | 15 lines
 
 Merged revisions 273717 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.4
 
 ........
   r273717 | tilghman | 2010-07-02 12:09:47 -0500 (Fri, 02 Jul 2010) | 8 lines
   
   Autoservice loop optimization causes a busy loop, when channels are serviced while in hangup.
   
   (closes issue ASTERISK-16299)
    Reported by: ramonpeek
    Patches:
          20100630__issue17564.diff.txt uploaded by tilghman (license 14)
    Tested by: ramonpeek
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=273719