[Home]

Summary:ASTERISK-06681: [patch] System freeze: responds to ping, accepts connections on ports (5060, 80, etc.) but does not respond back
Reporter:Marcel Barbulescu (marcelbarbulescu)Labels:
Date Opened:2006-04-02 21:42:08Date Closed:2006-05-01 17:03:20
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20060424__no_realtime_priority.diff.txt
( 1) kernel-log.txt
Description:The system freezes at least once in 24h. When it's frozen it responds to ping, accepts connections on ports (5060, 80, 22, etc.) but does not respond back. There are no logs saved on the disk from the moment it freezes until it's rebooted. The Linux kernel is configured to reboot on panic but the system does not reboot itself after freezing.

There is no telephony hardware installed, SIP only, no MeetMe, no Voicemail, same happens with g729 or without, ztdummy loaded or not. System is mostly idle. System is stable if asterisk is not loaded.

Same happens with similar software configurations on a physical Pentium 4 3.0 HT with 1GB of RAM or on a VMWare 5.5 virtual machine with 512MB of RAM (tested both running on a Pentium 4 3.2 HT with 2GB of RAM and on a Pentium 3 800Mhz with 1GB of RAM).

The software configuration is basically a CentOS server install with latest updates plus ntpd, httpd, mysql, shorewall and asterisk installed.

The same behaviour is reported with both kernel 2.6.9-34.EL and 2.6.9-22.EL. If SysRq + k is performed on a frozen system, it comes back to life after killing all the running processes.

****** ADDITIONAL INFORMATION ******

Attached is a log of the kernel messages from boot time and the output from SysRq + p, SysRq + m and SysRq + t on a frozen machine.
Comments:By: adomjan (adomjan) 2006-04-03 02:13:53

* run with realtime priority?
It may happen in this case.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-03 02:19:26

It is started using the included Redhat style init.d script. I don't think that it has the realtime priority.

By: adomjan (adomjan) 2006-04-03 02:21:58

check it!

ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-03 02:23:57

This is the output for asterisk:

2083  2083 TS       -   0  24   0  0.0 S    wait           safe_asterisk
2126  2126 RR      10   -  50   0  0.5 Sl   -              asterisk

By: adomjan (adomjan) 2006-04-03 02:27:12

Your asterisk run in 10 realtime priority, check the astrisk.conf.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-03 02:35:14

I just changed "highpriority" to "no" in asterisk.conf. I assume that now even if asterisk enters an infinite loop it will not block the whole system.

Do you have an explanation of what did cause the behaviour? What's wrong with the high priority?

By: adomjan (adomjan) 2006-04-03 02:51:19

the high priority not a bugnote topic, the bugnote topic is your asterisk enters an infinite loop.
When your asterisk run in realtime priority an enters an infinite loop the other processes by running in normal priority won't get cputime anymore.
If you want to use realtime priority try the asterisk_safe script from chan-ss7.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-03 11:08:19

adomjan: Thanks a lot for the insights. I can bet that at least now the server will not freeze completely anymore. And I expect even the asterisk to work fine in lower priority mode. I'll report back after running it for a while.

However, one question still remains: what cause Asterisk to block in high priority mode with this configuration?

By: Joshua C. Colp (jcolp) 2006-04-05 15:38:41

That's hard to answer without seeing what Asterisk is doing at the time.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-05 15:41:11

How can I debug that?

I have a snapshot of a locked machine. I can generate a crash with SysRq-c and provide the core dump, if it helps in anyway.

By: Joshua C. Colp (jcolp) 2006-04-05 15:42:59

A core dump of Asterisk would be just peachy!

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-05 16:37:08

Unfortunately I cannot convince the locked machine to dump the core over the net (using a preinstalled netdump). It worked before but in this case it's just printing out the tasks and the memory info and that's all.

However, I can probably reproduce it within 24h so if you give me detail informations of how to capture debug info, I will be probably able to do it.

PS: Just in case somebody missed it: there are some kernel logs attached to the initial reporting, with boot log, task list, memory info, etc...

By: Mikael Kuisma (mikael) 2006-04-24 14:25:58

I had the same problem. It was caused by an external program called from within asterisk, and the fact that ast_safe_system() do not change back to normal priority before exec subprocess.

This patches solves that problem (1.2.7.1, rev 19351);

<code>
*** asterisk.c.orig     2006-04-11 23:55:51.000000000 +0200
--- asterisk.c  2006-04-24 10:52:05.000000000 +0200
***************
*** 442,449 ****
--- 442,452 ----

       pid = fork();

       if (pid == 0) {
+               /* Do not run in real time queue */
+               if (option_highpriority)
+                       ast_set_priority(0);
               /* Close file descriptors and launch system command */
               for (x = STDERR_FILENO + 1; x < 4096; x++)
                       close(x);
               execl("/bin/sh", "/bin/sh", "-c", s, NULL);
</code>



By: Tilghman Lesher (tilghman) 2006-04-24 16:40:14

Might as well do that for ALL external processes, this time.  Test the patch, please, to ensure that it solves your issue.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-24 19:27:52

I'm recompiling the Asterisk right now on one of the servers that was exhibiting the problem. I'll report back the result.

By: Marcel Barbulescu (marcelbarbulescu) 2006-04-30 23:42:16

Almost 5 days passed without a freeze. I strongly suspect that the problem has been fixed. I'll report back in about a week if Corydon decides not to close the bug yet.



By: Tilghman Lesher (tilghman) 2006-05-01 15:45:07

Committed to 1.2, as of revision 24019.

By: Tilghman Lesher (tilghman) 2006-05-01 17:03:20

Changes merged to trunk at revision 24053.