Summary:ASTERISK-10348: Asterisk suddenly slows down, and eats 100% cpu
Reporter:Atis Lezdins (atis)Labels:
Date Opened:2007-09-20 16:36:44Date Closed:2011-06-07 14:03:00
Versions:Frequency of
Environment:Attachments:( 0) backtrace.txt
( 1) backtrace2.txt
( 2) backtrace3.txt
( 3) bt_4.txt
( 4) cli.txt
( 5) cli2.txt
( 6) locks_4.txt
( 7) normal.txt
( 8) report
( 9) report.txt
(10) report2.txt
(11) threads_4.txt
(12) top_4.txt
Description:I'm running 1.4.10 (with reverted res_features from ASTERISK-10052). Approximately once or twice per week (restarted daily) asterisk suddenly starts acting slowly, and eats 100% cpu. Call volume drops as it takes really long time for customers to get to agents, but this doesn't improve situation in any way.

Last week i recompiled with DONT_OPTIMIZE, DEBUG_THREADS and DEBUG_LOCKS, so now i can attach debug output.

I killed asterisk with "killall -5 asterisk", so that core is dumped.

Attached files:
backtrace.txt - backtraces taken from core

report.txt - various information ("top", "core show threads", "core show locks", "core show channels") right before i killed asterisk.

normal.txt - top, and channel count when under normal load (30 minutes after kill)

cli.txt - multiple consecutive executions of "core show threads" and "core show locks", when problem raised.

Comments:By: Atis Lezdins (atis) 2007-09-20 16:43:11

Sorry, i missed top output before kill, please delete file "report"

By: Atis Lezdins (atis) 2007-09-20 17:21:53

Ok, seems that this happens "within some interval of day", or some other similar conditions. I got two more slowdowns today. The same was also week ago - wednesday - two slowdowns/day, and two weeks ago. Other days - we didn't had any problems.

I'll provide third backtrace tomorrow, if necessary.

If there is some more info that i could collect, please ask.

By: Eliel Sardanons (eliel) 2007-09-21 08:27:32

Please, check if asterisk is rotating the logs while this condition happen

By: Atis Lezdins (atis) 2007-09-21 08:37:32

No, it is not. My logs are rotaded in night (4am), so that there aren't any calls.

I looked into backtraces, and it seams to me that there is some problem with some read()'s in main function of main thread. Could this be somehow related to having safe_asterisk with default settings (TTY=9, console=yes).

I'm in doubt that something could have caused something to be written in tty9, however a test of "cat /dev/random > /dev/tty9" slowed down asterisk a little. I just disabled them, but will have to wait week or more to see that nothing happens.

By: Tilghman Lesher (tilghman) 2007-09-21 21:30:20

I'd kind of like to see the output of valgrind, when running an unoptimized build of Asterisk with valgrind.  That should give me a good idea of what routines are getting stuck in a busy loop, which would cause the 99% CPU issue.

By: Atis Lezdins (atis) 2007-09-22 00:25:11

Is there any manual on getting that valgrind output? As far as i have tried (from voip-info.org), "make valgrind" doesn't exist. Also, can i attach it to existing core?

By: Tilghman Lesher (tilghman) 2007-09-22 08:07:56

No, you cannot attach to a running process.  You must invoke Asterisk with valgrind, i.e.

valgrind asterisk -vvvvcg

By: Atis Lezdins (atis) 2007-09-24 03:44:26

You don't expect me to run that on production, and wait for problem (did i said that it happens once per week)? It's almost as slow as the problem itself.

Plus, running valgrind on test machine didn't revealed me anything happening inside, just some error and leak summaries. I used "sudo valgrind asterisk -vvvvvdcg".

By: Atis Lezdins (atis) 2007-09-24 04:53:11

Ok, i digged into sources and found that it's all related to last lines of code in main/asterisk.c. There's infinite loop "for(;;)", that reads something from console. I don't know what have caused this - i double checked all my system looking for tty9, there's nothing that could have caused it. As solution that fits me - i disabled console in safe_asterisk (and would recommend to do so in  distribution).

However the problem is somewhere there, i'm just not enough familiar with code to find it.Maybe because of "num_read", that i suppose means characters read? Why it's 5 megabytes? Could this slow down everything? Note, that it's the same value in all 3 backtraces.

#1  0x00000000004db0d7 in read_char (el=0x96ac70, cp=0x7fffc6d954c6 "") at read.c:296^
 num_read = 5116828^
 tried = 0^

By: Tilghman Lesher (tilghman) 2007-09-24 07:39:10

Are you saying that you AREN'T using safe_asterisk in production?  In that case, I would strongly recommend that you do so, because I suspect I know what the problem is, and, if it is that problem, running safe_asterisk will solve the problem.

The issue may be that the terminal running Asterisk is getting shut down.  We know about that issue, and it is handled with several commands within safe_asterisk which set up the environment in a way that prevents this problem from occurring.

By: Atis Lezdins (atis) 2007-09-27 10:31:26

No i'm using, i just was frightened by having asterisk console somewhere on tty9. However, turing it off didn't helped. Actually that was my mistake, assuming that asterisk uses separate processes, that can be monitored the same way as httpd for example.

So, you should ignore what i said about main thread. Threads are not visible when running regular "top". Now i'm using "top -H" instead. Unfortuneatly crashes are getting more often now (maybe because of increased load to manager), but this gives me more time for inspection (and more sleepless nights).

I monitored that whenever problem occurs, either few threads (two or three) are having high load that sums to 100%, or there are some dozen of them, each taking  some 5% of CPU - load is distributed quite evenly, but it all sums up to 100%.

However i did found some similar backtrace with similar problem in google:

I noticed that i have the same version of glibc - pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 related to IAX2 functions. So, as i use IAX for connecting to Hylafax, i re-routed IAXmodem to another machine, and i'm currently waiting to see some improvements (loadavg already have dropped a little).

I'll let you know on further situation regarding this, as soon as i'll have it. Meanwhile i would appreciate if you could confirm my theory as possible.

By: Atis Lezdins (atis) 2007-09-28 13:03:09

Another problem today (yesterday went fine).

Again i see pthread_cond_wait@@GLIBC_2.3.2, but this time on devicestate.c - i believe there is no simple way how i can disable that one.

Admin will try to upgrade libc as soon as possible, but it will probably require recompiling a lot of software, so probably it will be at some weekend, but that's the only idea we have now.

By: Atis Lezdins (atis) 2007-10-09 11:41:27

A little update.

Upgrading glibc (on fedora core 6) from 2.5-10.fc6 to 2.5-18.fc6 and disabling IAX seems have helped. No slowdowns yet. I'd like to wait another week, to be sure that problem is solved, and this is compatibility issue.

By: Jason Parker (jparker) 2007-10-15 17:35:43

It's been about a week.  Do we want to call it?

By: Atis Lezdins (atis) 2007-10-17 06:43:18

Last two weeks i had no problems (glibc upgrade + disabling IAX). This week i enabled IAX back (very low traffic comparing to SIP), and next day i got report from night shift - that asterisk was unresponsive again. Unfortuneately nobody took core dump and debug info, just killed it.. I suspect that this might be not actually the same, as at the same time we had other performance issues. I'll give an update in end of this week. However, if it will be repeating - that would be IAX problem.

Anyway - i think it's good to add warning in docs about GLIBC 2.5.10 on Fedora Core 6.

By: Atis Lezdins (atis) 2007-10-22 03:17:54

Another week without problems.

Also, it's mentioned by some other users that GLIBC 2.5 can be problematic.

Can you please add compatibility warning?

By: Atis Lezdins (atis) 2007-11-07 04:59:34.000-0600

Ok, almost a month, and no such problems so far. Can you close the issue (and update some docs)

By: Jason Parker (jparker) 2007-11-07 13:11:40.000-0600

No longer an issue.