[Home]

Summary:ASTERISK-13059: CPU Usage Increases and then Asterisk Crashes
Reporter:A.R. Nasir Qureshi (nasirq)Labels:
Date Opened:2008-11-12 08:45:39.000-0600Date Closed:2011-06-07 14:00:41
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) debug1-optimized.txt
( 1) Lock-Log-1-4-2009.txt
( 2) Lock-Log-30-4-2009.txt
Description:I am using Asterisk in a Call Center Environment. I am using AgentCallbackLogin for Agents. The Agents are on SIP phones (Polycom).

I am using the Asterisk::Manager perl api for connection to the manager interface for CTI. One perl script runs (using Asterisk::Manager) and does two things. First it send Action: Status every 500 ms to get the list of currently active channels and populates them in a MySQL table. Secondly it reads a directory for files, which if created consists Managers Actions. It reads the file, sends the actions to Asterisk and deletes the file. This way I can send many Managers actions while having only one manager connection to the Asterisk.

I have used this approach to reduce the load on Asterisk if many clients need to read the Events or send actions.

The problem I am facing is that sometimes (and I am unable to find the steps to reproduce) when sending actions, this interface seems to hang. No more actions get processed and the CPU usage starts to climb up. Using top I have seen more than two and sometimes five to six asterisk threads using a lot of CPU. Some times new calls go through and sometimes not. Sometimes after the active calls are hanged up everything gets back to normal and some times it gets worst. Using soft hangup on CLI does not work and even using restart now does not do anything. The only option left is using killall -9 asterisk.

At this stage of this issue, I need expert guidance to enable me to further dig in to find out what could be wrong when this occurs.

Please help me out.

Comments:By: Leif Madsen (lmadsen) 2008-11-12 14:16:44.000-0600

Can you please enabled DONT_OPTIMIZE in menuselect, then rebuild and reinstall Asterisk, and when you get a core dump (after asterisk crashes), can you attach the backtrace? Follow the instructions in the doc/backtrace.txt file located in your Asterisk source directory.

Thanks!

By: A.R. Nasir Qureshi (nasirq) 2008-11-12 23:43:36.000-0600

I'll do that.

And what if asterisk does not crash, but uses a lot of CPU and becomes non responsive ? Will killing it with -9 create a core dump ? If not, how can I know what is happening then or what part or asterisk is causing the problem ?

By: has1084 (hasitha) 2008-11-14 05:58:46.000-0600

Hi, I'm having exactly the same issue. The service crashed this morning again after running 6 hours. When it crashed CPU time of the process is 96 hours and CPU usage is 130%.
I had to use 'kill -9' (as usual) because there was no other way to restart the service.

I complied the asterisk with DONT_OPTIMIZE enabled and it was running with option -g when it crashed.

But I can't find any core dump files in /tmp directory.

Please advice what else can be done?

By: has1084 (hasitha) 2008-11-17 03:34:46.000-0600

Hi,
Asterisk does not produce any core dump files. Actually, it does not crash and stop. The process just freeze and stop responding to any sip messages.

Does anyone have any idea about this problem?

Thanks!

By: Leif Madsen (lmadsen) 2009-02-02 14:58:19.000-0600

Does this still happen to be an issue?

By: A.R. Nasir Qureshi (nasirq) 2009-02-09 03:33:37.000-0600

I have not tested the procedures that caused the problems for some time.

Let me do it and then I'll know if it is happening again or not.

BTW, I need a way to know where the asterisk is stuck when it does not crash (and produce a core dump) and just uses a lot of CPU and becomes non responsive.

By: Leif Madsen (lmadsen) 2009-02-09 10:47:16.000-0600

I've assigned to file for now.

file: can you answer the reporters question? I'm thinking that at this point we're looking for the issue to be reproduced in valgrind?

By: Joshua C. Colp (jcolp) 2009-02-09 10:48:44.000-0600

Sounds like it could be spinning 'round and 'round... attaching gdb using:

gdb asterisk --pid=`pidof asterisk`

And doing "thread apply all bt" would give the needed information.

By: Joshua C. Colp (jcolp) 2009-02-25 11:06:23.000-0600

nasirq: Have you been able to get the needed information?

By: A.R. Nasir Qureshi (nasirq) 2009-02-25 23:15:03.000-0600

Not yet. The system in question has now been moved to production now, so I am unable to stimulate the conditions. I have setup another testing system similar to the first one, and I will soon start testing. Will post the results soon.

By: Joshua C. Colp (jcolp) 2009-03-09 16:15:29

nasirq: Any results yet?

By: Dominic Böttger (dominic) 2009-03-10 06:11:09

Same issue in our production environment :-(
Attached the debug output of gdb. Can't recompile asterisk at the moment (production). Didn't have this issue for 9 weeks. Today it happened again. In the CLI i was able to execute "core show channels" just for one time. Then i had to exit the cli and run again to execute "core show channels" again.

I am using asterisk 1.4.22, can't upgrade to 1.4.23, cause there are issues with Siemens Openstage phones ....

I am not using AgentCallBackLogin. I am using the managementinterface to read the devicestates for a devicestateserver.

By: Joshua C. Colp (jcolp) 2009-03-11 14:40:44

dominic: Going to need Asterisk compiled with DONT_OPTIMIZE and DEBUG_THREADS in menuselect / Compiler Flags. Attach the output of core show locks taken when it has deadlocked.

By: Leif Madsen (lmadsen) 2009-03-23 12:32:49

dominic: ping? any luck on getting the requested information, or determining an ETA for it?

By: A.R. Nasir Qureshi (nasirq) 2009-04-01 01:34:06

Today we again faced this problem and I was not doing any thing (sending actions) that I stated when I started this report.

First I was able to do a "show channels" and get results, and then I did the gdb thing. After that the show channels was not working (output of asterisk CLI and gdb attached). I did gdb again. (both outputs attached). Asterisk process was using a lot of CPU
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
3283 root      20   0  608m  16m 7284 S 83.8  0.2 128:21.42 asterisk

To get out of the situation I had to do a "killall -9 asterisk".

My asterisk is only compiled with DONT_OPTIMIZE. I am recompiling it to your specs, and will do the change over on the weekend. I did'nt ready the post carefully so was not able to do core show locks. Will remember it next time as well.



By: A.R. Nasir Qureshi (nasirq) 2009-04-01 01:55:11

One more thing. I tried to telnet into the manager interface. Doing Action: Status the first time showed the result, but then it did'nt work.

Action: Status

Response: Success
Message: Channel status will follow

Event: Status
Privilege: Call
Channel: SIP/807-01e6cd60
CallerID: 807
CallerIDNum: 807
CallerIDName: <unknown>
Account:
State: Up
Link: Local/807@extensions-96f8,2
Uniqueid: 1238563323.13158

Event: Status
Privilege: Call
Channel: Agent/1004
CallerID: unknown
CallerIDNum: unknown
CallerIDName: unknown
Account:
State: Up
Link: SIP/901-ec051ae0
Uniqueid: 1238563323.13157

Event: Status
Privilege: Call
Channel: Local/807@extensions-96f8,2
CallerID: unknown
CallerIDNum: unknown
CallerIDName: unknown
Account:
State: Up
Context: extensions
Extension: 807
Priority: 1
Seconds: 2975
Link: SIP/807-01e6cd60
Uniqueid: 1238563323.13156



Action: Status






Action: Status

By: Leif Madsen (lmadsen) 2009-04-01 09:46:13

nasirq: if your note above isn't related to the issue here, then please open an alternate bug to track it. Thanks!

By: A.R. Nasir Qureshi (nasirq) 2009-04-02 01:21:48

This is related to the same issue, as this is the same machine, same usage of Manager interface and same symptoms. I was just not sending actions, and may be since we do not know what caused the original issue, sending of actions was not the cause of the problem.

However if you still think that it is not related, I'll open another bug.

By: Frank Waller (explidous) 2009-04-07 08:52:55

We had similar Issues when using the same Manager connection to send and receive actions.
Using a separate Manager connection for listening and separate connection for sending each set of dependent commands has solved our problems with this.
As a note we do use a separate thread for each manager connection as well...



By: Joshua C. Colp (jcolp) 2009-04-15 12:49:37

Pinging anyone at all. I need Asterisk compiled with DONT_OPTIMIZE and DEBUG_THREADS in menuselect / Compiler Flags and the output of core show locks attached to this issue to proceed.

By: A.R. Nasir Qureshi (nasirq) 2009-04-16 03:55:25

I have complied asterisk with the required flags. Now as soon as we face the issue, I will update this bug with the required info.

By: Joshua C. Colp (jcolp) 2009-04-27 09:34:11

nasirq: Glad to hear that! I'll be here waiting.

By: A.R. Nasir Qureshi (nasirq) 2009-04-30 05:26:25

We just got another Lock Down. I have attached the file with the output of core show locks, (gdb) thread apply all bt and show channels.

I had to kill asterisk using -9 again to then restart it to restore service.

By: Joshua C. Colp (jcolp) 2009-05-04 11:26:28

After examining further this has already been fixed in newer versions. It was fixed in revision 148912 from issue 13676.