[Home]

Summary:ASTERISK-12581: AEL while loop - high cpu usage and crash
Reporter:Arkadiusz Malka (yarns)Labels:
Date Opened:2008-08-14 14:34:32Date Closed:2011-06-07 14:00:50
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:PBX/pbx_ael
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk.zip
Description:extensions.ael

Set(CDR(accountcode)=${CUT(EXTEN,,2)});
MixMonitor(RECORD_${STRFTIME(${EPOCH},GMT+8,%C%y%m%d%H%M%S)}_${CALLERID(num)}_${CUT(EXTEN,,2-3)}_${CDR(uniqueid)}.gsm|W(1)b);
if (${GROUP_COUNT(IAX2/gsm)}<${Kanaly1})
{
   Set(GROUP()=IAX2/gsm);
   Dial(IAX2/gsm/${CUT(EXTEN,,3)});
try=0;
   while (${try} < 10)
   {
       Wait(0.25);
       if ("${DIALSTATUS}" = "CHANUNAVAIL")
       {
           Dial(IAX2/gsm/${CUT(EXTEN,,3)});
           try=${try}+1;
       }
       else
       {
           proba=20;
       };
   };
};

30 agents, quadcore cpu, 8GB RAM
while entering into while loop cpu goes to 135% (without loop - 12%), after 2 hours crash:
Aug 14 15:38:51 asterisk2 kernel: asterisk[27997]: segfault at 0000000000000418 rip 00002aaab1a66b22 rsp 00000000416d9ce0 error 4

If someone need more specific information just tell me what to do to get it.

Best regards
Comments:By: Sean Bright (seanbright) 2008-08-14 15:10:46

Asterisk shouldn't crash, obviously, but can you try changing your loop control variable from 'try' to 'attempt'?  Might be an issue that you are using a reserved word as an identifier.

By: Arkadiusz Malka (yarns) 2008-08-14 22:52:19

Changed variable name and asterisk still crash.

By: Sean Bright (seanbright) 2008-08-15 05:10:45

Well, it was worth a shot :-)

I'll take a look later on today if murf doesn't get to it first.

By: Steve Murphy (murf) 2008-08-18 15:15:20

I just tried to reproduce this. Runs 10 times thru the loop and exits the
loop, just as expected.

Hmm, could you "core set verbose 5" and capture the messages that come out on the console, and attach them to this bug? That should pretty much tell us what's going on. I don't have queues on my test machine, and had to set Kanaly1 to 10 to mimic the behavior you might be seeing. The only way this code could be going nuts is either if Wait() is not doing its thing, or if Dial is going nuts for some reason.

By: Steve Murphy (murf) 2008-08-18 15:26:46

Wait a minute! If dial fails, but the reason is NOT "Chanunavail",
then it doesn't increment the loop counter, and you get caught in an
infinite loop!

Solution: Look at what the dial command returns, and thoughtfully
take action based on that. Increment your loop counter no matter the
return value

By: Steve Murphy (murf) 2008-08-18 15:29:27

OK, I'm so confident that it's a bug in your dialplan code (as previously described), that I'm closing this bug with "no change required". If you strongly feel I'm wrong, feel free to re-open. Just increment your loop counter unconditionally in the while loop. And, find out what Dial() is actually returning for DIALSTATUS. And code appropriately.

By: Arkadiusz Malka (yarns) 2008-08-19 00:52:57

Hello,

i paste bad code - in else statment should be:
try=20 not proba=20
so if dialstatus is different than CHANUNAVAIL its break while loop.
Today will paste code from console output.

Best regards



By: Steve Murphy (murf) 2008-08-19 12:24:25

OK, then we go to plan B.

core set verbose 40

and then collect console output while the system is going crazy. Should only need a thousand lines of output or so to see the problem.

Attach the output to this bug.

By: Arkadiusz Malka (yarns) 2008-08-20 06:41:11

Added console output.

By: Sean Bright (seanbright) 2008-08-20 08:10:46

Please attach your extensions.ael and the output of 'dialplan show'

We can mark the bug private if you don't want that information public.

By: Arkadiusz Malka (yarns) 2008-08-20 09:24:07

Please mark it as private.

By: Sean Bright (seanbright) 2008-08-20 09:57:40

Done

By: Steve Murphy (murf) 2008-08-20 11:21:31

OK, I looked over the log, and it looks like your loop is running OK. It quits at the 10th iteration, which is to be expected. Here are my thoughts:

1. 10 dial attempts (and mixmonitor might or might not be a contributing factor) in 2.5 seconds must be what is putting the extra load on the CPU. Even if they
mostly fail.

2. I only spotted one successful dial (like on the 2nd try) in the log.

3. Dialing 10 times in quick succession might not be so bad, but if your system gets busy, and this is happening almost all the time, several calls in parallel, then yes, it might account for the cpu cycle usage.

4. You might try waiting longer than a quarter second. You might try every second or two. This will slow down the freqency of the Dial calls. Better yet, only try one other time, after 2.5 seconds, which is the amount of time you allow for a success...

5. You might pay better attention to the DIALSTATUS after the call, and adjust for all the different cases.  If the line is busy, one more try after 2.5 seconds might have success. For other reasons, ... you need to think it thru.
If you got 'NO ANSWER", then you've already spent maybe 10 or 20 or 30 seconds, so why try again at all, and so on....

By: Steve Murphy (murf) 2008-08-20 11:24:47

Ok, I'm going to close this bug, because it appears that the problem is not AEL related. It is doing its job. It appears to be a cpu cycle usage problem with the applications being called. I suggest the user reduce the number of calls to Dial from 10 to 1 and this should get rid of the loop entirely, and reduce the cpu cycle usage considerably.