Summary: | DAHLIN-00219: [patch] The TE122 and dadi produces an unusual high load | ||
Reporter: | Joao Carvalho (foxfire) | Labels: | |
Date Opened: | 2010-10-14 08:33:06 | Date Closed: | 2011-01-20 23:31:36.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | wcte12xp |
Versions: | 2.4.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) 0001-wcte12xp-Use-interruptible-waits-to-decrease-impact-.patch ( 1) now.png ( 2) stats-bad.png ( 3) stats-good.png | |
Description: | When starting dahdi_linux and there is a TE122 installed in the machine in a few seconds the load raizes from 0.00 to over 0.50 . This happens always, i have no problem with any other digium card only the TE122. I stopped all services and only started dahdi. Disabling the mg2 echocancler does not help. I consider this major because call quality is affected due to the load increasing to values. ****** ADDITIONAL INFORMATION ****** my /etc/dahdi/system.conf loadzone=pt defaultzone=pt span=1,0,0,css,hdb3,crc4 bchan=1-15 dchan=16 bchan=17-31 echocanceller=mg2,1-15 echocanceller=mg2,17-31 | ||
Comments: | By: Joao Carvalho (foxfire) 2010-10-14 08:38:51 This does not happen with zaptel and asterisk 1.4 It only happens with asterisk 1.6. But the load is there even before starting asterisk. i was running dahdi-linux-2.2.1.1 so i upgraded to 2.4.0 to check if the problem was solved, but it remains an issue. I have been able to put a machine aside for testing so i can run any tests you like. Also i tried booting the kernel with noacpi, nosmp and noapic, but nothing changed. By: Shaun Ruffell (sruffell) 2010-10-14 09:48:50 foxfire: Most likely your best bet is going to be to contact Digium technical support for help with triaging this issue. However, some things I wanted to ask first. If this only happens with Asterisk 1.6, what do you mean that the load is still there before loading Asterisk 1.6? How are you determining the load is 50% (or was it 0.50%). By: Joao Carvalho (foxfire) 2010-10-14 10:03:12 If i shutdown everything misdn, dahdi and asterisk the machine behaves normal, but as soon as i start dahdi the load appears. Load averages in linux are not in % it is a weird number. Normally it should be 0.00 , but once the load raises it raises over 0.50. If it goes over 1.00 the machine theoretically looses real-time capacity. Because of the initial load already is high this happens very often. By: Shaun Ruffell (sruffell) 2010-10-14 10:07:11 What is the command you're using to get the load? By: Joao Carvalho (foxfire) 2010-10-14 10:13:00 the simplest one just running uptime #uptime 16:09:32 up 1:41, 1 user, load average: 0.58, 0.51, 0.48 here is an other example of a machine that hasn't an TE122 bash-3.1# uptime 16:11:28 up 181 days, 48 min, 2 users, load average: 0.00, 0.00, 0.00 By: Shaun Ruffell (sruffell) 2010-10-14 10:34:17 What you're probably noticing is the fact that the wcte12xp driver has moved alarm polling processing out of the interrupt handler (which is now in a workqueue and gets scheduled every 100ms) combined with statistical sampling of the run queues for calculating those "loads". Those numbers aren't meaningful either without knowing how many processors are in your system. Have you detected any real audio quality differences? By: Joao Carvalho (foxfire) 2010-10-14 10:57:12 I have uploaded 2 png from two servers running the exact same software. Bad is a Poweredge 2950 with 1 TE122 Good is a Poweredge 2950 with 1 T4XXP Bad has a maximum of 20 simultaneous calls, Good has over 60. I have been informed that sometimes audio problems occur on Bad, i have no such complains from Good. By: Shaun Ruffell (sruffell) 2010-10-14 11:00:39 What version of the driver were you running when you collected those stats? By: Joao Carvalho (foxfire) 2010-10-14 11:06:42 productions servers are running : asterisk 1.6.1.18 dahdi 2.2.1.1 mISDN-1.1.9.2 libpri-1.4.10.2 as i said before i installed and upgraded dahdi to the latest version, but the problem continues. By: Shaun Ruffell (sruffell) 2010-10-14 11:29:19 I was under the impression that when you updated to DAHDI the only thing you saw continue was 0.50 load average. Do you have reports of audio problems after updating to dahdi-linux 2.4.0? By: Shaun Ruffell (sruffell) 2010-10-14 11:32:46 Also...you most likely will want to run with 2.3.0+ because the following commit: http://svn.asterisk.org/view/dahdi?view=revision&revision=8455 By: Joao Carvalho (foxfire) 2010-10-15 04:03:31 I really can't upgrade my asterisk on the productions servers without extensive tests, is it safe to upgrade dahdi without upgrading the rest ? I really need the load value to be fixed several programs react differently to high loads, have you heard of a patch that might fix this ? As i said before i have a machine available to perform any test or recommendation you might have. By: Shaun Ruffell (sruffell) 2010-10-15 13:15:38 foxfire: With regard to updating your production servers, I recommend you test everything out on your staging setup first. But again, I think in this case "load average" is a red herring and the CPU utilization numbers are a better indication. The fundamental issue is that the wcte12xp uses a periodic kernel timer to schedule work on a workqueue to read the alarm states on the framer. Linux runs all the expired timers (including the one from the wcte12xp driver) *before* calculating load averages. This means that every 100ms the kernel will always see that there is a process (the wcte12xp workqueue) ready to run. This process doesn't run for long. It basically just goes back to sleep after queuing some commands, but it still shows up as runnable when loads are being checked in the timer tick. You can adjust the "load" numbers without affecting CPU utilization by changing the frequency the alarms and signal bits are checked in the drivers/dahdi/wcte12xp.c:timer_work_func. To make the load increase, change "jiffies + HZ/10" to "jiffies + HZ/500". Likewise to make the load decrease, change it to "jiffies + HZ/5". The CPU utilization doesn't change linearly with the load since this function again spends most of the time sleeping. The reason you didn't see this with Zaptel is that the alarms were checked directly in the interrupt handler which was removed from the wcte12xp driver in DAHDI in http://svn.asterisk.org/view/dahdi?view=revision&revision=6525. So I would instead focus on trying to update to 2.4.0, and then see what the characteristics (if any) are of the audio quality problems. By: Joao Carvalho (foxfire) 2010-10-15 19:38:30 Thank you for the explanation. Even so i would like to lower the virtual load, because many statistics depend on them. So can changing the frequency to for example HZ/2 create any problems? I believe that the chosen value of HZ/10 was not chosen at random, but only HZ/2 seems to make the load look normal. By: Shaun Ruffell (sruffell) 2010-10-15 20:54:30 foxfire: That rate is used to check for alarms and collecting RBS bits from the framer. So Hz/2 means that you're only checking for alarms every 500ms. The alarms are probably not that big a deal since most of the time you won't be in alarm and you don't need to worry about the RBS bits since you appear to have a dchannel setup. Still, I recommend testing it as completely as possible in your staging environment. By: Joao Carvalho (foxfire) 2010-10-16 13:17:24 Thank you, i will do so. By: Shaun Ruffell (sruffell) 2010-10-20 20:52:06 Linking a note regarding load average about high load averages on idle systems with 2.6.35 stable kernels. Sounded very similar to what foxfire mentioned when he said "many statistics depend on them." "Even if it is only statistics, many supervision tools rely on the load avg, so for production environments, this is not a good thing." http://thread.gmane.org/gmane.linux.kernel/1042346/focus=1051361 This way I might find it if another discussion regarding load averages comes up in the future. By: Shaun Ruffell (sruffell) 2010-12-02 08:51:08.000-0600 Just taking another note here: I was digging around the kernel sources about something unrelated when I came across a comment about threads in interruptible sleeps are not added the load averages. I then looked in the kernel/sched.c:calc_load_fold_active function in the linux kernel and see that threads in uninterruptible sleeps are actually added to the load average. Sooooo...it'll probably be worth changing the wait_for_completion_timeout calls in t1_getreg to wait_for_completion_interruptible_timeout in order to reduce the impact of the alarm checking thread on the global load average. By: Shaun Ruffell (sruffell) 2010-12-06 11:17:16.000-0600 foxfire: I'll commit 0001-wcte12xp-Use-interruptible-waits-to-decrease-impact-.patch in the next day or two unless I hear from you. By: Joao Carvalho (foxfire) 2010-12-06 11:21:29.000-0600 ok i will test your patch when it becomes available and get the results back to you. Thank you By: Shaun Ruffell (sruffell) 2010-12-06 11:24:13.000-0600 The patch is now attached to this issue (didn't know if that was clear from your message). Thanks! (if you use the "wget patch" option on the issue, but sure to use "patch -p1" instead of "patch -p0" since it was made with git and has the potential commit message with it). By: Joao Carvalho (foxfire) 2010-12-06 16:28:53.000-0600 I have it already running in one server. I had to wait for after hours to change the module. So far it looks very promising, i will run it for 24 hours under some load so we can see how it behaves. I will get back to you with the results. ;) By: Joao Carvalho (foxfire) 2010-12-07 11:10:27.000-0600 Very nice, congrats, until now no problem and the load is low. On this site had an average of 6 simultaneous DAHDI calls, 18 SIP and 5 IAX2. It looks realistic to me, that machine had loads which reached 10 and over. thank you i believe you fixed it. By: Digium Subversion (svnbot) 2010-12-07 19:56:57.000-0600 Repository: dahdi Revision: 9512 U linux/trunk/drivers/dahdi/wcte12xp/base.c U linux/trunk/include/dahdi/kernel.h ------------------------------------------------------------------------ r9512 | sruffell | 2010-12-07 19:56:57 -0600 (Tue, 07 Dec 2010) | 48 lines wcte12xp: Use interruptible waits to decrease impact on load average. The wcte12xp does all the checking for alarm in a user space workqueue. Most of this time is spent sleeping waiting for reads from the framer to complete. Tasks in uninterruptible sleeps are added to running tasks for the purposes of calculating load average. This change makes the sleeps interruptible so as to not affect the load average as much. For example, the following command will load and configure the driver and then print the load average every 10 seconds. ]# modprobe wcte12xp && dahdi_cfg && ((x=12)); while [[ $x -gt 0 ]]; do cat /proc/loadavg; sleep 10; let x=$x-1; done With this change: 0.29 0.10 0.02 1/101 29945 0.24 0.10 0.02 1/101 29967 0.20 0.09 0.02 1/101 30019 0.17 0.09 0.02 1/101 30041 0.15 0.09 0.02 1/101 30062 0.12 0.08 0.02 1/101 30085 0.10 0.08 0.02 1/101 30107 0.09 0.08 0.02 1/101 30129 0.07 0.08 0.02 1/101 30151 0.14 0.09 0.02 1/101 30173 0.12 0.09 0.02 1/101 30195 0.10 0.08 0.02 1/101 30217 (and I've seen it get down to 0.0) Before this change: 0.57 0.22 0.07 1/101 31920 0.48 0.21 0.07 1/101 31942 0.48 0.22 0.07 1/101 31964 0.48 0.23 0.08 1/101 31986 0.41 0.22 0.07 1/101 32008 0.42 0.23 0.08 1/101 32030 0.43 0.24 0.08 1/101 32054 0.45 0.25 0.09 1/101 32076 0.45 0.25 0.09 1/101 32098 0.46 0.26 0.10 1/101 32120 0.47 0.27 0.10 1/101 32172 0.39 0.26 0.10 1/101 32194 (closes issue DAHLIN-219) Reported by: foxfire Tested by: foxfire Signed-off-by: Shaun Ruffell <sruffell@digium.com> ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=9512 By: Digium Subversion (svnbot) 2011-01-20 23:31:36.000-0600 Repository: dahdi Revision: 9683 U linux/branches/2.4/drivers/dahdi/wcte12xp/base.c U linux/branches/2.4/include/dahdi/kernel.h ------------------------------------------------------------------------ r9683 | sruffell | 2011-01-20 23:31:36 -0600 (Thu, 20 Jan 2011) | 50 lines wcte12xp: Use interruptible waits to decrease impact on load average. The wcte12xp does all the checking for alarm in a user space workqueue. Most of this time is spent sleeping waiting for reads from the framer to complete. Tasks in uninterruptible sleeps are added to running tasks for the purposes of calculating load average. This change makes the sleeps interruptible so as to not affect the load average as much. For example, the following command will load and configure the driver and then print the load average every 10 seconds. ]# modprobe wcte12xp && dahdi_cfg && ((x=12)); while [[ $x -gt 0 ]]; do cat /proc/loadavg; sleep 10; let x=$x-1; done With this change: 0.29 0.10 0.02 1/101 29945 0.24 0.10 0.02 1/101 29967 0.20 0.09 0.02 1/101 30019 0.17 0.09 0.02 1/101 30041 0.15 0.09 0.02 1/101 30062 0.12 0.08 0.02 1/101 30085 0.10 0.08 0.02 1/101 30107 0.09 0.08 0.02 1/101 30129 0.07 0.08 0.02 1/101 30151 0.14 0.09 0.02 1/101 30173 0.12 0.09 0.02 1/101 30195 0.10 0.08 0.02 1/101 30217 (and I've seen it get down to 0.0) Before this change: 0.57 0.22 0.07 1/101 31920 0.48 0.21 0.07 1/101 31942 0.48 0.22 0.07 1/101 31964 0.48 0.23 0.08 1/101 31986 0.41 0.22 0.07 1/101 32008 0.42 0.23 0.08 1/101 32030 0.43 0.24 0.08 1/101 32054 0.45 0.25 0.09 1/101 32076 0.45 0.25 0.09 1/101 32098 0.46 0.26 0.10 1/101 32120 0.47 0.27 0.10 1/101 32172 0.39 0.26 0.10 1/101 32194 (closes issue DAHLIN-219) Reported by: foxfire Tested by: foxfire Signed-off-by: Shaun Ruffell <sruffell@digium.com> Origin: http://svnview.digium.com/svn/dahdi?view=rev&rev=9512 ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=9683 |