DAHLIN-00271: MG2 echo canceller leads to severe IRQ misses

[Home]

Summary: DAHLIN-00271: MG2 echo canceller leads to severe IRQ misses

Reporter: Birger "WIMPy" Harzenetter (wimpy) Labels:

Date Opened: 2011-12-13 04:18:54.000-0600 Date Closed: 2011-12-15 12:15:06.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: General

Versions: 2.6.0 Frequency of
Occurrence

Related
Issues:
is related to DAHLIN-00270 VPM400 no longer recognised

Environment: Digium PRI cards Attachments:

Description: Having MG2 enabled causes severe IRQ misses for the PRI card.
I'm using an old dual PIII-1266 for testing. Not the fastest, but should be fast enough for that.
With MG2 enabled only about 4 channels are usable. Thereafter sound deteriorates and HDLC aborts appear.
At above 30 channels the whole thing goes completely belly up, dropping the interface.
I managed to keep it running for about a minute with 40 channels: load 0.01, about 98% idle, but dahdi_test goes well below 80%.
Without MG2 there are no issues at all with 60 channels in use.
There must be something horribly wrong.

Comments: By: Shaun Ruffell (sruffell) 2011-12-13 16:36:36.263-0600

The load you see is misleading. Only relatively recent kernels (2.6.39 since commit [abb74ce|http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=abb74cefa9c682fb38ba86c17ca3c86fed6cc464]) can do a good job of tracking the time spent in the interrupt handler, which is where software echocan occurs.

If you're able to rebuild your kernel on the system you're checking on, you can enable CONFIG_IRQ_TIME_ACCOUNTING and you will truly be able to see the time taken in the interrupt handler as you enable the software echocan on each additional channel.

So your results, on that CPU, are completely in line with what I would expect without knowing anything more about the system.
By: Shaun Ruffell (sruffell) 2011-12-13 16:38:40.752-0600

Reopening since hopefully you can change the kernel and report the results.
By: Birger "WIMPy" Harzenetter (wimpy) 2011-12-13 17:26:37.095-0600

I'm building 2.6.39.4.

But computationally intensive operations certainly must never be done within an interrupt handler. So the general issue stands.
By: Shaun Ruffell (sruffell) 2011-12-13 17:51:50.212-0600

Yes, I agree that echo cancelation (and mixing) should not be done in the interrupt handler, but those are architectural decisions that were made from the very beginning of Zaptel's days. I have some ideas about how to move that processing out of the interrupt handler but they will require fundamental changes to the data flow through the drivers.

In case you're curious, my understanding of why things are this way is that the interrupt handler is where channels can be naively bridged (with echocan) with the least amount of added latency / overhead. If you're making a virtual TDM bus in software you don't want to introduce any extra latency that you do not have to.

It is also the place "closest" to the actual hardware so that if you are not queuing up all samples going to the PSTN you have the best chance of having a good reference signal for the echocan.
By: Birger "WIMPy" Harzenetter (wimpy) 2011-12-13 19:18:59.237-0600

I can perfectly see why bridging is desirable in the interrupt handler. That makes perfect sense to me, but DSP functions are worlds away from that.

The upgrade from 2.6.38 to 2.6.39.4 seems to have made the situation considerably better. (Maybe because I left out some unnecessary bits?)

21 channels produce good audio now. At 31 channels audio is very distorted.

Status line from top and output of dahdi_test -c 24:

21 ch:
Cpu(s): 0.2%us, 0.1%sy, 0.0%ni, 74.3%id, 0.0%wa, 25.3%hi, 0.0%si, 0.0%st
Best: 99.995% -- Worst: 99.929% -- Average: 99.964994%
31 ch:
Cpu(s): 0.5%us, 0.4%sy, 0.0%ni, 68.5%id, 0.0%wa, 30.5%hi, 0.0%si, 0.0%st
Best: 99.997% -- Worst: 99.726% -- Average: 99.925618%
41 ch:
Cpu(s): 0.7%us, 0.5%sy, 0.0%ni, 51.7%id, 0.0%wa, 47.1%hi, 0.0%si, 0.0%st
Best: 99.971% -- Worst: 50.044% -- Average: 95.842440%

When playing with smp_affinity to route the wct4xxp to CPU1 and all others to CPU0, the situation gets a little better again. With 31 channels you can at least recognize what you're hearing.

21 ch:
Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 77.4%id, 0.0%wa, 22.4%hi, 0.0%si, 0.0%st
Best: 99.993% -- Worst: 99.937% -- Average: 99.972533%
31 ch:
Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 69.8%id, 0.0%wa, 29.7%hi, 0.0%si, 0.0%st
Best: 99.990% -- Worst: 99.944% -- Average: 99.971516%
41 ch:
Cpu(s): 0.4%us, 0.7%sy, 0.0%ni, 60.7%id, 0.0%wa, 38.1%hi, 0.0%si, 0.0%st
Best: 99.995% -- Worst: 99.309% -- Average: 99.930586%

Even though the 41ch test doesn't look bad, audio is very distorted.

By: Shaun Ruffell (sruffell) 2011-12-15 12:15:06.960-0600

Feel free to reopen this if you believe I'm closing in error, but I don't see anything to do here that doesn't involve major architectural changes to the drivers. This is just a result of architectural decisions to allow echocancelation while bridging channels with the least amount of added latency.

If you're interested in working on that and would like to discuss thoughts about why/what/how we can discuss on the #asterisk-dev mailing list.
By: Birger "WIMPy" Harzenetter (wimpy) 2011-12-18 14:03:25.254-0600

Out of interest I re-tried with a dual HFC-E1 card and mISDN.
Although I'm not sure how fair the comparison between MG2 and OSLEC is, the result was quite similar, just with the most load shifted from hardware interrupts to system time.
However, mISDN was at least able to keep the signalling stable at all times.

The bad thing is that neither architecture seem to be able to make use of the 2nd CPU.