|Summary:||ASTERISK-07662: Used B channels restart failure after E1 span comes back up|
|Reporter:||Serge Kruppa (sergejf)||Labels:|
|Date Opened:||2006-09-01 19:15:08||Date Closed:||2007-05-24 15:12:37|
|Environment:||Attachments:||( 0) channelrestarts.txt|
( 1) pstn_trace.txt
( 2) traz.cap
|Description:||Configuration: Fedora Core 3 Linux server with Sangoma A101 E1 card connected to the PSTN in Mexico City using PRI ISDN signaling. Auto-dialer application communicating via AMI with Asterisk. Ten (10) E1 timeslots active (fractional E1). Software package version numbers:|
Scenario: the auto-dialer is dialing out to customers with 8 channels used (1 to 8) when the E1 goes down (carrier issue) for a second and then recovers. Asterisk (chan_zap.c) attempts to restart the B channels that were in use when the E1 problem occured (1 to 8). The first B channel is restarted successfully. In the meantime the auto-dialer is not aware of the E1 having gone down and keeps on making outdial attempts. These attempts fail because Asterisk is "unable to request channel". This scenario carries on for hours after the E1 recovery, resulting in tens of thousands of failed outcalls. B channels 3 to 8 never get a restart a acknowledge and remain unavailable.
Counter-scenario: launching a small test campaign, unplugging the E1 and then plugging it back does not allow us to reproduce the problem, i.e. the B channels in use are properly restarted. The error occurs only under heavy load.
****** ADDITIONAL INFORMATION ******
Sangoma tech support discarded a driver related issue:
"The Sangoma A10x cards are T1/E1 interface cards, i.e they only perform the layers 1 and 2 in your OSI layer model, so basically they act as a data pump between the E1 and Zaptel. From The A10x and Wanpipe drivers point of view, it does not make any difference whether a channel is in use or not, or whether a call is up on a channel or not. The Wanpipe drivers only know of whether the E1 (physical layer) is connected or if it is in RED alarm."
"If you were able to make even a single call after the E1 RED alarm went off, then it means that the A10x and Wanpipe drivers recovered properly so the issue would be in Asterisk/libpri."
Furthermore Sangoma adds:
"I am suspecting this is a system load issue where asterisk did not have enough time to reset the b-channels before your dialer tried to dial out, or this could be an issue with your dialer using the b-channels while the PRI is recovering. Only unused b-chanels are restarted by asterisk."
|Comments:||By: Serge Vecher (serge-v) 2006-09-05 09:05:37|
Well, let's get the obvious stuff out of the way. Update to the latest Asterisk/zaptel/libpri releases (1.2.11, 1.2.8, 1.2.3).
By: alyed (alyed) 2006-09-26 18:49:06
The versions were updated, but the problem appeared once again. Asterisk logs just show:
-No D-channels available! Using Primary channel 16 as D-channel anyway!
-E1 LOS alarm is ON
afetrwards the log file gets a lot of
-Unable to request channel ZAP/g1/12345678
since the dialer tries to send new calls but none is available.
By: Matt O'Gorman (mogorman) 2006-10-03 17:29:33
alyed I have a feeling it could be your auto dialer, but i do have a question, if you are not passing calls when the pri goes down do all the channels come back successfully?
By: Andre Courchesne (acourchesne) 2006-10-05 23:31:30
Possibly the same as http://bugs.digium.com/view.php?id=8069
By: alyed (alyed) 2006-10-06 11:52:27
If no call is sent to through the E1 while the channels are restarted, they all come back successfully.
According to the description of bug 8069, they place more call files than the actual number of available channels in the "outgoing" directory. Hence some calls were queued by Asterisk. Weather or not this queueing was responsible of the blocking of channels is not clear to me.
Our dialer uses Manager API to place the calls via the Originate action. Also, we don't try to place more calls than the available channels at a given time. This is both, because we discover that the channels were blocked if this happened (just as bug 8069), and due to we leave 2 free channels to make transfers between the number dialed and an agent.
By: jmls (jmls) 2006-11-05 12:51:44.000-0600
alyed: are you still having these issues ?
By: Serge Kruppa (sergejf) 2006-11-05 13:44:47.000-0600
The issue still exists. We will offer a bounty of US$200 for the bug to be fixed, to be paid after verification on-site from us. Please contact me at firstname.lastname@example.org if you want to discuss in more details.
By: Fernando Romo (el_pop) 2006-11-30 22:04:53.000-0600
The same happend with Digium hardware. Maxcom is the carrier?, because we found the same error with 8 E1's but connecting the same machine to Telmex (Another carrier), the equipmente work fine.
We use Asterisk 1.2-Branch from lastest SVN, the same for libpri and zaptel and using the Sangoma Driver Stable 2.3.4-0 but exists one new: ftp://ftp.sangoma.com/linux/current_wanpipe/wanpipe-2.3.4-2.tgz
We note a total talk loss when the "No D-channels available! Using Primary channel 16 as D-channel anyway!" appear
we check the trace from the Carrier and Note a "SABME" (set asynchronous balanced mode extended) arrive many times without waiting the "UA" (unnumbered acknowledgement) respond from Asterisk.
The "SABME" command is handle by libpri in the program "q921.c", this program handle the layer 2 ISDN protocol. In our case the Maxcom Carrier send a request of channel reset and asterisk obey acording the request from PSTN. This behaivor is correct in the most of the cases, but Maxcom send many times that request without wait or keep a timer.
We try to make a workarounf modify the q921.c program, but i don't know the posible consecuenses, because is a rigth method, but Maxcom send a wrong signal. the posible patch is only reset the available channels and not the busy ones.
I try to figure out why Maxcom request a Channel reset in place of a supervision request.
You try to put the "resetinterval" in the zapata.conf with a value of "600" to force zaptel to send the SABME supervision request more often than the PSTN and try to avoid the PSTN request, this not correct the error, but we note the time between failures are a little more longer. (Thanks to Roman Torres for the sugestion).
Keep the electrical "Ground" common between your * box and the Comunication equipment. Tha lack of electrical Ground cause a erratical behaivor and mad erros.
You can send me more info about it?
In our expirence appear a PSTN error. Our equipment has one year with one customer and Maxcom only "change the configuration" to bring new E1's and we see 3 days of a real Nigthmare, but with another carrier work fine, then the PSTN has something wrong. well... Obviusly the PSTN blame the PBX equipment, but in this case i think diferent.
By: Fernando Romo (el_pop) 2006-11-30 22:24:12.000-0600
In zapata.conf we read:
; ISDN Timers
; All of the ISDN timers and counters that are used are configurable. Specify
; the timer name, and its value (in ms for timers).
; pritimer => t200,1000
; pritimer => t313,4000
The q921.c use the t200 and t203 timers, maybe some test with the values could help:
pritimer => t203,2000
pritimer => t200,2000
But the only way is "try and error" cicle.
By: Fernando Romo (el_pop) 2006-12-01 00:51:25.000-0600
write now we are testing with Maxcom (the PSTN) and ask us to put this values in the ISDN Timers:
pritimer => t200,1000
pritimer => t203,10000
the only value with real increase is the t203 timer. We report tomorrow the results of this test.
By: Fernando Romo (el_pop) 2006-12-01 10:16:19.000-0600
we have 8 hours of operation without errors.
The parameter "resetinterval" in the zapata.conf is in the default value of "3600" and only we add the "t203" timer with value of "10000". This change afecct only with Maxcom Carrier. Ask your PSTN the values of the t203 and t200 timers and test.
We still testing for a couple of days.
By: Fernando Romo (el_pop) 2006-12-01 13:53:27.000-0600
We have one error before 10 hours of operation, we try to setting the t203 timer much higher, according with Cisco manuals the setting could be 30 seconds (30,000ms).
By: Serge Vecher (serge-v) 2006-12-01 15:35:12.000-0600
el_pop: if you are using Digium hardware, why are you using Sangoma drivers?
By: Fernando Romo (el_pop) 2006-12-01 17:40:29.000-0600
I test with the two hardware, i use Sangoma, but for test purposes we use a Digium card.
Our setup has 2 PBX, each has a Sangoma A104d (E1 x 4) using 2 Dell PowerEdge 2800 with dual Xeon at 3.6 Ghz, 2 Gb of Ram and Plenty of HD on Raid 5 (by hardware).
In one set of test, Roman Torres use a HP machine with a Digum Card with the same results. The problem not is the hardware, With 110 Agents our equipmente raise 15.5% of utilization, the problem is not hardware capacity.
The problem occurs when Maxcom "reconfigure" their E1 services and we note strange behaivor with the Layer 2 and Layer 3 ISDN control protocols. I suposse is a ISDN Timer problem. Our setup has one year in operation until the Maxcom "reconfiguration".
We have "mirror" instalations with another customers without any kind of problems, using Axtel, Telmex, Avantel and Maxcom E1 links, but this case is somethign strange for us.
Who is the carrier with the problematic instalation?
By: Fernando Romo (el_pop) 2006-12-01 17:55:03.000-0600
I check the changes in libpri and from one year ago and can't find much diference in this library, but i notice something strange with Zaptel, if i not declare "explicit" each pritimer, the values are show but not used.
Could be a variable initialization problem on zaptel.c?, i mean the default values of the timers.
i note q921.c use the t200, t203 timers, q931.c use t305, t308 and t313 ISDN timers, the first for layer 2 and the lastest for layer 3.
The "resetinterval" could be a problem with the carrier, i try to put this setting to "never" in place of "3600". but after the reconfiguration whe use the deafult values without problem.
And complemente the report: The alarms appear with or without traffic.
We use a dialer using AMI, with one year of operations without problems, until now. we update to lastest Asterisk, Zaptel, Libpri and Sangoma Drivers with the same result. But maybe in Zaptel could direct us to resolv this issue.
By: Fernando Romo (el_pop) 2006-12-02 18:07:42.000-0600
Using measurement equipment we have a couple of traces of the call behaivor. The layer 2 and 3 looks ok in the call process, but we discover somenthing strange in the d-channel control supervision.
In the file "traz.cap" we check a tipical call and we don't see any strange, but in the trace in the PSTN equipmenet we discover something wrong:
Asterisk send a "Rqst Remove all TEI Values" or message "Identity Remove Action Indic.: 127." adverting the PSTN to the channels disconnection, The PSTN send a SABME messages to check the status of the channles and Asterisk don't respond the SABME requets, then the PSTN order the channels reset and Asterisk Obey according to the protocol, detail in file "pstn_trace.txt".
The people of Lucent Technologies help ups to bring this trace from the PSTN Switch.
By: alyed (alyed) 2006-12-03 21:20:06.000-0600
>>"We note a total talk loss when the "No D-channels available! Using
>>Primary channel 16 as D-channel anyway!" appear"
This is not the case of this bug. Sorry if it was not stated clearly before, but calls on progress aren't lost. Furthermore, only those channels which the dialer tried to use while the E1 was down become blocked, all others remain fully usable after the E1's RED alarm goes off.
el_pop: many thanks for your interest, but your problem looks more like a configuration issue rather than a bug. It is more suitable that you discuss this problems/findings via the Asterisk users-list, there you will find lots of nice people that can shed better light on your problem.
By: Fernando Romo (el_pop) 2006-12-04 09:08:42.000-0600
alyed: is not a configuration issue, if you read the traces, Asterisk don't responde SABME messages acording the q921 especification (according to Cisco Manuals). If you think is a User question, please show me the light of your wisdow here.
And i forget: The D-channel warning is the first event then you got you "Unable to create channel", is not a isolate event, is an "efect and consequence" issue.
If you read the trace from PSTN, Asterisk request a disconection, the PSTN send SABME but asterisk respond 45 seconds after, then the PST expire th timers to the 30 seconds time frame and the UA respond come to late to avoid the PSTN reset channel command. You can say... is a "resetinterval" parameter issue, but with values of "never", "3600" or "10000000000" the problem persist, then... something seems like a bug.
If you care of read al the thread, you see i think in first stage the same like you: a configuration issue. But the scenario put me on track of a bug. What you think?
By: hosin (hosin) 2006-12-20 18:51:06.000-0600
One tentative solution is to skip handling of error condition in chan_zap.
My case was I got very short (possiblly fake) yellow alarm frequently thus Asterisk restarted channels every time, while that alarm did not actually affect the connection. It seems the situation is getting red alarm so it may not fall into this case, though......
BTW my simple question is why anyone not suggesting getting asterisk log of "pri debug span x" in case of Q931 and "pri intense debug span x" in case of Q931+Q921? Is it a matter of lengthy log size ?
As the last reporter suggests it seems asterisk Layer2 is not responding from a viewpoint of carrier's switch (who put the analyzer getting the result of traz.cap ??), but what happens on the side of Asterisk? Was the log acquired near at carrier's switch and red alarm condition physically resides and disconnects the link (and therefore Asterisk originating SABME is not recognized at carrier's switch)?
By: Fernando Romo (el_pop) 2007-01-25 13:19:35.000-0600
Ok, nobody send info, we found a temporay workaround using the HDLC D-Channel handling available on Sangoma Cards.
We use a couple of A104 Sangoma Cards (4 x E1) using the D-Channel HDLC Hardware handler feature using the last Zaptel branch of version 1.2 and the TRUNK version of libpri. The result: the PSTN links has more of one month up without failure.
Then the problem is how zaptel/libpri deal with layer 2 signalling.
Hosin: I put the trace file (you can see this in the "Issue History" section of echa bug report), the people of Lucent/Avaya help me to trace from the PSTN and determinate the lack on the SABME message interchange.