[Home]

Summary:ASTERISK-12120: [patch] Failure of resetting of a PRI B-Channel causes deadlock in process
Reporter:Edwin Groothuis (mavetju)Labels:
Date Opened:2008-05-31 01:31:31Date Closed:2011-06-07 14:03:12
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_zap
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20080710__bug12766.diff.txt
( 1) patch-deadlock-channels.txt
Description:By default, once every hour, the Asterisk system resets the channels of a PRI. This workflow is based on "reset the first channel"; when the reset gets acknowledged "let's reset the next channel" until all channels are done. This works fine if all channels get acknowledged, but when an acknowledgment isn't received, the whole system is deadlocked for that PRI.
Comments:By: Edwin Groothuis (mavetju) 2008-05-31 01:41:06

To overcome this deadlock I have made the following changes:

- Added the field "resettimeout" to the struct pri.
- Changed the format of resetting in struct zt_pvt from int:1 to time_t.
- In the function pri_check_restart(), the pri->pvt->resetting now gets the time it was resetted instead of a flag.
- In the function pri_dchannel() where it checks if the PRI needs to be restarted, check if the time between the reset of the last channel and the current time is bigger than resettimeout. If so, print a warning, stop the current channel from being reset and reset the next channel.

By: Edwin Groothuis (mavetju) 2008-05-31 01:47:05

This is what I get now:

   -- B-channel 0/7 successfully restarted on span 2
   -- B-channel 0/8 successfully restarted on span 2
   -- B-channel 0/9 successfully restarted on span 2
   -- B-channel 0/10 successfully restarted on span 2
   -- Span 2 - Recover from deadlock on reset of B-channel 11
   -- B-channel 0/12 successfully restarted on span 2
   -- B-channel 0/13 successfully restarted on span 2
   -- B-channel 0/14 successfully restarted on span 2

Please note that this doesn't resolve the issue with busy-backwards blocked channels, but at least it shows that there is something wrong.



By: Edwin Groothuis (mavetju) 2008-05-31 01:51:55

This is the change in /etc/asterisk/zapata.conf:

;
; PRI resetinterval: sets the time in seconds between restart of unused
; channels, defaults to 3600; minimum 60 seconds.  Some PBXs don't like
; channel restarts. so set the interval to a very long interval e.g. 100000000
; or 'never' to disable *entirely*.
+; PRI resettimeout: sets the time Asterisk will wait for an acknowledgement
+; on a restart of an unused channel, defaults to 3.
;
;resetinterval = 3600
+;resettimeout = 3
;
;

By: Edwin Groothuis (mavetju) 2008-05-31 04:33:15

According to Figure A.4/Q.931 of the Q.931 specification this has to happen with the T316 timer which isn't implemented in Asterisk but which is optional. Their suggested timeout is 2 minutes. See chapter 5.5 of the Q.931 specification.

By: Tilghman Lesher (tilghman) 2008-06-17 19:40:51

I need a couple of changes before this can go in:

1) In the first hunk of the patch, the comment on resettimeout is exactly the same as on resetinterval.  This needs to be changed to distinguish the two values.

2) You're using time() to get when the reset started.  Due to timing, that means that a value of 3 for the resettimeout could wait as little as 2.001 seconds before the timeout will fire, rather than a full 3 seconds.  I'd prefer if you switched this to using 'struct timeval' and ast_tvnow() (with ast_tvdiff_ms() to calculate the difference).

3) Instead of calling it 'recover from deadlock', I'd like to see something more along the lines of 'provider did not acknowledge reset -- channel may not get used'.

By: Tilghman Lesher (tilghman) 2008-07-02 16:42:44

mavetju: ping.  Any progress on this?

By: Tilghman Lesher (tilghman) 2008-07-10 20:04:43

Since there was no response, I went ahead and updated the patch.  This will need some testing that I cannot do very well here, as I have no PRI.

By: Tilghman Lesher (tilghman) 2008-08-06 12:44:52

mavetju: ping

By: Edwin Groothuis (mavetju) 2008-08-06 16:29:49

Due to a change in jobs I have lost access to all my Asterisk infrastructure and am thus not able to do any testing or approvals. Sorry.

By: Tilghman Lesher (tilghman) 2008-09-17 16:51:53

Since there is no further demand for this change, I am suspending it.