Summary: | DAHLIN-00268: Flag "stuck" channels as in-use and optionally reset firmware. | ||
Reporter: | Shaun Ruffell (sruffell) | Labels: | |
Date Opened: | 2011-11-30 13:24:12.000-0600 | Date Closed: | 2014-08-19 11:07:49 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | wctc4xxp |
Versions: | 2.5.0.2 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) 0001-wctc4xxp-Fail-gracefully-on-Failed-to-create-channel.patch ( 1) 0003-wctc4xxp-Fail-gracefully-on-Failed-to-create-channel.patch ( 2) kern.log | |
Description: | seanbright reported on IRC that the wctc4xxp driver can sometimes report: {noformat} [3133994.496258] wctc4xxp 0000:07:08.0: Failed to create channel in timeslot 52. Response from DTE was (ffbd). {noformat} When this happens, the card typically has that channel stuck and will return an error message on any new channel creation until the driver is reloaded. While unclear how the firmware gets into this state the driver could be more proactive about minimizing the impact of this on a running system by marking channels as in use and not trying them until after a reload, and optionally reloading after the channel use count is at 0 for some specific amount of time. | ||
Comments: | By: Shaun Ruffell (sruffell) 2012-07-05 16:15:00.480-0500 Today I heard from someone else who had this happen once on a system that has been running pretty solid. By: Russ Meyerriecks (rmeyerriecks) 2012-08-15 15:03:32.755-0500 Attached potential fix for this issue. Needs testing. By: Russ Meyerriecks (rmeyerriecks) 2012-09-21 13:29:41.410-0500 According to user report in dahdi-990 the fix does not work. Needs further development. By: Russ Meyerriecks (rmeyerriecks) 2012-10-10 12:12:06.113-0500 Filed a bug/info request to mindspeed regarding the 0xffbd (ERR_TDMDRV_INVTS) error we see occasionally. No support contract with Mindspeed so this avenue not likely to return much. But worth a shot. By: Sean Bright (seanbright) 2013-02-13 06:13:14.790-0600 I'm curious what the user in dahdi-990 means when he says "the fix does not work." Is this patch safe to test? By: Russ Meyerriecks (rmeyerriecks) 2013-02-13 10:46:20.427-0600 That patch should be safe to test. Customer was applying it incorrectly. I would actually really like to get some feedback for that patch. By: Shaun Ruffell (sruffell) 2013-02-13 10:58:29.997-0600 I don't think the patch attached to this issue is correct. I think you wanted the one that defined the WEDGED state so that busy isn't cleared when the file descriptor is closed. By: Sean Bright (seanbright) 2013-02-13 11:36:54.190-0600 Well let me know which to test and I will roll it out to a few servers. This bug bit me yesterday. By: Russ Meyerriecks (rmeyerriecks) 2013-02-13 11:48:04.301-0600 So the 0003 patch I just attached attempts to flag the "wedged" echo can timeslots so dahdi will no longer attempt to use them. This is sort of a workaround instead of a permanent fix, since the wedged timeslot will not longer be useful until the driver is reloaded, but it should theoretically keep dahdi from becoming hung up. By: Sean Bright (seanbright) 2013-02-13 11:52:05.580-0600 We unload and reload the driver each night, so this will help us get through the day. I will roll it out tonight. By: Russ Meyerriecks (rmeyerriecks) 2013-02-13 11:54:56.250-0600 Ugh, I hate to hear that. Are there reasons that you reload the driver every night other than this problem? By: Sean Bright (seanbright) 2013-02-13 11:58:11.890-0600 No, that's it. I haven't been tracking how often this happens, but because we reload the driver each night, it seems to happen infrequently. This if the first time it's happened in a few months. Without info from Mindspeed it doesn't appear that this will ever be "fixed," right? By: Sean Bright (seanbright) 2013-02-14 06:08:40.348-0600 This is now deployed on 3 servers. I'll let you know what happens over the next few weeks. By: Sean Bright (seanbright) 2013-03-20 13:34:32.804-0500 Finally saw some failures. It doesn't seem to have "fixed" the problem. We still had to end up bouncing the machine. Log file attached. By: Russ Meyerriecks (rmeyerriecks) 2013-03-25 10:20:16.822-0500 Sean, is that log output a dump from dmesg or the log file? By: Sean Bright (seanbright) 2013-03-25 10:27:56.724-0500 /var/log/syslog - Sorry, I should have named it appropriately. By: Shaun Ruffell (sruffell) 2014-08-19 11:07:50.100-0500 With the changes in DAHDI-Linux 2.9.2, I would be surprised if this is still an issue. Specifically, the wctc4xxp driver will now reset the firmware either immediately if a critical alert is received, or when all channels are closed if a non-critical alert is received. |