Summary: | DAHLIN-00201: [patch] oops after driver is removed | ||
Reporter: | Tzafrir Cohen (tzafrir) | Labels: | |
Date Opened: | 2010-07-18 11:37:51 | Date Closed: | 2011-01-20 23:28:30.000-0600 |
Priority: | Minor | Regression? | No |
Status: | Closed/Complete | Components: | General |
Versions: | 2.3.0.1 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) 0001-dahdi-Better-tolerate-surprise-removal-of-channels.patch ( 1) 0002-dahdi-Sleep-a-bit-before-returning-ENODEV-from-read-.patch ( 2) 20100830__issue17669.diff.txt | |
Description: | The 'bind' and 'unbind' properties allow (un)binding a device to a driver at runtime: zavadi:~# dahdi_hardware pci:0000:00:09.0 wcb4xxp+ 1397:08b4 Junghanns QuadBRI ISDN card zavadi:~# echo -n 0000:00:09.0 >/sys/bus/pci/drivers/wcb4xxp/unbind zavadi:~# dahdi_hardware pci:0000:00:09.0 wcb4xxp- 1397:08b4 Junghanns QuadBRI ISDN card zavadi:~# echo -n 0000:00:09.0 >/sys/bus/pci/drivers/wcb4xxp/bind zavadi:~# dahdi_hardware pci:0000:00:09.0 wcb4xxp+ 1397:08b4 Junghanns QuadBRI ISDN card However if any of the channels used by the device had a channel that was open at unbind time, userspace is left with invalid file handles. You'll probably soon get something along the lines of: wcb4xxp 0000:00:09.0: Driver unloaded. BUG: unable to handle kernel NULL pointer dereference at 00000080 IP: [<cc98c11e>] :dahdi:dahdi_chan_ioctl+0x18/0x819 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: dahdi_echocan_oslec echo ipv6 xpp_usb xpp wctdm wct4xxp firmware_class loop parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_midi snd_seq_midi_event serio_raw pcspkr snd_rawmidi snd_seq snd_timer snd_seq_device psmouse snd i2c_viapro i2c_core soundcore button wcb4xxp dahdi crc_ccitt shpchp pci_hotplug via_agp agpgart evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom sd_mod via82cxxx ide_pci_generic ide_core sata_via floppy via_rhine mii ehci_hcd ata_generic uhci_hcd usbcore libata scsi_mod dock thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] Pid: 2827, comm: asterisk Not tainted (2.6.26-2-686 #1) EIP: 0060:[<cc98c11e>] EFLAGS: 00210286 CPU: 0 EIP is at dahdi_chan_ioctl+0x18/0x819 [dahdi] EAX: ffffffff EBX: b29608d8 ECX: b29608d8 EDX: 8004da08 ESI: 8004da08 EDI: c87d2280 EBP: 00000000 ESP: c629de24 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process asterisk (pid: 2827, ti=c629c000 task=cb4a6660 task.ti=c629c000) Stack: 00000000 00000000 00000000 00000000 00000000 00000000 8004da08 000000fe c87d2280 8004da08 cc98e068 ffffffff 00000009 b29608d8 b2960886 00000005 00000000 ffffffff ffffffff c1188fc0 ca3574e0 00000001 ca3574e8 c011be2d Call Trace: [<cc98e068>] dahdi_ioctl+0x1749/0x17e9 [dahdi] [<c011be2d>] push_rt_task+0xff/0x1ac [<c011bee4>] push_rt_tasks+0xa/0x10 [<c011b733>] try_to_wake_up+0xe8/ [<c0118511>] __wake_up_common+0x2 [<c011a6fb>] __wake_up+0x29/0x39 [<c013ab0a>] wake_futex+0x1c/0x26 [<c013baa2>] do_futex+0x303/0x70e [<c01dd39a>] __next_cpu+0x12/0x21 [<c011a0d3>] find_lowest_rq+0xb0/ [<cc98c91f>] dahdi_ioctl+0x0/0x17 [<c017e83c>] vfs_ioctl+0x1c/0x5d [<c017eac7>] do_vfs_ioctl+0x24a/0 [<c017eb1f>] sys_ioctl+0x41/0x5a [<c0103857>] sysenter_past_esp+0x [<c02b0000>] quirk_ali7101_acpi+0 ======================= Code: 00 89 95 98 05 00 00 31 f6 83 c4 60 89 f0 5b 5e 5f 5d c3 55 57 89 c7 56 89 d6 53 89 cb 83 ec 18 8b 44 24 2c 8b 2c 85 60 09 9b cc <83> bd 80 00 00 00 00 75 0f ba eb 14 00 00 b8 99 fe 98 cc e8 bd EIP: [<cc98c11e>] dahdi_chan_ioctl+0x18/0x819 [dahdi] SS:ESP 0068:c629de24 ---[ end trace 4cc6bfb9265c52fd ]--- Only tested with wcb4xxp, and not the latest, but I suspect it has not changed recently. | ||
Comments: | By: Shaun Ruffell (sruffell) 2010-08-01 14:21:48 Taking tzafrir's advice, I looked at this issue a bit yesterday since I am of the current belief that this issue is going to be directly related to rooting all the spans in the device tree under the actual device that is implementing the span (pci_dev, usb_dev or /sys/devices/virtual/dahdi as the case may be). I'm still looking at this, but a first cut at embedding a "struct device *parent" in the span, and then performing a "device_get" on channel open still didn't prevent the release method from being called when the the driver was unbinding from the device, which isn't what I would have expected. hmm.... By: Shaun Ruffell (sruffell) 2010-08-01 18:46:55 After digging into this some more, I noticed that the pci_dev->remove is *not* the function that is called when the embedded "struct device" ref count goes to 0, but is instead part of pci hotplug called at unbind time regardless (pci_dev has it's own remove along with struct device) So, what I'm thinking needs to happen now in order to handle surprise removal is that the callbacks in the file_operations need to check if the underlying channel has been removed from the system and error out instead of trying to keep the memory backing he device around. By: Shaun Ruffell (sruffell) 2010-08-04 12:32:50 Uploaded the patch that also sits out at http://github.com/sruffell/dahdi-linux/tree/issue-17669. I think this is the right way to go...comments? By: Tzafrir Cohen (tzafrir) 2010-08-04 14:19:03 Looks like the right direction. However here's what I see: (a system with an Astribank and a PCI card. Configured. Running Asterisk) * Once I unbind the PCI card, the load on the system jumps. I suspect this is a read loop in Asterisk. But ATM the computer becomes unresponsive too quickly. * If I unplug the Astribank, I get an immdiate oops for a NULL pointer dereference. The trace is something with dahdi_specfile_release. Looking into it now. By: Tzafrir Cohen (tzafrir) 2010-08-04 14:47:39 The xpp disconnecting issue is a separate regression. It has existed before this patch. By the looks of it: seems to be related to the dereferencing chan->span->ops->close . Aug 4 22:42:16 zavadi kernel: [ 382.330968] BUG: unable to handle kernel NULL pointer dereference at 0000016c Aug 4 22:42:16 zavadi kernel: [ 382.330979] IP: [<ccb1301d>] :dahdi:dahdi_specchan_release+0x5b/0x7f Aug 4 22:42:16 zavadi kernel: [ 382.330997] *pde = 00000000 Aug 4 22:42:16 zavadi kernel: [ 382.331007] Oops: 0000 [#1] SMP Aug 4 22:42:16 zavadi kernel: [ 382.331013] Modules linked in: xpp_usb xpd_bri xpd_pri xpd_fxo xpd_fxs xpp wcb4xxp dahdi_echocan_mg2 dahdi ipv6 echo firmware_class loop parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_midi snd_seq_midi_event serio_raw snd_rawmidi snd_seq snd_timer snd_seq_device pcspkr psmouse i2c_viapro snd i2c_core soundcore button crc_ccitt shpchp pci_hotplug via_agp agpgart evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom sd_mod via82cxxx ide_pci_generic ide_core sata_via floppy via_rhine mii ehci_hcd ata_generic uhci_hcd usbcore libata scsi_mod dock thermal processor fan thermal_sys [last unloaded: dahdi] Aug 4 22:42:16 zavadi kernel: [ 382.331102] Aug 4 22:42:16 zavadi kernel: [ 382.331107] Pid: 4364, comm: asterisk Not tainted (2.6.26-2-686 #1) Aug 4 22:42:16 zavadi kernel: [ 382.331113] EIP: 0060:[<ccb1301d>] EFLAGS: 00210246 CPU: 0 Aug 4 22:42:16 zavadi kernel: [ 382.331125] EIP is at dahdi_specchan_release+0x5b/0x7f [dahdi] Aug 4 22:42:16 zavadi kernel: [ 382.331131] EAX: 00000000 EBX: ca574000 ECX: cb401100 EDX: 00000001 Aug 4 22:42:16 zavadi kernel: [ 382.331137] ESI: 00000000 EDI: 0000002c EBP: cb0f1d74 ESP: ca405f74 Aug 4 22:42:16 zavadi kernel: [ 382.331142] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Aug 4 22:42:16 zavadi kernel: [ 382.331149] Process asterisk (pid: 4364, ti=ca404000 task=ca346660 task.ti=ca404000) Aug 4 22:42:16 zavadi kernel: [ 382.331155] Stack: 00000008 ca8c4480 cad3e790 c01754ff cb41de00 ca8c4480 ca31a180 00000000 Aug 4 22:42:16 zavadi kernel: [ 382.331170] ca404000 c0172ec9 ca31a180 00000030 b551fe5c c0174003 00000030 0842de50 Aug 4 22:42:16 zavadi kernel: [ 382.331182] c0103857 00000030 00000001 b647f2f4 0842de50 b551fe5c b551fe48 00000006 Aug 4 22:42:16 zavadi kernel: [ 382.331195] Call Trace: Aug 4 22:42:16 zavadi kernel: [ 382.331209] [<c01754ff>] __fput+0x8a/0x135 Aug 4 22:42:16 zavadi kernel: [ 382.331233] [<c0172ec9>] filp_close+0x4d/0x53 Aug 4 22:42:16 zavadi kernel: [ 382.331248] [<c0174003>] sys_close+0x5b/0x8d Aug 4 22:42:16 zavadi kernel: [ 382.331259] [<c0103857>] sysenter_past_esp+0x78/0xb1 Aug 4 22:42:16 zavadi kernel: [ 382.331299] ======================= Aug 4 22:42:16 zavadi kernel: [ 382.331303] Code: f3 89 d8 e8 5e fd ff ff 8b 83 b8 00 00 00 85 c0 74 26 8b 80 6c 01 00 00 8b 50 24 85 d2 74 06 89 d8 ff d2 89 c6 8b 83 b8 00 00 00 <8b> 80 6c 01 00 00 8b 00 e8 63 ad 62 f3 8b 04 bd c0 ea b3 cc 85 Aug 4 22:42:16 zavadi kernel: [ 382.331354] EIP: [<ccb1301d>] dahdi_specchan_release+0x5b/0x7f [dahdi] SS:ESP 0068:ca405f74 Aug 4 22:42:16 zavadi kernel: [ 382.332916] ---[ end trace a14e1150f8dd63d1 ]--- Again, this is not related to the current issue. I'll look into it tommorow. By: Tzafrir Cohen (tzafrir) 2010-08-08 10:48:12 r9090 fixed that crash in the Astribank code. I can also no longer reproduce the 100% CPU issue. By: Shaun Ruffell (sruffell) 2010-08-29 01:05:28 I attached the patch "0002-dahdi-Sleep-a-bit-before-returning-ENODEV-from-read-.patch" in case a read loop is still a problem. The idea that this still may be needed was brought up in the following thread: http://lists.digium.com/pipermail/asterisk-dev/2010-August/045935.html By: Tilghman Lesher (tilghman) 2010-08-30 01:01:42 I was thinking more along the lines of what I've attached. If there's a tight userspace loop that refuses to accept that the channel doesn't exist, then soon returning what it wants to hear from the kernel driver will avoid the tight loop without introducing artificial delays in the kernel. By: Shaun Ruffell (sruffell) 2010-08-30 13:08:50 What is the advantage of returning dummy data versus forcing the caller to sleep a little? It isn't clear just from looking at the driver that returning dummy data wouldn't cause some remote endpoint from being flooded with packets with dummy data. On the other hand, if a caller is calling read/write (and probably poll should be added in there too) on a channel that has disappeared, there are going to be bigger problems to deal with than a 5ms sleep (which is still less than the typical 20ms chunk size that a file opened in blocking mode would normally wait). This sleep could even be reduced to 1ms and achieve the same effect. As an aside, to keep statistics on a per-channel basis about rates of calls would probably require using the file->private_data pointer to hold a pointer to another structure that contains a dahdi_chan pointer as opposed to the dahdi_chan pointer directly. Otherwise if you had two channels in the system and you hotswapped one out and userspace is alternating reads from each of them you might never get your 'last_badchan_count' up above 3. By: Digium Subversion (svnbot) 2010-09-20 15:32:30 Repository: dahdi Revision: 9353 U linux/trunk/drivers/dahdi/dahdi-base.c ------------------------------------------------------------------------ r9353 | sruffell | 2010-09-20 15:32:29 -0500 (Mon, 20 Sep 2010) | 27 lines dahdi: Be more tolerant of surprise removal of channels. Enable DAHDI to detect if an operation on a file handle refers to a channel that may have been unregistered. This can occur, for example, when a board driver is hot-swapped out in a live system. This patch ensures that file->private_data is always properly set for any open channel, and it's set back to NULL when a channel is unregistered. This way file->private_data can be used to check whether it's valid to perform an operation on the channel. (NOTE: There is still a race condition here if the driver was unbound on one processor during the window of time between when file->private_data was checked and the system call finishes). Also, since DAHDI should only return -ENODEV on read or write when there was a surprise device removal on a running system this sleep can prevent the system from becoming unresponsive if the userspace application does not check for the -ENODEV error and constantly tries to call read with elevated privileges. (issue DAHLIN-201) Reported by: tzafrir Tested by: sruffell Review: https://reviewboard.asterisk.org/r/905/ Signed-off-by: Shaun Ruffell <sruffell@digium.com> ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=9353 By: Digium Subversion (svnbot) 2011-01-20 23:28:30.000-0600 Repository: dahdi Revision: 9662 U linux/branches/2.4/drivers/dahdi/dahdi-base.c ------------------------------------------------------------------------ r9662 | sruffell | 2011-01-20 23:28:29 -0600 (Thu, 20 Jan 2011) | 29 lines dahdi: Be more tolerant of surprise removal of channels. Enable DAHDI to detect if an operation on a file handle refers to a channel that may have been unregistered. This can occur, for example, when a board driver is hot-swapped out in a live system. This patch ensures that file->private_data is always properly set for any open channel, and it's set back to NULL when a channel is unregistered. This way file->private_data can be used to check whether it's valid to perform an operation on the channel. (NOTE: There is still a race condition here if the driver was unbound on one processor during the window of time between when file->private_data was checked and the system call finishes). Also, since DAHDI should only return -ENODEV on read or write when there was a surprise device removal on a running system this sleep can prevent the system from becoming unresponsive if the userspace application does not check for the -ENODEV error and constantly tries to call read with elevated privileges. (issue DAHLIN-201) Reported by: tzafrir Tested by: sruffell Review: https://reviewboard.asterisk.org/r/905/ Signed-off-by: Shaun Ruffell <sruffell@digium.com> Origin: http://svnview.digium.com/svn/dahdi?view=rev&rev=9353 ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=9662 |