[Home]

Summary:DAHLIN-00201: [patch] oops after driver is removed
Reporter:Tzafrir Cohen (tzafrir)Labels:
Date Opened:2010-07-18 11:37:51Date Closed:2011-01-20 23:28:30.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:General
Versions:2.3.0.1 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 0001-dahdi-Better-tolerate-surprise-removal-of-channels.patch
( 1) 0002-dahdi-Sleep-a-bit-before-returning-ENODEV-from-read-.patch
( 2) 20100830__issue17669.diff.txt
Description:The 'bind' and 'unbind' properties allow (un)binding a device to a driver at runtime:

zavadi:~# dahdi_hardware
pci:0000:00:09.0     wcb4xxp+     1397:08b4 Junghanns QuadBRI ISDN card
zavadi:~# echo -n 0000:00:09.0  >/sys/bus/pci/drivers/wcb4xxp/unbind
zavadi:~# dahdi_hardware
pci:0000:00:09.0     wcb4xxp-     1397:08b4 Junghanns QuadBRI ISDN card
zavadi:~# echo -n 0000:00:09.0  >/sys/bus/pci/drivers/wcb4xxp/bind
zavadi:~# dahdi_hardware
pci:0000:00:09.0     wcb4xxp+     1397:08b4 Junghanns QuadBRI ISDN card

However if any of the channels used by the device had a channel that was open at unbind time, userspace is left with invalid file handles. You'll probably soon get something along the lines of:

wcb4xxp 0000:00:09.0: Driver unloaded.
BUG: unable to handle kernel NULL pointer dereference at 00000080
IP: [<cc98c11e>] :dahdi:dahdi_chan_ioctl+0x18/0x819
*pde = 00000000
Oops: 0000 [#1] SMP
Modules linked in: dahdi_echocan_oslec echo ipv6 xpp_usb xpp wctdm wct4xxp firmware_class loop parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_midi snd_seq_midi_event serio_raw pcspkr snd_rawmidi snd_seq snd_timer snd_seq_device psmouse snd i2c_viapro i2c_core soundcore button wcb4xxp dahdi crc_ccitt shpchp pci_hotplug via_agp agpgart evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom sd_mod via82cxxx ide_pci_generic ide_core sata_via floppy via_rhine mii ehci_hcd ata_generic uhci_hcd usbcore libata scsi_mod dock thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]

Pid: 2827, comm: asterisk Not tainted (2.6.26-2-686 #1)
EIP: 0060:[<cc98c11e>] EFLAGS: 00210286 CPU: 0
EIP is at dahdi_chan_ioctl+0x18/0x819 [dahdi]
EAX: ffffffff EBX: b29608d8 ECX: b29608d8 EDX: 8004da08
ESI: 8004da08 EDI: c87d2280 EBP: 00000000 ESP: c629de24
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process asterisk (pid: 2827, ti=c629c000 task=cb4a6660 task.ti=c629c000)
Stack: 00000000 00000000 00000000 00000000 00000000 00000000 8004da08 000000fe
      c87d2280 8004da08 cc98e068 ffffffff 00000009 b29608d8 b2960886 00000005
      00000000 ffffffff ffffffff c1188fc0 ca3574e0 00000001 ca3574e8 c011be2d
Call Trace:
[<cc98e068>] dahdi_ioctl+0x1749/0x17e9 [dahdi]
[<c011be2d>] push_rt_task+0xff/0x1ac
[<c011bee4>] push_rt_tasks+0xa/0x10
[<c011b733>] try_to_wake_up+0xe8/

[<c0118511>] __wake_up_common+0x2

[<c011a6fb>] __wake_up+0x29/0x39
[<c013ab0a>] wake_futex+0x1c/0x26
[<c013baa2>] do_futex+0x303/0x70e
[<c01dd39a>] __next_cpu+0x12/0x21
[<c011a0d3>] find_lowest_rq+0xb0/

[<cc98c91f>] dahdi_ioctl+0x0/0x17

[<c017e83c>] vfs_ioctl+0x1c/0x5d
[<c017eac7>] do_vfs_ioctl+0x24a/0

[<c017eb1f>] sys_ioctl+0x41/0x5a
[<c0103857>] sysenter_past_esp+0x

[<c02b0000>] quirk_ali7101_acpi+0

=======================
Code: 00 89 95 98 05 00 00 31 f6 83 c4 60 89 f0 5b 5e 5f 5d c3 55 57 89 c7 56 89 d6 53 89 cb 83 ec 18 8b 44 24 2c 8b 2c 85 60 09 9b cc <83> bd 80 00 00 00 00 75 0f ba eb 14 00 00 b8 99 fe 98 cc e8 bd
EIP: [<cc98c11e>] dahdi_chan_ioctl+0x18/0x819 [dahdi] SS:ESP 0068:c629de24
---[ end trace 4cc6bfb9265c52fd ]---


Only tested with wcb4xxp, and not the latest, but I suspect it has not changed recently.
Comments:By: Shaun Ruffell (sruffell) 2010-08-01 14:21:48

Taking tzafrir's advice, I looked at this issue a bit yesterday since I am of the current belief that this issue is going to be directly related to rooting all the spans in the device tree under the actual device that is implementing the span (pci_dev, usb_dev or /sys/devices/virtual/dahdi as the case may be).

I'm still looking at this, but a first cut at embedding a "struct device *parent" in the span, and then performing a "device_get" on channel open still didn't prevent the release method from being called when the the driver was unbinding from the device, which isn't what I would have expected.  hmm....

By: Shaun Ruffell (sruffell) 2010-08-01 18:46:55

After digging into this some more, I noticed that the pci_dev->remove is *not* the function that is called when the embedded "struct device" ref count goes to 0, but is instead part of pci hotplug called at unbind time regardless (pci_dev has it's own remove along with struct device)

So, what I'm thinking needs to happen now in order to handle surprise removal is that the callbacks in the file_operations need to check if the underlying channel has been removed from the system and error out instead of trying to keep the memory backing he device around.

By: Shaun Ruffell (sruffell) 2010-08-04 12:32:50

Uploaded the patch that also sits out at http://github.com/sruffell/dahdi-linux/tree/issue-17669.

I think this is the right way to go...comments?

By: Tzafrir Cohen (tzafrir) 2010-08-04 14:19:03

Looks like the right direction.

However here's what I see: (a system with an Astribank and a PCI card. Configured. Running Asterisk)

* Once I unbind the PCI card, the load on the system jumps. I suspect this is a read loop in Asterisk. But ATM the computer becomes unresponsive too quickly.

* If I unplug the Astribank, I get an immdiate oops for a NULL pointer dereference. The trace is something with dahdi_specfile_release. Looking into it now.

By: Tzafrir Cohen (tzafrir) 2010-08-04 14:47:39

The xpp disconnecting issue is a separate regression. It has existed before this patch. By the looks of it: seems to be related to the dereferencing chan->span->ops->close .

Aug  4 22:42:16 zavadi kernel: [  382.330968] BUG: unable to handle kernel NULL pointer dereference at 0000016c
Aug  4 22:42:16 zavadi kernel: [  382.330979] IP: [<ccb1301d>] :dahdi:dahdi_specchan_release+0x5b/0x7f
Aug  4 22:42:16 zavadi kernel: [  382.330997] *pde = 00000000
Aug  4 22:42:16 zavadi kernel: [  382.331007] Oops: 0000 [#1] SMP
Aug  4 22:42:16 zavadi kernel: [  382.331013] Modules linked in: xpp_usb xpd_bri xpd_pri xpd_fxo xpd_fxs xpp wcb4xxp dahdi_echocan_mg2 dahdi ipv6 echo firmware_class loop parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_midi snd_seq_midi_event serio_raw snd_rawmidi snd_seq snd_timer snd_seq_device pcspkr psmouse i2c_viapro snd i2c_core soundcore button crc_ccitt shpchp pci_hotplug via_agp agpgart evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom sd_mod via82cxxx ide_pci_generic ide_core sata_via floppy via_rhine mii ehci_hcd ata_generic uhci_hcd usbcore libata scsi_mod dock thermal processor fan thermal_sys [last unloaded: dahdi]
Aug  4 22:42:16 zavadi kernel: [  382.331102]
Aug  4 22:42:16 zavadi kernel: [  382.331107] Pid: 4364, comm: asterisk Not tainted (2.6.26-2-686 #1)
Aug  4 22:42:16 zavadi kernel: [  382.331113] EIP: 0060:[<ccb1301d>] EFLAGS: 00210246 CPU: 0
Aug  4 22:42:16 zavadi kernel: [  382.331125] EIP is at dahdi_specchan_release+0x5b/0x7f [dahdi]
Aug  4 22:42:16 zavadi kernel: [  382.331131] EAX: 00000000 EBX: ca574000 ECX: cb401100 EDX: 00000001
Aug  4 22:42:16 zavadi kernel: [  382.331137] ESI: 00000000 EDI: 0000002c EBP: cb0f1d74 ESP: ca405f74
Aug  4 22:42:16 zavadi kernel: [  382.331142]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Aug  4 22:42:16 zavadi kernel: [  382.331149] Process asterisk (pid: 4364, ti=ca404000 task=ca346660 task.ti=ca404000)
Aug  4 22:42:16 zavadi kernel: [  382.331155] Stack: 00000008 ca8c4480 cad3e790 c01754ff cb41de00 ca8c4480 ca31a180 00000000
Aug  4 22:42:16 zavadi kernel: [  382.331170]        ca404000 c0172ec9 ca31a180 00000030 b551fe5c c0174003 00000030 0842de50
Aug  4 22:42:16 zavadi kernel: [  382.331182]        c0103857 00000030 00000001 b647f2f4 0842de50 b551fe5c b551fe48 00000006
Aug  4 22:42:16 zavadi kernel: [  382.331195] Call Trace:
Aug  4 22:42:16 zavadi kernel: [  382.331209]  [<c01754ff>] __fput+0x8a/0x135
Aug  4 22:42:16 zavadi kernel: [  382.331233]  [<c0172ec9>] filp_close+0x4d/0x53
Aug  4 22:42:16 zavadi kernel: [  382.331248]  [<c0174003>] sys_close+0x5b/0x8d
Aug  4 22:42:16 zavadi kernel: [  382.331259]  [<c0103857>] sysenter_past_esp+0x78/0xb1
Aug  4 22:42:16 zavadi kernel: [  382.331299]  =======================
Aug  4 22:42:16 zavadi kernel: [  382.331303] Code: f3 89 d8 e8 5e fd ff ff 8b 83 b8 00 00 00 85 c0 74 26 8b 80 6c 01 00 00 8b 50 24 85 d2 74 06 89 d8 ff d2 89 c6 8b 83 b8 00 00 00 <8b> 80 6c 01 00 00 8b 00 e8 63 ad 62 f3 8b 04 bd c0 ea b3 cc 85
Aug  4 22:42:16 zavadi kernel: [  382.331354] EIP: [<ccb1301d>] dahdi_specchan_release+0x5b/0x7f [dahdi] SS:ESP 0068:ca405f74
Aug  4 22:42:16 zavadi kernel: [  382.332916] ---[ end trace a14e1150f8dd63d1 ]---

Again, this is not related to the current issue. I'll look into it tommorow.

By: Tzafrir Cohen (tzafrir) 2010-08-08 10:48:12

r9090 fixed that crash in the Astribank code. I can also no longer reproduce the 100% CPU issue.

By: Shaun Ruffell (sruffell) 2010-08-29 01:05:28

I attached the patch "0002-dahdi-Sleep-a-bit-before-returning-ENODEV-from-read-.patch" in case a read loop is still a problem.

The idea that this still may be needed was brought up in the following thread:
http://lists.digium.com/pipermail/asterisk-dev/2010-August/045935.html

By: Tilghman Lesher (tilghman) 2010-08-30 01:01:42

I was thinking more along the lines of what I've attached.  If there's a tight userspace loop that refuses to accept that the channel doesn't exist, then soon returning what it wants to hear from the kernel driver will avoid the tight loop without introducing artificial delays in the kernel.

By: Shaun Ruffell (sruffell) 2010-08-30 13:08:50

What is the advantage of returning dummy data versus forcing the caller to sleep a little?  It isn't clear just from looking at the driver that returning dummy data wouldn't cause some remote endpoint from being flooded with packets with dummy data.

On the other hand, if a caller is calling read/write (and probably poll should be added in there too) on a channel that has disappeared, there are going to be bigger problems to deal with than a 5ms sleep (which is still less than the typical 20ms chunk size that a file opened in blocking mode would normally wait).  This sleep could even be reduced to 1ms and achieve the same effect.

As an aside, to keep statistics on a per-channel basis about rates of calls would probably require using the file->private_data pointer to hold a pointer to another structure that contains a dahdi_chan pointer as opposed to the dahdi_chan pointer directly.  Otherwise if you had two channels in the system and you hotswapped one out and userspace is alternating reads from each of them you might never get your 'last_badchan_count' up above 3.

By: Digium Subversion (svnbot) 2010-09-20 15:32:30

Repository: dahdi
Revision: 9353

U   linux/trunk/drivers/dahdi/dahdi-base.c

------------------------------------------------------------------------
r9353 | sruffell | 2010-09-20 15:32:29 -0500 (Mon, 20 Sep 2010) | 27 lines

dahdi: Be more tolerant of surprise removal of channels.

Enable DAHDI to detect if an operation on a file handle refers to a
channel that may have been unregistered. This can occur, for example,
when a board driver is hot-swapped out in a live system.

This patch ensures that file->private_data is always properly set for
any open channel, and it's set back to NULL when a channel is
unregistered.  This way file->private_data can be used to check whether
it's valid to perform an operation on the channel.  (NOTE:  There is
still a race condition here if the driver was unbound on one processor
during the window of time between when file->private_data was checked
and the system call finishes).

Also, since DAHDI should only return -ENODEV on read or write when there
was a surprise device removal on a running system this sleep can prevent
the system from becoming unresponsive if the userspace application does
not check for the -ENODEV error and constantly tries to call read with
elevated privileges.

(issue DAHLIN-201)
Reported by: tzafrir
Tested by: sruffell

Review: https://reviewboard.asterisk.org/r/905/

Signed-off-by: Shaun Ruffell <sruffell@digium.com>
------------------------------------------------------------------------

http://svn.digium.com/view/dahdi?view=rev&revision=9353

By: Digium Subversion (svnbot) 2011-01-20 23:28:30.000-0600

Repository: dahdi
Revision: 9662

U   linux/branches/2.4/drivers/dahdi/dahdi-base.c

------------------------------------------------------------------------
r9662 | sruffell | 2011-01-20 23:28:29 -0600 (Thu, 20 Jan 2011) | 29 lines

dahdi: Be more tolerant of surprise removal of channels.

Enable DAHDI to detect if an operation on a file handle refers to a
channel that may have been unregistered. This can occur, for example,
when a board driver is hot-swapped out in a live system.

This patch ensures that file->private_data is always properly set for
any open channel, and it's set back to NULL when a channel is
unregistered.  This way file->private_data can be used to check whether
it's valid to perform an operation on the channel.  (NOTE:  There is
still a race condition here if the driver was unbound on one processor
during the window of time between when file->private_data was checked
and the system call finishes).

Also, since DAHDI should only return -ENODEV on read or write when there
was a surprise device removal on a running system this sleep can prevent
the system from becoming unresponsive if the userspace application does
not check for the -ENODEV error and constantly tries to call read with
elevated privileges.

(issue DAHLIN-201)
Reported by: tzafrir
Tested by: sruffell

Review: https://reviewboard.asterisk.org/r/905/

Signed-off-by: Shaun Ruffell <sruffell@digium.com>

Origin: http://svnview.digium.com/svn/dahdi?view=rev&rev=9353
------------------------------------------------------------------------

http://svn.digium.com/view/dahdi?view=rev&revision=9662