Bug Report: can't unload nvme module in case of disabled device
Max Gurtovoy
maxg at mellanox.com
Sun Aug 13 01:29:59 PDT 2017
On 8/10/2017 10:36 PM, Keith Busch wrote:
> On Thu, Aug 10, 2017 at 08:04:13PM +0300, Max Gurtovoy wrote:
>>
>> I'm using PCIe ctrl.
>> Using 4.13-rc4+ I couldn't even run easier scenario of only unloading the
>> nvme module (with SAMSUNG MZPLL1T6HEHP-00003 and Intel P3500/3700 devices):
>>
>> [ 369.997917] INFO: task modprobe:3709 blocked for more than 120 seconds.
>> [ 370.005215] Not tainted 4.13.0-rc4+ #21
>> [ 370.010017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 370.018647] modprobe D 0 3709 3654 0x00000000
>> [ 370.024695] Call Trace:
>> [ 370.027400] __schedule+0x1dc/0x780
>> [ 370.031261] schedule+0x36/0x80
>> [ 370.034756] blk_mq_freeze_queue_wait+0x4b/0xb0
>> [ 370.039750] ? remove_wait_queue+0x60/0x60
>> [ 370.044263] blk_freeze_queue+0x1a/0x20
>> [ 370.048489] blk_cleanup_queue+0x7f/0x150
>> [ 370.052927] nvme_dev_remove_admin+0x36/0x50 [nvme]
>> [ 370.058303] nvme_remove+0xa2/0x130 [nvme]
>> [ 370.062820] pci_device_remove+0x39/0xc0
>> [ 370.067142] device_release_driver_internal+0x141/0x200
>> [ 370.072898] driver_detach+0x3f/0x80
>> [ 370.076852] bus_remove_driver+0x55/0xd0
>> [ 370.081186] driver_unregister+0x2c/0x50
>> [ 370.085521] pci_unregister_driver+0x2a/0xa0
>> [ 370.090227] nvme_exit+0x10/0xb84 [nvme]
>> [ 370.094562] SyS_delete_module+0x171/0x250
>> [ 370.099101] ? exit_to_usermode_loop+0x5e/0x88
>> [ 370.103996] entry_SYSCALL_64_fastpath+0x1a/0xa5
>> [ 370.109096] RIP: 0033:0x7f146b5106b7
>> [ 370.113037] RSP: 002b:00007ffd2cae12e8 EFLAGS: 00000206 ORIG_RAX:
>> 00000000000000b0
>> [ 370.121431] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
>> 00007f146b5106b7
>> [ 370.129295] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
>> 000000000223f5e8
>> [ 370.137167] RBP: 000000000223f580 R08: 00007f146b7d5060 R09:
>> 00007f146b580a40
>> [ 370.145029] R10: 00007ffd2cae1070 R11: 0000000000000206 R12:
>> 00007ffd2cae0310
>> [ 370.152890] R13: 0000000000000000 R14: 000000000223f5e8 R15:
>> 0000000000000000
>>
>> the new scenario:
>> 1. modprobe nvme
>> 2. sleep 10
>> 3. modprobe -r nvme
>>
>> works on 4.11.0/4.12.0 but not on 4.13.0-rc4+.
>
> This I'm not able to reproduce. The stack trace is saying there are
> entered requests on the admin queue, but that shouldn't be possible at
> this point in nvme_remove. I'll keep looking.
>
After bisecting I found that the following commit caused the simple
load/unload nvme driver failure:
commit 1ad43c0078b79a76accd0fe64062e47b3430dc6b
Author: Ming Lei <minlei at redhat.com>
Date: Wed Aug 2 08:01:45 2017 +0800
blk-mq: don't leak preempt counter/q_usage_counter when allocating
rq failed
Adding Ming to this thread.
I'm continuing with the debug of the new scenario (load nvme && sleep 10
&& unload nvme).
More information about the Linux-nvme
mailing list