Oops when completing request on the wrong queue
Gabriel Krisman Bertazi
krisman at linux.vnet.ibm.com
Fri Aug 19 09:38:17 PDT 2016
Jens Axboe <axboe at kernel.dk> writes:
>> Some good detective work so far! I agree, this looks like a blk-mq core
>> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
>> req->tag != nvmeq->tags? I don't immediately see how this could happen,
>> the freezing should protect us from this, unless it's broken somehow.
>> I'll take a look at this.
Nothing much on the BUG() stack trace, at least to me. We are coming
from an async execution of __blk_mq_run_hw_queue:
This is from 4.7. I'll give another run on 4.8-rc1.
[ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602!
cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0]
pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme]
lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme]
sp: c000007870d3ba20
msr: 900000010282b033
current = 0xc000007870d9e400
paca = 0xc000000002bd8800 softe: 0 irq_happened: 0x01
pid = 9205, comm = kworker/72:1H
kernel BUG at ../drivers/nvme/host/pci.c:602!
Linux version 4.7.0.mainline+ (root at iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.
1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016
enter ? for help
[c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640
[c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660
[c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130
[c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c
48:mon>
Another interesting data point is that we always reach the WARN_ON in
__blk_mq_run_hw_queue immediately before submitting the bad request.
Maybe we are touching some percpu variable from the wrong cpu?
See the WARN_ON trace below.
[ 369.953825] ------------[ cut here ]------------
[ 369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6
40
[ 369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_
misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv
nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra
id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng
[ 369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150
[ 369.955085] Workqueue: kblockd blk_mq_run_work_fn
[ 369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000
[ 369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000
[ 369.955314] REGS: c0000078cb027890 TRAP: 0700 Not tainted (4.7.0.mainline+)
[ 369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28002224 XER:
20000000
[ 369.955579] CFAR: c000000000569a18 SOFTE: 1
GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008
GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001
GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003
GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000
GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800
GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88
GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb
GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700
[ 369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640
[ 369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640
[ 369.956433] Call Trace:
[ 369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable)
[ 369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590
[ 369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660
[ 369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130
[ 369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c
[ 369.956791] Instruction dump:
[ 369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a
[ 369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060
[ 369.957020] ---[ end trace 1398dd60e3057065 ]---
[ 369.957094] ------------[ cut here ]------------
> Can you share what you ran to online/offline CPUs? I can't reproduce
> this here.
I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs. but I just reproduced it with the script below.
Note that this is ppc64le. I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.
#!/bin/bash
MAXCPUs=159
STATE=1
while true; do
for i in $(seq 1 ${MAXCPUS}); do
if (($i%8)); then
echo $STATE > /sys/devices/system/cpu/cpu$i/online
fi
done
if [[ $STATE -eq 1 ]]; then
STATE=0
else
STATE=1
fi
done
--
Gabriel Krisman Bertazi
More information about the Linux-nvme
mailing list