kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7

Yi Zhang yizhan at redhat.com
Wed Apr 19 23:03:39 PDT 2017


Hi

I reproduced two different kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7, here is the steps and kernel log, thanks

Reproduce steps:
1. Configure NVMe over RDMA on target
#nvmetcli restore rdma.json
2. connect to target on client
#nvme connect-all -t rdam -a $IP -s 4420
3. reset_controller during IO on client
#!/bin/bash
num=0
fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=300 -size=-group_reporting -name=mytest -numjobs=60 &
sleep 5
while [ $num -lt 50 ]
do
	echo 1 >/sys/block/nvme0n1/device/reset_controller
	[ $? -eq 1 ] && echo "reset_controller operation failed: $num" && exit 1
	((num++))
	sleep 0.5
done

RMDA device:
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Kernel log
[1]
[ 5968.515237] DMAR: DRHD: handling fault status reg 2
[ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[ 5968.519450] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519452] 00000000 02005104 00000316 a71710e3
[ 5968.546797] DMAR: [DMA Read] Request device [05:00.0] fault addr ab978000 [fault reason 06] PTE Read access is not set
[ 5999.693035] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[ 5999.701799] IP: 0x1
[ 5999.704142] PGD 0 
[ 5999.704143] 
[ 5999.708052] Oops: 0010 [#1] SMP
[ 5999.711562] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_isee
[ 5999.791200]  drm tg3 devlink ahci libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 5999.803558] CPU: 16 PID: 3839 Comm: kworker/16:1H Not tainted 4.11.0-rc7+ #1
[ 5999.811440] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[ 5999.819812] Workqueue: kblockd blk_mq_timeout_work
[ 5999.825848] task: ffff93c9258f8000 task.stack: ffffbc26a13f8000
[ 5999.833113] RIP: 0010:0x1
[ 5999.836684] RSP: 0018:ffffbc26a13fbca8 EFLAGS: 00010202
[ 5999.843170] RAX: ffff93c91a776000 RBX: ffff93c917a33600 RCX: ffff93ca3fa00000
[ 5999.851789] RDX: ffffbc26a13fbcb0 RSI: ffffbc26a13fbcb8 RDI: ffff93c91a777c00
[ 5999.860395] RBP: ffffbc26a13fbd08 R08: 000000000000ffff R09: 0000000000000000
[ 5999.869001] R10: 00000574e9925428 R11: 0000000000000020 R12: ffff93c92e22cbd0
[ 5999.877616] R13: ffff93da3e3fc000 R14: ffff93c9facd5000 R15: ffff93c9179efc00
[ 5999.886235] FS:  0000000000000000(0000) GS:ffff93ca3fa00000(0000) knlGS:0000000000000000
[ 5999.895934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5999.903021] CR2: 0000000000000001 CR3: 0000000fd5e72000 CR4: 00000000001406e0
[ 5999.911676] Call Trace:
[ 5999.915091]  ? nvme_rdma_unmap_data+0x126/0x1b0 [nvme_rdma]
[ 5999.922014]  nvme_rdma_complete_rq+0x1c/0xa0 [nvme_rdma]
[ 5999.928654]  __blk_mq_complete_request+0xb9/0x130
[ 5999.934615]  blk_mq_rq_timed_out+0x66/0x70
[ 5999.939900]  blk_mq_check_expired+0x37/0x60
[ 5999.945277]  bt_iter+0x48/0x50
[ 5999.949387]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 5999.955440]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.960912]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.966378]  blk_mq_timeout_work+0x88/0x170
[ 5999.971734]  process_one_work+0x165/0x410
[ 5999.976884]  worker_thread+0x137/0x4c0
[ 5999.981740]  kthread+0x109/0x140
[ 5999.986002]  ? rescuer_thread+0x3b0/0x3b0
[ 5999.991293]  ? kthread_park+0x90/0x90
[ 5999.996202]  ret_from_fork+0x2c/0x40
[ 6000.001023] Code:  Bad RIP value.
[ 6000.005395] RIP: 0x1 RSP: ffffbc26a13fbca8
[ 6000.010641] CR2: 0000000000000001
[ 6000.017674] ---[ end trace aefe12bb2d39bb6c ]---
[ 6000.025847] Kernel panic - not syncing: Fatal exception
[ 6000.032339] Kernel Offset: 0x7800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 6000.047129] ---[ end Kernel panic - not syncing: Fatal exception

[2]
[  181.885449] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.45.92:4420
[  182.051854] nvme nvme0: creating 40 I/O queues.
[  183.196669] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.45.92:4420
[  335.152533] DMAR: DRHD: handling fault status reg 2
[  335.155522] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 02005104 00000313 2d56a1e3
[  335.184087] DMAR: [DMA Read] Request device [05:00.0] fault addr afe64000 [fault reason 06] PTE Read access is not set
[  335.565825] nvme nvme0: creating 40 I/O queues.
[  335.946585] DMAR: DRHD: handling fault status reg 102
[  335.948848] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948850] 00000000 02005104 0000033e 123982e2
[  335.978349] DMAR: [DMA Read] Request device [05:00.0] fault addr af0c6000 [fault reason 06] PTE Read access is not set
[  336.286112] nvme nvme0: creating 40 I/O queues.
[  336.976392] nvme nvme0: creating 40 I/O queues.
[  337.329610] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  337.335456] 00000000 00000000 00000000 00000000
[  337.340521] 00000000 00000000 00000000 00000000
[  337.345586] 00000000 00000000 00000000 00000000
[  337.350651] 00000000 93005204 0000038c 052a29e3
[  337.623917] nvme nvme0: creating 40 I/O queues.
[  338.286747] nvme nvme0: creating 40 I/O queues.
[  338.647457] DMAR: DRHD: handling fault status reg 202
[  338.649077] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  338.649078] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649080] 00000000 02005104 000003dc 096258e2
[  338.681899] DMAR: [DMA Read] Request device [05:00.0] fault addr adaf8000 [fault reason 06] PTE Read access is not set
[  339.003086] nvme nvme0: creating 40 I/O queues.
[  341.419403] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[  341.428698] IP: 0x1
[  341.431518] PGD 0 
[  341.431519] 
[  341.436353] Oops: 0010 [#1] SMP
[  341.440319] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srpe
[  341.523752]  drm tg3 ahci devlink libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  341.536724] CPU: 29 PID: 859 Comm: kworker/u82:2 Not tainted 4.11.0-rc7+ #1
[  341.545128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  341.554122] Workqueue: writeback wb_workfn (flush-259:0)
[  341.560683] task: ffff94c637484380 task.stack: ffffb6d08ed1c000
[  341.567928] RIP: 0010:0x1
[  341.571481] RSP: 0018:ffffb6d08ed1f670 EFLAGS: 00010282
[  341.577959] RAX: ffff94c63e6cb800 RBX: ffff94b53080cd20 RCX: 0000000000000001
[  341.586589] RDX: ffffb6d08ed1f678 RSI: ffff94c62c3413a8 RDI: ffff94c63e6ce400
[  341.595227] RBP: ffffb6d08ed1f6c0 R08: ffff94c62c3413a8 R09: 0000000000000000
[  341.603848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94b4e5620fc0
[  341.612459] R13: ffff94c63a39c000 R14: 0000000000000002 R15: ffff94c62c341200
[  341.621069] FS:  0000000000000000(0000) GS:ffff94c63f380000(0000) knlGS:0000000000000000
[  341.630754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.637822] CR2: 0000000000000001 CR3: 00000008d5c09000 CR4: 00000000001406e0
[  341.646457] Call Trace:
[  341.649862]  ? nvme_rdma_post_send+0x9b/0x100 [nvme_rdma]
[  341.656577]  nvme_rdma_queue_rq+0x2fb/0x680 [nvme_rdma]
[  341.663085]  blk_mq_try_issue_directly+0xbb/0x110
[  341.668995]  blk_mq_make_request+0x354/0x620
[  341.674424]  generic_make_request+0x110/0x2c0
[  341.679947]  submit_bio+0x75/0x150
[  341.684400]  submit_bh_wbc+0x141/0x180
[  341.689244]  __block_write_full_page+0x13d/0x3b0
[  341.695068]  ? I_BDEV+0x20/0x20
[  341.699244]  ? I_BDEV+0x20/0x20
[  341.703411]  block_write_full_page+0xdc/0x100
[  341.708946]  blkdev_writepage+0x18/0x20
[  341.713898]  __writepage+0x13/0x40
[  341.718357]  write_cache_pages+0x26f/0x510
[  341.723593]  ? compound_head+0x20/0x20
[  341.728447]  generic_writepages+0x51/0x80
[  341.733600]  ? __wake_up_common+0x55/0x90
[  341.738753]  blkdev_writepages+0x2f/0x40
[  341.743795]  do_writepages+0x1e/0x30
[  341.748429]  __writeback_single_inode+0x45/0x330
[  341.754221]  writeback_sb_inodes+0x280/0x570
[  341.759616]  __writeback_inodes_wb+0x8c/0xc0
[  341.765015]  wb_writeback+0x276/0x310
[  341.769742]  wb_workfn+0x19c/0x3b0
[  341.774178]  process_one_work+0x165/0x410
[  341.779274]  worker_thread+0x137/0x4c0
[  341.784061]  kthread+0x109/0x140
[  341.788243]  ? rescuer_thread+0x3b0/0x3b0
[  341.793279]  ? kthread_park+0x90/0x90
[  341.797905]  ret_from_fork+0x2c/0x40
[  341.802414] Code:  Bad RIP value.
[  341.806612] RIP: 0x1 RSP: ffffb6d08ed1f670
[  341.811663] CR2: 0000000000000001
[  341.815833] ---[ end trace 9a64941b3df0eb88 ]---
[  341.878226] Kernel panic - not syncing: Fatal exception
[  341.884533] Kernel Offset: 0x12800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  341.917376] ---[ end Kernel panic - not syncing: Fatal exception


Best Regards,
  Yi Zhang





More information about the Linux-nvme mailing list