[bug report] rdma_rxe: kernel NULL pointer observed with blktests nvme/012 on ppc64le/aarch64

Yi Zhang yi.zhang at redhat.com
Wed Oct 21 01:55:06 EDT 2020



On 10/20/20 2:54 AM, Bob Pearson wrote:
> On 10/18/20 10:20 AM, Yi Zhang wrote:
>> Hello
>>
>> I found this bug with blktests nvme/012 on ppc64le/aarch64, could anyone help check it,
>> Let me know if you need any test for further investigation, thanks.
>>
>> ppc64le:
>> [  155.427446] run blktests nvme/012 at 2020-10-18 09:54:03
>> [  156.593195] rdma_rxe: loaded
>> [  156.614836] infiniband rxe0: set active
>> [  156.614843] infiniband rxe0: added env2
>> [  156.617421] lo speed is unknown, defaulting to 1000
>> [  156.617449] lo speed is unknown, defaulting to 1000
>> [  156.617484] lo speed is unknown, defaulting to 1000
>> [  156.619911] infiniband rxe1: set active
>> [  156.619916] infiniband rxe1: added lo
>> [  156.619987] lo speed is unknown, defaulting to 1000
>> [  156.640971] Rounding down aligned max_sectors from 4294967295 to 4294967168
>> [  156.641095] db_root: cannot open: /etc/target
>> [  156.655793] lo speed is unknown, defaulting to 1000
>> [  156.700482] loop: module loaded
>> [  156.744820] Loading iSCSI transport class v2.0-870.
>> [  156.805883] iscsi: registered transport (iser)
>> [  156.833559] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [  156.884254] nvmet_rdma: enabling port 0 (10.0.2.182:4420)
>> [  157.005564] RPC: Registered rdma transport module.
>> [  157.005573] RPC: Registered rdma backchannel transport module.
>> [  157.121039] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0c00c1b2-6456-4e5e-a0d4-9677000ca7bc.
>> [  157.121388] nvme nvme0: creating 16 I/O queues.
>> [  157.128894] nvme nvme0: mapped 16/0/0 default/read/poll queues.
>> [  157.130039] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.182:4420
>> [  158.196673] XFS (nvme0n1): Mounting V5 Filesystem
>> [  158.206313] XFS (nvme0n1): Ending clean mount
>> [  158.207141] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>> [  190.087119] XFS (nvme0n1): Unmounting Filesystem
>> [  190.087284] BUG: Kernel NULL pointer dereference on read at 0x00000000
>> [  190.087289] Faulting instruction address: 0xc0000000000ae400
>> [  190.087294] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  190.087298] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>> [  190.087302] Modules linked in: ib_isert iscsi_target_mod rpcrdma ib_iser libiscsi scsi_transport_iscsi loop ib_srpt target_core_mod crc32_generic rdma_rxe ib_srp ib_ipoib rdma_ucm ib_uverbs ip6_udp_tunnel udp_tunnel nvme_rdma nvme_fabrics nvme_core ib_umad nvmet_rdma rdma_cm iw_cm ib_cm ib_core nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd bonding grace fscache rfkill sunrpc xts pseries_rng uio_pdrv_genirq vmx_crypto uio ip_tables xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp
>> [  190.087335] CPU: 9 PID: 56 Comm: ksoftirqd/9 Not tainted 5.9.0 #1
>> [  190.087339] NIP:  c0000000000ae400 LR: c008000002d1f77c CTR: 0000000000000007
>> [  190.087342] REGS: c00000033b9bf550 TRAP: 0380   Not tainted  (5.9.0)
>> [  190.087345] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48088280  XER: 20040000
>> [  190.087351] CFAR: c0000000000ae3a8 IRQMASK: 0
>> [  190.087351] GPR00: c008000002d1f77c c00000033b9bf7e0 c000000001842500 c00000033133b086
>> [  190.087351] GPR04: 0000000000000000 00000000000003c0 0000000000000007 0000000000000001
>> [  190.087351] GPR08: c00000033b9bfa70 c00000033133b086 0000000000000000 c008000002d243f0
>> [  190.087351] GPR12: c0000000000b8680 c00000001eca3200 c00000032d2f4000 0000000000000001
>> [  190.087351] GPR16: c00000032d2f4000 0000000000000400 0000000000000000 c008000007c3ba38
>> [  190.087351] GPR20: 0000000000000000 c000000338a06c00 c00000033b9bfa70 c00000033b9bfa70
>> [  190.087351] GPR24: 0000000000000001 c00000033133b086 c000000334083f00 000000005e7d553d
>> [  190.087351] GPR28: 0000000000000000 00000000000003c0 00000000000003c0 0000000000000001
>> [  190.087383] NIP [c0000000000ae400] memcpy_power7+0xa0/0x7e0
>> [  190.087388] LR [c008000002d1f77c] rxe_mem_copy+0x1f4/0x308 [rdma_rxe]
>> [  190.087391] Call Trace:
>> [  190.087394] [c00000033b9bf7e0] [c000000327d83158] 0xc000000327d83158 (unreliable)
>> [  190.087399] [c00000033b9bf8e0] [c008000002d1f77c] rxe_mem_copy+0x1f4/0x308 [rdma_rxe]
>> [  190.087404] [c00000033b9bf970] [c008000002d1fbe4] copy_data+0x11c/0x460 [rdma_rxe]
>> [  190.087409] [c00000033b9bfa00] [c008000002d131ac] rxe_requester+0x1054/0x1368 [rdma_rxe]
>> [  190.087414] [c00000033b9bfb50] [c008000002d21298] rxe_do_task+0x110/0x1d0 [rdma_rxe]
>> [  190.087418] [c00000033b9bfbe0] [c000000000157f40] tasklet_action_common.isra.18+0x1b0/0x1c0
>> [  190.087424] [c00000033b9bfc40] [c000000000d4e8bc] __do_softirq+0x15c/0x3b4
>> [  190.087427] [c00000033b9bfd30] [c0000000001577f4] run_ksoftirqd+0x64/0x90
>> [  190.087432] [c00000033b9bfd50] [c00000000018a794] smpboot_thread_fn+0x204/0x270
>> [  190.087436] [c00000033b9bfdb0] [c000000000184070] kthread+0x1a0/0x1b0
>> [  190.087440] [c00000033b9bfe20] [c00000000000d3d0] ret_from_kernel_thread+0x5c/0x6c
>> [  190.087443] Instruction dump:
>> [  190.087446] f9c10070 f9e10078 fa010080 fa210088 fa410090 fa610098 fa8100a0 faa100a8
>> [  190.087451] fac100b0 f8010110 78a6c9c2 7cc903a6 <e8040000> e8c40008 e8e40010 e9040018
>> [  190.087458] ---[ end trace 244246d4a62eb74b ]---
>> [  190.089228]
>> [  191.089236] Kernel panic - not syncing: Fatal exception
>>
>>
>> # gdb drivers/infiniband/sw/rxe/rdma_rxe.ko
>> Reading symbols from drivers/infiniband/sw/rxe/rdma_rxe.ko...done.
>> (gdb) l *(rxe_mem_copy+0x1f4)
>> 0xf7bc is in rxe_mem_copy (./include/linux/string.h:406).
>> 401			if (q_size < size)
>> 402				__read_overflow2();
>> 403		}
>> 404		if (p_size < size || q_size < size)
>> 405			fortify_panic(__func__);
>> 406		return __underlying_memcpy(p, q, size);
>> 407	}
>> 408	
>> 409	__FORTIFY_INLINE void *memmove(void *p, const void *q, __kernel_size_t size)
>> 410	{
>> (gdb)
>>
>> aarch64:
>> [  691.557907] r[  691.888082] lo speed is unknown, defaulting to 1000
>> [  691.892016] lo speed is unknown, defaulting to 1000
>> [  691.896920] lo speed is unknown, defaulting to 1000
>> [  691.904834] lo speed is unknown, defaulting to 1000
>> [  691.908762] lo speed is unknown, defaulting to 1000
>> [  691.913610] lo speed is unknown, defaulting to 1000
>> [  693.427251] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>> [  730.178748] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>> [  730.186576] Mem abort info:
>> [  730.189356]   ESR = 0x96000006
>> [  730.192385]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [  730.197680]   SET = 0, FnV = 0
>> [  730.200725]   EA = 0, S1PTW = 0
>> [  730.203843] Data abort info:
>> [  730.206707]   ISV = 0, ISS = 0x00000006
>> [  730.210533]   CM = 0, WnR = 0
>> [  730.213479] user pgtable: 64k pages, 42-bit VAs, pgdp=00000017b7950000
>> [  730.219995] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000
>> [  730.230587] Internal error: Oops: 96000006 [#1] SMP
>> [  730.235441] Modules linked in: loop crc32_generic rdma_rxe ip6_udp_tunnel udp_tunnel nvme_rdma nvme_fabrics nvme_core nvmet_rdma nvmet rfkill vfat fat ib_isert iscsi_target_mod rpcrdma ib_srpt target_core_mod sunrpc ib_srp scsi_transport_srp rdma_ucm ib_umad ib_iser ib_ipoib libiscsi rdma_cm scsi_transport_iscsi ib_cm iw_cm dm_multipath mlx5_ib ib_uverbs ib_core acpi_ipmi crct10dif_ce ghash_ce sha2_ce sha256_arm64 ipmi_ssif sha1_ce sbsa_gwdt ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c sg mlx5_core sdhci_acpi mlxfw tls sdhci qcom_emac mmc_core ahci_platform libahci_platform hdma hdma_mgmt dm_mirror dm_region_hash dm_log dm_mod
>> [  730.291605] CPU: 24 PID: 131 Comm: ksoftirqd/24 Not tainted 5.8.0+ #2
>> [  730.298027] Hardware name: WIWYNN Qualcomm Centriq 2400 Reference Evaluation Platform CV90-LA115-P11/Qualcomm Centriq 2400 Customer Reference Board, BIOS
>> [  730.311830] pstate: 20400005 (nzCv daif +PAN -UAO BTYPE=--)
>> [  730.317392] pc : __memcpy+0x100/0x180
>> [  730.321038] lr : rxe_mem_copy+0x1fc/0x230 [rdma_rxe]
>> [  730.325978] sp : fffffe001588faf0
>> [  730.329277] x29: fffffe001588faf0 x28: fffffe00294b1080
>> [  730.334572] x27: fffffc17b5065086 x26: 00000000000003c0
>> [  730.339867] x25: 0000000000000000 x24: fffffe00113f3000
>> [  730.345162] x23: fffffc17ae07c0f0 x22: 00000000288c9f50
>> [  730.350457] x21: fffffe001588fc60 x20: 0000000000000001
>> [  730.355753] x19: 00000000000003c0 x18: 0000000000000002
>> [  730.361048] x17: 1df0130affff0000 x16: 0000000000000001
>> [  730.366343] x15: 0000000000000002 x14: 0000000000000000
>> [  730.371638] x13: 0000000000000000 x12: 0000000000000000
>> [  730.376933] x11: 0000000000000000 x10: 0000000000000000
>> [  730.382228] x9 : fffffe000a52dd44 x8 : 0000000000000000
>> [  730.387523] x7 : 0000000000000000 x6 : fffffc17b5065086
>> [  730.392818] x5 : fffffe001588fc60 x4 : 0000000000000000
>> [  730.398113] x3 : 00000000000003c0 x2 : 0000000000000340
>> [  730.403409] x1 : 0000000000000000 x0 : fffffc17b5065086
>> [  730.408705] Call trace:
>> [  730.411136]  __memcpy+0x100/0x180
>> [  730.414438]  copy_data+0xc4/0x318 [rdma_rxe]
>> [  730.418691]  rxe_requester+0xa58/0xe38 [rdma_rxe]
>> [  730.423379]  rxe_do_task+0x128/0x200 [rdma_rxe]
>> [  730.427892]  tasklet_action_common.isra.21+0xfc/0x130
>> [  730.432924]  tasklet_action+0x2c/0x38
>> [  730.436570]  __do_softirq+0x128/0x33c
>> [  730.440215]  run_ksoftirqd+0x40/0x58
>> [  730.443777]  smpboot_thread_fn+0x168/0x1b0
>> [  730.447856]  kthread+0x114/0x118
>> [  730.451066]  ret_from_fork+0x10/0x18
>> [  730.454627] Code: d503201f d503201f d503201f d503201f (a8c12027)
>> [  730.460720] ---[ end trace 7d9bff591d1280d9 ]---
>> [  730.465302] Kernel panic - not syncing: Fatal exception in interrupt
>> [  730.471648] SMP: stopping secondary CPUs
>> [  730.475614] Kernel Offset: disabled
>> [  730.479017] CPU features: 0x040002,61800418
>> [  730.483183] Memory Limit: none
>> [  730.486232] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>>
>>
>> Best Regards,
>>    Yi Zhang
>>
>>
> The code looks different than current. What kernel version are you testing?

The ppc64le log is from 5.9.0, and aarch64 log is from 5.8.0
> Is there tight timing on MR creation/destruction/invalidation and use?
No, I only ran the blktests nvme_trtype=rdma nvme/012

> Bob Pearson
>




More information about the Linux-nvme mailing list