nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

Tue Nov 8 01:16:09 PST 2022

On 07 Nov 2022 18:28:16 +0100 Gerd Bayer <gbayer at linux.ibm.com>
> Hi,
> 
> our internal s390 CI pointed us to a potential racy "use after free" or similar 
> issue in drivers/nvme/host/pci.c by ending one of the tests in the following 
> kernel panic:
> 
> [ 1836.550881] nvme nvme0: pci function 0004:00:00.0
> [ 1836.563814] nvme nvme0: Shutdown timeout set to 15 seconds
> [ 1836.569587] nvme nvme0: 63/0/0 default/read/poll queues
> [ 1836.577114]  nvme0n1: p1 p2
> [ 1861.856726] nvme nvme0: pci function 0004:00:00.0
> [ 1861.869539] nvme nvme0: failed to mark controller CONNECTING
> [ 1861.869542] nvme nvme0: Removing after probe failure status: -16

Work callback exits with error result.

> [ 1861.869552] Unable to handle kernel pointer dereference in virtual kernel address space
> [ 1861.869554] Failing address: 0000000000000000 TEID: 0000000000000483
> [ 1861.869555] Fault in home space mode while using kernel ASCE.
> [ 1861.869558] AS:0000000135c4c007 R3:00000003fffe0007 S:00000003fffe6000 P:000000000000013d 
> [ 1861.869587] Oops: 0004 ilc:3 [#1] SMP 
> [ 1861.869591] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables
> nfnetlink mlx5_ib ib_uverbs uvdevice s390_trng ib_core vfio_ccw mdev vfio_iommu_type1 eadm_sch
>  vfio sch_fq_codel configfs dm_service_time mlx5_core ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes
> sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 nvme sha_common nvme_core zfcp scsi_transport_fc
> dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log pkey zcry
> pt rng_core autofs4
> [ 1861.869627] CPU: 4 PID: 2929 Comm: kworker/u800:0 Not tainted 6.1.0-rc3-next-20221104 #4
> [ 1861.869630] Hardware name: IBM 3931 A01 701 (LPAR)
> [ 1861.869631] Workqueue: nvme-reset-wq nvme_reset_work [nvme]

Work is re-scheduled, which supports uaf above.

> [ 1861.869637] Krnl PSW : 0704c00180000000 0000000134f026d0 (mutex_lock+0x10/0x28)
> [ 1861.869643]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [ 1861.869646] Krnl GPRS: 0000000001000000 0000000000000000 0000000000000078 00000000a5f8c200
> [ 1861.869648]            000003800309601c 0000000000000004 0000000000000000 0000000088e64220
> [ 1861.869650]            0000000000000078 0000000000000000 0000000000000098 0000000088e64000
> [ 1861.869651]            00000000a5f8c200 0000000088e641e0 00000001349bdac2 0000038003ea7c20
> [ 1861.869658] Krnl Code: 0000000134f026c0: c0040008cfb8        brcl    0,000000013501c630
> [ 1861.869658]            0000000134f026c6: a7190000            lghi    %r1,0
> [ 1861.869658]           #0000000134f026ca: e33003400004        lg      %r3,832
> [ 1861.869658]           >0000000134f026d0: eb1320000030        csg     %r1,%r3,0(%r2)
> [ 1861.869658]            0000000134f026d6: ec160006007c        cgij    %r1,0,6,0000000134f026e2
> [ 1861.869658]            0000000134f026dc: 07fe                bcr     15,%r14
> [ 1861.869658]            0000000134f026de: 47000700            bc      0,1792
> [ 1861.869658]            0000000134f026e2: c0f4ffffffe7        brcl    15,0000000134f026b0
> [ 1861.869715] Call Trace:
> [ 1861.869716]  [<0000000134f026d0>] mutex_lock+0x10/0x28 
> [ 1861.869719]  [<000003ff7fc381d6>] nvme_dev_disable+0x1b6/0x2b0 [nvme] 
> [ 1861.869722]  [<000003ff7fc3929e>] nvme_reset_work+0x49e/0x6a0 [nvme] 
> [ 1861.869724]  [<0000000134309158>] process_one_work+0x200/0x458 
> [ 1861.869730]  [<00000001343098e6>] worker_thread+0x66/0x480 
> [ 1861.869732]  [<0000000134312888>] kthread+0x108/0x110 
> [ 1861.869735]  [<0000000134297354>] __ret_from_fork+0x3c/0x58 
> [ 1861.869738]  [<0000000134f074ea>] ret_from_fork+0xa/0x40 
> [ 1861.869740] Last Breaking-Event-Address:
> [ 1861.869741]  [<00000001349bdabc>] blk_mq_quiesce_tagset+0x2c/0xc0
> [ 1861.869747] Kernel panic - not syncing: Fatal exception: panic_on_oops
> 
> On a stock kernel from
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=next-20221104
> we have been able to reproduce this at will with
> this small script 
> 
> #!/usr/bin/env bash
> 
> echo $1 > /sys/bus/pci/drivers/nvme/unbind
> echo $1 > /sys/bus/pci/drivers/nvme/bind
> echo 1 > /sys/bus/pci/devices/$1/remove
> 
> when filling in the NVMe drives' PCI identifier.
> 
> We believe this to be a race-condition somewhere, since this sequence does not produce the panic
> when executed interactively.
> 
> Could this be linked to the recent (refactoring) work by Christoph Hellwig?
> E.g. https://lore.kernel.org/all/20221101150050.3510-3-hch@lst.de/

The minimum change is to flush the reset work before jumping in the blackhole,
instead of perhaps better options like grabbing another hold on controler
upon scheduling work, with the bonus of figuring out the spot and reason for
rescheduling.

Only for thoughts.

Hillf

+++ next-20221104/drivers/nvme/host/pci.c
@@ -2776,6 +2776,7 @@ static void nvme_pci_free_ctrl(struct nv
 	mempool_destroy(dev->iod_mempool);
 	put_device(dev->dev);
 	kfree(dev->queues);
+	flush_work(&dev->ctrl.reset_work);
 	kfree(dev);
 }