WARN_ON hit on __blk_mq_run_hw_queue followed by an Oops

Tue Jul 5 01:25:48 PDT 2016

On Wed, 15 Jun 2016 12:09:51 -0300 Gabriel Krisman Bertazi wrote:
> Hi,
>
> While doing stress test on a nvme device at the same time we offlined
> some cores to change the SMT configuration in a Power box, we started
> hitting this WARN_ON in __blk_mq_run_hw_queue:
>
> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
>
> We can easily reproduce this by disabling a few CPUs via sysfs, while
> doing intensive IO to the device.  We start getting warnings like this:
>
> [  271.372565] WARNING: CPU: 0 PID: 7501 at ../block/blk-mq.c:739 __blk_mq_run_hw_queue+0x98/0x490
> [  271.643816] CPU: 0 PID: 7501 Comm: kworker/28:1H Tainted: G        W       4.7.0-rc2.mainline+ #10
> [  271.675180] Workqueue: kblockd blk_mq_run_work_fn
> [  271.700182] task: c0000078cda45800 ti: c0000078ce794000 task.ti: c0000078ce794000
> [  271.713909] NIP: c000000000568b68 LR: c0000000000eb6d0 CTR: c000000000569460
> [  271.732379] REGS: c0000078ce7978d0 TRAP: 0700   Tainted: G        W        (4.7.0-rc2.mainline+)
> [  271.753510] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 24002824  XER: 20000000
> [  271.872673] CFAR: c000000000569498 SOFTE: 1
> [  272.052782] NIP [c000000000568b68] __blk_mq_run_hw_queue+0x98/0x490
> [  272.072649] LR [c0000000000eb6d0] process_one_work+0x1e0/0x590
> [  272.093302] Call Trace:
> [  272.093322] [c0000078ce797b50] [c000000000568d94] __blk_mq_run_hw_queue+0x2c4/0x490 (unreliable)
> [  272.144296] [c0000078ce797c50] [c0000000000eb6d0] process_one_work+0x1e0/0x590
> [  272.152311] [c0000078ce797ce0] [c0000000000ebdb0] worker_thread+0x330/0x660
> [  272.175452] [c0000078ce797d80] [c0000000000f4690] kthread+0x110/0x130
> [  272.196932] [c0000078ce797e30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
> [  272.222188] Instruction dump:
> [  272.222218] 391e0188 eb1e0298 fb810088 fac10078 a12d0008 792ad182 792906a0 794a1f24
> [  272.266424] 7d48502a 7d494c36 792907e0 69290001 <0b090000> e93e0080 792a07e1 40820320
> [  272.310400] ---[ end trace b9ad111e6335f7e6 ]---

Is this WARN_ON still appear in 4.7-rc6 ?
If so, the following patch may fix this problem.

http://lists.infradead.org/pipermail/linux-nvme/2016-July/005303.html

The patch is a part of series of the following.

http://lists.infradead.org/pipermail/linux-nvme/2016-July/005294.html

- Masayoshi Mizuma

>
> Eventually, if we keep stressing the SMT settings, we hit an Oops for a
> bad memory access, in __blk_mq_complete_request, coming from the IRQ
> path of the nvme device.  the Oops happens because of a NULL
> dereference, since the request_queue pointer of the request (rq->q) is
> NULL.
>
> Sorry I don't have the full log of the Oops, it got truncated with other
> output.  But the stack trace is as follows:
>
> 58:mon> e
> cpu 0x58: Vector: 300 (Data Access) at [c000007fff4aba80]
>      pc: c000000000567e30: __blk_mq_complete_request+0x30/0x1b0
>      lr: d000000035d30714: __nvme_process_cq+0xb4/0x230 [nvme]
>      sp: c000007fff4abd00
>     msr: 9000000000009033
>     dar: b0
>   dsisr: 40000000
>    current = 0xc0000073de79e400
>    paca    = 0xc000000002c01800   softe: 0        irq_happened: 0x01
>      pid   = 3199, comm = aio-stress
> Linux version 4.7.0-rc2.mainline+ (root at iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.1) ) #10 SMP Mon Jun 13 13:12:59 CDT 2016
> 58:mon> t
> [c000007fff4abd40] d000000035d30714 __nvme_process_cq+0xb4/0x230 [nvme]
> [c000007fff4abde0] d000000035d3097c nvme_irq+0x3c/0xb0 [nvme]
> [c000007fff4abe10] c000000000140690 handle_irq_event_percpu+0xa0/0x310
> [c000007fff4abed0] c000000000140968 handle_irq_event+0x68/0xc0
> [c000007fff4abf00] c000000000145d0c handle_fasteoi_irq+0xec/0x2a0
> [c000007fff4abf30] c00000000013f744 generic_handle_irq+0x54/0x80
> [c000007fff4abf60] c000000000011290 __do_irq+0x80/0x1d0
> [c000007fff4abf90] c000000000024f70 call_do_irq+0x14/0x24
> [c000007887457540] c000000000011478 do_IRQ+0x98/0x140
> [c000007887457590] c000000000002594 hardware_interrupt_common+0x114/0x180
> --- Exception: 501 (Hardware Interrupt) at c000000000068800 __copy_tofrom_user_power7+0x3ec/0x7cc
> [c000007887457980] c0000000005b56dc iov_iter_copy_from_user_atomic+0xac/0x2b0
> [c0000078874579d0] c00000000024431c generic_perform_write+0x13c/0x280
> [c000007887457a70] c000000000246650 __generic_file_write_iter+0x200/0x240
> [c000007887457ad0] c0000000003d5ff4 ext4_file_write_iter+0x284/0x430
> [c000007887457b90] c000000000385b50 aio_run_iocb+0x350/0x3e0
> [c000007887457ce0] c000000000387548 do_io_submit+0x368/0x750
> [c000007887457e30] c000000000009204 system_call+0x38/0xec
> --- Exception: c01 (System Call) at 00003fff99720818
> SP (3fff91cce4d0) is in userspace
>
> I've been working on this for a while now, but I haven't really gone
> anywhere, given I'm still very unfamiliar with the block layer.  My
> current understanding is that the Oops and the Warning are unrelated,
> even though they are both triggered by the same condition (SMT changes
> plus IO stress).
>
> At this point, I could use some help from you guys, whether you have
> ever observed a similar issue on x86, or any debugging suggestions.  I'm
> currently looking to understand why the thread was scheduled for the
> wrong core.
>
> Thanks,
>