WQ_UNBOUND workqueue warnings from multiple drivers

Tue Apr 2 16:50:28 PDT 2024

Sagi, Chaitanya,

Sorry for the delay, found your replies in the junk folder :(

> Was the test you were running read-heavy?
No, most of the failing fio tests were doing heavy writes. All were with 8 Controllers and 32 NS each. io-specs are below.

[1] bs=16k, iodepth=16, rwmixread=0, numjobs=16
Failed in ~1 min

Some others were:
[2] bs=8k, iodepth=16, rwmixread=5, numjobs=16 
[3] bs=8k, iodepth=16, rwmixread=50, numjobs=16 

Thanks,
Kamaljit

From: Chaitanya Kulkarni <chaitanyak at nvidia.com>
Date: Thursday, March 21, 2024 at 10:36
To: Sagi Grimberg <sagi at grimberg.me>, Kamaljit Singh <Kamaljit.Singh1 at wdc.com>
Cc: kbusch at kernel.org <kbusch at kernel.org>, linux-kernel at vger.kernel.org <linux-kernel at vger.kernel.org>, linux-nvme at lists.infradead.org <linux-nvme at lists.infradead.org>
Subject: Re: WQ_UNBOUND workqueue warnings from multiple drivers
CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.

On 3/20/24 02:11, Sagi Grimberg wrote:
>
>
> On 19/03/2024 0:33, Kamaljit Singh wrote:
>> Hello,
>>
>> After switching from Kernel v6.6.2 to v6.6.21 we're now seeing these
>> workqueue
>> warnings. I found a discussion thread about the the Intel drm driver
>> here
>> https://lore.kernel.org/lkml/ZO-BkaGuVCgdr3wc@slm.duckdns.org/T/
>>
>> and this related bug report
>> https://gitlab.freedesktop.org/drm/intel/-/issues/9245
>> but that that drm fix isn't merged into v6.6.21. It appears that we
>> may need the same
>> WQ_UNBOUND change to the nvme host tcp driver among others.
>>   [Fri Mar 15 22:30:06 2024] workqueue: nvme_tcp_io_work [nvme_tcp]
>> hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Fri Mar 15 23:44:58 2024] workqueue: drain_vmap_area_work hogged CPU
>> for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 09:55:27 2024] workqueue: drain_vmap_area_work hogged CPU
>> for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 17:51:18 2024] workqueue: nvme_tcp_io_work [nvme_tcp]
>> hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 23:04:14 2024] workqueue: nvme_tcp_io_work [nvme_tcp]
>> hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
>> [Sun Mar 17 21:35:46 2024] perf: interrupt took too long (2707 >
>> 2500), lowering kernel.perf_event_max_sample_rate to 73750
>> [Sun Mar 17 21:49:34 2024] workqueue: drain_vmap_area_work hogged CPU
>> for >10000us 16 times, consider switching to WQ_UNBOUND
>> ...
>> workqueue: drm_fb_helper_damage_work [drm_kms_helper] hogged CPU for
>> >10000us 32 times, consider switching to WQ_UNBOUND
>
> Hey Kamaljit,
>
> Its interesting that this happens because nvme_tcp_io_work is bound to
> 1 jiffie.
> Although in theory we do not stop receiving from a socket once we
> started, so
> I guess this can happen in some extreme cases. Was the test you were
> running
> read-heavy?
>
> I was thinking that we may want to optionally move the recv path to
> softirq instead to
> get some latency improvements, although I don't know if that would
> improve the situation
> if we end up spending a lot of time in soft-irq...
>
>>     Thanks,
>> Kamaljit Singh
>
>

we need a regular test for this in blktests as it doesn't look like we
caught this in
regular testing ...

Kamaljit, can you please provide details of the tests you are running so
we can
reproduce ?

-ck