WQ_UNBOUND workqueue warnings from multiple drivers

Thu Mar 21 10:36:15 PDT 2024

On 3/20/24 02:11, Sagi Grimberg wrote:
>
>
> On 19/03/2024 0:33, Kamaljit Singh wrote:
>> Hello,
>>
>> After switching from Kernel v6.6.2 to v6.6.21 we're now seeing these 
>> workqueue
>> warnings. I found a discussion thread about the the Intel drm driver 
>> here
>> https://lore.kernel.org/lkml/ZO-BkaGuVCgdr3wc@slm.duckdns.org/T/
>>
>> and this related bug report 
>> https://gitlab.freedesktop.org/drm/intel/-/issues/9245
>> but that that drm fix isn't merged into v6.6.21. It appears that we 
>> may need the same
>> WQ_UNBOUND change to the nvme host tcp driver among others.
>>   [Fri Mar 15 22:30:06 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Fri Mar 15 23:44:58 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 09:55:27 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 17:51:18 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 23:04:14 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
>> [Sun Mar 17 21:35:46 2024] perf: interrupt took too long (2707 > 
>> 2500), lowering kernel.perf_event_max_sample_rate to 73750
>> [Sun Mar 17 21:49:34 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 16 times, consider switching to WQ_UNBOUND
>> ...
>> workqueue: drm_fb_helper_damage_work [drm_kms_helper] hogged CPU for 
>> >10000us 32 times, consider switching to WQ_UNBOUND
>
> Hey Kamaljit,
>
> Its interesting that this happens because nvme_tcp_io_work is bound to 
> 1 jiffie.
> Although in theory we do not stop receiving from a socket once we 
> started, so
> I guess this can happen in some extreme cases. Was the test you were 
> running
> read-heavy?
>
> I was thinking that we may want to optionally move the recv path to 
> softirq instead to
> get some latency improvements, although I don't know if that would 
> improve the situation
> if we end up spending a lot of time in soft-irq...
>
>>     Thanks,
>> Kamaljit Singh
>
>

we need a regular test for this in blktests as it doesn't look like we 
caught this in
regular testing ...

Kamaljit, can you please provide details of the tests you are running so 
we can
reproduce ?

-ck