Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)

Tue Jul 11 14:39:17 PDT 2023

Hello,

On Tue, Jul 11, 2023 at 04:06:22PM +0200, Geert Uytterhoeven wrote:
> On Tue, Jul 11, 2023 at 3:55 PM Geert Uytterhoeven <geert at linux-m68k.org> wrote:
> >
> > Hi Tejun,
> >
> > On Fri, May 12, 2023 at 9:54 PM Tejun Heo <tj at kernel.org> wrote:
> > > Workqueue now automatically marks per-cpu work items that hog CPU for too
> > > long as CPU_INTENSIVE, which excludes them from concurrency management and
> > > prevents stalling other concurrency-managed work items. If a work function
> > > keeps running over the thershold, it likely needs to be switched to use an
> > > unbound workqueue.
> > >
> > > This patch adds a debug mechanism which tracks the work functions which
> > > trigger the automatic CPU_INTENSIVE mechanism and report them using
> > > pr_warn() with exponential backoff.
> > >
> > > v2: Drop bouncing through kthread_worker for printing messages. It was to
> > >     avoid introducing circular locking dependency but wasn't effective as it
> > >     still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just
> > >     print directly using printk_deferred().
> > >
> > > Signed-off-by: Tejun Heo <tj at kernel.org>
> > > Suggested-by: Peter Zijlstra <peterz at infradead.org>
> >
> > Thanks for your patch, which is now commit 6363845005202148
> > ("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE
> > mechanism") in v6.5-rc1.
> >
> > I guess you are interested to know where this triggers.
> > I enabled CONFIG_WQ_CPU_INTENSIVE_REPORT=y, and tested
> > the result on various machines...
> 
> > OrangeCrab/Linux-on-LiteX-VexRiscV with ht16k33 14-seg display and ssd130xdrmfb:
> >
> >   workqueue: check_lifetime hogged CPU for >10000us 4 times, consider
> > switching to WQ_UNBOUND
> >   workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 1024
> > times, consider switching to WQ_UNBOUND
> >   workqueue: fb_flashcursor hogged CPU for >10000us 128 times,
> > consider switching to WQ_UNBOUND
> >   workqueue: ht16k33_seg14_update hogged CPU for >10000us 128 times,
> > consider switching to WQ_UNBOUND
> >   workqueue: mmc_rescan hogged CPU for >10000us 128 times, consider
> > switching to WQ_UNBOUND
> 
> Got one more after a while:
> 
> workqueue: neigh_managed_work hogged CPU for >10000us 4 times,
> consider switching to WQ_UNBOUND

I wonder whether the right thing to do here is somehow scaling the threshold
according to the relative processing power. It's difficult to come up with a
threshold which works well across the latest & fastest and really tiny CPUs.
I'll think about it some more but if you have some ideas, please feel free
to suggest.

Thanks.

-- 
tejun