[PATCH RFC v4 12/18] riscv_cbqri: resctrl: Add L3 cache occupancy monitoring

Sat May 16 15:39:53 PDT 2026

On Tue, May 12, 2026 at 05:00:00AM +0000, sashiko-bot at kernel.org wrote:
> >     resctrl_arch_reset_rmid() re-arms CONFIG_EVENT with EVT_ID=Occupancy.
> >     CONFIG_EVENT both resets the counter to 0 and selects the event, so
> >     re-arming with the same event keeps the MCID counting after reset rather
> >     than relying on sticky-last-event semantics that the CBQRI register
> >     definition does not guarantee.
> 
> The resctrl core only calls mon_event_read() with first=true (which
> triggers the arch reset hook) for Memory Bandwidth Monitoring events,
> never for L3 Occupancy.
> 
> Because resctrl_arch_reset_rmid() is never invoked for L3 Occupancy during
> RMID recycle, will the counter fail to re-arm when a new monitoring group
> is created, causing it to stop counting or inherit dirty data from previous
> users?

cbqri_init_mon_counters() pre-configures the MCID slot at domain init,
and occupancy is a current state read rather than an accumulator, so a
recycled MCID returns the live counter. I'll update the commit message
and drop the occupancy branch in resctrl_arch_reset_rmid() since it is
unreachable.

> > +/*
> > + * Per-event controller table. Only events CBQRI can back occupy a
> > + * slot, so other events do not bloat the array.
> > + */
> > +#define CBQRI_MAX_EVENT QOS_L3_OCCUP_EVENT_ID
> > +static struct cbqri_controller *cbqri_resctrl_counters[CBQRI_MAX_EVENT + 1];
> 
> This array is allocated and initialized in cbqri_resctrl_pick_counters()
> and nulled out during teardown, but its contents appear to never be accessed
> by any logic in the driver. 
> 
> Is this array just dead code?

No, cbqri_resctrl_counters[QOS_L3_MBM_TOTAL_EVENT_ID] is read in
cbqri_resctrl.c to decide whether to enable mbm_total_bytes. 

> > @@ -205,10 +225,101 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
> >  			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
> >  			   void *arch_priv, u64 *val, void *arch_mon_ctx)
> >  {
> > -	return -ENODATA;
> > +	struct cbqri_resctrl_dom *hw_dom;
> > +	struct cbqri_controller *ctrl;
> > +	struct rdt_ctrl_domain *d;
> > +	u64 ctr_val;
> > +	int err;
> > +
> > +	resctrl_arch_rmid_read_context_check();
> > +
> > +	/*
> > +	 * Each branch takes a sleeping mutex. Bail if called with IRQs
> > +	 * disabled (e.g. smp_call_function_any() from nohz_full CPUs).
> > +	 */
> > +	if (irqs_disabled())
> > +		return -EIO;
> 
> The resctrl core intentionally reads remote domains via
> smp_call_function_any() when the target CPU is nohz_full, running the
> callback in hardirq context.
> 
> Will rejecting calls in atomic context completely break cache occupancy
> monitoring for nohz_full CPUs, as any attempt to read their statistics
> will always fail with -EIO?

This matches MPAM's __read_mon(). fs/resctrl explicitly handles -EIO
from this class of drivers when dispatching via smp_call_function_any()
for nohz_full CPUs. I'll update the comment to clarify.

> > +	err = resctrl_online_mon_domain(res, &mon_dom->hdr);
> > +	if (err)
> > +		goto err_listdel;
> > +
> > +	err = cbqri_init_mon_counters(ctrl);
> 
> During domain initialization, resctrl_online_mon_domain() internally
> calls resctrl_arch_reset_rmid_all(), which loops over all max_rmid to
> issue CONFIG_EVENT MMIO operations.
> 
> Immediately after this returns, cbqri_init_mon_counters() is called,
> which loops over ctrl->mcid_count to issue the exact same CONFIG_EVENT
> MMIO operations again.
> 
> Since each CBQRI MMIO operation can sleep and take up to 1ms, does
> doing this twice per domain initialization introduce strictly redundant
> and significant latency during CPU hotplug?

Yes. I'll drop cbqri_init_mon_counters() from the L3 mon attach path,
since resctrl_arch_reset_rmid_all() called from
resctrl_online_mon_domain() already covers the same occupancy
configuration.

> > +	if (err)
> > +		goto err_offline;
> > +
> > +	return 0;
> > +
> > +err_offline:
> > +	cancel_delayed_work_sync(&mon_dom->cqm_limbo);
> > +	cancel_delayed_work_sync(&mon_dom->mbm_over);
> 
> The resctrl core only initializes mbm_over if resctrl_is_mbm_enabled()
> is true. Since the CBQRI driver only enables the L3 Occupancy event and
> not MBM, mbm_over remains completely zero-initialized from allocation.
> 
> Can calling cancel_delayed_work_sync() on a zeroed structure pass an
> uninitialized timer to del_timer_sync() and trigger kernel panics?

Yes. The fix is to guard the mbm_over cancel on
resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID). resctrl only
INIT_DELAYED_WORK's mbm_over when MBM is enabled, so a system without a
paired BC will no longer cancel a zeroed work struct.

> > +static int cbqri_attach_cpu_to_cap_ctrl(struct cbqri_controller *ctrl,
> > +					unsigned int cpu)
> > +{
> 
> The hotplug callback cbqri_resctrl_online_cpu() unconditionally skips
> controllers if !ctrl->alloc_capable.
> 
> Will this cause controllers that are only capable of monitoring to be
> ignored entirely, meaning no CPUs will ever be attached and no monitoring
> domains will be created on monitoring-only hardware?

This implementation does not support a capacity controller that supports
mon_capable but not alloc_capable. I'll add pr_warn_once() in
cbqri_resctrl_pick_caches() to make the limitation visible. 

> > -	domain = cbqri_create_ctrl_domain(ctrl, res, cpu, dom_id);
> > -	if (IS_ERR(domain))
> > -		return PTR_ERR(domain);
> > +	if (ctrl->mon_capable && ctrl->cache.cache_level == 3) {
> > +		err = cbqri_attach_cpu_to_l3_mon(ctrl, res, cpu);
> > +		if (err)
> > +			return err;
> 
> If cbqri_attach_cpu_to_l3_mon() fails here, the error is returned, but the
> CPU is not removed from the ctrl_domain mask and the domain structure is
> not freed.
> 
> Because the CPU hotplug core handles a startup failure by rolling back
> from the previous state, the offline callback for the current state
> (cbqri_resctrl_offline_cpu) is not invoked.
> 
> Does this leave the CPU permanently stranded in the ctrl_domain mask and
> leak the newly allocated domain structure?

Yes. I'll make cbqri_attach_cpu_to_cap_ctrl() roll back the
cpumask_set_cpu and tear down the new ctrl_domain when
cbqri_attach_cpu_to_l3_mon() fails.

> > +static void cbqri_detach_cpu_from_l3_mon(struct rdt_resource *res,
> > +					 unsigned int cpu)
> > +{
> > +	struct rdt_l3_mon_domain *mon_dom, *tmp;
> > +
> > +	lockdep_assert_held(&cbqri_domain_list_lock);
> > +
> > +	list_for_each_entry_safe(mon_dom, tmp, &res->mon_domains, hdr.list) {
> > +		if (!cpumask_test_cpu(cpu, &mon_dom->hdr.cpu_mask))
> > +			continue;
> > +		cpumask_clear_cpu(cpu, &mon_dom->hdr.cpu_mask);
> > +		if (cpumask_empty(&mon_dom->hdr.cpu_mask)) {
> > +			cancel_delayed_work_sync(&mon_dom->cqm_limbo);
> 
> This function is executed as a CPU hotplug callback, meaning the thread
> holds cpus_write_lock(). Concurrently, the cqm_handle_limbo() worker
> function in the resctrl core begins by acquiring cpus_read_lock().
> 
> If the worker starts executing while the hotplug thread is active, the
> worker will block on cpus_read_lock(), while the hotplug thread blocks in
> cancel_delayed_work_sync() waiting for the worker to finish.
> 
> Could this create an unresolvable AB-BA deadlock that hangs the system
> during CPU offline operations?

Yes, I will move the L3 mon_domain teardown path from
cancel_delayed_work_sync to cancel_delayed_work.

> > +			cancel_delayed_work_sync(&mon_dom->mbm_over);
> 
> (As mentioned above, mbm_over appears to be uninitialized here as well.)

I'll do the same fix here too.

-Drew