[RFC PATCH 3/5] RISC-V: hwprobe: Introduce which-cpus flag

Tue Oct 10 03:44:30 PDT 2023

On Mon, Oct 09, 2023 at 09:50:00AM -0700, Evan Green wrote:
> On Mon, Oct 9, 2023 at 8:39 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> >
> > On Thu, Oct 05, 2023 at 08:11:25PM +0200, Andrew Jones wrote:
> > > On Thu, Oct 05, 2023 at 10:12:14AM -0700, Evan Green wrote:
> > > > On Thu, Oct 5, 2023 at 6:23 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > > > >
> > > > > On Mon, Sep 25, 2023 at 09:16:01AM -0700, Evan Green wrote:
> > > ...
> > > > > > So if I write out this algorithm, I get something like:
> > > > > >  * Create an array of every possible key, and dedupe the caller's list
> > > > > > of pairs into this array.
> > > > > >  * For each remaining cpu, go through this array and either confirm
> > > > > > the big array's element matches this cpu's value, or clear the cpu
> > > > > > from the result set.
> > > > > >
> > > > > > But why do we go to all the effort of de-duping the caller's array of
> > > > > > pairs? Can't they do that themselves (or pay a small performance
> > > > > > penalty for "yes" results)? Instead, couldn't it be something like:
> > > > > > For each pair in the user's set, for each remaining cpu in the set,
> > > > > > compare the values, or clear the cpu in the remaining set.
> > > > > >
> > > > > > Doing that would also take the runtime from O(keyspace * ncpus) to
> > > > > > O(query_lengh * ncpus).
> > > > >
> > > > > I want to de-dupe for two reasons:
> > > > >  * query_length is unbounded, but keyspace is bounded (and is currently
> > > > >    small)
> > > >
> > > > Ok, but remember that if we ship this behavior today, we're committed
> > > > to it forever. The keyspace is likely to grow, it would be unfortunate
> > > > if this step started to cause a noticeable performance delay.
> > >
> > > Maybe it's not too late to put a bound on pairs, i.e. pair_count greater
> > > than some number should return E2BIG.
> > >
> >
> > Scratch this idea. I went looking for precedent for limiting the length of
> > input arrays to syscalls, but couldn't find any. To the contrary, I found
> > that move_pages() used to return E2BIG when there were "too many pages to
> > move", but it hasn't done so since 2.6.29 and, even then, it appears the
> > concern was multiplication overflow, not having "too much" work. So, we
> > should leave the number of pairs unbounded, but, too me, that means we
> > should de-dupe, since we can do the copy_from_user() once for each at that
> > time, and any user which decides to provide each bit separately for each
> > bitmask key type should get tiny speedup.
> 
> I still don't get the usecase where usermode is going to be submitting
> this large jumble of keys, replete with many duplicates. And even if
> such a case did exist, why should the kernel be the one deduping,
> dragging in a runtime penalty for all other callers that didn't submit
> duplicates. Usermode could de-dupe on their own I'd think.

It's not about usecases, but about best handling of all allowed inputs.

If callers don't submit duplicates or make the call with a small
pair_count, then the de-dupe loop would have negligible overhead in
terms of time. However, thinking about this some more, we can avoid
issuing copy_from_user() multiple times by looping over cpus within
the loop over user input pairs. We'll do cpumask_set_cpu() and
cpumask_clear_cpu() on the one_cpu cpumask more times instead, but
that's just a couple loads and some arithmetic. I think this would
be the better approach, not because of time overhead, but because
we won't have to store pairs at all, avoiding memory allocation or
the assumption that we can store a pair per possible key on the stack.

Thanks,
drew