[RFC PATCH 3/5] RISC-V: hwprobe: Introduce which-cpus flag

Tue Oct 10 09:22:47 PDT 2023

On Tue, Oct 10, 2023 at 08:14:16AM -0700, Evan Green wrote:
> On Tue, Oct 10, 2023 at 3:44 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> >
> > On Mon, Oct 09, 2023 at 09:50:00AM -0700, Evan Green wrote:
> > > On Mon, Oct 9, 2023 at 8:39 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > > >
> > > > On Thu, Oct 05, 2023 at 08:11:25PM +0200, Andrew Jones wrote:
> > > > > On Thu, Oct 05, 2023 at 10:12:14AM -0700, Evan Green wrote:
> > > > > > On Thu, Oct 5, 2023 at 6:23 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > > > > > >
> > > > > > > On Mon, Sep 25, 2023 at 09:16:01AM -0700, Evan Green wrote:
> > > > > ...
> > > > > > > > So if I write out this algorithm, I get something like:
> > > > > > > >  * Create an array of every possible key, and dedupe the caller's list
> > > > > > > > of pairs into this array.
> > > > > > > >  * For each remaining cpu, go through this array and either confirm
> > > > > > > > the big array's element matches this cpu's value, or clear the cpu
> > > > > > > > from the result set.
> > > > > > > >
> > > > > > > > But why do we go to all the effort of de-duping the caller's array of
> > > > > > > > pairs? Can't they do that themselves (or pay a small performance
> > > > > > > > penalty for "yes" results)? Instead, couldn't it be something like:
> > > > > > > > For each pair in the user's set, for each remaining cpu in the set,
> > > > > > > > compare the values, or clear the cpu in the remaining set.
> > > > > > > >
> > > > > > > > Doing that would also take the runtime from O(keyspace * ncpus) to
> > > > > > > > O(query_lengh * ncpus).
> > > > > > >
> > > > > > > I want to de-dupe for two reasons:
> > > > > > >  * query_length is unbounded, but keyspace is bounded (and is currently
> > > > > > >    small)
> > > > > >
> > > > > > Ok, but remember that if we ship this behavior today, we're committed
> > > > > > to it forever. The keyspace is likely to grow, it would be unfortunate
> > > > > > if this step started to cause a noticeable performance delay.
> > > > >
> > > > > Maybe it's not too late to put a bound on pairs, i.e. pair_count greater
> > > > > than some number should return E2BIG.
> > > > >
> > > >
> > > > Scratch this idea. I went looking for precedent for limiting the length of
> > > > input arrays to syscalls, but couldn't find any. To the contrary, I found
> > > > that move_pages() used to return E2BIG when there were "too many pages to
> > > > move", but it hasn't done so since 2.6.29 and, even then, it appears the
> > > > concern was multiplication overflow, not having "too much" work. So, we
> > > > should leave the number of pairs unbounded, but, too me, that means we
> > > > should de-dupe, since we can do the copy_from_user() once for each at that
> > > > time, and any user which decides to provide each bit separately for each
> > > > bitmask key type should get tiny speedup.
> > >
> > > I still don't get the usecase where usermode is going to be submitting
> > > this large jumble of keys, replete with many duplicates. And even if
> > > such a case did exist, why should the kernel be the one deduping,
> > > dragging in a runtime penalty for all other callers that didn't submit
> > > duplicates. Usermode could de-dupe on their own I'd think.
> >
> > It's not about usecases, but about best handling of all allowed inputs.
> >
> > If callers don't submit duplicates or make the call with a small
> > pair_count, then the de-dupe loop would have negligible overhead in
> > terms of time. However, thinking about this some more, we can avoid
> > issuing copy_from_user() multiple times by looping over cpus within
> > the loop over user input pairs. We'll do cpumask_set_cpu() and
> > cpumask_clear_cpu() on the one_cpu cpumask more times instead, but
> > that's just a couple loads and some arithmetic. I think this would
> > be the better approach, not because of time overhead, but because
> > we won't have to store pairs at all, avoiding memory allocation or
> > the assumption that we can store a pair per possible key on the stack.
> 
> That sounds good to me. If I understand the proposal right, you'll
> also be able to keep your optimization of looping only over the cpus
> that are "still in the running", which was a nice early exit for
> certain "no" answers.

Yup. If a cpu is dropped due to one pair, then it won't be checked against
later pairs. And, if a pair's key is not recognized, then no cpus will be
checked against later pairs, as that's a "clear all" condition.

Thanks,
drew