[RFC PATCH 3/5] RISC-V: hwprobe: Introduce which-cpus flag

Evan Green evan at rivosinc.com
Tue Oct 10 08:14:16 PDT 2023


On Tue, Oct 10, 2023 at 3:44 AM Andrew Jones <ajones at ventanamicro.com> wrote:
>
> On Mon, Oct 09, 2023 at 09:50:00AM -0700, Evan Green wrote:
> > On Mon, Oct 9, 2023 at 8:39 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > >
> > > On Thu, Oct 05, 2023 at 08:11:25PM +0200, Andrew Jones wrote:
> > > > On Thu, Oct 05, 2023 at 10:12:14AM -0700, Evan Green wrote:
> > > > > On Thu, Oct 5, 2023 at 6:23 AM Andrew Jones <ajones at ventanamicro.com> wrote:
> > > > > >
> > > > > > On Mon, Sep 25, 2023 at 09:16:01AM -0700, Evan Green wrote:
> > > > ...
> > > > > > > So if I write out this algorithm, I get something like:
> > > > > > >  * Create an array of every possible key, and dedupe the caller's list
> > > > > > > of pairs into this array.
> > > > > > >  * For each remaining cpu, go through this array and either confirm
> > > > > > > the big array's element matches this cpu's value, or clear the cpu
> > > > > > > from the result set.
> > > > > > >
> > > > > > > But why do we go to all the effort of de-duping the caller's array of
> > > > > > > pairs? Can't they do that themselves (or pay a small performance
> > > > > > > penalty for "yes" results)? Instead, couldn't it be something like:
> > > > > > > For each pair in the user's set, for each remaining cpu in the set,
> > > > > > > compare the values, or clear the cpu in the remaining set.
> > > > > > >
> > > > > > > Doing that would also take the runtime from O(keyspace * ncpus) to
> > > > > > > O(query_lengh * ncpus).
> > > > > >
> > > > > > I want to de-dupe for two reasons:
> > > > > >  * query_length is unbounded, but keyspace is bounded (and is currently
> > > > > >    small)
> > > > >
> > > > > Ok, but remember that if we ship this behavior today, we're committed
> > > > > to it forever. The keyspace is likely to grow, it would be unfortunate
> > > > > if this step started to cause a noticeable performance delay.
> > > >
> > > > Maybe it's not too late to put a bound on pairs, i.e. pair_count greater
> > > > than some number should return E2BIG.
> > > >
> > >
> > > Scratch this idea. I went looking for precedent for limiting the length of
> > > input arrays to syscalls, but couldn't find any. To the contrary, I found
> > > that move_pages() used to return E2BIG when there were "too many pages to
> > > move", but it hasn't done so since 2.6.29 and, even then, it appears the
> > > concern was multiplication overflow, not having "too much" work. So, we
> > > should leave the number of pairs unbounded, but, too me, that means we
> > > should de-dupe, since we can do the copy_from_user() once for each at that
> > > time, and any user which decides to provide each bit separately for each
> > > bitmask key type should get tiny speedup.
> >
> > I still don't get the usecase where usermode is going to be submitting
> > this large jumble of keys, replete with many duplicates. And even if
> > such a case did exist, why should the kernel be the one deduping,
> > dragging in a runtime penalty for all other callers that didn't submit
> > duplicates. Usermode could de-dupe on their own I'd think.
>
> It's not about usecases, but about best handling of all allowed inputs.
>
> If callers don't submit duplicates or make the call with a small
> pair_count, then the de-dupe loop would have negligible overhead in
> terms of time. However, thinking about this some more, we can avoid
> issuing copy_from_user() multiple times by looping over cpus within
> the loop over user input pairs. We'll do cpumask_set_cpu() and
> cpumask_clear_cpu() on the one_cpu cpumask more times instead, but
> that's just a couple loads and some arithmetic. I think this would
> be the better approach, not because of time overhead, but because
> we won't have to store pairs at all, avoiding memory allocation or
> the assumption that we can store a pair per possible key on the stack.

That sounds good to me. If I understand the proposal right, you'll
also be able to keep your optimization of looping only over the cpus
that are "still in the running", which was a nice early exit for
certain "no" answers.
-Evan



More information about the linux-riscv mailing list