[PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Alexander Potapenko glider at google.com
Wed Jul 26 01:08:28 PDT 2023


On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov at gmail.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > +                             unsigned long value,
> > +                             unsigned long start, unsigned long nbits)
> > +{
> > +     size_t index = BIT_WORD(start);
> > +     unsigned long offset = start % BITS_PER_LONG;
> > +     unsigned long space = BITS_PER_LONG - offset;
> > +
> > +     if (unlikely(!nbits))
> > +             return;
> > +     value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.

I can easily imagine someone passing -1 (or ~0) as a value, but
wanting to only write n bits of n.

>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
>
> > +     if (space >= nbits) {
> > +             map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

As noted in the other patch discussion, pulling offset into GENMASK is
actually not an identity transformation, because nbits + offset may
exceed 64, producing an invalid mask.

>
> > +             map[index] |= value << offset;
>
> Here you commit 2 reads and 2 writes for a word instead of one.

There won't be two reads and two writes.
The compiler will read map[index] to a register, apply the mask, then
write value << offset to it, then perform the write.
Unless map[] is a volatile, repeated reads/writes will be optimized
out by any decent compiler.

>
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
>
> ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
> ~25 bytes of .text on x86_64.
>
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> With all that I think the implementation should look something like
> this:
>
>         unsigned long w, mask;
>
>         if (unlikely(nbits == 0))
>                 return 0;
>
>         map += BIT_WORD(start);
>         start %= BITS_PER_LONG;
>         end = start + nbits - 1;
>
>         w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
>         *map = w | (value << start);
>
>         if (end < BITS_PER_LONG)
>                 return;
>
>         w = *++map & BITMAP_FIRST_WORD_MASK(end);
>         *map = w | (value >> BITS_PER_LONG * 2 - end);
>
> It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
> it should be more efficient.
>
> Alexander, can you please try the above and compare?

I like the idea of sharing the first write between the branches, and
it can be made even shorter:

===========================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
===========================================================

According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
GCC (which on the other hand does a poor job optimizing both).

Overall, given that there's currently a single user of these
functions, isn't it premature to optimize them without knowing
anything about their performance?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Will godbolt work for you (see above)?


>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;
Will fix, thanks!

>  - git-am throws warning about trailing line;
Interesting, I generate the patches using `git format-patch`, I
thought `git am` should be the inversion of it. I'll check.

>  - checkpatch warns 7 times;

Indeed, there were warnings that I ignored (e.g. one of them was
caused by me adding extern symbols to the test module, as requested
during the review process). I think I can fix all of them.

>
> Can you fix all the above before sending the new version?
>
> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

BE64 works, I'll also need to check the 32-bit versions as well.
Note that MTE is an ARM64 feature (yet we still need to ensure
bitops.h works on 32 bits).

>
> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

I'll provide the exact numbers with the next patch series. Last time I
checked, the order of magnitude was tens of megabytes.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

For out-of-line allocations the data can be derived from
/proc/slabinfo, but we don't calculate inline allocations.
Agreed, a debugfs interface won't hurt.



More information about the linux-arm-kernel mailing list