[PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

Sun Nov 8 15:14:50 EST 2020

On Sun, 8 Nov 2020 at 19:10, Arvind Sankar <nivedita at alum.mit.edu> wrote:
>
> On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > > happens badly with no gains and considering previous GCC experiences
> > > which generated unoptimized code which was worse than the default asm
> > > implementation, it is safer to default clang builds to the known good
> > > generic implementation.
> > >
> > > The kernel currently supports a minimum Clang version of v10.0.1, see
> > > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > >
> > > When the bug gets eventually fixed, this commit could be reverted or,
> > > if the minimum clang version bump takes a long time, a warning could
> > > be added for users to upgrade their compilers like was done for GCC.
> > >
> > > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > >
> > > Signed-off-by: Adrian Ratiu <adrian.ratiu at collabora.com>
> > > ---
> > >  arch/arm/include/asm/xor.h | 3 ++-
> > >  arch/arm/lib/Makefile      | 3 +++
> > >  arch/arm/lib/xor-neon.c    | 4 ++++
> > >  3 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > > index aefddec79286..49937dafaa71 100644
> > > --- a/arch/arm/include/asm/xor.h
> > > +++ b/arch/arm/include/asm/xor.h
> > > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> > >             NEON_TEMPLATES;                 \
> > >     } while (0)
> > >
> > > -#ifdef CONFIG_KERNEL_MODE_NEON
> > > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> > >
> > >  extern struct xor_block_template const xor_block_neon_inner;
> > >
> > > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > > index 6d2ba454f25b..53f9e7dd9714 100644
> > > --- a/arch/arm/lib/Makefile
> > > +++ b/arch/arm/lib/Makefile
> > > @@ -43,8 +43,11 @@ endif
> > >  $(obj)/csumpartialcopy.o:  $(obj)/csumpartialcopygeneric.S
> > >  $(obj)/csumpartialcopyuser.o:      $(obj)/csumpartialcopygeneric.S
> > >
> > > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > > +ifndef CONFIG_CC_IS_CLANG
> > >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> > >    NEON_FLAGS                       := -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> > >    CFLAGS_xor-neon.o                += $(NEON_FLAGS)
> > >    obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> > >  endif
> > > +endif
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index e1e76186ec23..84c91c48dfa2 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> > >   * Pull in the reference implementations while instructing GCC (through
> > >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> > >   * NEON instructions.
> > > +
> > > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > > + * so xor-neon is disabled in favor of the default reg implementations.
> > >   */
> > >  #ifdef CONFIG_CC_IS_GCC
> > >  #pragma GCC optimize "tree-vectorize"
> > > --
> > > 2.29.0
> > >
> >
> > It's actually a bad idea to use #pragma GCC optimize. This is basically
> > the same as tagging all the functions with __attribute__((optimize)),
> > which GCC does not recommend for production use, as it _replaces_
> > optimization options rather than appending to them, and has been
> > observed to result in dropping important compiler flags.
> >
> > There've been a few discussions recently around other such cases:
> > https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
> > https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/
> >
> > For this file, given that it is supposed to use -ftree-vectorize for the
> > whole file anyway, is there any reason it's not just added to CFLAGS via
> > the Makefile? This seems to be the only use of pragma optimize in the
> > kernel.
>
> Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
> https://godbolt.org/z/1nfrKT
>
> The first function does not use vectorization because s and s->a might
> alias.
>

Thanks, Arvind. I wasn't aware of this issue at the time, but I agree
that we should replace the #pragma with a command line option in this
case.

And given that we already set CFLAGS_xor-neon.o in the Makefile,
adding it there would have been more straight-forward to begin with.