[LEDE-DEV] [PATCH] openssl: Enable assembler optimizations for aarch64
Baptiste Jonglez
baptiste at bitsofnetworks.org
Sat Oct 28 01:05:20 PDT 2017
The awesome AES performance was too good to be true: it seems to produce
incorrect results when encrypting on the pine64 and decrypting on a x86_64
machine :(
Possibly some assembler is optimized away by the compiler, which would
explain why it's so fast. Please don't merge for now until I investigate.
SHA does seem to give correct results though (and is really fast).
On 27-10-17, Baptiste Jonglez wrote:
> OpenSSL is built with the generic linux settings for most targets,
> including aarch64. These generic settings are designed for 32-bit CPU and
> provide no assembler optmization: this is widely suboptimal for aarch64.
>
> This patch simply switches to the aarch64 settings that are already
> available in OpenSSL.
>
> Here is the output of "openssl speed" before the optimization, with
> "(...)" representing build flags that didn't change:
>
> OpenSSL 1.0.2l 25 May 2017
> options:bn(64,32) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
> compiler: aarch64-openwrt-linux-musl-gcc (...)
>
> And after this patch, OpenSSL uses 64 bit mode and assembler optimizations:
>
> OpenSSL 1.0.2l 25 May 2017
> options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
> compiler: aarch64-openwrt-linux-musl-gcc (...) -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
>
> Here are some benchmarks on a pine64+ running latest LEDE master r5142-20d363aed3:
>
> before# openssl speed sha aes blowfish
> The 'numbers' are in 1000s of bytes per second processed.
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> sha1 3918.89k 9982.43k 19148.03k 24933.03k 27325.78k
> sha256 4604.51k 10240.64k 17472.51k 21355.18k 22801.07k
> sha512 3662.19k 14539.41k 21443.16k 29544.11k 33177.60k
> blowfish cbc 16266.63k 16940.86k 17176.92k 17237.33k 17252.35k
> aes-128 cbc 19712.95k 21447.40k 22091.09k 22258.35k 22304.09k
> aes-192 cbc 17680.12k 19064.47k 19572.14k 19703.13k 19737.26k
> aes-256 cbc 15986.67k 17132.48k 17537.28k 17657.17k 17689.26k
>
> after# openssl speed sha aes blowfish
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> sha1 6770.87k 26172.80k 86878.38k 205649.58k 345978.20k
> sha256 20913.93k 74663.85k 184658.18k 290891.09k 351032.66k
> sha512 7633.10k 30110.14k 50083.24k 71883.43k 82485.25k
> blowfish cbc 16224.93k 16933.55k 17173.76k 17234.94k 17252.35k
> aes-128 cbc 19425.74k 21193.31k 22065.74k 22304.77k 22380.54k
> aes-192 cbc 17452.29k 18883.84k 19536.90k 19741.70k 19800.06k
> aes-256 cbc 15815.89k 17003.01k 17530.03k 17695.40k 17746.60k
>
> For some reason AES and blowfish do not benefit, but SHA performance
> improves between 1.7x and 15x. SHA256 clearly benefits the most from the
> optimization (4.5x on small blocks, 15x on large blocks!).
>
> When using EVP (with "openssl speed -evp <algo>"):
>
> # Before, EVP mode
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> sha1 3824.46k 10049.66k 19170.56k 24947.03k 27325.78k
> sha256 3368.33k 8511.15k 16061.44k 20772.52k 22721.88k
> sha512 2845.23k 11381.57k 19467.69k 28512.26k 33008.30k
> bf-cbc 15146.74k 16623.83k 17092.01k 17211.39k 17249.62k
> aes-128-cbc 17873.03k 20870.61k 21933.65k 22216.36k 22301.35k
> aes-192-cbc 16184.18k 18607.15k 19447.13k 19670.02k 19737.26k
> aes-256-cbc 14774.06k 16757.25k 17457.58k 17639.42k 17686.53k
>
> # After, EVP mode
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> sha1 7056.97k 27142.10k 89515.86k 209155.41k 347419.99k
> sha256 7745.70k 29750.06k 95341.48k 211001.69k 332376.75k
> sha512 4550.47k 18086.06k 39997.10k 65880.75k 81431.21k
> bf-cbc 15129.20k 16619.03k 17090.56k 17212.76k 17246.89k
> aes-128-cbc 99619.74k 269032.34k 450214.23k 567353.00k 613933.06k
> aes-192-cbc 93180.74k 231017.79k 361766.66k 433671.51k 461731.16k
> aes-256-cbc 89343.23k 209858.58k 310160.04k 362234.88k 380878.85k
>
> Blowfish does not seem to have assembler optimization at all, and SHA
> still benefits (between 1.6x and 14.5x) but is generally slower than in
> non-EVP mode.
>
> However, AES performance is improved between 5.5x and 27.5x, which is
> really impressive! For aes-128-cbc on large blocks, a core i7-6600U
> @2.60GHz is only twice as fast...
>
> Signed-off-by: Baptiste Jonglez <git at bitsofnetworks.org>
> ---
> package/libs/openssl/Makefile | 4 +++-
> package/libs/openssl/patches/110-optimize-for-size.patch | 3 ++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/package/libs/openssl/Makefile b/package/libs/openssl/Makefile
> index 7707c19431..d7037cb7c1 100644
> --- a/package/libs/openssl/Makefile
> +++ b/package/libs/openssl/Makefile
> @@ -11,7 +11,7 @@ PKG_NAME:=openssl
> PKG_BASE:=1.0.2
> PKG_BUGFIX:=l
> PKG_VERSION:=$(PKG_BASE)$(PKG_BUGFIX)
> -PKG_RELEASE:=1
> +PKG_RELEASE:=2
> PKG_USE_MIPS16:=0
>
> PKG_BUILD_PARALLEL:=0
> @@ -161,6 +161,8 @@ else
> OPENSSL_OPTIONS+=no-sse2
> ifeq ($(CONFIG_mips)$(CONFIG_mipsel),y)
> OPENSSL_TARGET:=linux-mips-openwrt
> + else ifeq ($(CONFIG_aarch64),y)
> + OPENSSL_TARGET:=linux-aarch64-openwrt
> else ifeq ($(CONFIG_arm)$(CONFIG_armeb),y)
> OPENSSL_TARGET:=linux-armv4-openwrt
> else
> diff --git a/package/libs/openssl/patches/110-optimize-for-size.patch b/package/libs/openssl/patches/110-optimize-for-size.patch
> index 0f174a3469..d6d4a21111 100644
> --- a/package/libs/openssl/patches/110-optimize-for-size.patch
> +++ b/package/libs/openssl/patches/110-optimize-for-size.patch
> @@ -1,11 +1,12 @@
> --- a/Configure
> +++ b/Configure
> -@@ -470,6 +470,12 @@ my %table=(
> +@@ -470,6 +470,13 @@ my %table=(
> "linux-alpha-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
> "linux-alpha+bwx-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
>
> +# OpenWrt targets
> +"linux-armv4-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${armv4_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
> ++"linux-aarch64-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${aarch64_asm}:linux64:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
> +"linux-x86_64-openwrt", "gcc:-m64 -DL_ENDIAN -DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_UNROLL:${x86_64_asm}:elf:dlfcn:linux-shared:-fPIC:-m64:.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR):::64",
> +"linux-mips-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${mips32_asm}:o32:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
> +"linux-generic-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/lede-dev/attachments/20171028/476f8f3b/attachment.sig>
More information about the Lede-dev
mailing list