[PATCH v2 6/8] arm64: Import latest memcpy()/memmove() implementation

Tue Jun 8 05:21:19 PDT 2021

+ Kevin

On 08.06.2021 13:37, Robin Murphy wrote:
> Hi Marek,
>
> On 2021-06-08 12:15, Marek Szyprowski wrote:
>> Hi Robin,
>>
>> On 27.05.2021 17:34, Robin Murphy wrote:
>>> Import the latest implementation of memcpy(), based on the
>>> upstream code of string/aarch64/memcpy.S at commit afd6244 from
>>> https://protect2.fireeye.com/v1/url?k=0e25d630-51beef28-0e245d7f-0cc47a314e9a-b41fdb2d4d06ff75&q=1&e=fcfaf71d-f01a-4bc4-8e16-8ae86e0c0116&u=https%3A%2F%2Fgithub.com%2FARM-software%2Foptimized-routines, 
>>> and subsuming
>>> memmove() in the process.
>>>
>>> Note that for simplicity Arm have chosen to contribute this code
>>> to Linux under GPLv2 rather than the original MIT license.
>>>
>>> Note also that the needs of the usercopy routines vs. regular memcpy()
>>> have now diverged so far that we abandon the shared template idea
>>> and the damage which that incurred to the tuning of LDP/STP loops.
>>> We'll be back to tackle those routines separately in future.
>>>
>>> Signed-off-by: Robin Murphy <robin.murphy at arm.com>
>>
>> This patch landed recently in linux-next as commit 285133040e6c ("arm64:
>> Import latest memcpy()/memmove() implementation"). Sadly it causes
>> serious issues on Khadas VIM3 board. Reverting it on top of linux
>> next-20210607 (together with 6b8f648959e5 and resolving the conflict in
>> the Makefile) fixes the issue. Here is the kernel log:
>>
>> Unable to handle kernel paging request at virtual address 
>> ffff8000136bd204
>> Mem abort info:
>>     ESR = 0x96000061
>>     EC = 0x25: DABT (current EL), IL = 32 bits
>>     SET = 0, FnV = 0
>>     EA = 0, S1PTW = 0
>> Data abort info:
>>     ISV = 0, ISS = 0x00000061
>
> That's an alignment fault, which implies we're accessing something 
> which isn't normal memory.
>
>>     CM = 0, WnR = 1
>> swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000009da6000
>> [ffff8000136bd204] pgd=10000000f4806003, p4d=10000000f4806003,
>> pud=10000000f4805003, pmd=1000000000365003, pte=00680000ffe03713
>> Internal error: Oops: 96000061 [#1] PREEMPT SMP
>> Modules linked in: brcmfmac brcmutil cfg80211 dw_hdmi_i2s_audio
>> meson_gxl hci_uart btqca btbcm bluetooth panfrost ecdh_generic ecc
>> snd_soc_meson_axg_sound_card crct10dif_ce snd_soc_meson_card_utils
>> rfkill rtc_hym8563 gpu_sched dwmac_generic rc_khadas meson_gxbb_wdt
>> meson_ir pwm_meson snd_soc_meson_axg_tdmin snd_soc_meson_g12a_tohdmitx
>> rtc_meson_vrtc snd_soc_meson_axg_tdmout snd_soc_meson_axg_frddr
>> reset_meson_audio_arb snd_soc_meson_codec_glue axg_audio meson_rng
>> sclk_div dwmac_meson8b snd_soc_meson_axg_toddr mdio_mux_meson_g12a
>> clk_phase stmmac_platform rng_core snd_soc_meson_axg_fifo meson_dw_hdmi
>> stmmac meson_drm meson_canvas dw_hdmi pcs_xpcs display_connector
>> snd_soc_meson_axg_tdm_interface nvmem_meson_efuse adc_keys
>> snd_soc_meson_axg_tdm_formatter
>> CPU: 4 PID: 135 Comm: kworker/4:3 Not tainted 5.13.0-rc5-next-20210607
>> #10441
>> Hardware name: Khadas VIM3 (DT)
>> Workqueue: events request_firmware_work_func
>> pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
>> pc : __memcpy+0x2c/0x260
>> lr : sg_copy_buffer+0x90/0x118
>> ...
>> Call trace:
>>    __memcpy+0x2c/0x260
>>    sg_copy_to_buffer+0x14/0x20
>>    meson_mmc_start_cmd+0xf4/0x2c8
>>    meson_mmc_request+0x4c/0xb8
>>    __mmc_start_request+0xa4/0x2a8
>>    mmc_start_request+0x80/0xa8
>>    mmc_wait_for_req+0x68/0xd8
>>    mmc_io_rw_extended+0x1d4/0x2e0
>>    sdio_io_rw_ext_helper+0xb0/0x1e8
>>    sdio_memcpy_toio+0x20/0x28
>>    brcmf_sdiod_skbuff_write.isra.18+0x2c/0x68 [brcmfmac]
>>    brcmf_sdiod_ramrw+0xe0/0x230 [brcmfmac]
>>    brcmf_sdio_firmware_callback+0xa8/0x7c8 [brcmfmac]
>>    brcmf_fw_request_done+0x7c/0x100 [brcmfmac]
>>    request_firmware_work_func+0x4c/0xd8
>>    process_one_work+0x2a8/0x718
>>    worker_thread+0x48/0x460
>>    kthread+0x12c/0x160
>>    ret_from_fork+0x10/0x18
>> Code: 540000c3 a9401c26 a97f348c a9001c06 (a93f34ac)
>> ---[ end trace be83fa283dc82415 ]---
>>
>> I hope that the above log helps fixing the issue. IIRC the SDHCI driver
>> on VIM3 board uses internal SRAM for transferring data (instead of DMA),
>> so the issue is somehow related to that.
>
> Drivers shouldn't be using memcpy() on iomem mappings. Even if they 
> happen to have got away with it sometimes ;)
>
> Taking a quick look at that driver,
>
>     host->bounce_buf = host->regs + SD_EMMC_SRAM_DATA_BUF_OFF;
>
> is completely bogus, as Sparse will readily point out.
>
> Robin.
>
>>> ---
>>>    arch/arm64/lib/Makefile  |   2 +-
>>>    arch/arm64/lib/memcpy.S  | 272 
>>> ++++++++++++++++++++++++++++++++-------
>>>    arch/arm64/lib/memmove.S | 189 ---------------------------
>>>    3 files changed, 230 insertions(+), 233 deletions(-)
>>>    delete mode 100644 arch/arm64/lib/memmove.S
>>>
>>> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
>>> index d31e1169d9b8..01c596aa539c 100644
>>> --- a/arch/arm64/lib/Makefile
>>> +++ b/arch/arm64/lib/Makefile
>>> @@ -1,7 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    lib-y        := clear_user.o delay.o copy_from_user.o        \
>>>               copy_to_user.o copy_in_user.o copy_page.o \
>>> -           clear_page.o csum.o memchr.o memcpy.o memmove.o \
>>> +           clear_page.o csum.o memchr.o memcpy.o        \
>>>               memset.o memcmp.o strcmp.o strncmp.o strlen.o \
>>>               strnlen.o strchr.o strrchr.o tishift.o
>>>    diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
>>> index dc8d2a216a6e..31073a8304fb 100644
>>> --- a/arch/arm64/lib/memcpy.S
>>> +++ b/arch/arm64/lib/memcpy.S
>>> @@ -1,66 +1,252 @@
>>>    /* SPDX-License-Identifier: GPL-2.0-only */
>>>    /*
>>> - * Copyright (C) 2013 ARM Ltd.
>>> - * Copyright (C) 2013 Linaro.
>>> + * Copyright (c) 2012-2020, Arm Limited.
>>>     *
>>> - * This code is based on glibc cortex strings work originally 
>>> authored by Linaro
>>> - * be found @
>>> - *
>>> - * 
>>> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>>> - * files/head:/src/aarch64/
>>> + * Adapted from the original at:
>>> + * 
>>> https://protect2.fireeye.com/v1/url?k=30360c6b-6fad3573-30378724-0cc47a314e9a-eee98177cc643ca2&q=1&e=fcfaf71d-f01a-4bc4-8e16-8ae86e0c0116&u=https%3A%2F%2Fgithub.com%2FARM-software%2Foptimized-routines%2Fblob%2Fmaster%2Fstring%2Faarch64%2Fmemcpy.S
>>>     */
>>>       #include <linux/linkage.h>
>>>    #include <asm/assembler.h>
>>> -#include <asm/cache.h>
>>>    -/*
>>> - * Copy a buffer from src to dest (alignment handled by the hardware)
>>> +/* Assumptions:
>>> + *
>>> + * ARMv8-a, AArch64, unaligned accesses.
>>>     *
>>> - * Parameters:
>>> - *    x0 - dest
>>> - *    x1 - src
>>> - *    x2 - n
>>> - * Returns:
>>> - *    x0 - dest
>>>     */
>>> -    .macro ldrb1 reg, ptr, val
>>> -    ldrb  \reg, [\ptr], \val
>>> -    .endm
>>>    -    .macro strb1 reg, ptr, val
>>> -    strb \reg, [\ptr], \val
>>> -    .endm
>>> +#define L(label) .L ## label
>>>    -    .macro ldrh1 reg, ptr, val
>>> -    ldrh  \reg, [\ptr], \val
>>> -    .endm
>>> +#define dstin    x0
>>> +#define src    x1
>>> +#define count    x2
>>> +#define dst    x3
>>> +#define srcend    x4
>>> +#define dstend    x5
>>> +#define A_l    x6
>>> +#define A_lw    w6
>>> +#define A_h    x7
>>> +#define B_l    x8
>>> +#define B_lw    w8
>>> +#define B_h    x9
>>> +#define C_l    x10
>>> +#define C_lw    w10
>>> +#define C_h    x11
>>> +#define D_l    x12
>>> +#define D_h    x13
>>> +#define E_l    x14
>>> +#define E_h    x15
>>> +#define F_l    x16
>>> +#define F_h    x17
>>> +#define G_l    count
>>> +#define G_h    dst
>>> +#define H_l    src
>>> +#define H_h    srcend
>>> +#define tmp1    x14
>>>    -    .macro strh1 reg, ptr, val
>>> -    strh \reg, [\ptr], \val
>>> -    .endm
>>> +/* This implementation handles overlaps and supports both memcpy 
>>> and memmove
>>> +   from a single entry point.  It uses unaligned accesses and 
>>> branchless
>>> +   sequences to keep the code small, simple and improve performance.
>>>    -    .macro ldr1 reg, ptr, val
>>> -    ldr \reg, [\ptr], \val
>>> -    .endm
>>> +   Copies are split into 3 main cases: small copies of up to 32 
>>> bytes, medium
>>> +   copies of up to 128 bytes, and large copies.  The overhead of 
>>> the overlap
>>> +   check is negligible since it is only required for large copies.
>>>    -    .macro str1 reg, ptr, val
>>> -    str \reg, [\ptr], \val
>>> -    .endm
>>> -
>>> -    .macro ldp1 reg1, reg2, ptr, val
>>> -    ldp \reg1, \reg2, [\ptr], \val
>>> -    .endm
>>> -
>>> -    .macro stp1 reg1, reg2, ptr, val
>>> -    stp \reg1, \reg2, [\ptr], \val
>>> -    .endm
>>> +   Large copies use a software pipelined loop processing 64 bytes 
>>> per iteration.
>>> +   The destination pointer is 16-byte aligned to minimize unaligned 
>>> accesses.
>>> +   The loop tail is handled by always copying 64 bytes from the end.
>>> +*/
>>>    +SYM_FUNC_START_ALIAS(__memmove)
>>> +SYM_FUNC_START_WEAK_ALIAS_PI(memmove)
>>>    SYM_FUNC_START_ALIAS(__memcpy)
>>>    SYM_FUNC_START_WEAK_PI(memcpy)
>>> -#include "copy_template.S"
>>> +    add    srcend, src, count
>>> +    add    dstend, dstin, count
>>> +    cmp    count, 128
>>> +    b.hi    L(copy_long)
>>> +    cmp    count, 32
>>> +    b.hi    L(copy32_128)
>>> +
>>> +    /* Small copies: 0..32 bytes.  */
>>> +    cmp    count, 16
>>> +    b.lo    L(copy16)
>>> +    ldp    A_l, A_h, [src]
>>> +    ldp    D_l, D_h, [srcend, -16]
>>> +    stp    A_l, A_h, [dstin]
>>> +    stp    D_l, D_h, [dstend, -16]
>>>        ret
>>> +
>>> +    /* Copy 8-15 bytes.  */
>>> +L(copy16):
>>> +    tbz    count, 3, L(copy8)
>>> +    ldr    A_l, [src]
>>> +    ldr    A_h, [srcend, -8]
>>> +    str    A_l, [dstin]
>>> +    str    A_h, [dstend, -8]
>>> +    ret
>>> +
>>> +    .p2align 3
>>> +    /* Copy 4-7 bytes.  */
>>> +L(copy8):
>>> +    tbz    count, 2, L(copy4)
>>> +    ldr    A_lw, [src]
>>> +    ldr    B_lw, [srcend, -4]
>>> +    str    A_lw, [dstin]
>>> +    str    B_lw, [dstend, -4]
>>> +    ret
>>> +
>>> +    /* Copy 0..3 bytes using a branchless sequence.  */
>>> +L(copy4):
>>> +    cbz    count, L(copy0)
>>> +    lsr    tmp1, count, 1
>>> +    ldrb    A_lw, [src]
>>> +    ldrb    C_lw, [srcend, -1]
>>> +    ldrb    B_lw, [src, tmp1]
>>> +    strb    A_lw, [dstin]
>>> +    strb    B_lw, [dstin, tmp1]
>>> +    strb    C_lw, [dstend, -1]
>>> +L(copy0):
>>> +    ret
>>> +
>>> +    .p2align 4
>>> +    /* Medium copies: 33..128 bytes.  */
>>> +L(copy32_128):
>>> +    ldp    A_l, A_h, [src]
>>> +    ldp    B_l, B_h, [src, 16]
>>> +    ldp    C_l, C_h, [srcend, -32]
>>> +    ldp    D_l, D_h, [srcend, -16]
>>> +    cmp    count, 64
>>> +    b.hi    L(copy128)
>>> +    stp    A_l, A_h, [dstin]
>>> +    stp    B_l, B_h, [dstin, 16]
>>> +    stp    C_l, C_h, [dstend, -32]
>>> +    stp    D_l, D_h, [dstend, -16]
>>> +    ret
>>> +
>>> +    .p2align 4
>>> +    /* Copy 65..128 bytes.  */
>>> +L(copy128):
>>> +    ldp    E_l, E_h, [src, 32]
>>> +    ldp    F_l, F_h, [src, 48]
>>> +    cmp    count, 96
>>> +    b.ls    L(copy96)
>>> +    ldp    G_l, G_h, [srcend, -64]
>>> +    ldp    H_l, H_h, [srcend, -48]
>>> +    stp    G_l, G_h, [dstend, -64]
>>> +    stp    H_l, H_h, [dstend, -48]
>>> +L(copy96):
>>> +    stp    A_l, A_h, [dstin]
>>> +    stp    B_l, B_h, [dstin, 16]
>>> +    stp    E_l, E_h, [dstin, 32]
>>> +    stp    F_l, F_h, [dstin, 48]
>>> +    stp    C_l, C_h, [dstend, -32]
>>> +    stp    D_l, D_h, [dstend, -16]
>>> +    ret
>>> +
>>> +    .p2align 4
>>> +    /* Copy more than 128 bytes.  */
>>> +L(copy_long):
>>> +    /* Use backwards copy if there is an overlap.  */
>>> +    sub    tmp1, dstin, src
>>> +    cbz    tmp1, L(copy0)
>>> +    cmp    tmp1, count
>>> +    b.lo    L(copy_long_backwards)
>>> +
>>> +    /* Copy 16 bytes and then align dst to 16-byte alignment.  */
>>> +
>>> +    ldp    D_l, D_h, [src]
>>> +    and    tmp1, dstin, 15
>>> +    bic    dst, dstin, 15
>>> +    sub    src, src, tmp1
>>> +    add    count, count, tmp1    /* Count is now 16 too large.  */
>>> +    ldp    A_l, A_h, [src, 16]
>>> +    stp    D_l, D_h, [dstin]
>>> +    ldp    B_l, B_h, [src, 32]
>>> +    ldp    C_l, C_h, [src, 48]
>>> +    ldp    D_l, D_h, [src, 64]!
>>> +    subs    count, count, 128 + 16    /* Test and readjust count.  */
>>> +    b.ls    L(copy64_from_end)
>>> +
>>> +L(loop64):
>>> +    stp    A_l, A_h, [dst, 16]
>>> +    ldp    A_l, A_h, [src, 16]
>>> +    stp    B_l, B_h, [dst, 32]
>>> +    ldp    B_l, B_h, [src, 32]
>>> +    stp    C_l, C_h, [dst, 48]
>>> +    ldp    C_l, C_h, [src, 48]
>>> +    stp    D_l, D_h, [dst, 64]!
>>> +    ldp    D_l, D_h, [src, 64]!
>>> +    subs    count, count, 64
>>> +    b.hi    L(loop64)
>>> +
>>> +    /* Write the last iteration and copy 64 bytes from the end.  */
>>> +L(copy64_from_end):
>>> +    ldp    E_l, E_h, [srcend, -64]
>>> +    stp    A_l, A_h, [dst, 16]
>>> +    ldp    A_l, A_h, [srcend, -48]
>>> +    stp    B_l, B_h, [dst, 32]
>>> +    ldp    B_l, B_h, [srcend, -32]
>>> +    stp    C_l, C_h, [dst, 48]
>>> +    ldp    C_l, C_h, [srcend, -16]
>>> +    stp    D_l, D_h, [dst, 64]
>>> +    stp    E_l, E_h, [dstend, -64]
>>> +    stp    A_l, A_h, [dstend, -48]
>>> +    stp    B_l, B_h, [dstend, -32]
>>> +    stp    C_l, C_h, [dstend, -16]
>>> +    ret
>>> +
>>> +    .p2align 4
>>> +
>>> +    /* Large backwards copy for overlapping copies.
>>> +       Copy 16 bytes and then align dst to 16-byte alignment.  */
>>> +L(copy_long_backwards):
>>> +    ldp    D_l, D_h, [srcend, -16]
>>> +    and    tmp1, dstend, 15
>>> +    sub    srcend, srcend, tmp1
>>> +    sub    count, count, tmp1
>>> +    ldp    A_l, A_h, [srcend, -16]
>>> +    stp    D_l, D_h, [dstend, -16]
>>> +    ldp    B_l, B_h, [srcend, -32]
>>> +    ldp    C_l, C_h, [srcend, -48]
>>> +    ldp    D_l, D_h, [srcend, -64]!
>>> +    sub    dstend, dstend, tmp1
>>> +    subs    count, count, 128
>>> +    b.ls    L(copy64_from_start)
>>> +
>>> +L(loop64_backwards):
>>> +    stp    A_l, A_h, [dstend, -16]
>>> +    ldp    A_l, A_h, [srcend, -16]
>>> +    stp    B_l, B_h, [dstend, -32]
>>> +    ldp    B_l, B_h, [srcend, -32]
>>> +    stp    C_l, C_h, [dstend, -48]
>>> +    ldp    C_l, C_h, [srcend, -48]
>>> +    stp    D_l, D_h, [dstend, -64]!
>>> +    ldp    D_l, D_h, [srcend, -64]!
>>> +    subs    count, count, 64
>>> +    b.hi    L(loop64_backwards)
>>> +
>>> +    /* Write the last iteration and copy 64 bytes from the start.  */
>>> +L(copy64_from_start):
>>> +    ldp    G_l, G_h, [src, 48]
>>> +    stp    A_l, A_h, [dstend, -16]
>>> +    ldp    A_l, A_h, [src, 32]
>>> +    stp    B_l, B_h, [dstend, -32]
>>> +    ldp    B_l, B_h, [src, 16]
>>> +    stp    C_l, C_h, [dstend, -48]
>>> +    ldp    C_l, C_h, [src]
>>> +    stp    D_l, D_h, [dstend, -64]
>>> +    stp    G_l, G_h, [dstin, 48]
>>> +    stp    A_l, A_h, [dstin, 32]
>>> +    stp    B_l, B_h, [dstin, 16]
>>> +    stp    C_l, C_h, [dstin]
>>> +    ret
>>> +
>>>    SYM_FUNC_END_PI(memcpy)
>>>    EXPORT_SYMBOL(memcpy)
>>>    SYM_FUNC_END_ALIAS(__memcpy)
>>>    EXPORT_SYMBOL(__memcpy)
>>> +SYM_FUNC_END_ALIAS_PI(memmove)
>>> +EXPORT_SYMBOL(memmove)
>>> +SYM_FUNC_END_ALIAS(__memmove)
>>> +EXPORT_SYMBOL(__memmove)
>>> \ No newline at end of file
>>> diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S
>>> deleted file mode 100644
>>> index 1035dce4bdaf..000000000000
>>> --- a/arch/arm64/lib/memmove.S
>>> +++ /dev/null
>>> @@ -1,189 +0,0 @@
>>> -/* SPDX-License-Identifier: GPL-2.0-only */
>>> -/*
>>> - * Copyright (C) 2013 ARM Ltd.
>>> - * Copyright (C) 2013 Linaro.
>>> - *
>>> - * This code is based on glibc cortex strings work originally 
>>> authored by Linaro
>>> - * be found @
>>> - *
>>> - * 
>>> http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>>> - * files/head:/src/aarch64/
>>> - */
>>> -
>>> -#include <linux/linkage.h>
>>> -#include <asm/assembler.h>
>>> -#include <asm/cache.h>
>>> -
>>> -/*
>>> - * Move a buffer from src to test (alignment handled by the hardware).
>>> - * If dest <= src, call memcpy, otherwise copy in reverse order.
>>> - *
>>> - * Parameters:
>>> - *    x0 - dest
>>> - *    x1 - src
>>> - *    x2 - n
>>> - * Returns:
>>> - *    x0 - dest
>>> - */
>>> -dstin    .req    x0
>>> -src    .req    x1
>>> -count    .req    x2
>>> -tmp1    .req    x3
>>> -tmp1w    .req    w3
>>> -tmp2    .req    x4
>>> -tmp2w    .req    w4
>>> -tmp3    .req    x5
>>> -tmp3w    .req    w5
>>> -dst    .req    x6
>>> -
>>> -A_l    .req    x7
>>> -A_h    .req    x8
>>> -B_l    .req    x9
>>> -B_h    .req    x10
>>> -C_l    .req    x11
>>> -C_h    .req    x12
>>> -D_l    .req    x13
>>> -D_h    .req    x14
>>> -
>>> -SYM_FUNC_START_ALIAS(__memmove)
>>> -SYM_FUNC_START_WEAK_PI(memmove)
>>> -    cmp    dstin, src
>>> -    b.lo    __memcpy
>>> -    add    tmp1, src, count
>>> -    cmp    dstin, tmp1
>>> -    b.hs    __memcpy        /* No overlap.  */
>>> -
>>> -    add    dst, dstin, count
>>> -    add    src, src, count
>>> -    cmp    count, #16
>>> -    b.lo    .Ltail15  /*probably non-alignment accesses.*/
>>> -
>>> -    ands    tmp2, src, #15     /* Bytes to reach alignment. */
>>> -    b.eq    .LSrcAligned
>>> -    sub    count, count, tmp2
>>> -    /*
>>> -    * process the aligned offset length to make the src aligned 
>>> firstly.
>>> -    * those extra instructions' cost is acceptable. It also make the
>>> -    * coming accesses are based on aligned address.
>>> -    */
>>> -    tbz    tmp2, #0, 1f
>>> -    ldrb    tmp1w, [src, #-1]!
>>> -    strb    tmp1w, [dst, #-1]!
>>> -1:
>>> -    tbz    tmp2, #1, 2f
>>> -    ldrh    tmp1w, [src, #-2]!
>>> -    strh    tmp1w, [dst, #-2]!
>>> -2:
>>> -    tbz    tmp2, #2, 3f
>>> -    ldr    tmp1w, [src, #-4]!
>>> -    str    tmp1w, [dst, #-4]!
>>> -3:
>>> -    tbz    tmp2, #3, .LSrcAligned
>>> -    ldr    tmp1, [src, #-8]!
>>> -    str    tmp1, [dst, #-8]!
>>> -
>>> -.LSrcAligned:
>>> -    cmp    count, #64
>>> -    b.ge    .Lcpy_over64
>>> -
>>> -    /*
>>> -    * Deal with small copies quickly by dropping straight into the
>>> -    * exit block.
>>> -    */
>>> -.Ltail63:
>>> -    /*
>>> -    * Copy up to 48 bytes of data. At this point we only need the
>>> -    * bottom 6 bits of count to be accurate.
>>> -    */
>>> -    ands    tmp1, count, #0x30
>>> -    b.eq    .Ltail15
>>> -    cmp    tmp1w, #0x20
>>> -    b.eq    1f
>>> -    b.lt    2f
>>> -    ldp    A_l, A_h, [src, #-16]!
>>> -    stp    A_l, A_h, [dst, #-16]!
>>> -1:
>>> -    ldp    A_l, A_h, [src, #-16]!
>>> -    stp    A_l, A_h, [dst, #-16]!
>>> -2:
>>> -    ldp    A_l, A_h, [src, #-16]!
>>> -    stp    A_l, A_h, [dst, #-16]!
>>> -
>>> -.Ltail15:
>>> -    tbz    count, #3, 1f
>>> -    ldr    tmp1, [src, #-8]!
>>> -    str    tmp1, [dst, #-8]!
>>> -1:
>>> -    tbz    count, #2, 2f
>>> -    ldr    tmp1w, [src, #-4]!
>>> -    str    tmp1w, [dst, #-4]!
>>> -2:
>>> -    tbz    count, #1, 3f
>>> -    ldrh    tmp1w, [src, #-2]!
>>> -    strh    tmp1w, [dst, #-2]!
>>> -3:
>>> -    tbz    count, #0, .Lexitfunc
>>> -    ldrb    tmp1w, [src, #-1]
>>> -    strb    tmp1w, [dst, #-1]
>>> -
>>> -.Lexitfunc:
>>> -    ret
>>> -
>>> -.Lcpy_over64:
>>> -    subs    count, count, #128
>>> -    b.ge    .Lcpy_body_large
>>> -    /*
>>> -    * Less than 128 bytes to copy, so handle 64 bytes here and then 
>>> jump
>>> -    * to the tail.
>>> -    */
>>> -    ldp    A_l, A_h, [src, #-16]
>>> -    stp    A_l, A_h, [dst, #-16]
>>> -    ldp    B_l, B_h, [src, #-32]
>>> -    ldp    C_l, C_h, [src, #-48]
>>> -    stp    B_l, B_h, [dst, #-32]
>>> -    stp    C_l, C_h, [dst, #-48]
>>> -    ldp    D_l, D_h, [src, #-64]!
>>> -    stp    D_l, D_h, [dst, #-64]!
>>> -
>>> -    tst    count, #0x3f
>>> -    b.ne    .Ltail63
>>> -    ret
>>> -
>>> -    /*
>>> -    * Critical loop. Start at a new cache line boundary. Assuming
>>> -    * 64 bytes per line this ensures the entire loop is in one line.
>>> -    */
>>> -    .p2align    L1_CACHE_SHIFT
>>> -.Lcpy_body_large:
>>> -    /* pre-load 64 bytes data. */
>>> -    ldp    A_l, A_h, [src, #-16]
>>> -    ldp    B_l, B_h, [src, #-32]
>>> -    ldp    C_l, C_h, [src, #-48]
>>> -    ldp    D_l, D_h, [src, #-64]!
>>> -1:
>>> -    /*
>>> -    * interlace the load of next 64 bytes data block with store of 
>>> the last
>>> -    * loaded 64 bytes data.
>>> -    */
>>> -    stp    A_l, A_h, [dst, #-16]
>>> -    ldp    A_l, A_h, [src, #-16]
>>> -    stp    B_l, B_h, [dst, #-32]
>>> -    ldp    B_l, B_h, [src, #-32]
>>> -    stp    C_l, C_h, [dst, #-48]
>>> -    ldp    C_l, C_h, [src, #-48]
>>> -    stp    D_l, D_h, [dst, #-64]!
>>> -    ldp    D_l, D_h, [src, #-64]!
>>> -    subs    count, count, #64
>>> -    b.ge    1b
>>> -    stp    A_l, A_h, [dst, #-16]
>>> -    stp    B_l, B_h, [dst, #-32]
>>> -    stp    C_l, C_h, [dst, #-48]
>>> -    stp    D_l, D_h, [dst, #-64]!
>>> -
>>> -    tst    count, #0x3f
>>> -    b.ne    .Ltail63
>>> -    ret
>>> -SYM_FUNC_END_PI(memmove)
>>> -EXPORT_SYMBOL(memmove)
>>> -SYM_FUNC_END_ALIAS(__memmove)
>>> -EXPORT_SYMBOL(__memmove)
>>
>> Best regards
>>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland