About ARM64 Kernel Instruction Trace

Steve Clevenger scclevenger at os.amperecomputing.com
Tue Aug 29 09:20:03 PDT 2023


Hi James,

On 8/29/2023 5:55 AM, James Clark wrote:
> 
> 
> On 29/08/2023 05:44, Steve Clevenger wrote:
>>
>> Ampere has been using the following command sequence to generate a 'perf
>> record' kernel instruction trace on an AmpereOne AARCH64 based systems
>> using a 6.3.0 Fedora distribution with CoreSight related patches.
>>
>>
>> [root at sut01sys-b212 kernel]# uname -a
>> Linux sut01sys-b212.scc-lab.amperecomputing.com 6.3.0-coresight-enabled+
>> #9 SMP PREEMPT_DYNAMIC Fri Jun 30 12:54:14 PDT 2023 aarch64 aarch64
>> aarch64 GNU/Linux
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# timeout 45s perf record --kcore -o kcore -e
>> cs_etm/@tmc_etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
>> of=/dev/null
>> [ perf record: Woken up 41 times to write data ]
>> [ perf record: Captured and wrote 144.056 MB kcore ]
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# ls -l ~/linux/vmlinux
>> -rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# timeout 45s perf script --input
>> ./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
>> cpu,event,ip,addr,sym – -d objdump -k ~/linux/vmlinux > ./perf.itrace
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# ls -l perf.itrace
>> -rw-r--r--. 1 root root  485834744 Jul 17 14:18 perf.itrace
>>
>>
>> The executive summary is the code sections in vmlinux do not match the
>> actual code executing on the target due to self-modifying code
>> (alternate code sequences) in the kernel image. These sequences are
>> typically applied over nop place-holder instructions. This was verified
>> by comparing a location in the ./drivers/char/mem.c read_zero objdump
>> instruction stream to the memory resident instructions captured with the
>> TRACE32 ETM instruction trace feature. This particular instruction trace
>> example shows a disconnect. It displays a ‘b    ffff800008ab8a70
>> <read_zero+0x168>’ instruction at address 0xffff800008ab89e8 which did
>> not branch. The actual instruction present at 0xffff800008ab89e8 is a ‘nop’.
>>         .
>>         .
>>         ffff800008ab89a8:       1400000d        b       ffff800008ab89dc
>> <read_zero+0xd4>
>>               dd  8774/8774  [0015]         0.000000000  read_zero+0xa0
>>         ffff800008ab89dc <read_zero+0xd4>:
>>         ffff800008ab89dc:       9248f840        and     x0, x2,
>> #0xff7fffffffffffff
>>         ffff800008ab89e0:       aa1403e1        mov     x1, x20
>>         ffff800008ab89e4:       9418b6fb        bl      ffff8000090e65d0
>> <__arch_clear_user>
>>               dd  8774/8774  [0015]         0.000000000  read_zero+0xdc
>> ffff8000090e65d0 <__arch_clear_user>:
>>         ffff8000090e65d0:       d503245f        bti     c
>>         ffff8000090e65d4:       8b010002        add     x2, x0, x1
>>         ffff8000090e65d8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65dc:       54000104        b.mi    ffff8000090e65fc
>> <__arch_clear_user+0x2c>  // b.first
>>         ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
>>         ffff8000090e65e4:       91002000        add     x0, x0, #0x8
>>         ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
>> <__arch_clear_user+0x10>  // b.pmore
>>               dd  8774/8774  [0015]         0.000000000
>> __arch_clear_user+0x1c
>> ...vec/linux/arch/arm64/lib/clear_user.S   31         b.hi    1b
>>         ffff8000090e65e0 <__arch_clear_user+0x10>:
>>         ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
>>         ffff8000090e65e4:       91002000        add     x0, x0, #0x8
>>         ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
>> <__arch_clear_user+0x10>  // b.pmore
>>         .
>>         .
>>         ffff8000090e65f0:       f81f885f        sttr    xzr, [x2, #-8]
>>         ffff8000090e65f4:       d2800000        mov     x0, #0x0
>>                // #0
>>         ffff8000090e65f8:       d65f03c0        ret
>>               dd  8774/8774  [0015]         0.000000000
>> __arch_clear_user+0x28
>> ...vec/linux/arch/arm64/lib/clear_user.S   34         ret
>>         ffff800008ab89e8 <read_zero+0xe0>:
>>         ffff800008ab89e8:       14000022        b       ffff800008ab8a70
>> <read_zero+0x168>
>>         ffff800008ab89ec:       d503201f        nop
>>              dd  8774/8774  [0015]         0.000000000  read_zero+0xe4
>>                         /home/stevec/linux/drivers/char/mem.c     521
>>              left = clear_user(buf + cleared, chunk);
>>         ffff800008ab8a18 <read_zero+0x110>:
>>         ffff800008ab8a18:       8b14035a        add     x26, x26, x20
>>         ffff800008ab8a1c:       b5000360        cbnz    x0,
>> ffff800008ab8a88 <read_zero+0x180>
>>         ffff800008ab8a20:       f9400320        ldr     x0, [x25]
>>         ffff800008ab8a24:       cb140273        sub     x19, x19, x20
>>         .
>>         .
>>
>> Ostensibly, the ‘perf record –kcore’ option produces a representative
>> image of the kernel. But this option does not produce suitable output to
>> generate ARM64 kernel instruction trace. perf doesn’t disassemble ARM64,
>> so the arm-cs-trace-disasm.py python script is used with the native
>> objdump utility to provide ARM64 disassembly from CoreSight trace
>> capture. objdump itself requires an ELF/DWARF file image + symbols to
>> generate the symbolic (+ line information for mixed source) disassembly.
>> The linux vmlinux image + symbols file is typically used for this
>> purpose. The kallsyms file is not formatted for objdump use. As an
>> experiment, I patched the executable code sections in a local copy of
>> vmlinux with the corresponding executable code segments extracted from
>> the kcore image using an Ampere internal ELF patch utility.
>>
>> This patch utility leverages the (MIT Licensed) ELFIO open source
>> library API. These were the commands.
>>
>>
>> [root at sut01sys-b212 kernel]# timeout 30s perf record --kcore -o kcore -e
>> cs_etm/@tmc_etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
>> of=/dev/null
>> [ perf record: Woken up 25 times to write data ]
>> [ perf record: Captured and wrote 88.053 MB kcore ]
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# ls -l ~/linux/vmlinux
>> -rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
>> [root at sut01sys-b212 kernel]# cp ~/linux/vmlinux .
>> [root at sut01sys-b212 kernel]# patch-elf --help
>>
>> patch-elf overlays the kernel image from a local copy of
>> '/proc/kcore' to the  corresponding (by address) ELF sections
>> in a local copy of the vmlinux ELF file.
>> A local '/proc/kcore' is created by:
>>     'perf report --kcore -e cs_etm/@tmc_etr1/k ...'
>> The local (patched) vmlinux copy is used by:
>>     'perf script -s arm-cs-trace-disasm.py ...'
>> See the CoreSight Hardware-Assisted Trace Application Note for
>> use of the 'perf report' and 'perf script' commands.
>>
>> Usage: patch-elf <--verbose> kcore_file vmlinux_file
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# patch-elf kcore/kcore_dir/kcore ./vmlinux
>> ELF File kcore Properties
>> ELF file class:     ELF64
>> ELF file encoding:  Little endian
>> Machine:            ARM AArch64
>> Type:               Core file
>> Number of segments: 3
>> Number of sections: 0
>>
>> ELF File vmlinux Properties
>> ELF file class:     ELF64
>> ELF file encoding:  Little endian
>> Machine:            ARM AArch64
>> Type:               Shared object file
>> Number of segments: 3
>> Number of sections: 43
>>
>> Patching section[ 2] ffff800008010000  17997936 bytes
>>
>> Patching section[15] ffff800009a31000  20480 bytes
>>
>> Patching section[16] ffff800009a40000  612372 bytes
>>
>> Patching section[17] ffff800009ad5818  24752 bytes
>>
>> [root at sut01sys-b212 kernel]# ls -l ./vmlinux
>> -rwxr-xr-x. 1 root root 390426908 Jul 19 11:14 ./vmlinux*
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# timeout 45s perf script --input
>> ./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
>> cpu,event,ip,addr,sym – -d objdump -k ./vmlinux > ./perf.itrace
>> [root at sut01sys-b212 kernel]#
>> [root at sut01sys-b212 kernel]# ls -l perf.itrace
>> -rw-r--r--. 1 root root  32142060 Jul 19 11:18 perf.itrace
>>
>> Here is the representative kernel instruction trace using a patched vmlinux.
>>
>>         .
>>         .
>>         ffff800008ab89a8:       1400000d        b       ffff800008ab89dc
>> <read_zero+0xd4>
>>               dd  8774/8774  [0015]         0.000000000  read_zero+0xa0
>>         ffff800008ab89dc <read_zero+0xd4>:
>>         ffff800008ab89dc:       9248f840        and     x0, x2,
>> #0xff7fffffffffffff
>>         ffff800008ab89e0:       aa1403e1        mov     x1, x20
>>         ffff800008ab89e4:       9418b6fb        bl      ffff8000090e65d0
>> <__arch_clear_user>
>>               dd  8774/8774  [0015]         0.000000000  read_zero+0xdc
>> ffff8000090e65d0 <__arch_clear_user>:
>>         ffff8000090e65d0:       d503245f        bti     c
>>         ffff8000090e65d4:       8b010002        add     x2, x0, x1
>>         ffff8000090e65d8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65dc:       54000104        b.mi    ffff8000090e65fc
>> <__arch_clear_user+0x2c>  // b.first
>>         ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
>>         ffff8000090e65e4:       91002000        add     x0, x0, #0x8
>>         ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
>> <__arch_clear_user+0x10>  // b.pmore
>>               dd  8774/8774  [0015]         0.000000000
>> __arch_clear_user+0x1c
>>         ffff8000090e65e0 <__arch_clear_user+0x10>:
>>         ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
>>         ffff8000090e65e4:       91002000        add     x0, x0, #0x8
>>         ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
>>         ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
>> <__arch_clear_user+0x10>  // b.pmore
>>         .
>>         .
>>         ffff8000090e65f0:       f81f885f        sttr    xzr, [x2, #-8]
>>         ffff8000090e65f4:       d2800000        mov     x0, #0x0
>>                // #0
>>         ffff8000090e65f8:       d65f03c0        ret
>>               dd  8774/8774  [0015]         0.000000000
>> __arch_clear_user+0x28
>> ...vec/linux/arch/arm64/lib/clear_user.S   34         ret
>>         ffff800008ab89e8 <read_zero+0xe0>:
>>         ffff800008ab89e8:       d503201f        nop
>>         ffff800008ab89ec:       1400000b        b       ffff800008ab8a18
>> <read_zero+0x110>
>>               dd  8774/8774  [0015]         0.000000000  read_zero+0xe4
>>                          /home/stevec/linux/drivers/char/mem.c     521
>>               left = clear_user(buf + cleared, chunk);
>>         ffff800008ab8a18 <read_zero+0x110>:
>>         ffff800008ab8a18:       8b14035a        add     x26, x26, x20
>>         ffff800008ab8a1c:       b5000360        cbnz    x0,
>> ffff800008ab8a88 <read_zero+0x180>
>>         ffff800008ab8a20:       f9400320        ldr     x0, [x25]
>>         ffff800008ab8a24:       cb140273        sub     x19, x19, x20
>>         .
>>         .
>>         .
>>
>> This begs the question what perf enhancements could be added to make
>> ARM64 kernel instruction trace easier to use? The process I’ve followed
>> is cumbersome, but could be done behind the scenes by perf. The caveat
>> is it requires a vmlinux which might not be available to an end user.
>> Here are 2 options.
>>
>> 1.	'perf report -kcore'could use the process I’ve used here
>> transparently in the background. The plus side is the objdump feature of
>> mixed disassembly is available based on the current vmlinux.
>>
> 
> Hi Steve,
> 
> What you're saying makes sense to me. I think #1 sounds best, I'm not
> sure of the use-case where you wanted to make actionable decisions from
> the trace but wouldn't have vmlinux available? Maybe an end-user could
> be missing it, but I can only imagine use cases where you are actively
> building and developing the kernel.

I agree the primary use case is for a kernel developer who will have a
vmlinux. There's a non-zero chance of complaints from a fringe element,
hence the second option. In my opinion the vmlinux solution makes better
sense and offers the best function.

> 
>> 2.	'perf report -kcore' generates an ELF + symbols file based on
>> kallsyms (and/or System.map). No vmlinux patching, so intermixed source
>> and disassembly wouldn’t be available. It’s a reasonable alternative
>> without relying on vmlinux.
>>
>> It makes sense performance-wise to use an ARM64 disassembler directly
>> through perf. perf-script use of the arm-cs-trace-disasm.py python
>> script can be slow. I’m unfamiliar with the Intel implementation, but
>> perf-annotate uses objdump. Unfortunately, I can't seem to get annotate
>> to work for me. A patch operation is still required if vmlinux is used.
>>
> 
> We have had other reports about arm-cs-trace-disasm.py being slow. I had
> a plan to make it work with disassembly directly from Perf as that is
> available in other modes. For example the annotate mode in perf already
> works on Arm. What error are you getting with it exactly? It might
> require installation of the aarch64 objdump tool if you are running on
> x86, or provide the --objdump option?

My self-hosted trace tests are done natively on Ampere systems. I don't
know if it's the case here, but python performance is compromised for
cases where the runtime is repeatedly loaded during use.

Nothing seems out-of-order with the 'perf annotate' command line, but I
admit I've not used it much. The complaint is a missing DSO (text below).

$ perf annotate --input=kcore/data -k ./vmlinux --itrace=iybxwpe

CS ETM Trace: Missing DSO. Use 'perf archive' or debuginfod to export
data from the traced system. Enable CONFIG_PROC_KCORE or use option -k
/path/to/vmlinux for kernel symbols.

> 
>> I suspect the CoreSight/perf communities are aware of these issues.
>> Is there any ongoing work not known to the outside world?
>>
>> Thanks and regards,
>> Steve C.



More information about the linux-arm-kernel mailing list