Re: ❌ FAIL: Test report for kernel 5.13.0-rc7 (arm-next, 8ab9b1a9)

Robin Murphy robin.murphy at arm.com
Fri Jun 25 04:15:46 PDT 2021


On 2021-06-25 12:09, Catalin Marinas wrote:
> On Fri, Jun 25, 2021 at 12:02:52PM +0100, Robin Murphy wrote:
>> On 2021-06-25 10:52, Veronika Kabatova wrote:
>> [...]
>>>>>          ❌ stress: stress-ng
>>>>
>>>> Oh no, this looks like another alignment fault in memcpy:
>>>>
>>>> [13330.651903] Unable to handle kernel paging request at virtual address ffff8000534705ff
>>>> [13330.651914] Mem abort info:
>>>> [13330.651918]   ESR = 0x96000021
>>>> [13330.651922]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>> [13330.651928]   SET = 0, FnV = 0
>>>> [13330.651931]   EA = 0, S1PTW = 0
>>>> [13330.651933]   FSC = 0x21: alignment fault
>>>> [13330.651938] Data abort info:
>>>> [13330.651940]   ISV = 0, ISS = 0x00000021
>>>> [13330.651941]   CM = 0, WnR = 0
>>>> [13330.651943] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f3e6b000
>>>> [13330.651945] [ffff8000534705ff] pgd=1000008ffcfff003, p4d=1000008ffcfff003, pud=100000088e57d003, pmd=10000008d0aeb003, pte=006800008021370f
>>>> [13330.651956] Internal error: Oops: 96000021 [#1] SMP
>>>> [13330.651961] Modules linked in: unix_diag binfmt_misc fcrypt sm4_generic crc32_generic md4 michael_mic nhpoly1305_neon nhpoly1305 poly1305_generic libpoly1305 poly1305_neon rmd160 sha3_generic sm3_generic streebog_generic wp512 blowfish_generic blowfish_common cast5_generic des_generic libdes chacha_generic chacha_neon libchacha camellia_generic cast6_generic cast_common serpent_generic twofish_generic twofish_common dm_thin_pool dm_persistent_data dm_bio_prison nvme nvme_core ipmi_watchdog ipmi_poweroff loop tun af_key crypto_user scsi_transport_iscsi xt_multiport ip_gre ip_tunnel gre overlay xt_CONNSECMARK xt_SECMARK nft_counter xt_state xt_conntrack nft_compat ah6 ah4 nft_objref nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink jfs sctp ip6_udp_tunnel udp_tunnel dm_log_writes dm_flakey rfkill mlx5_ib ib_uverbs ib_core sunrpc coresight_etm4x i2c_smbus coresight_replicator coresight_tpiu coresight_tmc joydev mlx5_core acpi_ipmi psample ipmi_ssif mlxfw !
>>>>    ipmi_devintf
>>>> [13330.652076]  ipmi_msghandler coresight_funnel thunderx2_pmu coresight vfat fat fuse zram ip_tables xfs ast crct10dif_ce i2c_algo_bit ghash_ce drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm drm gpio_xlp i2c_xlp9xx uas usb_storage aes_neon_bs [last unloaded: nvmet]
>>>> [13330.652123] CPU: 115 PID: 188446 Comm: stress-ng Tainted: G           OEL    5.13.0-rc7 #1
>>>> [13330.652129] Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.15 05/08/2020
>>>> [13330.652133] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>>>> [13330.652139] pc : __memcpy+0x168/0x250
>>>> [13330.652150] lr : memory_read_from_buffer+0x58/0x80
>>>> [13330.652161] sp : ffff800063ef3c20
>>>> [13330.652163] x29: ffff800063ef3c20 x28: ffff0008b1380000 x27: 0000000000000000
>>>> [13330.652170] x26: 0000000000000000 x25: 0000000000000000 x24: ffff00080a960fe0
>>>> [13330.652176] x23: ffff800063ef3d28 x22: 000000000000063f x21: ffff800063ef3c88
>>>> [13330.652181] x20: 000000000000063f x19: 000000000000063f x18: 0000000000000000
>>>> [13330.652186] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>>>> [13330.652191] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
>>>> [13330.652196] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
>>>> [13330.652200] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
>>>> [13330.652206] x5 : ffff000d0fb0063f x4 : ffff80005347063f x3 : ffff000d0fb005c0
>>>> [13330.652212] x2 : ffffffffffffffef x1 : ffff800053470600 x0 : ffff000d0fb00000
>>>> [13330.652218] Call trace:
>>>> [13330.652221]  __memcpy+0x168/0x250
>>>> [13330.652225]  acpi_data_show+0x5c/0x8c
>>>> [13330.652232]  sysfs_kf_bin_read+0x78/0xa0
>>>> [13330.652238]  kernfs_file_read_iter+0x9c/0x1a4
>>>> [13330.652241]  kernfs_fop_read_iter+0x34/0x50
>>>> [13330.652244]  new_sync_read+0xdc/0x154
>>>> [13330.652253]  vfs_read+0x158/0x1e4
>>>> [13330.652260]  ksys_read+0x64/0xec
>>>> [13330.652266]  __arm64_sys_read+0x28/0x34
>>>> [13330.652273]  invoke_syscall+0x50/0x120
>>>> [13330.652280]  el0_svc_common.constprop.0+0x4c/0xd4
>>>> [13330.652284]  do_el0_svc+0x30/0x9c
>>>> [13330.652286]  el0_svc+0x2c/0x54
>>>> [13330.652294]  el0t_64_sync_handler+0x1a4/0x1b0
>>>> [13330.652296]  el0t_64_sync+0x19c/0x1a0
>>>> [13330.652303] Code: a984346c a9c4342c f1010042 54fffee8 (a97c3c8e)
>>>> [13330.652307] ---[ end trace 227d4380f57145d4 ]---
>>>>
>>>> So maybe this issue isn't limited to weird modules, after all...
>>>>
>>>
>>> It ran on the machine from the same set that we were able to reproduce
>>> it on previously. If you or anyone else have an idea on how to stabilize
>>> the reproducibility or have a debug patch we'll be happy to try it.
>>
>> Possibly it depends on the individual machines' firmware exactly how the
>> relevant bits of their ACPI tables are aligned in memory?
>>
>> I've started digging into that callstack - it may not be a "weird module"
>> but it's definitely crusty ACPI code... a238317ce818 ("ACPI: Clean up
>> acpi_os_map/unmap_memory() to eliminate __iomem.") looks frankly a bit
>> questionable in its decision to blithely cast away __iomem, but then the
>> rationale in aafc65c731fe ("ACPI: add arm64 to the platforms that use
>> ioremap") seems particularly dubious on top of that (especially given this
>> end result).
>>
>> At a wild guess, I'm wondering if this may be sufficient:
>>
>> ----->8-----
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index 327e1b4eb6b0..f5d26b102fbe 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -277,7 +277,7 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size)
>>          return NULL;
>>   }
>>
>> -#if defined(CONFIG_IA64) || defined(CONFIG_ARM64)
>> +#if defined(CONFIG_IA64)
>>   /* ioremap will take care of cache attributes */
>>   #define should_use_kmap(pfn)   0
>>   #else
>> -----8<-----
> 
> I thought the same but shouldn't acpi_os_ioremap() map it with the right
> attributes? It uses the EFI maps to check what kind of memory this is.

Oh crikey, I missed that branch of the rabbit hole... I guess that must 
mean that the tables being poked here are *not* covered by the EFI 
memory map, so page_is_ram() is unlikely to help either :(

Robin.


More information about the linux-arm-kernel mailing list